Метрикой качества в данной задаче является метрика **ROC-AUC**  
Формат вектор прогнозов представлен в файле sample_submit.csv   
#### Описание источников данных: 

train.csv - пары "заявка - целевая переменная", для этой выборки нужно собрать признаки и обучить модель;  
test.csv - пары "заявки - прогнозное значение", для этой выборки нужно собрать признаки и построить прогнозы;  
bki.csv - данные БКИ о предыдущих кредитах клиента;  
client_profile.csv - клиентский профиль, некоторые знания, которые есть у компании о клиенте;  
payments.csv - история платежей клиента;  
applications_history.csv - история предыдущих заявок клиента.  


#### Описание задачи:  
Для построения модели в данном соревновании, сначала нужно будет собрать выборку для обучения модели.  
Формат соревнования очень похож на то, как в промышленности Data Scinetist'ы строят алгоритмы: сначала нужно провести анализ данных, собрать выборку и после этого строить модели.   
В соревновании представлены 4 типа источника данных, которые могут быть интерпретированы как таблицы в базе данных.  Некоторые источники данных уже готовы для моделирования, представлены в агрерированном виде. Другие источники данных требуется представить в удобном для модели виде.

In [1]:
import os, sys
module_path = os.path.abspath(os.path.join(os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
import numpy as np
import pandas as pd
import xgboost as xgb
import matplotlib.pyplot as plt
import missingno as msno
import lightgbm as lgb
import catboost as cb
from sklearn import metrics
import seaborn as sns
from scipy.stats import probplot, ks_2samp
import warnings
from tqdm import tqdm
from typing import List, Tuple
from scipy.stats import ttest_rel
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split, cross_val_score,GroupKFold,TimeSeriesSplit
from classes_and_functions import lgb_param_rnd_test,cb_param_rnd_test,make_cross_validation,make_cross_validation_cb

warnings.simplefilter("ignore")
%matplotlib inline
pd.set_option('display.max_columns', None)

In [3]:
PATCH = r'D:\train\kagle/'
patch_train = PATCH + 'train.csv'
patch_test = PATCH + 'test.csv'
patch_bki = PATCH + 'bki.csv'
patch_client_profile = PATCH + 'client_profile.csv'
patch_payments = PATCH + 'payments.csv'
patch_applications_history = PATCH + 'applications_history.csv'
patch_samble_submit = PATCH + 'sample_submit.csv'


Осмотрим данные

In [4]:
# как сдавать результаты
pd.read_csv(patch_samble_submit)


Unnamed: 0,APPLICATION_NUMBER,TARGET
0,123724268,0
1,123456549,0
2,123428178,0
3,123619984,0
4,123671104,0
...,...,...
165136,123487967,0
165137,123536402,0
165138,123718238,0
165139,123631557,0


In [5]:
test = pd.read_csv(patch_test)
test

Unnamed: 0,APPLICATION_NUMBER,NAME_CONTRACT_TYPE
0,123724268,Cash
1,123456549,Cash
2,123428178,Credit Card
3,123619984,Cash
4,123671104,Cash
...,...,...
165136,123487967,Cash
165137,123536402,Cash
165138,123718238,Cash
165139,123631557,Cash


In [6]:

test.NAME_CONTRACT_TYPE.value_counts()

Cash           149432
Credit Card     15709
Name: NAME_CONTRACT_TYPE, dtype: int64

In [7]:
sum(test.APPLICATION_NUMBER.duplicated())

0

Дубликатов нет

In [8]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165141 entries, 0 to 165140
Data columns (total 2 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   APPLICATION_NUMBER  165141 non-null  int64 
 1   NAME_CONTRACT_TYPE  165141 non-null  object
dtypes: int64(1), object(1)
memory usage: 2.5+ MB


In [9]:
test.APPLICATION_NUMBER.sort_values()

85962     123423342
127477    123423343
129290    123423344
55321     123423346
158817    123423347
            ...    
142454    123730847
141662    123730848
75755     123730849
9566      123730850
76629     123730851
Name: APPLICATION_NUMBER, Length: 165141, dtype: int64

Все заявки в тесте уникальны. Заявки могут быть на наличный и кредитные карты. Нужно дать ответ по 165 141 заявке. Вероятность от  0 до 1   (поскольку метрика ROC-AUC)
Похоже , что заявки имеют сквозную нумерацию. Это значить, что в номере заявки заложена информация о времени. При этом видно, что заявки перемешаны

In [10]:
train = pd.read_csv(patch_train)
train

Unnamed: 0,APPLICATION_NUMBER,TARGET,NAME_CONTRACT_TYPE
0,123687442,0,Cash
1,123597908,1,Cash
2,123526683,0,Cash
3,123710391,1,Cash
4,123590329,1,Cash
...,...,...,...
110088,123458312,0,Cash
110089,123672463,0,Cash
110090,123723001,0,Cash
110091,123554358,0,Cash


In [11]:
train.NAME_CONTRACT_TYPE.value_counts()

Cash           99551
Credit Card    10542
Name: NAME_CONTRACT_TYPE, dtype: int64

In [12]:
sum(test.APPLICATION_NUMBER.duplicated())

0

Дубликатов нет

In [13]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110093 entries, 0 to 110092
Data columns (total 3 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   APPLICATION_NUMBER  110093 non-null  int64 
 1   TARGET              110093 non-null  int64 
 2   NAME_CONTRACT_TYPE  110093 non-null  object
dtypes: int64(2), object(1)
memory usage: 2.5+ MB


In [14]:
train.APPLICATION_NUMBER.sort_values()

47058     123423341
6474      123423345
2373      123423349
78690     123423351
21529     123423352
            ...    
102567    123730828
82146     123730830
38743     123730833
38221     123730838
26422     123730843
Name: APPLICATION_NUMBER, Length: 110093, dtype: int64

Все заявки в трейне уникальны. Заявки могут быть на наличный и кредитные карты. Имеется  110093 заявки ждя обучения с таргетом 0 или 1  
Похоже , что заявки имеют сквозную нумерацию. Это значить, что в номере заявки заложена информация о времени. При этом видно, что заявки перемешаны

In [15]:
train.TARGET.value_counts(),train.TARGET.value_counts(normalize=True)

(0    101196
 1      8897
 Name: TARGET, dtype: int64,
 0    0.919187
 1    0.080813
 Name: TARGET, dtype: float64)

Доля положительного таргета очень маленькая, всего 8%. около 9000 строк всего

In [16]:
client = pd.read_csv(patch_client_profile)
client

Unnamed: 0,APPLICATION_NUMBER,GENDER,CHILDRENS,TOTAL_SALARY,AMOUNT_CREDIT,AMOUNT_ANNUITY,EDUCATION_LEVEL,FAMILY_STATUS,REGION_POPULATION,AGE,DAYS_ON_LAST_JOB,OWN_CAR_AGE,FLAG_PHONE,FLAG_EMAIL,FAMILY_SIZE,EXTERNAL_SCORING_RATING_1,EXTERNAL_SCORING_RATING_2,EXTERNAL_SCORING_RATING_3,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,123666076,F,0,157500.0,270000.0,13500.0,Incomplete higher,Civil marriage,0.008068,8560,1549,,1,0,2.0,0.329471,0.236315,0.678568,0.0,0.0,0.0,0.0,1.0,2.0
1,123423688,F,0,270000.0,536917.5,28467.0,Secondary / secondary special,Married,0.020246,23187,365243,,0,0,2.0,,0.442295,0.802745,0.0,0.0,0.0,0.0,1.0,1.0
2,123501780,M,1,427500.0,239850.0,23850.0,Incomplete higher,Married,0.072508,14387,326,18.0,0,0,3.0,0.409017,0.738159,,,,,,,
3,123588799,M,0,112500.0,254700.0,17149.5,Secondary / secondary special,Married,0.019101,14273,1726,12.0,0,0,2.0,,0.308994,0.590233,0.0,0.0,0.0,0.0,0.0,3.0
4,123647485,M,0,130500.0,614574.0,19822.5,Lower secondary,Married,0.022625,22954,365243,,0,0,2.0,,0.739408,0.156640,0.0,0.0,1.0,0.0,0.0,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249995,123657254,M,0,216000.0,45000.0,2425.5,Higher education,Married,0.018850,19150,7415,,0,1,2.0,0.555436,0.581592,0.048259,0.0,0.0,0.0,0.0,1.0,3.0
249996,123645397,M,0,103500.0,675000.0,28507.5,Higher education,Married,0.014520,19604,1799,16.0,0,0,2.0,,0.676409,0.726711,0.0,0.0,0.0,0.0,0.0,0.0
249997,123504053,M,0,202500.0,1078200.0,38331.0,Secondary / secondary special,Single / not married,0.031329,8351,124,12.0,0,0,1.0,,0.353665,0.283712,0.0,0.0,0.0,0.0,1.0,4.0
249998,123547316,F,0,135000.0,500211.0,38839.5,Secondary / secondary special,Married,0.030755,13277,1603,,0,1,2.0,0.305746,0.682462,0.639708,0.0,0.0,0.0,0.0,0.0,3.0


в файле видим некторые данные о клиенте привязанную к заявке. Это уже хорошо. Значть данные были источески актуальны на мамент подачи заявки

In [17]:
client.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 24 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   APPLICATION_NUMBER          250000 non-null  int64  
 1   GENDER                      250000 non-null  object 
 2   CHILDRENS                   250000 non-null  int64  
 3   TOTAL_SALARY                250000 non-null  float64
 4   AMOUNT_CREDIT               250000 non-null  float64
 5   AMOUNT_ANNUITY              249989 non-null  float64
 6   EDUCATION_LEVEL             250000 non-null  object 
 7   FAMILY_STATUS               250000 non-null  object 
 8   REGION_POPULATION           250000 non-null  float64
 9   AGE                         250000 non-null  int64  
 10  DAYS_ON_LAST_JOB            250000 non-null  int64  
 11  OWN_CAR_AGE                 85041 non-null   float64
 12  FLAG_PHONE                  250000 non-null  int64  
 13  FLAG_EMAIL    

In [18]:
sum(client.APPLICATION_NUMBER.duplicated())

0

Не дублей по заявкам

Проверим как покрывают эти данные тест и трейн (тест и трейн в сумме 275 234 записей, в даннойм файле у нас всего 250 000 записей)

In [19]:
all_app = pd.concat((test.APPLICATION_NUMBER , train.APPLICATION_NUMBER))

In [20]:
sum(all_app.duplicated())

0

На трейне и тесте нет повторяющихся заявок :)

In [21]:
d_map_app_index_client_test = {i:1 for  i in client.APPLICATION_NUMBER.values}
test['is_client_info'] = test.APPLICATION_NUMBER.map(d_map_app_index_client_test)
test['is_client_info'] = test['is_client_info'].fillna(0)
test['is_client_info'] = test['is_client_info'].astype(int)
test.is_client_info.value_counts(), test.is_client_info.value_counts(normalize = True)

(1    134176
 0     30965
 Name: is_client_info, dtype: int64,
 1    0.812494
 0    0.187506
 Name: is_client_info, dtype: float64)

In [22]:
d_map_app_index_client_train = {i:1 for  i in client.APPLICATION_NUMBER.values}
train['is_client_info'] = train.APPLICATION_NUMBER.map(d_map_app_index_client_train)
train['is_client_info'] = train['is_client_info'].fillna(0)
train['is_client_info'] = train['is_client_info'].astype(int)
train.is_client_info.value_counts(),train.is_client_info.value_counts(normalize = True)

(1    89539
 0    20554
 Name: is_client_info, dtype: int64,
 1    0.813303
 0    0.186697
 Name: is_client_info, dtype: float64)

Как видно на тесте не хватает данных о клдиентах по 30 965 заявкам  
На трейне 20 554 з0аявки без данных о клиенте  
в обоих случаях это 18 %


In [23]:
train

Unnamed: 0,APPLICATION_NUMBER,TARGET,NAME_CONTRACT_TYPE,is_client_info
0,123687442,0,Cash,1
1,123597908,1,Cash,0
2,123526683,0,Cash,1
3,123710391,1,Cash,1
4,123590329,1,Cash,0
...,...,...,...,...
110088,123458312,0,Cash,0
110089,123672463,0,Cash,1
110090,123723001,0,Cash,0
110091,123554358,0,Cash,1


In [24]:
bki = pd.read_csv(patch_bki)
bki

Unnamed: 0,APPLICATION_NUMBER,BUREAU_ID,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,123538884,5223613,Active,currency 1,718.0,0,377.0,,19386.810,0,675000.00,320265.495,0.0,0.0,Consumer credit,39.0,
1,123436670,6207544,Closed,currency 1,696.0,0,511.0,511.0,0.000,0,93111.66,0.000,0.0,0.0,Consumer credit,505.0,
2,123589020,6326395,Closed,currency 1,165.0,0,149.0,160.0,,0,36000.00,0.000,0.0,0.0,Consumer credit,150.0,0.0
3,123494590,6606618,Active,currency 1,55.0,0,310.0,,,0,38664.00,37858.500,,0.0,Consumer credit,15.0,
4,123446603,5046832,Active,currency 1,358.0,0,35.0,,,0,67500.00,0.000,0.0,0.0,Credit card,116.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
945229,123673441,5235365,Closed,currency 1,2759.0,0,1298.0,1834.0,,0,332725.50,0.000,,0.0,Consumer credit,1707.0,
945230,123539211,5899696,Active,currency 1,359.0,0,1467.0,,,0,1471500.00,1320183.000,0.0,0.0,Consumer credit,47.0,
945231,123686333,5445504,Closed,currency 1,1102.0,0,725.0,370.0,,0,112500.00,0.000,0.0,0.0,Consumer credit,233.0,
945232,123508200,6679628,Active,currency 1,1579.0,0,2085.0,,2339.955,0,108000.00,0.000,0.0,0.0,Credit card,16.0,


In [25]:
bki.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 945234 entries, 0 to 945233
Data columns (total 17 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   APPLICATION_NUMBER      945234 non-null  int64  
 1   BUREAU_ID               945234 non-null  int64  
 2   CREDIT_ACTIVE           945234 non-null  object 
 3   CREDIT_CURRENCY         945234 non-null  object 
 4   DAYS_CREDIT             945234 non-null  float64
 5   CREDIT_DAY_OVERDUE      945234 non-null  int64  
 6   DAYS_CREDIT_ENDDATE     886797 non-null  float64
 7   DAYS_ENDDATE_FACT       596274 non-null  float64
 8   AMT_CREDIT_MAX_OVERDUE  326557 non-null  float64
 9   CNT_CREDIT_PROLONG      945234 non-null  int64  
 10  AMT_CREDIT_SUM          945229 non-null  float64
 11  AMT_CREDIT_SUM_DEBT     803483 non-null  float64
 12  AMT_CREDIT_SUM_LIMIT    619267 non-null  float64
 13  AMT_CREDIT_SUM_OVERDUE  945234 non-null  float64
 14  CREDIT_TYPE         

In [26]:
sum(bki.APPLICATION_NUMBER.duplicated(keep=False))

877092

Как видно из 945 234 заявок - 877092 повторяются

Уникальных заявок

In [27]:
bki.APPLICATION_NUMBER.nunique()

273131

In [28]:
bki.APPLICATION_NUMBER.value_counts().head(25)

123444199    63
123493043    51
123641404    50
123603494    39
123574982    36
123604794    36
123543964    35
123648693    34
123718325    33
123541514    32
123748735    32
123753600    32
123568901    31
123665184    31
123603381    31
123708472    30
123619148    29
123758609    29
123700006    29
123732165    29
123699638    29
123577547    28
123628296    28
123652611    28
123533986    28
Name: APPLICATION_NUMBER, dtype: int64

Посмотрим максимально повторяющуюся заявку с 63 записями из бюро кредитных историй

In [29]:
bki[bki.APPLICATION_NUMBER==123444199]

Unnamed: 0,APPLICATION_NUMBER,BUREAU_ID,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
13351,123444199,6867146,Closed,currency 1,221.0,0,210.0,217.0,,0,23400.0,0.00,,0.0,Microloan,199.0,0.0
22605,123444199,6838810,Closed,currency 1,35.0,0,15.0,33.0,,0,18900.0,0.00,,0.0,Microloan,14.0,0.0
23775,123444199,6809390,Closed,currency 1,385.0,0,364.0,375.0,0.0,0,18000.0,0.00,0.00,0.0,Microloan,370.0,0.0
60695,123444199,6876163,Closed,currency 1,148.0,0,139.0,139.0,,0,31500.0,0.00,,0.0,Microloan,0.0,0.0
108977,123444199,6858817,Closed,currency 1,314.0,0,294.0,300.0,,0,49500.0,0.00,,0.0,Microloan,284.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
829178,123444199,6883034,Active,currency 1,768.0,0,1059.0,,,0,495000.0,8087.58,486912.42,0.0,Credit card,56.0,
851540,123444199,6885992,Closed,currency 1,315.0,0,300.0,315.0,,0,42750.0,0.00,,0.0,Microloan,299.0,0.0
871239,123444199,6897991,Closed,currency 1,253.0,0,244.0,244.0,0.0,0,45000.0,0.00,0.00,0.0,Microloan,242.0,0.0
898236,123444199,6876234,Closed,currency 1,427.0,0,409.0,409.0,0.0,0,22500.0,0.00,0.00,0.0,Microloan,337.0,67500.0


In [30]:
bki[bki.APPLICATION_NUMBER==123444199].BUREAU_ID.value_counts()

6867146    1
6804488    1
6832096    1
5114367    1
6839868    1
          ..
6847810    1
6844249    1
6824815    1
6895163    1
6831236    1
Name: BUREAU_ID, Length: 63, dtype: int64

не смотря на то, что 63 записи  - из бюро по заявке, все завписи уникальны

посмотрим уникальность ключа APPLICATION_NUMBER + BUREAU_ID , по иде это должен быть уникальный ключ таблицы

In [31]:
sum(bki[['APPLICATION_NUMBER','BUREAU_ID']].duplicated(keep=False))

28

Как видно 28 записей имееют одинаковый номер заявки и запись из бюро

In [32]:
bki[bki[['APPLICATION_NUMBER','BUREAU_ID']].duplicated(keep=False)].head(5)



Unnamed: 0,APPLICATION_NUMBER,BUREAU_ID,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
18373,123736638,6234312,Closed,currency 1,1235.0,0,885.0,1039.0,,0,52861.5,0.0,0.0,0.0,Consumer credit,1022.0,5359.5
31486,123769092,6483553,Closed,currency 1,1055.0,0,325.0,325.0,43056.0,0,900000.0,0.0,0.0,0.0,Consumer credit,325.0,0.0
70642,123546507,6690286,Active,currency 1,375.0,0,1451.0,,0.0,0,344389.5,245682.0,0.0,0.0,Consumer credit,10.0,
102496,123546507,6690286,Closed,currency 1,1035.0,0,822.0,853.0,0.0,0,36784.8,0.0,0.0,0.0,Consumer credit,853.0,
158248,123736638,6234312,Closed,currency 1,2047.0,0,12630.0,395.0,,0,495000.0,0.0,,0.0,Credit card,395.0,0.0


In [33]:
bki[(bki.APPLICATION_NUMBER ==123736638) &(bki.BUREAU_ID == 6234312) ]

Unnamed: 0,APPLICATION_NUMBER,BUREAU_ID,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
18373,123736638,6234312,Closed,currency 1,1235.0,0,885.0,1039.0,,0,52861.5,0.0,0.0,0.0,Consumer credit,1022.0,5359.5
158248,123736638,6234312,Closed,currency 1,2047.0,0,12630.0,395.0,,0,495000.0,0.0,,0.0,Credit card,395.0,0.0


Очень странно одна заявка, одно запись бюро, а информация разная

In [34]:
bki['BUREAU_ID'].nunique()

743651

In [35]:
sum(bki['BUREAU_ID'].duplicated(keep=False))

371912

In [36]:
bki.BUREAU_ID.value_counts().head(20)

5183679    7
5929269    7
5794512    7
5800551    6
6772730    6
5205471    6
5954685    6
5676064    6
6031936    6
6435929    6
5316508    6
6429302    6
5642133    6
6643514    6
5101211    6
6491432    6
5904401    6
5944547    6
6265842    6
5285858    6
Name: BUREAU_ID, dtype: int64

связано это с тем, что по каждой заявке несколько записей в бюро кредитный историй

In [37]:
bki.APPLICATION_NUMBER.nunique()

273131

273131 - уникальных заявок - это все равно меньше чем нужно

In [38]:
bki_app_uniq = bki.APPLICATION_NUMBER.unique()
d_map_app_index_bki = {i:1 for  i in bki_app_uniq}
test['is_bki_info'] = test.APPLICATION_NUMBER.map(d_map_app_index_bki)
test['is_bki_info'] = test['is_bki_info'].fillna(0)
test['is_bki_info'] = test['is_bki_info'].astype(int)
test.is_bki_info.value_counts(), test.is_bki_info.value_counts(normalize = True)

(1    126469
 0     38672
 Name: is_bki_info, dtype: int64,
 1    0.765824
 0    0.234176
 Name: is_bki_info, dtype: float64)

In [39]:

train['is_bki_info'] = train.APPLICATION_NUMBER.map(d_map_app_index_bki)
train['is_bki_info'] = train['is_bki_info'].fillna(0)
train['is_bki_info'] = train['is_bki_info'].astype(int)
train.is_bki_info.value_counts(), train.is_bki_info.value_counts(normalize = True)

(1    84508
 0    25585
 Name: is_bki_info, dtype: int64,
 1    0.767606
 0    0.232394
 Name: is_bki_info, dtype: float64)

Так же не хватает данных обо всех клиентах

In [40]:
payments = pd.read_csv(patch_payments)
payments

Unnamed: 0,PREV_APPLICATION_NUMBER,APPLICATION_NUMBER,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
0,49011181,123664960,1.0,5,1002.0,1015.0,12156.615,12156.615
1,48683432,123497205,1.0,13,442.0,432.0,18392.535,10047.645
2,48652024,123749925,1.0,10,8.0,23.0,5499.945,5499.945
3,48398897,123550846,0.0,82,398.0,398.0,7082.145,7082.145
4,49867197,123562174,0.0,63,1359.0,1359.0,156.735,156.735
...,...,...,...,...,...,...,...,...
1023927,50029793,123728077,0.0,123,993.0,993.0,2700.000,2700.000
1023928,48418780,123568892,0.0,73,529.0,529.0,232.335,232.335
1023929,49942303,123494001,2.0,24,389.0,393.0,23284.485,23284.485
1023930,50081462,123609565,0.0,4,2671.0,2671.0,9000.000,9000.000


In [41]:
sum(payments.APPLICATION_NUMBER.duplicated(keep=False))

949089

In [42]:
payments.APPLICATION_NUMBER.nunique()

264726

In [43]:
payments.APPLICATION_NUMBER.value_counts()

123619703    37
123486520    33
123449869    31
123682859    31
123632163    31
             ..
123731958     1
123778085     1
123722215     1
123639711     1
123560468     1
Name: APPLICATION_NUMBER, Length: 264726, dtype: int64

In [44]:
payments_app_uniq = payments.APPLICATION_NUMBER.unique()
d_map_app_index_payments = {i:1 for  i in payments_app_uniq}
test['is_payments_info'] = test.APPLICATION_NUMBER.map(d_map_app_index_payments)
test['is_payments_info'] = test['is_payments_info'].fillna(0)
test['is_payments_info'] = test['is_payments_info'].astype(int)
test.is_payments_info.value_counts(), test.is_payments_info.value_counts(normalize = True)

(1    122573
 0     42568
 Name: is_payments_info, dtype: int64,
 1    0.742232
 0    0.257768
 Name: is_payments_info, dtype: float64)

In [45]:
train['is_payments_info'] = train.APPLICATION_NUMBER.map(d_map_app_index_payments)
train['is_payments_info'] = train['is_payments_info'].fillna(0)
train['is_payments_info'] = train['is_payments_info'].astype(int)
train.is_payments_info.value_counts(), train.is_payments_info.value_counts(normalize = True)

(1    81967
 0    28126
 Name: is_payments_info, dtype: int64,
 1    0.744525
 0    0.255475
 Name: is_payments_info, dtype: float64)

In [46]:
applications_history = pd.read_csv(patch_applications_history)
applications_history

Unnamed: 0,PREV_APPLICATION_NUMBER,APPLICATION_NUMBER,NAME_CONTRACT_TYPE,AMOUNT_ANNUITY,AMT_APPLICATION,AMOUNT_CREDIT,AMOUNT_PAYMENT,AMOUNT_GOODS_PAYMENT,NAME_CONTRACT_STATUS,DAYS_DECISION,NAME_PAYMENT_TYPE,CODE_REJECT_REASON,NAME_TYPE_SUITE,NAME_CLIENT_TYPE,NAME_GOODS_CATEGORY,NAME_PORTFOLIO,NAME_PRODUCT_TYPE,SELLERPLACE_AREA,CNT_PAYMENT,NAME_YIELD_GROUP,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
0,49298709,123595216,,1730.430,17145.0,17145.0,0.0,17145.0,Approved,73,Cash through the bank,XAP,,Repeater,Mobile,POS,XNA,35,12.0,middle,365243.0,42.0,300.0,42.0,37.0,0.0
1,50070639,123431468,Cash,25188.615,607500.0,679671.0,,607500.0,Approved,164,XNA,XAP,Unaccompanied,Repeater,XNA,Cash,x-sell,-1,36.0,low_action,365243.0,134.0,916.0,365243.0,365243.0,1.0
2,49791680,123445379,Cash,15060.735,112500.0,136444.5,,112500.0,Approved,301,Cash through the bank,XAP,"Spouse, partner",Repeater,XNA,Cash,x-sell,-1,12.0,high,365243.0,271.0,59.0,365243.0,365243.0,1.0
3,50087457,123499497,Cash,47041.335,450000.0,470790.0,,450000.0,Approved,512,Cash through the bank,XAP,,Repeater,XNA,Cash,x-sell,-1,12.0,middle,365243.0,482.0,152.0,182.0,177.0,1.0
4,49052479,123525393,Cash,31924.395,337500.0,404055.0,,337500.0,Refused,781,Cash through the bank,HC,,Repeater,XNA,Cash,walk-in,-1,24.0,high,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1670209,49568678,123675354,,14704.290,267295.5,311400.0,0.0,267295.5,Approved,544,Cash through the bank,XAP,,Refreshed,Furniture,POS,XNA,43,30.0,low_normal,365243.0,508.0,362.0,358.0,351.0,0.0
1670210,49625245,123657974,,6622.020,87750.0,64291.5,29250.0,87750.0,Approved,1694,Cash through the bank,XAP,Unaccompanied,New,Furniture,POS,XNA,43,12.0,middle,365243.0,1604.0,1274.0,1304.0,1297.0,0.0
1670211,49927846,123572883,,11520.855,105237.0,102523.5,10525.5,105237.0,Approved,1488,Cash through the bank,XAP,"Spouse, partner",Repeater,Consumer Electronics,POS,XNA,1370,10.0,low_normal,365243.0,1457.0,1187.0,1187.0,1181.0,0.0
1670212,50053796,123723656,Cash,18821.520,180000.0,191880.0,,180000.0,Approved,1185,Cash through the bank,XAP,Family,Repeater,XNA,Cash,x-sell,-1,12.0,low_normal,365243.0,1155.0,825.0,825.0,817.0,1.0


In [47]:
applications_history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 26 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   PREV_APPLICATION_NUMBER    1670214 non-null  int64  
 1   APPLICATION_NUMBER         1670214 non-null  int64  
 2   NAME_CONTRACT_TYPE         940717 non-null   object 
 3   AMOUNT_ANNUITY             1297979 non-null  float64
 4   AMT_APPLICATION            1670214 non-null  float64
 5   AMOUNT_CREDIT              1670213 non-null  float64
 6   AMOUNT_PAYMENT             774370 non-null   float64
 7   AMOUNT_GOODS_PAYMENT       1284699 non-null  float64
 8   NAME_CONTRACT_STATUS       1670214 non-null  object 
 9   DAYS_DECISION              1670214 non-null  int64  
 10  NAME_PAYMENT_TYPE          1670214 non-null  object 
 11  CODE_REJECT_REASON         1670214 non-null  object 
 12  NAME_TYPE_SUITE            849809 non-null   object 
 13  NAME_CLIENT_

In [48]:
sum(applications_history.APPLICATION_NUMBER.duplicated(keep=False))

1609756

In [49]:
applications_history.APPLICATION_NUMBER.nunique()

338857

In [50]:
applications_history.APPLICATION_NUMBER.value_counts()

123511207    77
123589020    73
123497019    72
123565751    68
123530122    67
             ..
123458624     1
123635299     1
123750475     1
123564773     1
123514968     1
Name: APPLICATION_NUMBER, Length: 338857, dtype: int64

In [51]:
applications_history_app_uniq = applications_history.APPLICATION_NUMBER.unique()
d_map_app_index_applications_history = {i:1 for  i in applications_history_app_uniq}
test['is_history_info'] = test.APPLICATION_NUMBER.map(d_map_app_index_applications_history)
test['is_history_info'] = test['is_history_info'].fillna(0)
test['is_history_info'] = test['is_history_info'].astype(int)
test.is_history_info.value_counts(), test.is_history_info.value_counts(normalize = True)

(1    157084
 0      8057
 Name: is_history_info, dtype: int64,
 1    0.951211
 0    0.048789
 Name: is_history_info, dtype: float64)

In [52]:

train['is_history_info'] = train.APPLICATION_NUMBER.map(d_map_app_index_applications_history)
train['is_history_info'] = train['is_history_info'].fillna(0)
train['is_history_info'] = train['is_history_info'].astype(int)
train.is_history_info.value_counts(), train.is_history_info.value_counts(normalize = True)

(1    104656
 0      5437
 Name: is_history_info, dtype: int64,
 1    0.950614
 0    0.049386
 Name: is_history_info, dtype: float64)

In [53]:
train['all_info'] = train.is_bki_info+train.is_client_info+train.is_history_info + train.is_payments_info
test['all_info'] = test.is_bki_info+test.is_client_info+test.is_history_info + test.is_payments_info

In [54]:
train.all_info.value_counts(),test.all_info.value_counts()

(4    51666
 3    40629
 2    14608
 1     2903
 0      287
 Name: all_info, dtype: int64,
 4    76630
 3    61863
 2    21906
 1     4381
 0      361
 Name: all_info, dtype: int64)

In [55]:
train.head(3)

Unnamed: 0,APPLICATION_NUMBER,TARGET,NAME_CONTRACT_TYPE,is_client_info,is_bki_info,is_payments_info,is_history_info,all_info
0,123687442,0,Cash,1,1,1,1,4
1,123597908,1,Cash,0,1,1,1,3
2,123526683,0,Cash,1,1,1,1,4


In [56]:
train[train.all_info==0].TARGET.value_counts(),train[train.all_info==0].TARGET.value_counts(normalize = True)

(0    265
 1     22
 Name: TARGET, dtype: int64,
 0    0.923345
 1    0.076655
 Name: TARGET, dtype: float64)

287 и 361 (на трейн и тест) заявка без ккакой либо информации о пользователе, причем распределение положительного таргета примерно одинаковое

### Как работать с данными.  
пока вижу два подхода:  
1. Получить из дополнительных файлов плоские статистики и признаки и привязать их к заявке на трейне  
2. К трейну приявязать все возможные варианты по каждой заявке (типа OHE)  

Начну со свторого, так как это проще. Буду использовать кросс валидацию и Hold валидацию



## Обработка данных № 1

In [57]:
def get_df(my_df,add):
    df = my_df.copy()
    for data,name  in add:
        for ind in df.index:
            columns = data.columns.to_list()
            columns.remove('APPLICATION_NUMBER')
            my_data = data[data.APPLICATION_NUMBER == df.loc[ind].APPLICATION_NUMBER]
            for n,(z,i) in enumerate(my_data.iterrows()):
                for column in columns:
                    df.loc[ind,f'{name}_{n}_{column}']=i[column]
    return df               




In [58]:
%%time
ready_data = 1
if ready_data==0:
    # Создание данных
    my_files=[(client,'client'),(bki,'bki'),(payments,'payments'),(applications_history,'applications_history')]
    train_work_1 = get_df(train,my_files)
    train_work_1.to_csv(PATCH + 'train_work_1',index=False)
    test_work_1 = get_df(test,my_files)
    test_work_1.to_csv(PATCH + 'test_work_1',index=False)
else:
#     загрузка данных
    train_work_1 = pd.read_csv(PATCH + 'train_work_1')
    test_work_1 = pd.read_csv(PATCH + 'test_work_1')
    
columns_train = set(train_work_1.columns.to_list())
columns_test  = set(test_work_1.columns.to_list())
diff = columns_test -columns_train 
diff = list(diff)
print(f'разница в колонках {len(diff)}')
# удалим лишние столбцы из теста
for i in diff:
    test_work_1.drop(i,axis=1,inplace=True)
# уровняем колонки
columns = train_work_1.drop('TARGET',axis = 1).columns.to_list()
test_work_1 = test_work_1.reindex(columns=columns)
columns.append('TARGET')
train_work_1 = train_work_1.reindex(columns=columns)

разница в колонках 787
Wall time: 24min 52s


In [59]:
train_work_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110093 entries, 0 to 110092
Columns: 2408 entries, APPLICATION_NUMBER to TARGET
dtypes: float64(1655), int64(7), object(746)
memory usage: 2.0+ GB


In [60]:
test_work_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165141 entries, 0 to 165140
Columns: 2407 entries, APPLICATION_NUMBER to applications_history_63_NFLAG_INSURED_ON_APPROVAL
dtypes: float64(1654), int64(6), object(747)
memory usage: 3.0+ GB


## baseline

Подготовим данные для бустинга

In [61]:
%%time
cat_columns = test_work_1.select_dtypes(exclude=[np.number]).columns.to_list()
train_work_1[cat_columns] = train_work_1[cat_columns].astype(str)
test_work_1[cat_columns] = test_work_1[cat_columns].astype(str)
train_work_1[cat_columns] = train_work_1[cat_columns].astype('category')
test_work_1[cat_columns] = test_work_1[cat_columns].astype('category')

Wall time: 16min 18s


#### LGBMClassifier

In [62]:
model_lgb = lgb.LGBMClassifier(objective= "binary" ,metric= "auc", 
                                 n_jobs= 15,device="gpu")

In [63]:
%%time
model_lgb.fit(train_work_1.drop('TARGET',axis = 1),train_work_1['TARGET'],\
              eval_set = [(train_work_1.drop('TARGET',axis = 1),train_work_1['TARGET'])],verbose = False)

Wall time: 24.5 s


LGBMClassifier(device='gpu', metric='auc', n_jobs=15, objective='binary')

In [64]:
model_lgb.best_score_

defaultdict(collections.OrderedDict,
            {'valid_0': OrderedDict([('auc', 0.8395130743001351)])})

#### CatBoostClassifier

In [65]:
model_cb = cb.CatBoostClassifier(loss_function= "Logloss",
                                task_type= "GPU",
                                devices='0',     
                                verbose=False , 
                                eval_metric= "AUC",
                                thread_count= 15,
                                cat_features = cat_columns)

In [66]:
%%time
model_cb.fit(train_work_1.drop('TARGET',axis = 1),train_work_1['TARGET'],\
             eval_set =  [(train_work_1.drop('TARGET',axis = 1),train_work_1['TARGET'])],early_stopping_rounds  = 90)

Wall time: 14min 19s


<catboost.core.CatBoostClassifier at 0x2b59b2e0df0>

In [67]:
model_cb.best_score_

{'learn': {'Logloss': 0.2471396874403913, 'AUC': 0.7592366635799408},
 'validation': {'Logloss': 0.25486793693740745, 'AUC': 0.7257557809352875}}

#### Hold out

#### LGBMClassifier

In [68]:

x_train,x_test, y_train,y_test = train_test_split(train_work_1.drop('TARGET',axis = 1),train_work_1['TARGET'],\
                                                 test_size= 0.2,random_state = 41,stratify = train_work_1['TARGET'] )

In [69]:
model_lgb = lgb.LGBMClassifier(objective= "binary" ,metric= "auc", n_jobs= 15)

In [70]:
%%time
model_lgb.fit(x_train,y_train,\
              eval_set = [(x_train,y_train),(x_test,y_test)],verbose = False,early_stopping_rounds  = 90)

Wall time: 34.5 s


LGBMClassifier(metric='auc', n_jobs=15, objective='binary')

In [71]:
model_lgb.best_iteration_,model_lgb.best_score_

(100,
 defaultdict(collections.OrderedDict,
             {'training': OrderedDict([('auc', 0.8552263702131265)]),
              'valid_1': OrderedDict([('auc', 0.7077339631004672)])}))

#### CatBoostClassifier

In [72]:
model_cb = cb.CatBoostClassifier(loss_function= "Logloss",
                                task_type= "GPU",
                                devices='0',     
                                verbose=False , 
                                eval_metric= "AUC",
                                thread_count= 15, )

In [73]:
%%time
model_cb.fit(x_train,y_train,cat_features = cat_columns,\
              eval_set = [(x_test,y_test)],verbose = False,early_stopping_rounds  = 90)

Wall time: 6min 46s


<catboost.core.CatBoostClassifier at 0x2b7b092fd30>

In [74]:
model_cb.best_iteration_,model_cb.best_score_

(418,
 {'learn': {'Logloss': 0.25025214060051776, 'AUC': 0.7509132325649261},
  'validation': {'Logloss': 0.2592167825940097, 'AUC': 0.7069675922393799}})

### Подбор параметров

In [75]:
cv_strategy = StratifiedKFold(n_splits=5,random_state = 41,shuffle =True)

#### LGBMClassifier

In [76]:
params_lgb = {
        "boosting_type": "gbdt",
        "objective": "binary",
        'num_leaves':7,
        'max_depth':7,
        'learning_rate':0.25,
        "metric": "auc",
        "n_jobs": 15,
        'reg_alpha':20,
        'reg_lambda':20,
        'n_estimators':100 ,  
        "random_state": 27}

In [77]:
%%time
model_lgb = lgb.LGBMClassifier(**params_lgb)
model_lgb.fit(x_train,y_train,\
              eval_set = [(x_train,y_train),(x_test,y_test)],verbose = False,early_stopping_rounds  = 90)
model_lgb.best_iteration_,model_lgb.best_score_

Wall time: 20.4 s


(100,
 defaultdict(collections.OrderedDict,
             {'training': OrderedDict([('auc', 0.7584523555481649)]),
              'valid_1': OrderedDict([('auc', 0.7121078397065457)])}))

In [78]:
%%time
params_lgb = {
        "boosting_type": "gbdt",
        "objective": "binary",
        'num_leaves':7,
        'max_depth':7,
        'learning_rate':0.25,
        "metric": "auc",
        "n_jobs": 15,
        'reg_alpha':20,
        'reg_lambda':20,
        'n_estimators':31 ,  
        "random_state": 27}

model_lgb = lgb.LGBMClassifier(**params_lgb)
model_lgb.fit(x_train,y_train,\
              eval_set = [(x_train,y_train),(x_test,y_test)],verbose = False)
model_lgb.best_iteration_,model_lgb.best_score_

Wall time: 15.2 s


(None,
 defaultdict(collections.OrderedDict,
             {'training': OrderedDict([('auc', 0.7321917310880135)]),
              'valid_1': OrderedDict([('auc', 0.714537245021518)])}))

In [79]:
%%time
res =make_cross_validation(train_work_1.drop('TARGET',axis = 1),\
                           train_work_1['TARGET'],model_lgb, roc_auc_score, cv_strategy)

Fold: 1, train-observations = 88074, valid-observations = 22019
train-score = 0.7333, valid-score = 0.7179
Fold: 2, train-observations = 88074, valid-observations = 22019
train-score = 0.7309, valid-score = 0.7267
Fold: 3, train-observations = 88074, valid-observations = 22019
train-score = 0.7319, valid-score = 0.7178
Fold: 4, train-observations = 88075, valid-observations = 22018
train-score = 0.7339, valid-score = 0.7074
Fold: 5, train-observations = 88075, valid-observations = 22018
train-score = 0.733, valid-score = 0.7163
CV-results train: 0.7326 +/- 0.001
CV-results valid: 0.7172 +/- 0.006
OOF-score = 0.7171
Wall time: 1min 49s


Валидация приличная

Но скор маленький, нужно работать на данными

#### CatBoostClassifier

In [81]:
params_cb = {
        "loss_function": "Logloss",     
        'verbose':False , 
        "eval_metric": "AUC",
        "thread_count": 15, 
        "early_stopping_rounds": 90,
        "random_seed": 27, 
        'cat_features' : cat_columns,
        'max_depth':4,      
        'n_estimators':250 ,
        'learning_rate':0.3, 
        'l2_leaf_reg':2,      
        'min_child_samples':15,    
        'max_bin': 45}

In [82]:
%%time
model_cb = cb.CatBoostClassifier(**params_cb )
model_cb.fit(x_train,y_train,cat_features = cat_columns,\
              eval_set = [(x_train,y_train),(x_test,y_test)],verbose = False,early_stopping_rounds  = 90)
model_cb.best_iteration_, model_cb.best_score_

Wall time: 9min 8s


(82,
 {'learn': {'Logloss': 0.24653579161406874},
  'validation_0': {'Logloss': 0.24726302446252665, 'AUC': 0.7510094884880941},
  'validation_1': {'Logloss': 0.2572150102946844, 'AUC': 0.7143714159706901}})

In [83]:
%%time
params_cb = {
        "loss_function": "Logloss",     
        'verbose':False , 
        "eval_metric": "AUC",
        "thread_count": 15, 
#         "early_stopping_rounds": 90,
        "random_seed": 27, 
        'cat_features' : cat_columns,
        'max_depth':4,      
        'n_estimators':79 ,
        'learning_rate':0.3, 
        'l2_leaf_reg':2,      
        'min_child_samples':15,    
        'max_bin': 45}
model_cb = cb.CatBoostClassifier(**params_cb )
model_cb.fit(x_train,y_train,cat_features = cat_columns,\
              eval_set = [(x_train,y_train),(x_test,y_test)],verbose = False)
model_cb.best_iteration_, model_cb.best_score_

Wall time: 58 s


(78,
 {'learn': {'Logloss': 0.2518067666198074},
  'validation_0': {'Logloss': 0.252131296043299, 'AUC': 0.7337468184181887},
  'validation_1': {'Logloss': 0.2575488376157917, 'AUC': 0.7137777529677596}})

In [84]:
%%time
res_cb =make_cross_validation_cb(train_work_1.drop('TARGET',axis = 1),\
                           train_work_1['TARGET'],model_cb, roc_auc_score, cv_strategy)


Fold: 1, train-observations = 88074, valid-observations = 22019
train-score = 0.7325, valid-score = 0.7158
Fold: 2, train-observations = 88074, valid-observations = 22019
train-score = 0.7297, valid-score = 0.7212
Fold: 3, train-observations = 88074, valid-observations = 22019
train-score = 0.7345, valid-score = 0.7151
Fold: 4, train-observations = 88075, valid-observations = 22018
train-score = 0.7355, valid-score = 0.7039
Fold: 5, train-observations = 88075, valid-observations = 22018
train-score = 0.7347, valid-score = 0.7129
CV-results train: 0.7334 +/- 0.002
CV-results valid: 0.7138 +/- 0.006
CV-results train: 0.7334 +/- 0.002
CV-results valid: 0.7138 +/- 0.006
OOF-score = 0.7136
Wall time: 5min 17s


In [85]:
for i in res_cb[0]:
    print(i.best_iteration_)

56
56
56
56
56


Вот такой совсем basline, без каоq либо обработки данных. К сожалению нужно успеть сдать работу
Далее будем работать с данными.
Что бы улучщать результат, пока как есть согласно задания сдаю на лидер борд оба варианта

## Подготовка результатов

#### LGBMClassifier

In [86]:
%%time
model_lgb.fit(train_work_1.drop('TARGET',axis = 1),train_work_1['TARGET'],verbose = False)

Wall time: 40.5 s


LGBMClassifier(learning_rate=0.25, max_depth=7, metric='auc', n_estimators=31,
               n_jobs=15, num_leaves=7, objective='binary', random_state=27,
               reg_alpha=20, reg_lambda=20)

In [87]:
pred = model_lgb.predict_proba(test_work_1)[:,1]

In [88]:
submission_lgb_base = pd.DataFrame()
submission_lgb_base['APPLICATION_NUMBER'] = test_work_1['APPLICATION_NUMBER']
submission_lgb_base['TARGET'] = pred
submission_lgb_base.to_csv(PATCH + 'submission_lgb_base_0.csv',index=False)



#### CatBoostClassifier

In [89]:
params_cb = {
        "loss_function": "Logloss",     
        'verbose':False , 
        "eval_metric": "AUC",
        "thread_count": 15, 
#         "early_stopping_rounds": 90,
        "random_seed": 27, 
#         'cat_features' : cat_columns,
        'max_depth':5,      
        'n_estimators':82 ,
        'learning_rate':0.3, 
        'l2_leaf_reg':2,      
        'min_child_samples':15,    
        'max_bin': 45}
model_cb = cb.CatBoostClassifier(**params_cb )

In [90]:
%%time
model_cb.fit(train_work_1.drop('TARGET',axis = 1),train_work_1['TARGET'],verbose = False,cat_features = cat_columns )

Wall time: 1min 7s


<catboost.core.CatBoostClassifier at 0x2b5a4538940>

In [91]:
pred = model_cb.predict_proba(test_work_1)[:,1]

In [92]:
submission_cb_base = pd.DataFrame()
submission_cb_base['APPLICATION_NUMBER'] = test_work_1['APPLICATION_NUMBER']
submission_cb_base['TARGET'] = pred
submission_cb_base.to_csv(PATCH + 'submission_cb_base_0',index=False)