![title](ltfs.jpg)

# LTFS Data Science FinHack ( ML Hackathon)

## L&T Financial Services & Analytics Vidhya presents **‘DataScience FinHack’**.

In this FinHack, you will develop a model for our most common but real challenge **‘Loan Default Prediction’ & also, get a feel of our business!**

If your solution adds good value to our organization, take it from us, Sky is the limit for you!

## Problem Statement

### Vehicle Loan Default Prediction

Financial institutions incur significant losses due to the default of vehicle loans. This has led to the tightening up of vehicle loan underwriting and increased vehicle loan rejection rates. The need for a better credit risk scoring model is also raised by these institutions. This warrants a study to estimate the determinants of vehicle loan default.
A financial institution has hired you to accurately predict the probability of loanee/borrower defaulting on a vehicle loan in the first EMI (Equated Monthly Instalments) on the due date. Following Information regarding the loan and loanee are provided in the datasets:

* **Loanee Information (Demographic data like age, income, Identity proof etc.)**
* **Loan Information (Disbursal details, amount, EMI, loan to value ratio etc.)**
* **Bureau data & history (Bureau score, number of active accounts, the status of other loans, credit history etc.)**

Doing so will ensure that clients capable of repayment are not rejected and important determinants can be identified which can be further used for minimising the default rates.
 

### Data Description

* train.zip contains train.csv and data_dictionary.csv.
* train.csv contains the training data with details on loan as described in the last section
* data_dictionary.csv contains a brief description on each variable provided in the training and test set.
* test.csv contains details of all customers and loans for which the participants are to submit probability of default.
 


* sample_submission.csv contains the submission format for the predictions against the test set. A single csv needs to be submitted as a solution.
 

### Evaluation Metric

Submissions are evaluated on **area under the ROC curve** between the predicted probability and the observed target.
 

### Public and Private Split
Test data is further randomly divided into **Public (25%) and Private (75%)** data.
Your initial responses will be checked and scored on the Public data.
The final rankings would be based on your private score which will be published once the competition is over.

In [1]:
import numpy as np
import pandas as pd


In [4]:
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
train.head()

Unnamed: 0,UniqueID,disbursed_amount,asset_cost,ltv,branch_id,supplier_id,manufacturer_id,Current_pincode_ID,Date.of.Birth,Employment.Type,...,SEC.SANCTIONED.AMOUNT,SEC.DISBURSED.AMOUNT,PRIMARY.INSTAL.AMT,SEC.INSTAL.AMT,NEW.ACCTS.IN.LAST.SIX.MONTHS,DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS,AVERAGE.ACCT.AGE,CREDIT.HISTORY.LENGTH,NO.OF_INQUIRIES,loan_default
0,420825,50578,58400,89.55,67,22807,45,1441,01-01-84,Salaried,...,0,0,0,0,0,0,0yrs 0mon,0yrs 0mon,0,0
1,537409,47145,65550,73.23,67,22807,45,1502,31-07-85,Self employed,...,0,0,1991,0,0,1,1yrs 11mon,1yrs 11mon,0,1
2,417566,53278,61360,89.63,67,22807,45,1497,24-08-85,Self employed,...,0,0,0,0,0,0,0yrs 0mon,0yrs 0mon,0,0
3,624493,57513,66113,88.48,67,22807,45,1501,30-12-93,Self employed,...,0,0,31,0,0,0,0yrs 8mon,1yrs 3mon,1,1
4,539055,52378,60300,88.39,67,22807,45,1495,09-12-77,Self employed,...,0,0,0,0,0,0,0yrs 0mon,0yrs 0mon,1,1


### Combining Train and Test

In [3]:
df=train.append(test,ignore_index=True)
df.tail()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


Unnamed: 0,AVERAGE.ACCT.AGE,Aadhar_flag,CREDIT.HISTORY.LENGTH,Current_pincode_ID,DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS,Date.of.Birth,DisbursalDate,Driving_flag,Employee_code_ID,Employment.Type,...,State_ID,UniqueID,VoterID_flag,asset_cost,branch_id,disbursed_amount,loan_default,ltv,manufacturer_id,supplier_id
345541,0yrs 0mon,1,0yrs 0mon,2591,0,14-05-92,28-11-18,0,3823,Self employed,...,4,755547,0,72509,74,44875,,67.58,86,16646
345542,0yrs 0mon,1,0yrs 0mon,2593,0,24-04-84,19-11-18,0,3823,Self employed,...,4,716994,0,72509,74,53549,,78.61,86,16646
345543,4yrs 10mon,1,9yrs 5mon,2593,0,01-01-86,17-11-18,0,3823,Self employed,...,4,709352,0,72509,74,55243,,81.37,86,16646
345544,0yrs 3mon,1,0yrs 3mon,2593,1,01-01-73,15-11-18,0,3823,Self employed,...,4,699795,0,72509,74,55243,,81.37,86,16646
345545,0yrs 0mon,1,0yrs 0mon,2593,0,01-01-74,28-11-18,0,3823,Self employed,...,4,755244,0,72509,74,40274,,60.68,86,16646


### Feature Engineering

In [5]:
def calculateage(v):
    return pd.to_datetime('today').year-pd.to_datetime(v).year
df['age']=df['Date.of.Birth'].apply(calculateage)
df['acct_age']=df['AVERAGE.ACCT.AGE'].apply(lambda x: int(x.split(" ")[0][:-3])*12+int(x.split(" ")[1][:-3]))
df['credit_history_len']=df['CREDIT.HISTORY.LENGTH'].apply(lambda x: int(x.split(" ")[0][:-3])*12+int(x.split(" ")[1][:-3]))
df['credit_history_len_before_sixM']=df['credit_history_len'].apply(lambda x: x if x-6<0 else x-6)
def d(v):
    if v.split("-")[-1]=='00' or v.split("-")[-1]=='18':
        return "-".join(v.split("-")[:-1])+'-20'+"".join(v.split("-")[-1])
    else:
        return "-".join(v.split("-")[:-1])+'-19'+"".join(v.split("-")[-1])
df['Date.of.Birth']=pd.to_datetime(df['Date.of.Birth'].apply(d),format='%d-%m-%Y',errors='coerce')
print(1)
df['dob_y']=df['Date.of.Birth'].dt.year
df['dob_m']=df['Date.of.Birth'].dt.month
df['dob_d']=df['Date.of.Birth'].dt.day
df['dob_w']=df['Date.of.Birth'].dt.week
df['DisbursalDate']=pd.to_datetime(df['DisbursalDate'].apply(d),format='%d-%m-%Y',errors='coerce')
print(1)
df['disb_m']=df['DisbursalDate'].dt.month
df['disb_d']=df['DisbursalDate'].dt.day
df['disb_w']=df['DisbursalDate'].dt.week
df['Employment.Type'].fillna('Other',inplace=True)
print(1)
def alphs(v):
    vv=v.split("-")
    if len(vv)!=1:
        return vv[0]
    else:
        return 'Z'
df['PERFORM_CNS.SCORE.DESCRIPTION.APLHS']=df['PERFORM_CNS.SCORE.DESCRIPTION'].apply(alphs).astype(np.object)
df['PERFORM_CNS.SCORE.BINS']=pd.cut(df['PERFORM_CNS.SCORE'],5,labels=['Slim','Doubtful','Fair','Good','Great']).astype(np.object)
df['ltv.BINS']=pd.cut(df['ltv'],3,labels=['Low','Medium','High']).astype(np.object)

df['PERFORM_CNS.SCORE.DESCRIPTION'].replace({'C-Very Low Risk':'Very Low Risk','A-Very Low Risk':'Very Low Risk','D-Very Low Risk':'Very Low Risk','B-Very Low Risk':'Very Low Risk',
'M-Very High Risk':'Very High Risk','L-Very High Risk':'Very High Risk',
                                               'F-Low Risk':'Low Risk','E-Low Risk':'Low Risk',
                                               'H-Medium Risk':'Medium Risk','I-Medium Risk':'Medium Risk',
                                               'J-High Risk':'High Risk','K-High Risk':'High Risk'},inplace=True)

df['Not_Scored']=np.where(df['PERFORM_CNS.SCORE.DESCRIPTION'].str.contains('Not Scored'),1,0)
df['Very_Low']=np.where(df['PERFORM_CNS.SCORE.DESCRIPTION'].str.contains('Very Low'),1,0)
df['Very_High']=np.where(df['PERFORM_CNS.SCORE.DESCRIPTION'].str.contains('Very High'),1,0)
df['No_History']=np.where(df['PERFORM_CNS.SCORE.DESCRIPTION'].str.contains('No Bureau'),1,0)

df['ACTIVE.ACCTS']=df['PRI.ACTIVE.ACCTS']+df['SEC.ACTIVE.ACCTS']
df['CURRENT.BALANCE']=df['PRI.CURRENT.BALANCE']+df['SEC.CURRENT.BALANCE']
df['DISBURSED.AMOUNT']=df['PRI.DISBURSED.AMOUNT']+df['SEC.DISBURSED.AMOUNT']
df['NO.OF.ACCTS']=df['SEC.NO.OF.ACCTS']+df['PRI.NO.OF.ACCTS']
df['OVERDUE.ACCTS']=df['PRI.OVERDUE.ACCTS']+df['SEC.OVERDUE.ACCTS']
df['SANCTIONED.AMOUNT']=df['PRI.SANCTIONED.AMOUNT']+df['SEC.SANCTIONED.AMOUNT']
df['INSTAL.AMT']=df['PRIMARY.INSTAL.AMT']+df['SEC.INSTAL.AMT']

df['SANCTION_DISBURSED']=df['SANCTIONED.AMOUNT']-df['DISBURSED.AMOUNT']

df['NO_DEACTIVE_ACCOUNTS']=df['NO.OF.ACCTS']-df['ACTIVE.ACCTS']
df['NO.OF.ACC.BEF.SIX.MONTH']=df['NO.OF.ACCTS']-df['NEW.ACCTS.IN.LAST.SIX.MONTHS']
df['OVERDUE.ACC.BEF.SIX.MONTHS']=df['OVERDUE.ACCTS']-df['DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS']
df['CLEAN.ACC']=df['NO.OF.ACCTS']-(df['ACTIVE.ACCTS']+df['OVERDUE.ACCTS'])


df['asset_value']=df['disbursed_amount']*(df['ltv']/100)
df['value_cost']=df['asset_cost']-df['asset_value']
df['value_per_cost']=df['value_cost']/df['asset_value']
df['extra_finance']=df['asset_cost']*(df['ltv']/100)-df['disbursed_amount']
df['asset_disburse']=(df['asset_cost']-df['disbursed_amount'])/df['disbursed_amount']
df['sixmmonths_dfault']=df['NEW.ACCTS.IN.LAST.SIX.MONTHS']-df['DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS']

df['Current_pincode_ID']=df['Current_pincode_ID'].astype(np.object)
df['Employee_code_ID']=df['Employee_code_ID'].astype(np.object)
df['State_ID']=df['State_ID'].astype(np.object)
df['branch_id']=df['branch_id'].astype(np.object)
df['manufacturer_id']=df['manufacturer_id'].astype(np.object)
df['supplier_id']=df['supplier_id'].astype(np.object)

# df['NEW.ACCTS.IN.LAST.SIX.MONTHS']=df['NEW.ACCTS.IN.LAST.SIX.MONTHS'].astype(np.object)
# df['DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS']=df['DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS'].astype(np.object)
# df['NO.OF_INQUIRIES']=df['NO.OF_INQUIRIES'].astype(np.object)

for i in [ 'Aadhar_flag', 'Driving_flag', 'PAN_flag', 'Passport_flag', 'VoterID_flag']:
    df[i]=df[i].astype(np.object)
df['AVERAGE.ACCT.AGE.YEAR']=df['AVERAGE.ACCT.AGE'].apply(lambda x: int(x.split(" ")[0][:-3]))
df['AVERAGE.ACCT.AGE.MONTH']=df['AVERAGE.ACCT.AGE'].apply(lambda x: int(x.split(" ")[1][:-3]))

df['CREDIT.HISTORY.LENGTH.YEAR']=df['CREDIT.HISTORY.LENGTH'].apply(lambda x: int(x.split(" ")[0][:-3]))
df['CREDIT.HISTORY.LENGTH.MONTH']=df['CREDIT.HISTORY.LENGTH'].apply(lambda x: int(x.split(" ")[1][:-3]))
df['Total_Flag']=df['Driving_flag']+df['Aadhar_flag']+df['PAN_flag']+df['Passport_flag']+df['VoterID_flag']
df['Branch_Emp_code']=(df['Employee_code_ID']+df['branch_id']).astype(np.object)
df['Pin_Emp_code']=(df['Employee_code_ID']+df['Current_pincode_ID']).astype(np.object)
df.info()
# ------------------------------------------------------


# 2nd try remove 
# df['Very_Low']=np.where(df['PERFORM_CNS.SCORE.DESCRIPTION'].str.contains('Very Low'),1,0)
# df['No_History']=np.where(df['PERFORM_CNS.SCORE.DESCRIPTION'].str.contains('No Bureau'),1,0)
# df['AVERAGE.ACCT.AGE.YEAR']=df['AVERAGE.ACCT.AGE'].apply(lambda x: int(x.split(" ")[0][:-3]))
# df['CREDIT.HISTORY.LENGTH.YEAR']=df['CREDIT.HISTORY.LENGTH'].apply(lambda x: int(x.split(" ")[0][:-3]))


1
1
1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 345546 entries, 0 to 345545
Data columns (total 84 columns):
AVERAGE.ACCT.AGE                       345546 non-null object
Aadhar_flag                            345546 non-null object
CREDIT.HISTORY.LENGTH                  345546 non-null object
Current_pincode_ID                     345546 non-null object
DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS    345546 non-null int64
Date.of.Birth                          345546 non-null datetime64[ns]
DisbursalDate                          345546 non-null datetime64[ns]
Driving_flag                           345546 non-null object
Employee_code_ID                       345546 non-null object
Employment.Type                        345546 non-null object
MobileNo_Avl_Flag                      345546 non-null int64
NEW.ACCTS.IN.LAST.SIX.MONTHS           345546 non-null int64
NO.OF_INQUIRIES                        345546 non-null int64
PAN_flag                               345546 non-null object
PER

In [7]:
df.columns[df.columns.str.startswith('SEC')]

Index(['SEC.ACTIVE.ACCTS', 'SEC.CURRENT.BALANCE', 'SEC.DISBURSED.AMOUNT',
       'SEC.INSTAL.AMT', 'SEC.NO.OF.ACCTS', 'SEC.OVERDUE.ACCTS',
       'SEC.SANCTIONED.AMOUNT'],
      dtype='object')

In [8]:
print(df.shape)

(345546, 84)


In [9]:
df.drop(['Date.of.Birth','DisbursalDate','UniqueID','MobileNo_Avl_Flag'],axis=1,inplace=True)

#-----------------------2nd try--------------------------------
# df.drop(['Date.of.Birth','DisbursalDate','UniqueID','MobileNo_Avl_Flag','Passport_flag','Driving_flag','PAN_flag'
#         ,'SEC.ACTIVE.ACCTS', 'SEC.CURRENT.BALANCE', 'SEC.DISBURSED.AMOUNT',
#        'SEC.INSTAL.AMT', 'SEC.NO.OF.ACCTS', 'SEC.OVERDUE.ACCTS'],axis=1,inplace=True)


# df.drop(pre_sec,axis=1,inplace=True)
df.head()

Unnamed: 0,AVERAGE.ACCT.AGE,Aadhar_flag,CREDIT.HISTORY.LENGTH,Current_pincode_ID,DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS,Driving_flag,Employee_code_ID,Employment.Type,NEW.ACCTS.IN.LAST.SIX.MONTHS,NO.OF_INQUIRIES,...,extra_finance,asset_disburse,sixmmonths_dfault,AVERAGE.ACCT.AGE.YEAR,AVERAGE.ACCT.AGE.MONTH,CREDIT.HISTORY.LENGTH.YEAR,CREDIT.HISTORY.LENGTH.MONTH,Total_Flag,Branch_Emp_code,Pin_Emp_code
0,0yrs 0mon,1,0yrs 0mon,1441,0,0,1998,Salaried,0,0,...,1719.2,0.154652,0,0,0,0,0,1,2065,3439
1,1yrs 11mon,1,1yrs 11mon,1502,1,0,1998,Self employed,0,0,...,857.265,0.390391,-1,1,11,1,11,1,2065,3500
2,0yrs 0mon,1,0yrs 0mon,1497,0,0,1998,Self employed,0,0,...,1718.968,0.151695,0,0,0,0,0,1,2065,3495
3,0yrs 8mon,1,1yrs 3mon,1501,0,0,1998,Self employed,0,1,...,983.7824,0.149531,0,0,8,1,3,1,2065,3499
4,0yrs 0mon,1,0yrs 0mon,1495,0,0,1998,Self employed,0,1,...,921.17,0.151247,0,0,0,0,0,1,2065,3493


In [10]:
dftrain=df[df['loan_default'].isnull()!=True]
dftest=df[df['loan_default'].isnull()==True]
dftrain.head()

Unnamed: 0,AVERAGE.ACCT.AGE,Aadhar_flag,CREDIT.HISTORY.LENGTH,Current_pincode_ID,DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS,Driving_flag,Employee_code_ID,Employment.Type,NEW.ACCTS.IN.LAST.SIX.MONTHS,NO.OF_INQUIRIES,...,extra_finance,asset_disburse,sixmmonths_dfault,AVERAGE.ACCT.AGE.YEAR,AVERAGE.ACCT.AGE.MONTH,CREDIT.HISTORY.LENGTH.YEAR,CREDIT.HISTORY.LENGTH.MONTH,Total_Flag,Branch_Emp_code,Pin_Emp_code
0,0yrs 0mon,1,0yrs 0mon,1441,0,0,1998,Salaried,0,0,...,1719.2,0.154652,0,0,0,0,0,1,2065,3439
1,1yrs 11mon,1,1yrs 11mon,1502,1,0,1998,Self employed,0,0,...,857.265,0.390391,-1,1,11,1,11,1,2065,3500
2,0yrs 0mon,1,0yrs 0mon,1497,0,0,1998,Self employed,0,0,...,1718.968,0.151695,0,0,0,0,0,1,2065,3495
3,0yrs 8mon,1,1yrs 3mon,1501,0,0,1998,Self employed,0,1,...,983.7824,0.149531,0,0,8,1,3,1,2065,3499
4,0yrs 0mon,1,0yrs 0mon,1495,0,0,1998,Self employed,0,1,...,921.17,0.151247,0,0,0,0,0,1,2065,3493


In [11]:
X,y=dftrain.drop('loan_default',axis=1),dftrain['loan_default']
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,f1_score
X_train,X_val,y_train,y_val = train_test_split(X,y,test_size=0.3,random_state = 1994)

In [19]:
categorical_features_indices = np.where(X_train.dtypes =='object')[0]
categorical_features_indices

array([ 0,  1,  2,  3,  5,  6,  7, 10, 12, 20, 28, 29, 31, 34, 35, 47, 48,
       49, 76, 77, 78], dtype=int64)

### Hyperparameter - learning_rate

In [13]:
import catboost

class ModelOptimizer:
    best_score = None
    opt = None
    
    def __init__(self, model, X_train, y_train, categorical_columns_indices=None, n_fold=3, seed=1994, early_stopping_rounds=30, is_stratified=True, is_shuffle=True):
        self.model = model
        self.X_train = X_train
        self.y_train = y_train
        self.categorical_columns_indices = categorical_columns_indices
        self.n_fold = n_fold
        self.seed = seed
        self.early_stopping_rounds = early_stopping_rounds
        self.is_stratified = is_stratified
        self.is_shuffle = is_shuffle
        
        
    def update_model(self, **kwargs):
        for k, v in kwargs.items():
            setattr(self.model, k, v)
            
    def evaluate_model(self):
        pass
    
    def optimize(self, param_space, max_evals=10, n_random_starts=2):
        start_time = time.time()
        
        @use_named_args(param_space)
        def _minimize(**params):
            self.model.set_params(**params)
            return self.evaluate_model()
        
        opt = gp_minimize(_minimize, param_space, n_calls=max_evals, n_random_starts=n_random_starts, random_state=2405, n_jobs=-1)
        best_values = opt.x
        optimal_values = dict(zip([param.name for param in param_space], best_values))
        best_score = opt.fun
        self.best_score = best_score
        self.opt = opt
        
        print('optimal_parameters: {}\noptimal score: {}\noptimization time: {}'.format(optimal_values, best_score, time.time() - start_time))
        print('updating model with optimal values')
        self.update_model(**optimal_values)
        plot_convergence(opt)
        return optimal_values
    
class CatboostOptimizer(ModelOptimizer):
    def evaluate_model(self):
        validation_scores = catboost.cv(
        catboost.Pool(self.X_train, 
                      self.y_train, 
                      cat_features=self.categorical_columns_indices),
        self.model.get_params(), 
        nfold=self.n_fold,
        stratified=self.is_stratified,
        seed=self.seed,
        early_stopping_rounds=self.early_stopping_rounds,
        shuffle=self.is_shuffle,
        verbose=100,
        plot=False)
        self.scores = validation_scores
        test_scores = validation_scores.iloc[:, 2]
        best_metric = test_scores.max()
        return 1 - best_metric

In [15]:
from skopt import gp_minimize
from skopt.space import Real, Integer
from skopt.utils import use_named_args
from skopt.plots import plot_convergence
import time

In [16]:
cb = catboost.CatBoostClassifier(n_estimators=4000, # use large n_estimators deliberately to make use of the early stopping
                         loss_function='Logloss',
                         eval_metric='AUC',
                         boosting_type='Ordered', # use permutations
                         random_seed=1994, 
                         use_best_model=True)
cb_optimizer = CatboostOptimizer(cb, X_train, y_train,categorical_columns_indices=categorical_features_indices)
params_space = [Real(0.01, 0.8, name='learning_rate'),]
cb_optimal_values = cb_optimizer.optimize(params_space)

0:	learn: 0.5846680	test: 0.5834369	best: 0.5834369 (0)	total: 3.93s	remaining: 4h 22m 1s
100:	learn: 0.6747684	test: 0.6704210	best: 0.6704210 (100)	total: 4m 21s	remaining: 2h 48m 1s
200:	learn: 0.6860045	test: 0.6736056	best: 0.6736131 (199)	total: 9m 8s	remaining: 2h 52m 41s
300:	learn: 0.6921693	test: 0.6744569	best: 0.6744569 (300)	total: 13m 53s	remaining: 2h 50m 43s
400:	learn: 0.6974692	test: 0.6751834	best: 0.6751834 (400)	total: 18m 39s	remaining: 2h 47m 30s
500:	learn: 0.7014065	test: 0.6754404	best: 0.6754697 (490)	total: 22m 3s	remaining: 2h 34m 5s
600:	learn: 0.7047433	test: 0.6756045	best: 0.6756527 (575)	total: 25m 26s	remaining: 2h 23m 53s
Stopped by overfitting detector  (30 iterations wait)
0:	learn: 0.5846680	test: 0.5834369	best: 0.5834369 (0)	total: 2.17s	remaining: 2h 24m 31s
100:	learn: 0.6601778	test: 0.6605716	best: 0.6605716 (100)	total: 4m 29s	remaining: 2h 53m 34s
200:	learn: 0.6692627	test: 0.6678177	best: 0.6678177 (200)	total: 8m 56s	remaining: 2h 49m
3



0:	learn: 0.5846680	test: 0.5834369	best: 0.5834369 (0)	total: 2.18s	remaining: 2h 25m 36s
100:	learn: 0.6404873	test: 0.6395678	best: 0.6395678 (100)	total: 4m 33s	remaining: 2h 56m 3s
200:	learn: 0.6535273	test: 0.6539822	best: 0.6539822 (200)	total: 9m 4s	remaining: 2h 51m 33s
300:	learn: 0.6604842	test: 0.6607351	best: 0.6607351 (300)	total: 13m 56s	remaining: 2h 51m 23s
400:	learn: 0.6645557	test: 0.6642086	best: 0.6642086 (400)	total: 18m 40s	remaining: 2h 47m 36s
500:	learn: 0.6673193	test: 0.6663282	best: 0.6663282 (500)	total: 23m 11s	remaining: 2h 42m 1s
600:	learn: 0.6692305	test: 0.6676843	best: 0.6676843 (600)	total: 27m 30s	remaining: 2h 35m 35s
700:	learn: 0.6709732	test: 0.6688732	best: 0.6688732 (700)	total: 31m 50s	remaining: 2h 29m 51s
800:	learn: 0.6723604	test: 0.6697781	best: 0.6697781 (800)	total: 35m 47s	remaining: 2h 22m 55s
900:	learn: 0.6735134	test: 0.6704353	best: 0.6704353 (900)	total: 40m	remaining: 2h 17m 37s
1000:	learn: 0.6746846	test: 0.6711162	best: 



0:	learn: 0.5846680	test: 0.5834369	best: 0.5834369 (0)	total: 2.19s	remaining: 2h 26m 14s
100:	learn: 0.6404873	test: 0.6395678	best: 0.6395678 (100)	total: 4m 32s	remaining: 2h 55m 35s
200:	learn: 0.6535273	test: 0.6539822	best: 0.6539822 (200)	total: 9m 17s	remaining: 2h 55m 29s
300:	learn: 0.6604842	test: 0.6607351	best: 0.6607351 (300)	total: 14m 11s	remaining: 2h 54m 24s
400:	learn: 0.6645557	test: 0.6642086	best: 0.6642086 (400)	total: 18m 46s	remaining: 2h 48m 27s
500:	learn: 0.6673193	test: 0.6663282	best: 0.6663282 (500)	total: 23m 19s	remaining: 2h 42m 54s
600:	learn: 0.6692305	test: 0.6676843	best: 0.6676843 (600)	total: 27m 40s	remaining: 2h 36m 33s
700:	learn: 0.6709732	test: 0.6688732	best: 0.6688732 (700)	total: 32m 2s	remaining: 2h 30m 48s
800:	learn: 0.6723604	test: 0.6697781	best: 0.6697781 (800)	total: 36m 15s	remaining: 2h 24m 49s
900:	learn: 0.6735134	test: 0.6704353	best: 0.6704353 (900)	total: 40m 32s	remaining: 2h 19m 25s
1000:	learn: 0.6746846	test: 0.6711162	



0:	learn: 0.5846680	test: 0.5834369	best: 0.5834369 (0)	total: 2.26s	remaining: 2h 30m 33s
100:	learn: 0.6404873	test: 0.6395678	best: 0.6395678 (100)	total: 4m 43s	remaining: 3h 2m 27s
200:	learn: 0.6535273	test: 0.6539822	best: 0.6539822 (200)	total: 9m 18s	remaining: 2h 56m 3s
300:	learn: 0.6604842	test: 0.6607351	best: 0.6607351 (300)	total: 13m 18s	remaining: 2h 43m 36s
400:	learn: 0.6645557	test: 0.6642086	best: 0.6642086 (400)	total: 18m 8s	remaining: 2h 42m 50s
500:	learn: 0.6673193	test: 0.6663282	best: 0.6663282 (500)	total: 22m 44s	remaining: 2h 38m 49s
600:	learn: 0.6692305	test: 0.6676843	best: 0.6676843 (600)	total: 27m 6s	remaining: 2h 33m 20s
700:	learn: 0.6709732	test: 0.6688732	best: 0.6688732 (700)	total: 31m 33s	remaining: 2h 28m 32s
800:	learn: 0.6723604	test: 0.6697781	best: 0.6697781 (800)	total: 35m 47s	remaining: 2h 22m 56s
900:	learn: 0.6735134	test: 0.6704353	best: 0.6704353 (900)	total: 40m 5s	remaining: 2h 17m 54s
1000:	learn: 0.6746846	test: 0.6711162	best

## Tried and Evaluated

## Journey

* **Started with LGBM and XGB which has a CV of 67.7 and lb score as 65.1**
* **Tried tuning it but could'nt increase the leaderboard score much.**
* **Started with CatBoost for which a simple model without and Feature engineering scored 65.70 in Public LB**
* **Tuned in, tried varied possible combination of features, using One- Hot Encoding, with categoricals as well as other methods the increment in Local CV dose'nt matched with the Public Leaderboard score.**
* **Tried feature selection which also worked only till a point, after which 10-folds CV reached to 68.2X which helped to increase the score till 66.62 in Public Leaderboard.**
* **One model training was taking at least 1hr, so my lapp**

## Things Tried in Several Other Notebooks which did'nt made it to final Cut :p

* **Model Tuning**
* **Curbing the Overfitting**
* **Tried many possible combinations of hyperparameters**
* **Tried other models like XGB and LGBM which sadly didnt work**

## Evaluation

In [24]:
import matplotlib.pyplot as plt
%matplotlib inline
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score,confusion_matrix,roc_auc_score
m=CatBoostClassifier(n_estimators=3000,random_state=1994,eval_metric='AUC',max_depth=6,learning_rate=0.029,od_wait=50
                     ,l2_leaf_reg=5,cat_features=categorical_features_indices,bagging_temperature=0.85,random_strength=100,
                     use_best_model=True)
m.fit(X_train,y_train,eval_set=[(X_train,y_train),(X_val, y_val)], early_stopping_rounds=100,verbose=100)
p2=m.predict_proba(X_val)[:,-1]

########### gave 68.00

0:	test: 0.6256996	test1: 0.6175770	best: 0.6175770 (0)	total: 963ms	remaining: 48m 8s
100:	test: 0.6939269	test1: 0.6641209	best: 0.6641209 (100)	total: 1m 47s	remaining: 51m 19s
200:	test: 0.7031226	test1: 0.6706726	best: 0.6706726 (200)	total: 3m 35s	remaining: 49m 58s
300:	test: 0.7084578	test1: 0.6734228	best: 0.6734228 (300)	total: 5m 21s	remaining: 48m 7s
400:	test: 0.7124937	test1: 0.6751649	best: 0.6751649 (400)	total: 7m 10s	remaining: 46m 32s
500:	test: 0.7153664	test1: 0.6761197	best: 0.6761197 (500)	total: 8m 59s	remaining: 44m 51s
600:	test: 0.7181662	test1: 0.6767696	best: 0.6767776 (599)	total: 10m 47s	remaining: 43m 5s
700:	test: 0.7200585	test1: 0.6772081	best: 0.6772081 (700)	total: 12m 36s	remaining: 41m 20s
800:	test: 0.7220448	test1: 0.6776259	best: 0.6776259 (800)	total: 14m 26s	remaining: 39m 37s
900:	test: 0.7238743	test1: 0.6779664	best: 0.6779664 (899)	total: 16m 15s	remaining: 37m 52s
1000:	test: 0.7254875	test1: 0.6782294	best: 0.6782387 (991)	total: 18m 7s

n_estimators=3000,random_state=1994,eval_metric='AUC',max_depth=7,learning_rate=0.029,od_wait=50
                     ,l2_leaf_reg=10,cat_features=categorical_features_indices,bagging_temperature=0.8,random_strength=100,
                     use_best_model=True

0:	test: 0.6298164	test1: 0.6206030	best: 0.6206030 (0)	total: 1.51s	remaining: 1h 15m 19s
100:	test: 0.6967200	test1: 0.6657370	best: 0.6657370 (100)	total: 2m 33s	remaining: 1h 13m 20s
200:	test: 0.7073686	test1: 0.6716921	best: 0.6716921 (200)	total: 5m 21s	remaining: 1h 14m 33s
300:	test: 0.7132705	test1: 0.6740105	best: 0.6740105 (300)	total: 7m 58s	remaining: 1h 11m 26s
400:	test: 0.7176570	test1: 0.6754946	best: 0.6754946 (400)	total: 10m 44s	remaining: 1h 9m 39s
500:	test: 0.7209912	test1: 0.6764125	best: 0.6764168 (499)	total: 13m 35s	remaining: 1h 7m 49s
600:	test: 0.7242409	test1: 0.6771109	best: 0.6771154 (599)	total: 16m 23s	remaining: 1h 5m 24s
700:	test: 0.7267972	test1: 0.6775394	best: 0.6775394 (700)	total: 19m 2s	remaining: 1h 2m 26s
800:	test: 0.7291595	test1: 0.6779639	best: 0.6779705 (799)	total: 21m 47s	remaining: 59m 48s
900:	test: 0.7314468	test1: 0.6783770	best: 0.6783815 (899)	total: 24m 34s	remaining: 57m 15s
1000:	test: 0.7334292	test1: 0.6784995	best: 0.6785201 (973)	total: 27m 19s	remaining: 54m 34s
1100:	test: 0.7354691	test1: 0.6787310	best: 0.6787310 (1100)	total: 30m 8s	remaining: 51m 59s
1200:	test: 0.7375465	test1: 0.6789353	best: 0.6789427 (1198)	total: 32m 45s	remaining: 49m 4s
1300:	test: 0.7395440	test1: 0.6791223	best: 0.6791410 (1272)	total: 35m 2s	remaining: 45m 45s
1400:	test: 0.7414093	test1: 0.6792434	best: 0.6792525 (1395)	total: 37m 19s	remaining: 42m 36s
1500:	test: 0.7429491	test1: 0.6794451	best: 0.6794451 (1500)	total: 39m 37s	remaining: 39m 34s
1600:	test: 0.7446167	test1: 0.6794840	best: 0.6795185 (1559)	total: 41m 58s	remaining: 36m 40s
1700:	test: 0.7461050	test1: 0.6796050	best: 0.6796050 (1700)	total: 44m 17s	remaining: 33m 49s
1800:	test: 0.7479902	test1: 0.6797087	best: 0.6797108 (1798)	total: 46m 38s	remaining: 31m 2s
1900:	test: 0.7494178	test1: 0.6798259	best: 0.6798259 (1900)	total: 48m 56s	remaining: 28m 17s
2000:	test: 0.7508585	test1: 0.6798631	best: 0.6799069 (1952)	total: 51m 15s	remaining: 25m 35s
Stopped by overfitting detector  (100 iterations wait)

bestTest = 0.6799069447
bestIteration = 1952

Shrink model to first 1953 iterations.

In [11]:
# import matplotlib.pyplot as plt
# %matplotlib inline
# from catboost import CatBoostClassifier
# from sklearn.metrics import accuracy_score,confusion_matrix,roc_auc_score
# m=CatBoostClassifier(n_estimators=3000,random_state=1994,eval_metric='AUC',max_depth=6,learning_rate=0.029,od_wait=50
#                      ,l2_leaf_reg=10,cat_features=categorical_features_indices,bagging_temperature=0.85,random_strength=100,
#                      use_best_model=True)
# m.fit(X_train,y_train,eval_set=[(X_train,y_train),(X_val, y_val)], early_stopping_rounds=100,verbose=100)
# p2=m.predict_proba(X_val)[:,-1]

########### gave 68.00

0:	test: 0.6257002	test1: 0.6175802	best: 0.6175802 (0)	total: 1.83s	remaining: 1h 31m 17s
100:	test: 0.6861999	test1: 0.6596716	best: 0.6596716 (100)	total: 2m 29s	remaining: 1h 11m 40s
200:	test: 0.6984332	test1: 0.6675033	best: 0.6675033 (200)	total: 5m	remaining: 1h 9m 42s
300:	test: 0.7034418	test1: 0.6708843	best: 0.6708843 (300)	total: 7m 33s	remaining: 1h 7m 44s
400:	test: 0.7071229	test1: 0.6729311	best: 0.6729311 (400)	total: 10m 36s	remaining: 1h 8m 44s
500:	test: 0.7099471	test1: 0.6741656	best: 0.6741656 (500)	total: 14m 30s	remaining: 1h 12m 22s
600:	test: 0.7122577	test1: 0.6750352	best: 0.6750352 (600)	total: 18m 27s	remaining: 1h 13m 39s
700:	test: 0.7144558	test1: 0.6758022	best: 0.6758022 (700)	total: 22m 21s	remaining: 1h 13m 19s
800:	test: 0.7162215	test1: 0.6762848	best: 0.6762848 (800)	total: 26m 14s	remaining: 1h 12m 2s
900:	test: 0.7176287	test1: 0.6766625	best: 0.6766633 (899)	total: 30m 10s	remaining: 1h 10m 16s
1000:	test: 0.7191127	test1: 0.6770818	best: 0.

import matplotlib.pyplot as plt
%matplotlib inline
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score,confusion_matrix,roc_auc_score
m=CatBoostClassifier(n_estimators=3000,random_state=1994,eval_metric='AUC',max_depth=6,learning_rate=0.05,od_wait=50
                     ,l2_leaf_reg=10,cat_features=categorical_features_indices,bagging_temperature=1,random_strength=100,
                     use_best_model=True)
m.fit(X_train,y_train,eval_set=[(X_train,y_train),(X_val, y_val)], early_stopping_rounds=100,verbose=100)
p2=m.predict_proba(X_val)[:,-1]

0:	test: 0.6238759	test1: 0.6166456	best: 0.6166456 (0)	total: 1.08s	remaining: 54m 7s
100:	test: 0.7005969	test1: 0.6690637	best: 0.6690637 (100)	total: 1m 35s	remaining: 45m 28s
200:	test: 0.7094896	test1: 0.6736557	best: 0.6736557 (200)	total: 2m 39s	remaining: 37m
300:	test: 0.7143557	test1: 0.6756185	best: 0.6756235 (299)	total: 3m 59s	remaining: 35m 45s
400:	test: 0.7180776	test1: 0.6767544	best: 0.6767544 (400)	total: 5m 29s	remaining: 35m 36s
500:	test: 0.7209041	test1: 0.6773625	best: 0.6773625 (500)	total: 7m 7s	remaining: 35m 34s
600:	test: 0.7235243	test1: 0.6779699	best: 0.6779718 (599)	total: 8m 46s	remaining: 35m 2s
700:	test: 0.7265405	test1: 0.6783837	best: 0.6784066 (694)	total: 10m 25s	remaining: 34m 11s
800:	test: 0.7292407	test1: 0.6785872	best: 0.6786010 (796)	total: 12m 6s	remaining: 33m 14s
900:	test: 0.7318031	test1: 0.6788393	best: 0.6788449 (899)	total: 13m 46s	remaining: 32m 6s
1000:	test: 0.7341236	test1: 0.6791289	best: 0.6791357 (993)	total: 15m 26s	remaining: 30m 49s
1100:	test: 0.7360354	test1: 0.6792755	best: 0.6792755 (1100)	total: 17m 5s	remaining: 29m 28s
1200:	test: 0.7378263	test1: 0.6793381	best: 0.6793531 (1199)	total: 18m 43s	remaining: 28m 3s
1300:	test: 0.7397921	test1: 0.6795331	best: 0.6795331 (1300)	total: 20m 19s	remaining: 26m 33s
1400:	test: 0.7417990	test1: 0.6796844	best: 0.6796990 (1392)	total: 21m 58s	remaining: 25m 4s
1500:	test: 0.7435731	test1: 0.6798866	best: 0.6799281 (1493)	total: 23m 37s	remaining: 23m 35s
1600:	test: 0.7453242	test1: 0.6799940	best: 0.6800087 (1599)	total: 25m 16s	remaining: 22m 5s
1700:	test: 0.7468000	test1: 0.6800143	best: 0.6800710 (1651)	total: 26m 54s	remaining: 20m 33s
Stopped by overfitting detector  (100 iterations wait)

bestTest = 0.6800709593
bestIteration = 1651

Shrink model to first 1652 iterations.

In [63]:
df.columns

Index(['AVERAGE.ACCT.AGE', 'Aadhar_flag', 'CREDIT.HISTORY.LENGTH',
       'Current_pincode_ID', 'DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS',
       'Driving_flag', 'Employee_code_ID', 'Employment.Type',
       'NEW.ACCTS.IN.LAST.SIX.MONTHS', 'NO.OF_INQUIRIES', 'PAN_flag',
       'PERFORM_CNS.SCORE', 'PERFORM_CNS.SCORE.DESCRIPTION',
       'PRI.ACTIVE.ACCTS', 'PRI.CURRENT.BALANCE', 'PRI.DISBURSED.AMOUNT',
       'PRI.NO.OF.ACCTS', 'PRI.OVERDUE.ACCTS', 'PRI.SANCTIONED.AMOUNT',
       'PRIMARY.INSTAL.AMT', 'Passport_flag', 'SEC.ACTIVE.ACCTS',
       'SEC.CURRENT.BALANCE', 'SEC.DISBURSED.AMOUNT', 'SEC.INSTAL.AMT',
       'SEC.NO.OF.ACCTS', 'SEC.OVERDUE.ACCTS', 'SEC.SANCTIONED.AMOUNT',
       'State_ID', 'VoterID_flag', 'asset_cost', 'branch_id',
       'disbursed_amount', 'loan_default', 'ltv', 'manufacturer_id',
       'supplier_id', 'age', 'acct_age', 'credit_history_len',
       'credit_history_len_before_sixM', 'dob_y', 'dob_m', 'dob_d', 'dob_w',
       'disb_m', 'disb_d', 'disb_w', 'PERFORM

In [25]:
sorted(zip(m.feature_importances_,X_train),reverse=True)

[(6.047410210517497, 'branch_id'),
 (5.787161448462624, 'asset_value'),
 (4.994259419681581, 'Employee_code_ID'),
 (4.312546992571475, 'supplier_id'),
 (3.72534727762862, 'dob_y'),
 (3.3864781168895046, 'Current_pincode_ID'),
 (3.267800040923283, 'ltv'),
 (3.1403483869080873, 'disb_d'),
 (3.0213980109815943, 'PERFORM_CNS.SCORE'),
 (2.990357229114383, 'extra_finance'),
 (2.680491400355188, 'value_cost'),
 (2.546525675276578, 'Employment.Type'),
 (2.531500881865133, 'disb_w'),
 (2.2421724086708004, 'Branch_Emp_code'),
 (2.2048309859899895, 'manufacturer_id'),
 (2.0455763646965544, 'State_ID'),
 (2.0288702476535763, 'dob_d'),
 (1.9659156720924318, 'Pin_Emp_code'),
 (1.9040263829153796, 'asset_cost'),
 (1.8898962924176472, 'CLEAN.ACC'),
 (1.8697323133686394, 'NO.OF_INQUIRIES'),
 (1.7532643125188854, 'PERFORM_CNS.SCORE.DESCRIPTION'),
 (1.6727839421650463, 'SANCTION_DISBURSED'),
 (1.612082124979268, 'disbursed_amount'),
 (1.5939762213367175, 'asset_disburse'),
 (1.5856357241341692, 'PRIMARY.

In [12]:
dftest.drop('loan_default',axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


## 20- fold Cross Validation

In [31]:
from sklearn.metrics import accuracy_score,confusion_matrix,roc_auc_score
errcb=[]
y_pred_totcb=[]
from sklearn.model_selection import KFold,StratifiedKFold
fold=StratifiedKFold(n_splits=20,shuffle=True,random_state=1994)
i=1
for train_index, test_index in fold.split(X,y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y[train_index], y[test_index]
    m=CatBoostClassifier(n_estimators=3000,random_state=1994,eval_metric='AUC',max_depth=6,learning_rate=0.029,od_wait=50
                     ,l2_leaf_reg=10,cat_features=categorical_features_indices,bagging_temperature=0.85,random_strength=100,
                     use_best_model=True)
    m.fit(X_train,y_train,eval_set=[(X_train,y_train),(X_test, y_test)], early_stopping_rounds=100,verbose=100)
    preds=m.predict_proba(X_test)[:,-1]
    print("err: ",roc_auc_score(y_test,preds))
    errcb.append(roc_auc_score(y_test,preds))
    p = m.predict_proba(dftest)[:,-1]
    y_pred_totcb.append(p)

0:	test: 0.6255709	test1: 0.6235215	best: 0.6235215 (0)	total: 617ms	remaining: 30m 51s
100:	test: 0.6912090	test1: 0.6663184	best: 0.6663184 (100)	total: 1m 4s	remaining: 30m 44s
200:	test: 0.6996197	test1: 0.6710377	best: 0.6711387 (196)	total: 2m 15s	remaining: 31m 25s
300:	test: 0.7048390	test1: 0.6731636	best: 0.6731636 (300)	total: 3m 25s	remaining: 30m 44s
400:	test: 0.7080067	test1: 0.6743669	best: 0.6743669 (400)	total: 4m 36s	remaining: 29m 53s
500:	test: 0.7102995	test1: 0.6752153	best: 0.6752209 (491)	total: 5m 47s	remaining: 28m 53s
600:	test: 0.7122254	test1: 0.6756048	best: 0.6756048 (600)	total: 6m 58s	remaining: 27m 51s
700:	test: 0.7143447	test1: 0.6761486	best: 0.6761524 (699)	total: 8m 10s	remaining: 26m 49s
800:	test: 0.7161435	test1: 0.6765585	best: 0.6765585 (800)	total: 9m 24s	remaining: 25m 48s
900:	test: 0.7176593	test1: 0.6769492	best: 0.6769492 (900)	total: 10m 38s	remaining: 24m 46s
1000:	test: 0.7189799	test1: 0.6771910	best: 0.6771910 (1000)	total: 11m 51

err:  0.6822089123249053
0:	test: 0.6252943	test1: 0.6254041	best: 0.6254041 (0)	total: 619ms	remaining: 30m 56s
100:	test: 0.6900003	test1: 0.6697481	best: 0.6697481 (100)	total: 1m 6s	remaining: 32m 2s
200:	test: 0.7003300	test1: 0.6770288	best: 0.6770288 (200)	total: 2m 16s	remaining: 31m 46s
300:	test: 0.7049339	test1: 0.6800258	best: 0.6800286 (299)	total: 3m 25s	remaining: 30m 46s
400:	test: 0.7083498	test1: 0.6819269	best: 0.6819396 (399)	total: 4m 35s	remaining: 29m 45s
500:	test: 0.7111396	test1: 0.6831513	best: 0.6831523 (497)	total: 5m 45s	remaining: 28m 44s
600:	test: 0.7131556	test1: 0.6839218	best: 0.6839259 (595)	total: 6m 56s	remaining: 27m 41s
700:	test: 0.7150111	test1: 0.6843911	best: 0.6843911 (700)	total: 8m 8s	remaining: 26m 40s
800:	test: 0.7163601	test1: 0.6848077	best: 0.6848531 (796)	total: 9m 19s	remaining: 25m 35s
900:	test: 0.7178451	test1: 0.6851528	best: 0.6851528 (900)	total: 10m 30s	remaining: 24m 29s
1000:	test: 0.7192884	test1: 0.6856108	best: 0.68561

1600:	test: 0.7250679	test1: 0.6771310	best: 0.6771372 (1598)	total: 19m 15s	remaining: 16m 49s
1700:	test: 0.7260419	test1: 0.6773691	best: 0.6774362 (1689)	total: 20m 30s	remaining: 15m 39s
Stopped by overfitting detector  (100 iterations wait)

bestTest = 0.6774362321
bestIteration = 1689

Shrink model to first 1690 iterations.
err:  0.677436232050502
0:	test: 0.6216150	test1: 0.6103711	best: 0.6103711 (0)	total: 1.01s	remaining: 50m 21s
100:	test: 0.6919819	test1: 0.6546138	best: 0.6546138 (100)	total: 1m 16s	remaining: 36m 25s
200:	test: 0.7023167	test1: 0.6595422	best: 0.6595422 (200)	total: 2m 24s	remaining: 33m 36s
300:	test: 0.7067608	test1: 0.6617835	best: 0.6618193 (298)	total: 3m 33s	remaining: 31m 51s
400:	test: 0.7100749	test1: 0.6630082	best: 0.6630311 (399)	total: 4m 41s	remaining: 30m 27s
500:	test: 0.7128757	test1: 0.6640254	best: 0.6640254 (500)	total: 5m 51s	remaining: 29m 13s
600:	test: 0.7149214	test1: 0.6646851	best: 0.6646967 (595)	total: 7m 2s	remaining: 28m 4s

700:	test: 0.7148588	test1: 0.6848948	best: 0.6848948 (700)	total: 8m 5s	remaining: 26m 31s
800:	test: 0.7163820	test1: 0.6853115	best: 0.6853352 (796)	total: 9m 16s	remaining: 25m 28s
900:	test: 0.7178133	test1: 0.6856260	best: 0.6856352 (898)	total: 10m 31s	remaining: 24m 30s
1000:	test: 0.7192064	test1: 0.6857925	best: 0.6858110 (995)	total: 11m 45s	remaining: 23m 28s
1100:	test: 0.7206271	test1: 0.6860887	best: 0.6861007 (1091)	total: 12m 59s	remaining: 22m 23s
1200:	test: 0.7220098	test1: 0.6862627	best: 0.6863069 (1149)	total: 14m 12s	remaining: 21m 17s
1300:	test: 0.7232216	test1: 0.6864481	best: 0.6864481 (1300)	total: 15m 27s	remaining: 20m 10s
1400:	test: 0.7243796	test1: 0.6865786	best: 0.6865828 (1399)	total: 16m 42s	remaining: 19m 3s
1500:	test: 0.7255056	test1: 0.6866996	best: 0.6867163 (1494)	total: 17m 56s	remaining: 17m 55s
1600:	test: 0.7264869	test1: 0.6867906	best: 0.6867971 (1568)	total: 19m 10s	remaining: 16m 45s
1700:	test: 0.7276142	test1: 0.6868797	best: 0.6869

700:	test: 0.7160550	test1: 0.6754236	best: 0.6754236 (700)	total: 8m 7s	remaining: 26m 37s
800:	test: 0.7178938	test1: 0.6757614	best: 0.6757658 (798)	total: 9m 18s	remaining: 25m 33s
900:	test: 0.7193325	test1: 0.6761872	best: 0.6761872 (900)	total: 10m 30s	remaining: 24m 29s
1000:	test: 0.7207345	test1: 0.6765273	best: 0.6765675 (984)	total: 11m 43s	remaining: 23m 25s
1100:	test: 0.7221588	test1: 0.6768072	best: 0.6768363 (1090)	total: 12m 56s	remaining: 22m 19s
1200:	test: 0.7233876	test1: 0.6771213	best: 0.6771301 (1195)	total: 14m 9s	remaining: 21m 12s
1300:	test: 0.7245437	test1: 0.6772482	best: 0.6772512 (1299)	total: 15m 23s	remaining: 20m 5s
1400:	test: 0.7256834	test1: 0.6773163	best: 0.6773412 (1360)	total: 16m 36s	remaining: 18m 57s
1500:	test: 0.7267296	test1: 0.6774092	best: 0.6774163 (1498)	total: 17m 50s	remaining: 17m 49s
1600:	test: 0.7276818	test1: 0.6776333	best: 0.6776333 (1600)	total: 19m 4s	remaining: 16m 40s
1700:	test: 0.7286675	test1: 0.6777344	best: 0.677790

err:  0.6924930634078361
0:	test: 0.6218397	test1: 0.6117387	best: 0.6117387 (0)	total: 901ms	remaining: 45m 1s
100:	test: 0.6896418	test1: 0.6629047	best: 0.6629047 (100)	total: 1m 25s	remaining: 40m 50s
200:	test: 0.6996642	test1: 0.6699332	best: 0.6699332 (200)	total: 2m 52s	remaining: 40m
300:	test: 0.7046280	test1: 0.6732518	best: 0.6732691 (299)	total: 4m 21s	remaining: 39m 4s
400:	test: 0.7081828	test1: 0.6753162	best: 0.6753162 (400)	total: 5m 48s	remaining: 37m 40s
500:	test: 0.7110512	test1: 0.6763600	best: 0.6763709 (497)	total: 7m 17s	remaining: 36m 21s
600:	test: 0.7133232	test1: 0.6770069	best: 0.6770119 (591)	total: 8m 48s	remaining: 35m 7s
700:	test: 0.7150586	test1: 0.6775941	best: 0.6775941 (700)	total: 10m 15s	remaining: 33m 37s
800:	test: 0.7166268	test1: 0.6780820	best: 0.6780820 (800)	total: 11m 51s	remaining: 32m 32s
900:	test: 0.7181712	test1: 0.6785842	best: 0.6785842 (900)	total: 13m 24s	remaining: 31m 14s
1000:	test: 0.7193674	test1: 0.6789229	best: 0.6789523

In [67]:
# from sklearn.metrics import accuracy_score,confusion_matrix,roc_auc_score
# errcb=[]
# y_pred_totcb=[]
# from sklearn.model_selection import KFold,StratifiedKFold
# fold=StratifiedKFold(n_splits=10,shuffle=True,random_state=1994)
# i=1
# for train_index, test_index in fold.split(X,y):
#     X_train, X_test = X.iloc[train_index], X.iloc[test_index]
#     y_train, y_test = y[train_index], y[test_index]
#     m=CatBoostClassifier(n_estimators=3000,random_state=1994,eval_metric='AUC',max_depth=6,learning_rate=0.05,od_wait=50
#                      ,l2_leaf_reg=10,cat_features=categorical_features_indices,bagging_temperature=1,random_strength=100,
#                      use_best_model=True)
#     m.fit(X_train,y_train,eval_set=[(X_train,y_train),(X_test, y_test)], early_stopping_rounds=100,verbose=100)
#     preds=m.predict_proba(X_test)[:,-1]
#     print("err: ",roc_auc_score(y_test,preds))
#     errcb.append(roc_auc_score(y_test,preds))
#     p = m.predict_proba(dftest)[:,-1]
#     y_pred_totcb.append(p)

0:	test: 0.6213369	test1: 0.6195963	best: 0.6195963 (0)	total: 1.79s	remaining: 1h 29m 37s
100:	test: 0.6981124	test1: 0.6719014	best: 0.6719014 (100)	total: 2m 11s	remaining: 1h 2m 52s
200:	test: 0.7055669	test1: 0.6762466	best: 0.6762638 (198)	total: 4m 10s	remaining: 58m 3s
300:	test: 0.7104634	test1: 0.6781424	best: 0.6781508 (299)	total: 6m 22s	remaining: 57m 9s
400:	test: 0.7142091	test1: 0.6793192	best: 0.6793330 (394)	total: 8m 12s	remaining: 53m 8s
500:	test: 0.7174067	test1: 0.6801475	best: 0.6801475 (500)	total: 10m	remaining: 49m 57s
600:	test: 0.7201182	test1: 0.6805944	best: 0.6806643 (590)	total: 11m 48s	remaining: 47m 8s
700:	test: 0.7220414	test1: 0.6809836	best: 0.6809836 (700)	total: 13m 35s	remaining: 44m 33s
800:	test: 0.7239852	test1: 0.6812761	best: 0.6812779 (799)	total: 15m 23s	remaining: 42m 14s
900:	test: 0.7259074	test1: 0.6816682	best: 0.6816854 (885)	total: 17m 10s	remaining: 40m 1s
1000:	test: 0.7275491	test1: 0.6819562	best: 0.6819562 (1000)	total: 18m 5

300:	test: 0.7096651	test1: 0.6838672	best: 0.6838672 (300)	total: 5m 20s	remaining: 47m 50s
400:	test: 0.7128202	test1: 0.6844667	best: 0.6844667 (400)	total: 7m 7s	remaining: 46m 12s
500:	test: 0.7156778	test1: 0.6851212	best: 0.6851212 (500)	total: 8m 55s	remaining: 44m 30s
600:	test: 0.7176453	test1: 0.6854359	best: 0.6854797 (596)	total: 10m 43s	remaining: 42m 50s
700:	test: 0.7200977	test1: 0.6855448	best: 0.6855644 (672)	total: 12m 36s	remaining: 41m 19s
800:	test: 0.7221989	test1: 0.6859183	best: 0.6859318 (794)	total: 14m 26s	remaining: 39m 39s
900:	test: 0.7240577	test1: 0.6859862	best: 0.6860232 (875)	total: 16m 17s	remaining: 37m 58s
1000:	test: 0.7261459	test1: 0.6862396	best: 0.6862396 (1000)	total: 18m 8s	remaining: 36m 14s
1100:	test: 0.7278783	test1: 0.6863492	best: 0.6863698 (1077)	total: 19m 59s	remaining: 34m 29s
1200:	test: 0.7297762	test1: 0.6864110	best: 0.6864677 (1154)	total: 21m 51s	remaining: 32m 45s
1300:	test: 0.7315974	test1: 0.6865857	best: 0.6866102 (129

In [68]:
# normal 2 splits
np.mean(errcb)

0.682666383500668

In [15]:
# 10 -folds cross validated with tuning
np.mean(errcb)

0.682823114704624

In [29]:
# 10 folds CV with more tuning
np.mean(errcb)

0.6834699720831957

In [32]:
# final 20 folds CV with tuning
np.mean(errcb)

0.6836636808938612

In [33]:
s=pd.DataFrame({'UniqueID':test['UniqueID'],'loan_default':np.mean(y_pred_totcb,0)})
s.head()


s.to_csv('cb20foldscatMostfeat22Opt_U2.csv',index=False)

# Conclusion

* Catboost is best when you have Categorical data as well as numerical, also many hyperparameters to play with. 
* To remove Overfitting, on must check their local score, difference of score in validation set.
* K-Folds cross validation helps curbs overfitting


