**Some Experiements to try out**

1. Feature engineering -- most important, key to get in the top 5.
    * create features like support/oppose ratio, frequency of review and ratings, min ratings for every emp-comp pair, etc.
    * one more imp feature, the last recorded rating/ review, and then maybe having one more feature like difference between the last rating, reviews with the average of all the previous ratings/ reviews. Now this would be of highly decisive nature to decide whether employee is leaving,
    * adding more to that, you can calculate a gradient of the rating list of that employee for a particular company, this gradient might indicate the employee's attitude towards the company.
    * the probability decision threshold is set to be 0.15 now, you can tune it around 0.10-0.20 and see the differnece, I would say decreasing it further might overfit the submission, so better to be on safe side
    
2. Other methods
    * Trying out ensemble learning -- voting ensemble, weighted ensemble, stacking ensemble, in the respective order
    * Hyperparameter tuning, -- the last thing you would want to try, if nothing is working out, go for it
    * h2o.ml -- auto finetuning API, sometimes sucks, but is good for higher scale ensemble. 

### Dependencies

In [1]:
import os
import pandas as pd
import numpy as np

In [2]:
from tqdm import tqdm
import math
import gc
import os

In [3]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
# pd.set_option('display.width', 1000)
# pd.set_option('display.expand_frame_repr', True)

In [4]:
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import random
import datetime
import warnings
warnings.filterwarnings('ignore')
from scipy import stats


# Imports for better visualization
from matplotlib import rcParams
#colorbrewer2 Dark2 qualitative color table
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
                (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
                (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
                (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
                (0.4, 0.6509803921568628, 0.11764705882352941),
                (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
                (0.6509803921568628, 0.4627450980392157, 0.11372549019607843)]

rcParams['figure.figsize'] = (10, 4)
rcParams['figure.dpi'] = 150
#rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['font.size'] = 8
rcParams['patch.edgecolor'] = 'white'
rcParams['patch.facecolor'] = dark2_colors[0]
rcParams['font.family'] = 'StixGeneral'
rcParams['axes.grid'] = True
rcParams['axes.facecolor'] = '#eeeeee'

### Data

In [5]:
%%time
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')

ratings = pd.read_csv('./data/ratings.csv')
remarks = pd.read_csv('./data/remarks.csv')
remarks_support = pd.read_csv('./data/remarks_supp_opp.csv')

CPU times: user 536 ms, sys: 68.4 ms, total: 605 ms
Wall time: 2.44 s


In [6]:
train.head()

Unnamed: 0,id,emp,comp,lastratingdate,left
0,2228,939,phcvroct,17-10-2016,1
1,4349,250,jblrepyr,19-03-2017,0
2,945,134,ewpvmfbc,21-09-2016,0
3,4553,164,wsmblohy,17-03-2017,0
4,941,129,ewpvmfbc,04-04-2016,0


In [7]:
train['left'].value_counts()

0    2932
1     594
Name: left, dtype: int64

In [8]:
test.head()

Unnamed: 0,id,emp,comp,lastratingdate
0,353,509,bnivzbfi,20-03-2017
1,732,22,ocsicwng,17-03-2017
2,3076,2,siexkzzo,09-01-2017
3,2199,885,phcvroct,01-10-2016
4,2145,756,phcvroct,06-02-2017


In [9]:
ratings.head()

Unnamed: 0,emp,comp,Date,rating
0,31,bnivzbfi,01-02-2016,4
1,33,bnivzbfi,01-02-2016,4
2,79,bnivzbfi,01-02-2016,4
3,94,bnivzbfi,01-02-2016,4
4,16,bnivzbfi,01-02-2016,2


In [10]:
remarks.head()

Unnamed: 0,emp,comp,remarkId,txt,remarkDate
0,307,bnivzbfi,fvwadfrj,**********************************************...,20-03-2017
1,382,bnivzbfi,xrrfkgap,*****************************,20-03-2017
2,172,bnivzbfi,xkcrccwi,***************************,20-03-2017
3,135,bnivzbfi,lohhvtmo,***************************,20-03-2017
4,225,bnivzbfi,gpxxmoab,*********************************,20-03-2017


In [11]:
remarks_support.head()

Unnamed: 0,emp,comp,support,oppose,remarkId
0,307,bnivzbfi,True,False,fvwadfrj
1,36,bnivzbfi,True,False,fvwadfrj
2,276,bnivzbfi,True,False,fvwadfrj
3,24,bnivzbfi,True,False,fvwadfrj
4,382,bnivzbfi,True,False,xrrfkgap


### Basic Analysis

In [12]:
## create complete training and testing data

In [13]:
train.shape

(3526, 5)

In [14]:
def get_counts(df):
    print(df.shape)
    for i in df.columns:
        print('unique values of {} --> {} \n'.format(i, df[i].nunique()))

In [15]:
get_counts(train)

(3526, 5)
unique values of id --> 3526 

unique values of emp --> 714 

unique values of comp --> 37 

unique values of lastratingdate --> 439 

unique values of left --> 2 



* Label encoding for companies, 37 is enough! fit on train, and transform on test
* Won't use emp -- but if its allowed to use it then no issues!
* lastratingdate to be changed in datetime -- will use it for count, frequency of review and other time related features

In [16]:
get_counts(ratings)

(221232, 4)
unique values of emp --> 788 

unique values of comp --> 37 

unique values of Date --> 797 

unique values of rating --> 4 



In [17]:
get_counts(remarks)

(82756, 5)
unique values of emp --> 799 

unique values of comp --> 36 

unique values of remarkId --> 41399 

unique values of txt --> 959 

unique values of remarkDate --> 775 



* The remark text is hidden, though we can features like "remark length", features create from "remark length and its support status"
* now again, remarkDate can be used for count, frequency and stuff of their remarks!

In [18]:
get_counts(remarks_support)

(336960, 5)
unique values of emp --> 879 

unique values of comp --> 35 

unique values of support --> 2 

unique values of oppose --> 2 

unique values of remarkId --> 37520 



Obviosuly one has given ratings several times, and probably for more that one companies during their tenure

In [19]:
get_counts(test)

(882, 4)
unique values of id --> 882 

unique values of emp --> 418 

unique values of comp --> 35 

unique values of lastratingdate --> 249 



In [20]:
def check_common_test_train(var):
    return pd.Series(test[var].unique()).isin(train[var].unique()).sum()

In [21]:
check_common_test_train('emp')

344

Well, 344 emp from train and test are common, but we cannot use emp ID as per domain restrictions! If the competition allows it then voila, we are almost at 99% accuracy,

but it might be the case where the emp and company pair is different ( ideally it should be the case!)

Lets check that out.

In [22]:
train['comp_emp_pair'] = train['emp'].astype(str) +'_'+ train['comp'].astype(str)
test['comp_emp_pair'] = test['emp'].astype(str) +'_'+ test['comp'].astype(str)

In [23]:
check_common_test_train('comp_emp_pair')

13

In [24]:
test['comp_emp_pair'].nunique()

882

Yeah our guess was right, they have preserved the testing strategy, dont have any data leak!!! 

As out of 800+ comp_emp pairs only 13 are common in test train, so yes, data quality is good indeed!

In [25]:
del train['comp_emp_pair']
del test['comp_emp_pair']

### Groupby Analysis

In [26]:
ratings.groupby(['emp','comp'])['comp'].agg({'count'})

Unnamed: 0_level_0,Unnamed: 1_level_0,count
emp,comp,Unnamed: 2_level_1
1,azalutpt,2
1,ejeyobsm,181
1,ewpvmfbc,1
1,fqsozvpv,1
1,iqdwmigj,56
...,...,...
991,phcvroct,1
995,phcvroct,6
996,phcvroct,20
997,phcvroct,1


yup, so many employees have served in several different companies, 

**For remarks we cannot just merge those two DFs on remark_id, as they are remarks from two different people on same company or matter, though we can create features out of them**

this only covers 579 emps, we might need to find a workaround to include remarks data from both DFs for remaining emp

### FE 1

In [27]:
company_ratings_stats = ratings.groupby('comp')['rating'].agg({'rating_mean_comp':np.mean,'rating_std_comp':np.std}).reset_index()



In [28]:
company_ratings_stats


Unnamed: 0,comp,rating_mean_comp,rating_std_comp
0,azalutpt,3.456522,0.861691
1,bhqczwkj,3.089277,0.811038
2,bnivzbfi,3.23647,0.99569
3,bucyzegb,3.396226,0.660408
4,dmgwoqhz,3.475316,0.542246
5,ejeyobsm,2.843838,0.990282
6,ewpvmfbc,2.643531,0.885298
7,fjslutlg,3.666667,0.48795
8,fqsozvpv,2.777778,1.038679
9,iqdwmigj,2.487488,1.107501


In [29]:
# def get_last_diff_rating(x):
#     last = x.iloc[-1]
#     mean = x.iloc[:-1].mean()
#     return last-mean

# last_ratings_diff = ratings.groupby(['emp','comp'])['rating'].apply(get_last_diff_rating).reset_index()
# last_ratings_diff.rename(columns={'rating':'last_rating_diff'},inplace=True)

In [30]:
ratings_pair_df = ratings.groupby(['emp','comp']).agg({'rating':'mean','Date':'count'}).reset_index(drop=False)
ratings_pair_df.rename(columns={'rating':'avg_rating','Date':'rating_count'},inplace=True)

# ratings_pair_df = ratings.groupby(['emp','comp'])['rating'].agg({'rating_mean_emp':'mean','count':'count','rating_std_emp':'std'}).reset_index(drop=False)
# ratings_pair_df = ratings_pair_df.merge(last_ratings_diff,on=['emp','comp'],how='left')
ratings_pair_df = ratings_pair_df.merge(company_ratings_stats,on=['comp'],how='left')

In [31]:
remarks['txt_len'] = remarks['txt'].apply(lambda x:len(str(x)))

remarks_pair_df = remarks.groupby(['emp','comp']).agg({'txt_len':'mean','remarkDate':'count'}).reset_index(drop=False)
remarks_pair_df.rename(columns={'txt_len':'txt_len_avg','remarkDate':'remark_count'},inplace=True)
remarks_pair_df

Unnamed: 0,emp,comp,txt_len_avg,remark_count
0,-217,phcvroct,258.000000,2
1,-216,phcvroct,25.000000,6
2,-214,phcvroct,4.000000,2
3,-213,phcvroct,51.000000,2
4,-212,phcvroct,6.000000,2
...,...,...,...,...
3214,968,phcvroct,24.375000,16
3215,969,phcvroct,33.500000,4
3216,970,phcvroct,211.733333,30
3217,996,phcvroct,83.666667,12


In [32]:
bool_sum = lambda x:sum(x.astype(int))

In [33]:
remarks_support.isna().sum()

emp         0
comp        0
support     1
oppose      1
remarkId    1
dtype: int64

Lets just drop this one NaN

In [34]:
remarks_support_pair_df = remarks_support.dropna().groupby(['emp','comp']).agg({'support':bool_sum,'oppose':bool_sum}).reset_index()
remarks_pair_df



Unnamed: 0,emp,comp,txt_len_avg,remark_count
0,-217,phcvroct,258.000000,2
1,-216,phcvroct,25.000000,6
2,-214,phcvroct,4.000000,2
3,-213,phcvroct,51.000000,2
4,-212,phcvroct,6.000000,2
...,...,...,...,...
3214,968,phcvroct,24.375000,16
3215,969,phcvroct,33.500000,4
3216,970,phcvroct,211.733333,30
3217,996,phcvroct,83.666667,12


**merge them with the dfs**

In [35]:
train_new = train.merge(ratings_pair_df,on=['emp','comp'],how='left')
train_new = train_new.merge(remarks_pair_df,on=['emp','comp'],how='left')
train_new = train_new.merge(remarks_support_pair_df,on=['emp','comp'],how='left')

train_new

Unnamed: 0,id,emp,comp,lastratingdate,left,avg_rating,rating_count,rating_mean_comp,rating_std_comp,txt_len_avg,remark_count,support,oppose
0,2228,939,phcvroct,17-10-2016,1,3.000000,1,2.591810,1.092626,,,,
1,4349,250,jblrepyr,19-03-2017,0,3.767857,112,3.028743,0.860911,136.326923,104.0,233.0,29.0
2,945,134,ewpvmfbc,21-09-2016,0,2.923077,26,2.643531,0.885298,45.800000,10.0,32.0,0.0
3,4553,164,wsmblohy,17-03-2017,0,2.411765,17,2.763706,0.865561,21.200000,10.0,19.0,1.0
4,941,129,ewpvmfbc,04-04-2016,0,2.800000,5,2.643531,0.885298,75.000000,4.0,7.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3521,783,87,ocsicwng,20-03-2017,0,2.901408,71,2.855814,0.966641,74.444444,18.0,22.0,5.0
3522,3655,14,jnvpfmup,17-03-2017,0,3.083333,12,3.148148,0.741178,28.500000,4.0,4.0,1.0
3523,1460,53,xccmgbjz,09-03-2017,0,3.000000,2,3.509695,0.695171,,,,
3524,842,168,ocsicwng,01-03-2017,0,3.166667,12,2.855814,0.966641,82.333333,6.0,,


In [36]:
train_new.isna().sum()

id                     0
emp                    0
comp                   0
lastratingdate         0
left                   0
avg_rating             0
rating_count           0
rating_mean_comp       0
rating_std_comp        0
txt_len_avg         1200
remark_count        1200
support             1006
oppose              1006
dtype: int64

These NaNs are saying something indeed, lets check out the target ration there if it helps

In [37]:
train_new[train_new['support'].isna()]['left'].mean()

0.3996023856858847

In [38]:
train_new[train_new['remark_count'].isna()]['left'].mean()

0.34833333333333333

Damn, the general target ration is 0.16, right now its 0.39 and 0.34, so indeed the NaN means something!, Lets impute these NaNs with -9999

In [39]:
test_new = test.merge(ratings_pair_df,on=['emp','comp'],how='left')
test_new = test_new.merge(remarks_pair_df,on=['emp','comp'],how='left')
test_new = test_new.merge(remarks_support_pair_df,on=['emp','comp'],how='left')

test_new

Unnamed: 0,id,emp,comp,lastratingdate,avg_rating,rating_count,rating_mean_comp,rating_std_comp,txt_len_avg,remark_count,support,oppose
0,353,509,bnivzbfi,20-03-2017,4.000000,43,3.236470,0.995690,86.000000,6.0,75.0,2.0
1,732,22,ocsicwng,17-03-2017,2.130435,69,2.855814,0.966641,43.800000,10.0,82.0,6.0
2,3076,2,siexkzzo,09-01-2017,2.444444,18,2.633187,0.934324,,,19.0,8.0
3,2199,885,phcvroct,01-10-2016,2.831325,83,2.591810,1.092626,67.500000,4.0,205.0,7.0
4,2145,756,phcvroct,06-02-2017,2.111111,9,2.591810,1.092626,78.000000,2.0,7.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
877,4022,36,ydqdpmvi,05-10-2016,2.662162,74,2.878524,0.724472,26.000000,2.0,47.0,1.0
878,3397,30,fqsozvpv,14-03-2017,4.000000,2,2.777778,1.038679,72.000000,4.0,11.0,0.0
879,1028,15,rujnkvse,30-01-2017,3.250000,4,2.690715,0.997926,,,,
880,3179,127,siexkzzo,17-03-2017,2.814815,27,2.633187,0.934324,,,12.0,8.0


In [40]:
test_new.isna().sum()

id                    0
emp                   0
comp                  0
lastratingdate        0
avg_rating            0
rating_count          0
rating_mean_comp      0
rating_std_comp       0
txt_len_avg         296
remark_count        296
support             230
oppose              230
dtype: int64

In [41]:
train_new.corr()['left']

id                 -0.188754
emp                 0.145559
left                1.000000
avg_rating         -0.058465
rating_count       -0.044714
rating_mean_comp   -0.128823
rating_std_comp     0.199385
txt_len_avg         0.015386
remark_count        0.069445
support             0.053945
oppose              0.047473
Name: left, dtype: float64

In [42]:
## No need on LGBM

# train_new.fillna(int(-9999),inplace=True)
# test_new.fillna(int(-9999),inplace=True)

We can try the following 

In [43]:
# train_new['oppose_support_ration'] = train_new['oppose']/train_new['support']
# test_new['oppose_support_ration'] = test_new['oppose']/test_new['support']

### Preprocessing

In [44]:
train_new['flag'] = 'train'
test_new['flag'] = 'test'

merged = pd.concat([train_new,test_new]).reset_index(drop=True)

In [45]:
from sklearn.preprocessing import LabelEncoder

le1 = LabelEncoder()
le2 = LabelEncoder()

merged['comp'] = le1.fit_transform(merged['comp'])

merged['emp'] = le2.fit_transform(merged['emp'])

In [46]:
type(merged['emp'][0])

numpy.int64

##### DateTime Cleaning

In [47]:
%%time
merged['lastratingdate'] = merged['lastratingdate'].apply(lambda x:pd.to_datetime(x))
remarks['remarkDate'] = remarks['remarkDate'].apply(lambda x:pd.to_datetime(x))
ratings['Date'] = ratings['Date'].apply(lambda x:pd.to_datetime(x))

CPU times: user 28.3 s, sys: 32.5 ms, total: 28.3 s
Wall time: 28.4 s


**undo the merge**

In [48]:
train_new = merged[merged['flag']=='train'].reset_index(drop=True)
test_new = merged[merged['flag']=='test'].reset_index(drop=True)

In [49]:
del train_new['flag']
del test_new['flag']

In [50]:
train_new['left'] = train_new['left'].astype(int)

### Basic Model

Nothing fancy, Basic FE, processing and K-fold experiments

**Vaniall Training**

In [51]:
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

In [52]:
target_name="left"
id_name="id"
features_to_remove=to_drop = ['id','lastratingdate','left']

features=train_new.columns.tolist()

features=[ fea for fea in  features if fea not in features_to_remove  ]


In [53]:
from sklearn.model_selection import train_test_split
x_t, x_v, y_t, y_v = train_test_split(train_new.drop(features_to_remove,axis=1),train_new[target_name],
                                      test_size=0.2, stratify=train[target_name], random_state=4343)

In [54]:
model = LGBMClassifier(n_estimators=200,max_depth=6,learning_rate=0.1)
# model.fit(train_new.drop(features_to_remove,axis=1),train_new[target_name])

model.fit(x_t,y_t)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=6,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=200, n_jobs=-1, num_leaves=31, objective=None,
               random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [55]:
y_hat = model.predict(x_v)

In [56]:
probs = model.predict_proba(x_v)

In [57]:
probab_df = pd.DataFrame()
probab_df['y_true'] = y_v.reset_index(drop=True)
probab_df['y_pred'] = y_hat
probab_df['probab_to_leave'] = probs[::,1]
probab_df

Unnamed: 0,y_true,y_pred,probab_to_leave
0,1,0,0.001248
1,0,0,0.000282
2,0,0,0.000169
3,0,0,0.000370
4,0,0,0.001840
...,...,...,...
701,0,0,0.005017
702,0,0,0.074597
703,0,0,0.027055
704,0,0,0.134250


In [58]:
probab_df[(probab_df['y_true']==1) & (probab_df['y_pred']==0)]['probab_to_leave'].describe()
## the False Negatives

count    47.000000
mean      0.162982
std       0.151711
min       0.001248
25%       0.019004
50%       0.108654
75%       0.282034
max       0.479545
Name: probab_to_leave, dtype: float64

In [59]:
probab_df[(probab_df['y_true']==1) & (probab_df['y_pred']==1)]['probab_to_leave'].describe()
## the True Positives

count    72.000000
mean      0.860931
std       0.138959
min       0.509736
25%       0.770737
50%       0.897459
75%       0.988247
max       0.999230
Name: probab_to_leave, dtype: float64

In [60]:
from sklearn.metrics import accuracy_score, confusion_matrix

cf = confusion_matrix(y_v,y_hat)
(1*cf[0][0] + 5*cf[1][1])/(1*cf[0].sum()+5*cf[1].sum())

0.7859560067681896

Yeah so it seems to be getting to the baseline we have given as reference, 

but we haven't tuned the confidence score yet!

In [63]:
cf_tuned = confusion_matrix(y_v,(probs[::,1]>0.1).astype(int))
(1*cf_tuned[0][0] + 5*cf_tuned[1][1])/(1*cf_tuned[0].sum()+5*cf_tuned[1].sum())

0.8460236886632826

### K-Fold

In [64]:
target_name="left"
id_name="id"
features_to_remove=to_drop = ['id','lastratingdate','left']

features=train_new.columns.tolist()

features=[ fea for fea in  features if fea not in features_to_remove  ]


In [65]:
import lightgbm as lgbm 
import xgboost as xgb 
import catboost as cat 

In [66]:
Experiment_name="basic_model_lgbm"

In [67]:
from sklearn.model_selection import KFold, StratifiedKFold

try: 
    folds=pd.read_csv("./folds_id.csv")
    TRAIN=train_new.merge(folds,on="id",how="left")
    TRAIN.fold.nunique() 
    
except:
#### you run this cell  only for the first time 
    kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=220)
    
    TRAIN.reset_index(drop=True,inplace=True)
    folds=TRAIN[["id"]].copy()
    folds["fold"]=0
    for fold,(tr_indx,val_ind) in enumerate(kfold.split(folds,TRAIN[target_name])) : 
        folds.loc[val_ind,"fold"]=fold
    folds.to_csv("./folds_id.csv",index=False)
    TRAIN=TRAIN.merge(folds,on="id",how="left")

    del folds

In [68]:
def train_function(model,train,test,params,other_params,target_name,features,metric):
    
    folds_num=train.fold.nunique()
    validation=train[[id_name,"fold",target_name]].copy()
    validation["pred_"+target_name]=0
    sub=test[[id_name]].copy()
    sub[target_name]=0
    
    for fold in np.sort(train.fold.unique()):
        print("#"*50+" {} ".format(fold)+"#"*50)
        os.makedirs("model_save/lgbm/{}/{}".format(Experiment_name,str(int(fold))), exist_ok=True)
        X_train=train[train.fold!=fold]
        X_val=train[train.fold==fold]

        train_pred,validation_pred,test_pred=model(X_train,X_val,test,params,other_params)
        
        validation.loc[validation.fold==fold,"pred_"+target_name]=validation_pred
        sub[target_name]+=test_pred/folds_num
        train_score=metric(X_train[target_name],train_pred)
        val_score=metric(X_val[target_name],validation_pred)
        print("train score : {} validation score : {}".format(round(train_score,4),round(val_score,4)))
    
    final_validation_score=metric(validation[target_name],validation["pred_"+target_name])
    print("final validation score : {}".format(final_validation_score))
        
    return sub,validation,final_validation_score

def lgbm_model(X_train,X_val,X_test,params,other_params):
    
    dtrain = lgbm.Dataset(data=X_train[features], label=X_train[target_name], feature_name=features)
    dval = lgbm.Dataset(data=X_val[features], label=X_val[target_name], feature_name=features)
    model = lgbm.train(
        params=params,
        train_set=dtrain,
        num_boost_round=other_params["num_boost_round"],
        valid_sets=(dtrain, dval),
        early_stopping_rounds=other_params["early_stopping_rounds"],
        verbose_eval=other_params["verbose_eval"],
    )        
    
    best_iteration = model.best_iteration
    train_pred=model.predict(X_train[features], num_iteration=best_iteration)
    validation_pred=model.predict(X_val[features], num_iteration=best_iteration)
    test_pred=model.predict(X_test[features], num_iteration=best_iteration)
        
    return train_pred,validation_pred,test_pred

In [69]:
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score

## set the probab threshold here, according to your experiment
def metric(y,y_hat,threshold=0.1):
    cf = confusion_matrix(y,(y_hat>threshold).astype(int))
    return (1*cf[0][0] + 5*cf[1][1])/(1*(cf[0].sum())+5*(cf[1].sum()))

In [92]:
other_params={"num_boost_round":50000000,
              "early_stopping_rounds":200,
              "verbose_eval":1000,
}

lgbm_params = {'max_depth': 26, 'num_leaves': 80, 'min_child_samples': 200, 'scale_pos_weight': 1.0,
 'subsample': 0.9, 'colsample_bytree': 0.6, 'lambda_2': 5.785038883822358, 'metric': 'binary_logloss',
 'nthread': 8, 'boosting_type': 'gbdt', 'objective': 'binary', 'learning_rate': 0.3, 'max_bin': 100}

### 0.8427651643510674 -- K-fold score with above 


# lgbm_params = {'max_depth': 19, 'num_leaves': 73, 'min_child_samples': 20, 'scale_pos_weight': 53.59603802481987, 
#  'subsample': 0.9, 'colsample_bytree': 0.7619369064645131, 'lambda_2': 4.525888999514755, 
#  'metric': 'binary_logloss', 'nthread': 8, 'boosting_type': 'gbdt', 'objective': 'binary', 
#  'learning_rate': 0.2850075764695483, 'max_bin': 100}
# 0.8312436462216198 -- K-fold score with above, plus the last fold gave score near 0.70, means these params arent good

# lgbm_params = {
#     "bagging_fraction": 0.8,
#     "bagging_freq": 2,
#     "boosting_type": "gbdt",
#     "feature_fraction": 0.8,
#     "learning_rate": 0.01,
#     "max_depth": 3,
#     "num_threads": 16,
#     "objective": "binary",
#     "metric": "binary_logloss",
#     "seed": 2020,
# }
# 0.8375127075567604 -- with this one

In [111]:
%%time
sub,validation,score=train_function(model=lgbm_model,
                                    train=TRAIN,
                                    test=test_new,
                                    params=lgbm_params,
                                    other_params=other_params,
                                    target_name=target_name,
                                    features=features,
                                    metric=metric)

################################################## 0 ##################################################
Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[28]	training's binary_logloss: 0.196066	valid_1's binary_logloss: 0.236846
train score : 0.8822 validation score : 0.8494
################################################## 1 ##################################################
Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[19]	training's binary_logloss: 0.211415	valid_1's binary_logloss: 0.249822
train score : 0.85 validation score : 0.8493
################################################## 2 ##################################################
Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[136]	training's binary_logloss: 0.106494	valid_1's binary_logloss: 0.21847
train score : 0.9316 validation score : 0.8493
##############################

### HyperOpt

In [94]:
import gc

from skopt.space import Real, Integer
from skopt.utils import use_named_args
import itertools
from sklearn.metrics import roc_auc_score, accuracy_score
from skopt import gp_minimize
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

In [95]:
from sklearn.model_selection import train_test_split
x_t, x_v, y_t, y_v = train_test_split(train_new.drop(features_to_remove,axis=1),train_new[target_name],
                                      test_size=0.2, stratify=train[target_name], random_state=4343)

In [74]:
# from that dimension (`'log-uniform'` for the learning rate)
space  = [Integer(-1, 30, name='max_depth'),
          Integer(6, 80, name='num_leaves'),
          Integer(20, 200, name='min_child_samples'),
          Real(0.6, 0.9, name='subsample'),
          Real(0.6, 0.9, name='colsample_bytree'),
          Real(0.01,0.3,name='learning_rate'),
          Real(0,6,name='lambda_l2'),
          Real(1, 100, name='scale_pos_weight'),
         ]

In [82]:
def objective(values):

    params = {'max_depth': values[0],
              'num_leaves': values[1],
              'min_child_samples': values[2],
              'scale_pos_weight': values[7],
              'subsample': values[3],
              'colsample_bytree': values[4],
              'lambda_2': values[6],
              'metric': 'binary_logloss',
              'nthread': 8,
              'boosting_type': 'gbdt',
              'objective': 'binary',
              'learning_rate': values[5],
              'max_bin': 100,
              #              'min_child_weight': 0,
              #              'min_split_gain': 0,
              #              'subsample_freq': 1,
              }

    print('\nNext set of params.....', params)

    early_stopping_rounds = 200
    num_boost_round = 2000

    # Fit model on feature_set and calculate validation AUROC
    xgtrain = lgb.Dataset(x_t, label=y_t)
    xgvalid = lgb.Dataset(x_v, label=y_v)

    evals_results = {}
    model_lgb = lgb.train(params, xgtrain, valid_sets=[xgtrain, xgvalid],
                          valid_names=['train', 'valid'],
                          evals_result=evals_results,
                          num_boost_round=num_boost_round,
                          early_stopping_rounds=early_stopping_rounds,
                          verbose_eval=None, feval=None)

#     auc = -roc_auc_score(y_v, model_lgb.predict(x_v))
#     f1 = -f1_score(y_v,(model_lgb.predict(x_v)>0.5).astype(int))
#     print(model_lgb.predict(x_v))
#     accuracy = -accuracy_score(y_v,model_lgb.predict(x_v).argmax(axis=1))
    score = -metric(y_v, model_lgb.predict(x_v))

#     print('\nAUROC.....',-auc,".....iter.....", model_lgb.current_iteration())
#     print('\nF1 SCORE.....',-f1,".....iter.....", model_lgb.current_iteration())
    print('\ncustom SCORE.....', -score,
          ".....iter.....", model_lgb.current_iteration())

    gc.collect()

    return score

In [83]:
%%time
res_gp = gp_minimize(objective, space, n_calls=100, random_state=0,n_random_starts=10)

"Best score=%.4f" % res_gp.fun


Next set of params..... {'max_depth': 17, 'num_leaves': 68, 'min_child_samples': 174, 'scale_pos_weight': 6.614584754426876, 'subsample': 0.8541755216352377, 'colsample_bytree': 0.7870691090357917, 'lambda_2': 1.785207639266834, 'metric': 'binary_logloss', 'nthread': 8, 'boosting_type': 'gbdt', 'objective': 'binary', 'learning_rate': 0.12147069511488297, 'max_bin': 100}

custom SCORE..... 0.8384094754653131 .....iter..... 241

Next set of params..... {'max_depth': 7, 'num_leaves': 41, 'min_child_samples': 166, 'scale_pos_weight': 65.16901533306853, 'subsample': 0.7439931517125172, 'colsample_bytree': 0.7178354388302489, 'lambda_2': 2.0243769625036108, 'metric': 'binary_logloss', 'nthread': 8, 'boosting_type': 'gbdt', 'objective': 'binary', 'learning_rate': 0.2524628414258395, 'max_bin': 100}

custom SCORE..... 0.8316412859560067 .....iter..... 303

Next set of params..... {'max_depth': 10, 'num_leaves': 77, 'min_child_samples': 45, 'scale_pos_weight': 68.20907348177708, 'subsample': 0

'Best score=-0.8663'

In [84]:
# 0.8663282571912013
# {'max_depth': 26, 'num_leaves': 80, 'min_child_samples': 200, 'scale_pos_weight': 1.0, 'subsample': 0.9, 'colsample_bytree': 0.6, 'lambda_2': 5.785038883822358, 'metric': 'binary_logloss',
#  'nthread': 8, 'boosting_type': 'gbdt', 'objective': 'binary', 'learning_rate': 0.3, 'max_bin': 100}

# {'max_depth': 19, 'num_leaves': 73, 'min_child_samples': 20, 'scale_pos_weight': 53.59603802481987, 
#  'subsample': 0.9, 'colsample_bytree': 0.7619369064645131, 'lambda_2': 4.525888999514755, 
#  'metric': 'binary_logloss', 'nthread': 8, 'boosting_type': 'gbdt', 'objective': 'binary', 
#  'learning_rate': 0.2850075764695483, 'max_bin': 100}




### Submission

In [113]:
# sub_h2o = aml.leader.predict(h2o.H2OFrame(test_new[X.columns]))

In [114]:
# sub_h2o = sub_h2o.as_data_frame()
# sub_h2o['id'] = test_new['id'].values
# # sub

In [115]:
# sub_h2o_small = sub_h2o[['id','p1']]

In [116]:
threshold = 0.15

sub['left_label'] = sub['left'].apply(lambda x:int(x>threshold))
# sub_h2o_small['left'] = sub_h2o_small['p1'].apply(lambda x:int(x>threshold))


In [117]:
# sub_h2o_small['left'].mean()

In [119]:
sub['left_label'].mean()

0.26303854875283444

#### Simple Ensemble

In [120]:
# ensemble_preds = sub['left']*0.4 + sub_h2o_small['p1']*0.6
# ensemble_preds = sub['left']*0.5 + sub_h2o_small['p1']*0.5

In [100]:
# ensemble_preds

In [101]:
# sub_h2o_small['left'] = ensemble_preds.apply(lambda x:int(x>threshold))

#### Save and upload

In [352]:
if not os.path.exists('./submissions'):
    os.mkdir('./submissions')

In [102]:
# sub_h2o_small['left'].mean()

In [103]:
# pd.read_csv('./submissions/submission_baseline_0.1_cutoff_H2OML#3_ensemble.csv')['left'].mean()

# Confirms that current run is the benchmark with 0.2789115646258503 target mean

In [263]:
# sub.to_csv('./submissions/submission_baseline_0.15_cutoff_added_support_oppose_sum_fillna.csv',index=False)

## oh damn, 0.15 worked well, but is it overfitting leaderboard??? Don't think so yet... lets see!

# sub_h2o_small.drop('p1',axis=1).to_csv('./submissions/submission_baseline_0.1_cutoff_H2OML#2.csv',index=False)

sub_h2o_small.drop('p1',axis=1).to_csv('./submissions/submission_baseline_0.1_cutoff_H2OML#5_ensemble_50%.csv',index=False)



Ensemble worked pretty well! reached 84.09

okay, so 0.05 is too low cutoff, 0.1 is the optimal, lets stick to that

!! OKAY FILLNA with -9999 didnt work, I mean didnt take us any further, though the score remained almost same, because LGBM takes care of NaNs on it own!