# 员工离职预测

DC 竞赛中的题目

## 任务
给定影响员工离职的因素和员工是否离职的记录，建立模型预测有可能离职的员工。
## 数据


数据主要包括影响员工离职的各种因素（工资、出差、工作环境满意度、工作投入度、是否加班、是否升职、工资提升比例等）以及员工是否已经离职的对应记录。

数据分为训练数据和测试数据，分别保存在train.csv和test_noLabel.csv两个文件中，字段说明如下：

（1）Age：员工年龄 

（2）Label：员工是否已经离职，1表示已经离职，2表示未离职，这是目标预测值； 

（3）BusinessTravel：商务差旅频率，Non-Travel表示不出差，Travel_Rarely表示不经常出差，Travel_Frequently表示经常出差； 

（4）Department：员工所在部门，Sales表示销售部，Research & Development表示研发部，Human Resources表示人力资源部； 

（5）DistanceFromHome：公司跟家庭住址的距离，从1到29，1表示最近，29表示最远； 

（6）Education：员工的教育程度，从1到5，5表示教育程度最高； 

（7）EducationField：员工所学习的专业领域，Life Sciences表示生命科学，Medical表示医疗，Marketing表示市场营销，Technical Degree表示技术学位，Human Resources表示人力资源，Other表示其他； 

（8）EmployeeNumber：员工号码； 

（9）EnvironmentSatisfaction：员工对于工作环境的满意程度，从1到4，1的满意程度最低，4的满意程度最高； 

（10）Gender：员工性别，Male表示男性，Female表示女性； 

（11）JobInvolvement：员工工作投入度，从1到4，1为投入度最低，4为投入度最高； 

（12）JobLevel：职业级别，从1到5，1为最低级别，5为最高级别； 

（13）JobRole：工作角色：Sales Executive是销售主管，Research Scientist是科学研究员，Laboratory Technician实验室技术员，Manufacturing Director是制造总监，Healthcare Representative是医疗代表，Manager是经理，Sales Representative是销售代表，Research Director是研究总监，Human Resources是人力资源； 

（14）JobSatisfaction：工作满意度，从1到4，1代表满意程度最低，4代表满意程度最高； 

（15）MaritalStatus：员工婚姻状况，Single代表单身，Married代表已婚，Divorced代表离婚； 

（16）MonthlyIncome：员工月收入，范围在1009到19999之间； 

（17）NumCompaniesWorked：员工曾经工作过的公司数； 

（18）Over18：年龄是否超过18岁； 

（19）OverTime：是否加班，Yes表示加班，No表示不加班； 

（20）PercentSalaryHike：工资提高的百分比； 

（21）PerformanceRating：绩效评估； 

（22）RelationshipSatisfaction：关系满意度，从1到4，1表示满意度最低，4表示满意度最高； 

（23）StandardHours：标准工时； 

（24）StockOptionLevel：股票期权水平； 

（25）TotalWorkingYears：总工龄； 

（26）TrainingTimesLastYear：上一年的培训时长，从0到6，0表示没有培训，6表示培训时间最长； 

（27）WorkLifeBalance：工作与生活平衡程度，从1到4，1表示平衡程度最低，4表示平衡程度最高； 

（28）YearsAtCompany：在目前公司工作年数； 

（29）YearsInCurrentRole：在目前工作职责的工作年数 

（30）YearsSinceLastPromotion：距离上次升职时长 

（31）YearsWithCurrManager：跟目前的管理者共事年数； 测试数据主要包括350条记录，跟训练数据的不同是测试数据并不包括员工是否已经离职的记录，学员需要通过由训练数据所建立的模型以及所给的测试数据，得出测试数据相应的员工是否已经离职的预测。  

## 评分标准

评分算法为准确率，准确率越高，说明正确预测出离职员工与留职员工的效果越好。

评分算法参考代码如下：

 
```
from sklearn.metrics import accuracy_score
y_true = [1, 0, 1, 0]
y_pred = [1, 1, 1, 0]
score = accuracy_score(y_true, y_pred)
```

## 调入所需要的包

In [1]:
import pandas as pd

```
#显示所有列
pd.set_option('display.max_columns', None)

#显示所有行
pd.set_option('display.max_rows', None)

#设置value的显示长度为100，默认为50
pd.set_option('max_colwidth',100)
```

In [2]:
#显示所有列
pd.set_option('display.max_columns', None)

## 读取数据

In [3]:
path='./employee_leave/'
train = pd.read_csv(path+'train.csv')
test = pd.read_csv(path+'test_noLabel.csv')


## 数据展示

In [7]:
print(train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 32 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   ID                        1100 non-null   int64 
 1   Age                       1100 non-null   int64 
 2   BusinessTravel            1100 non-null   object
 3   Department                1100 non-null   object
 4   DistanceFromHome          1100 non-null   int64 
 5   Education                 1100 non-null   int64 
 6   EducationField            1100 non-null   object
 7   EmployeeNumber            1100 non-null   int64 
 8   EnvironmentSatisfaction   1100 non-null   int64 
 9   Gender                    1100 non-null   object
 10  JobInvolvement            1100 non-null   int64 
 11  JobLevel                  1100 non-null   int64 
 12  JobRole                   1100 non-null   object
 13  JobSatisfaction           1100 non-null   int64 
 14  MaritalStatus           

In [8]:
print(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 31 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   ID                        350 non-null    int64 
 1   Age                       350 non-null    int64 
 2   BusinessTravel            350 non-null    object
 3   Department                350 non-null    object
 4   DistanceFromHome          350 non-null    int64 
 5   Education                 350 non-null    int64 
 6   EducationField            350 non-null    object
 7   EmployeeNumber            350 non-null    int64 
 8   EnvironmentSatisfaction   350 non-null    int64 
 9   Gender                    350 non-null    object
 10  JobInvolvement            350 non-null    int64 
 11  JobLevel                  350 non-null    int64 
 12  JobRole                   350 non-null    object
 13  JobSatisfaction           350 non-null    int64 
 14  MaritalStatus             

In [9]:
train.describe()

Unnamed: 0,ID,Age,DistanceFromHome,Education,EmployeeNumber,EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,...,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Label
count,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,...,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0
mean,549.5,36.999091,9.427273,2.922727,1028.157273,2.725455,2.730909,2.054545,2.732727,6483.620909,...,80.0,0.788182,11.221818,2.807273,2.746364,7.011818,4.207273,2.226364,4.123636,0.161818
std,317.686953,9.03723,8.196694,1.022242,598.915204,1.098053,0.706366,1.107805,1.109731,4715.293419,...,0.0,0.843347,7.825548,1.291514,0.701121,6.223093,3.618115,3.31383,3.597996,0.368451
min,0.0,18.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1009.0,...,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,274.75,30.0,2.0,2.0,504.25,2.0,2.0,1.0,2.0,2924.5,...,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0,0.0
50%,549.5,36.0,7.0,3.0,1026.5,3.0,3.0,2.0,3.0,4857.0,...,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0,0.0
75%,824.25,43.0,15.0,4.0,1556.5,4.0,3.0,3.0,4.0,8354.5,...,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0,0.0
max,1099.0,60.0,29.0,5.0,2065.0,4.0,4.0,5.0,4.0,19999.0,...,80.0,3.0,40.0,6.0,4.0,37.0,18.0,15.0,17.0,1.0


In [10]:
test.describe()

Unnamed: 0,ID,Age,DistanceFromHome,Education,EmployeeNumber,EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,350.0,350.0,350.0,350.0,350.0,350.0,350.0,350.0,350.0,350.0,...,350.0,350.0,350.0,350.0,350.0,350.0,350.0,350.0,350.0,350.0
mean,1274.5,36.471429,8.391429,2.868571,1023.285714,2.714286,2.734286,2.068571,2.725714,6479.491429,...,2.745714,80.0,0.817143,11.202857,2.782857,2.808571,6.782857,4.26,1.951429,4.017143
std,101.180532,9.373378,7.685318,1.029583,612.566819,1.067129,0.726669,1.089615,1.083437,4633.609813,...,1.041226,0.0,0.886539,7.470399,1.295238,0.722488,5.489113,3.622336,2.752532,3.38372
min,1100.0,18.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1051.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,1187.25,30.0,2.0,2.0,463.25,2.0,2.0,1.0,2.0,2888.75,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,1274.5,35.0,7.0,3.0,1011.0,3.0,3.0,2.0,3.0,5104.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,1361.75,42.0,11.0,4.0,1584.5,4.0,3.0,3.0,4.0,8260.25,...,4.0,80.0,1.0,15.75,3.0,3.0,9.75,7.0,2.0,7.0
max,1449.0,60.0,29.0,5.0,2068.0,4.0,4.0,5.0,4.0,19973.0,...,4.0,80.0,3.0,37.0,6.0,4.0,29.0,16.0,15.0,14.0


In [16]:
train.head(10)

#显示所有列,中间没有省略号，通过下面的代码～
# （10）表示显示10行
# pd.set_option('display.max_columns', None)

Unnamed: 0,ID,Age,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Label
0,0,37,Travel_Rarely,Research & Development,1,4,Life Sciences,77,1,Male,2,2,Manufacturing Director,3,Divorced,5993,1,Y,No,18,3,3,80,1,7,2,4,7,5,0,7,0
1,1,54,Travel_Frequently,Research & Development,1,4,Life Sciences,1245,4,Female,3,3,Manufacturing Director,3,Divorced,10502,7,Y,No,17,3,1,80,1,33,2,1,5,4,1,4,0
2,2,34,Travel_Frequently,Research & Development,7,3,Life Sciences,147,1,Male,1,2,Laboratory Technician,3,Single,6074,1,Y,Yes,24,4,4,80,0,9,3,3,9,7,0,6,1
3,3,39,Travel_Rarely,Research & Development,1,1,Life Sciences,1026,4,Female,2,4,Manufacturing Director,4,Married,12742,1,Y,No,16,3,3,80,1,21,3,3,21,6,11,8,0
4,4,28,Travel_Frequently,Research & Development,1,3,Medical,1111,1,Male,2,1,Laboratory Technician,2,Divorced,2596,1,Y,No,15,3,1,80,2,1,2,3,1,0,0,0,1
5,5,24,Travel_Rarely,Sales,4,1,Medical,1445,4,Female,3,2,Sales Executive,3,Married,4162,1,Y,Yes,12,3,3,80,2,5,3,3,5,4,0,3,0
6,6,29,Travel_Rarely,Research & Development,9,5,Other,455,2,Male,2,1,Laboratory Technician,4,Single,3983,0,Y,No,17,3,3,80,0,4,2,3,3,2,2,2,0
7,7,36,Travel_Rarely,Sales,2,2,Medical,513,2,Male,2,3,Sales Executive,3,Married,7596,1,Y,No,13,3,2,80,2,10,2,3,10,9,9,0,0
8,8,33,Travel_Rarely,Research & Development,4,4,Medical,305,3,Female,2,1,Research Scientist,2,Married,2622,6,Y,No,21,4,4,80,0,7,3,3,3,2,1,1,0
9,9,34,Travel_Rarely,Research & Development,2,4,Technical Degree,1383,3,Female,3,2,Healthcare Representative,4,Single,6687,1,Y,No,11,3,4,80,0,14,2,4,14,11,4,11,0


## 字符串的特征变换成分类数据 .astype('category') 

In [4]:
test['Label'] = -1

In [5]:
test.head(10)

Unnamed: 0,ID,Age,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Label
0,1100,40,Non-Travel,Research & Development,9,4,Other,1449,3,Male,3,2,Laboratory Technician,3,Divorced,3975,3,Y,No,11,3,3,80,2,11,2,4,8,7,0,7,-1
1,1101,53,Travel_Rarely,Research & Development,7,2,Medical,1201,4,Female,3,5,Manager,3,Divorced,18606,3,Y,No,18,3,2,80,1,26,6,3,7,7,4,7,-1
2,1102,42,Travel_Rarely,Research & Development,2,4,Other,477,1,Male,2,2,Healthcare Representative,4,Single,6781,3,Y,No,23,4,2,80,0,14,6,3,1,0,0,0,-1
3,1103,34,Travel_Frequently,Human Resources,11,3,Life Sciences,1289,3,Male,2,2,Human Resources,2,Married,4490,4,Y,No,11,3,4,80,2,14,5,4,10,9,1,8,-1
4,1104,32,Travel_Rarely,Research & Development,1,1,Life Sciences,134,4,Male,3,1,Research Scientist,1,Single,2956,1,Y,No,13,3,4,80,0,1,2,3,1,0,0,0,-1
5,1105,45,Travel_Rarely,Human Resources,4,3,Life Sciences,1744,3,Female,1,3,Human Resources,3,Married,9756,4,Y,No,21,4,3,80,2,9,2,4,5,0,0,3,-1
6,1106,32,Travel_Rarely,Sales,1,4,Marketing,2016,3,Female,3,3,Sales Executive,4,Married,10422,1,Y,No,19,3,3,80,2,14,3,3,14,10,5,7,-1
7,1107,38,Travel_Rarely,Research & Development,25,2,Life Sciences,421,1,Female,2,3,Research Director,2,Married,12061,3,Y,No,17,3,3,80,1,19,2,3,10,8,0,1,-1
8,1108,37,Travel_Rarely,Research & Development,1,3,Life Sciences,391,4,Female,3,1,Research Scientist,4,Single,2115,1,Y,No,12,3,2,80,0,17,3,3,17,12,5,7,-1
9,1109,29,Travel_Frequently,Sales,1,1,Medical,312,2,Female,3,3,Sales Executive,4,Married,7918,1,Y,No,14,3,4,80,1,11,5,3,11,10,4,1,-1


In [6]:
data = pd.concat([train,test])

In [7]:
data.head()

Unnamed: 0,ID,Age,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Label
0,0,37,Travel_Rarely,Research & Development,1,4,Life Sciences,77,1,Male,2,2,Manufacturing Director,3,Divorced,5993,1,Y,No,18,3,3,80,1,7,2,4,7,5,0,7,0
1,1,54,Travel_Frequently,Research & Development,1,4,Life Sciences,1245,4,Female,3,3,Manufacturing Director,3,Divorced,10502,7,Y,No,17,3,1,80,1,33,2,1,5,4,1,4,0
2,2,34,Travel_Frequently,Research & Development,7,3,Life Sciences,147,1,Male,1,2,Laboratory Technician,3,Single,6074,1,Y,Yes,24,4,4,80,0,9,3,3,9,7,0,6,1
3,3,39,Travel_Rarely,Research & Development,1,1,Life Sciences,1026,4,Female,2,4,Manufacturing Director,4,Married,12742,1,Y,No,16,3,3,80,1,21,3,3,21,6,11,8,0
4,4,28,Travel_Frequently,Research & Development,1,3,Medical,1111,1,Male,2,1,Laboratory Technician,2,Divorced,2596,1,Y,No,15,3,1,80,2,1,2,3,1,0,0,0,1


In [12]:
data.columns
# 展示列名

Index(['ID', 'Age', 'BusinessTravel', 'Department', 'DistanceFromHome',
       'Education', 'EducationField', 'EmployeeNumber',
       'EnvironmentSatisfaction', 'Gender', 'JobInvolvement', 'JobLevel',
       'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome',
       'NumCompaniesWorked', 'Over18', 'OverTime', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager', 'Label'],
      dtype='object')

In [10]:
cat_col = [i for i in data.select_dtypes(object).columns if i not in ['ID','Label']]

In [11]:
cat_col

['BusinessTravel',
 'Department',
 'EducationField',
 'Gender',
 'JobRole',
 'MaritalStatus',
 'Over18',
 'OverTime']

In [20]:
for i in cat_col:
    data[i] = data[i].astype('category') 

## baseline 1: lightbgm + 字符的数据是分类数据

In [21]:
feats = [i for i in data.columns if i not in ['ID','Label']]
feats

['Age',
 'BusinessTravel',
 'Department',
 'DistanceFromHome',
 'Education',
 'EducationField',
 'EmployeeNumber',
 'EnvironmentSatisfaction',
 'Gender',
 'JobInvolvement',
 'JobLevel',
 'JobRole',
 'JobSatisfaction',
 'MaritalStatus',
 'MonthlyIncome',
 'NumCompaniesWorked',
 'Over18',
 'OverTime',
 'PercentSalaryHike',
 'PerformanceRating',
 'RelationshipSatisfaction',
 'StandardHours',
 'StockOptionLevel',
 'TotalWorkingYears',
 'TrainingTimesLastYear',
 'WorkLifeBalance',
 'YearsAtCompany',
 'YearsInCurrentRole',
 'YearsSinceLastPromotion',
 'YearsWithCurrManager']

## 构建lightgbm 模型 和 参数设置

In [49]:
import lightgbm as lgb

model = lgb.LGBMClassifier(
        boosting_type="gbdt", num_leaves=30, reg_alpha=0, reg_lambda=0.,
        max_depth=-1, n_estimators=1500, objective='binary', metric='auc',
        subsample=0.95, colsample_bytree=0.7, subsample_freq=1,
        learning_rate=0.02, random_state=2017
    )


## 划分训练和验证数据

In [41]:
from sklearn.model_selection import train_test_split #数据分隔出训练集和验证集

train_val_data = data[data['Label']!=-1][feats]
train_val_label = data[data['Label']!=-1]['Label']
test_data= data[data['Label']==-1][feats]

train_x, val_x, train_y, val_y = train_test_split(train_val_data, 
                                                    train_val_label, 
                                                    test_size=0.3, 
                                                    random_state=42)
model.fit(train_x,train_y)

LGBMClassifier(colsample_bytree=0.7, learning_rate=0.02,
               metric='binary_logloss', n_estimators=1500, num_leaves=30,
               objective='binary', random_state=2017, reg_alpha=0,
               subsample=0.95, subsample_freq=1)

## 预测

In [42]:
test_result = model.predict(test_data)

In [43]:
test_result

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [30]:
test_result.shape

(350,)

In [44]:
predict_result =  data[data['Label']==-1][['ID']]

In [45]:
predict_result['Label'] = test_result
predict_result.head()

Unnamed: 0,ID,Label
0,1100,0
1,1101,0
2,1102,0
3,1103,0
4,1104,0


## 保存数据到CSV文件

In [46]:
predict_result.to_csv('employee_leave_baseline2_auc_to_binary_logloss.csv',index=False )

## baseline 2 :  分类数据 + 5折交叉验证

In [50]:
from sklearn.model_selection import KFold

n_splits=5
kfold = KFold(n_splits=n_splits, shuffle=True, random_state=42)

train_val_data = data[data['Label']!=-1][feats]
train_val_label = data[data['Label']!=-1]['Label']
test_data= data[data['Label']==-1][feats]

predict_probab = data[data['Label']==-1][['ID']]
predict_probab['pred'] = 0

for train_idx, val_idx in kfold.split(train_val_data):
    train_data = train_val_data.loc[train_idx]
    train_label = train_val_label.loc[train_idx]
    val_data = train_val_data.loc[val_idx]
    val_label = train_val_label.loc[val_idx]
    model.fit(train_data, train_label,
              eval_set=[(train_data, train_label),(val_data, val_label)],
              categorical_feature=cat_col,
              eval_metric='auc',
              early_stopping_rounds=100)
    # Predicted probability
    predict_probab['pred'] += model.predict_proba(test_data)[:,1]
    
predict_probab['pred'] = predict_probab['pred'] / n_splits

[1]	training's auc: 0.881078	valid_1's auc: 0.75054
Training until validation scores don't improve for 100 rounds
[2]	training's auc: 0.902693	valid_1's auc: 0.715492
[3]	training's auc: 0.919518	valid_1's auc: 0.718476
[4]	training's auc: 0.922648	valid_1's auc: 0.725905
[5]	training's auc: 0.930066	valid_1's auc: 0.739365
[6]	training's auc: 0.931717	valid_1's auc: 0.739937
[7]	training's auc: 0.936573	valid_1's auc: 0.747937
[8]	training's auc: 0.937902	valid_1's auc: 0.748317
[9]	training's auc: 0.940826	valid_1's auc: 0.743746
[10]	training's auc: 0.944736	valid_1's auc: 0.749333
[11]	training's auc: 0.947434	valid_1's auc: 0.751873
[12]	training's auc: 0.946528	valid_1's auc: 0.760508
[13]	training's auc: 0.949276	valid_1's auc: 0.760381
[14]	training's auc: 0.94985	valid_1's auc: 0.759238
[15]	training's auc: 0.950423	valid_1's auc: 0.757841
[16]	training's auc: 0.952044	valid_1's auc: 0.758349
[17]	training's auc: 0.9545	valid_1's auc: 0.762159
[18]	training's auc: 0.955345	val

[190]	training's auc: 0.99995	valid_1's auc: 0.777143
[191]	training's auc: 0.99995	valid_1's auc: 0.778159
[192]	training's auc: 0.99996	valid_1's auc: 0.778921
[193]	training's auc: 0.99996	valid_1's auc: 0.778794
[194]	training's auc: 0.99997	valid_1's auc: 0.779302
[195]	training's auc: 0.99997	valid_1's auc: 0.77854
[196]	training's auc: 0.99998	valid_1's auc: 0.779556
[197]	training's auc: 0.99998	valid_1's auc: 0.779048
[198]	training's auc: 0.99998	valid_1's auc: 0.778413
[199]	training's auc: 0.99998	valid_1's auc: 0.779302
[200]	training's auc: 0.99998	valid_1's auc: 0.778794
[201]	training's auc: 0.99999	valid_1's auc: 0.778159
[202]	training's auc: 0.99999	valid_1's auc: 0.779175
[203]	training's auc: 0.99999	valid_1's auc: 0.779175
[204]	training's auc: 0.99999	valid_1's auc: 0.779683
[205]	training's auc: 0.99999	valid_1's auc: 0.779302
[206]	training's auc: 0.99999	valid_1's auc: 0.779683
[207]	training's auc: 0.99999	valid_1's auc: 0.779175
[208]	training's auc: 1	valid

[248]	training's auc: 1	valid_1's auc: 0.895682
[249]	training's auc: 1	valid_1's auc: 0.896135
[250]	training's auc: 1	valid_1's auc: 0.895682
[251]	training's auc: 1	valid_1's auc: 0.895984
[252]	training's auc: 1	valid_1's auc: 0.89689
[253]	training's auc: 1	valid_1's auc: 0.897494
[254]	training's auc: 1	valid_1's auc: 0.897041
[255]	training's auc: 1	valid_1's auc: 0.896588
[256]	training's auc: 1	valid_1's auc: 0.89689
[257]	training's auc: 1	valid_1's auc: 0.897041
[258]	training's auc: 1	valid_1's auc: 0.897645
[259]	training's auc: 1	valid_1's auc: 0.897494
[260]	training's auc: 1	valid_1's auc: 0.897494
[261]	training's auc: 1	valid_1's auc: 0.897494
[262]	training's auc: 1	valid_1's auc: 0.898249
[263]	training's auc: 1	valid_1's auc: 0.8984
[264]	training's auc: 1	valid_1's auc: 0.898249
[265]	training's auc: 1	valid_1's auc: 0.897796
[266]	training's auc: 1	valid_1's auc: 0.898098
[267]	training's auc: 1	valid_1's auc: 0.897494
[268]	training's auc: 1	valid_1's auc: 0.898

[459]	training's auc: 1	valid_1's auc: 0.908665
[460]	training's auc: 1	valid_1's auc: 0.908364
[461]	training's auc: 1	valid_1's auc: 0.908062
[462]	training's auc: 1	valid_1's auc: 0.908213
[463]	training's auc: 1	valid_1's auc: 0.908364
[464]	training's auc: 1	valid_1's auc: 0.908364
[465]	training's auc: 1	valid_1's auc: 0.908062
[466]	training's auc: 1	valid_1's auc: 0.908213
[467]	training's auc: 1	valid_1's auc: 0.908816
[468]	training's auc: 1	valid_1's auc: 0.908816
[469]	training's auc: 1	valid_1's auc: 0.908364
[470]	training's auc: 1	valid_1's auc: 0.908364
[471]	training's auc: 1	valid_1's auc: 0.908364
[472]	training's auc: 1	valid_1's auc: 0.908062
[473]	training's auc: 1	valid_1's auc: 0.907911
[474]	training's auc: 1	valid_1's auc: 0.908062
[475]	training's auc: 1	valid_1's auc: 0.908364
[476]	training's auc: 1	valid_1's auc: 0.908514
[477]	training's auc: 1	valid_1's auc: 0.908364
[478]	training's auc: 1	valid_1's auc: 0.908514
[479]	training's auc: 1	valid_1's auc: 0

[678]	training's auc: 1	valid_1's auc: 0.910628
[679]	training's auc: 1	valid_1's auc: 0.910779
[680]	training's auc: 1	valid_1's auc: 0.910477
[681]	training's auc: 1	valid_1's auc: 0.910477
[682]	training's auc: 1	valid_1's auc: 0.910326
[683]	training's auc: 1	valid_1's auc: 0.910477
[684]	training's auc: 1	valid_1's auc: 0.910024
[685]	training's auc: 1	valid_1's auc: 0.910477
[686]	training's auc: 1	valid_1's auc: 0.910326
[687]	training's auc: 1	valid_1's auc: 0.910477
[688]	training's auc: 1	valid_1's auc: 0.910628
[689]	training's auc: 1	valid_1's auc: 0.910024
[690]	training's auc: 1	valid_1's auc: 0.910024
[691]	training's auc: 1	valid_1's auc: 0.909571
[692]	training's auc: 1	valid_1's auc: 0.909118
[693]	training's auc: 1	valid_1's auc: 0.909118
[694]	training's auc: 1	valid_1's auc: 0.909571
[695]	training's auc: 1	valid_1's auc: 0.909722
[696]	training's auc: 1	valid_1's auc: 0.909571
[697]	training's auc: 1	valid_1's auc: 0.908816
[698]	training's auc: 1	valid_1's auc: 0

[207]	training's auc: 1	valid_1's auc: 0.776583
[208]	training's auc: 1	valid_1's auc: 0.776071
[209]	training's auc: 1	valid_1's auc: 0.776583
[210]	training's auc: 1	valid_1's auc: 0.777607
[211]	training's auc: 1	valid_1's auc: 0.777436
[212]	training's auc: 1	valid_1's auc: 0.777607
[213]	training's auc: 1	valid_1's auc: 0.777095
[214]	training's auc: 1	valid_1's auc: 0.777266
[215]	training's auc: 1	valid_1's auc: 0.777948
[216]	training's auc: 1	valid_1's auc: 0.77829
[217]	training's auc: 1	valid_1's auc: 0.778631
[218]	training's auc: 1	valid_1's auc: 0.777778
[219]	training's auc: 1	valid_1's auc: 0.777436
[220]	training's auc: 1	valid_1's auc: 0.777948
[221]	training's auc: 1	valid_1's auc: 0.777266
[222]	training's auc: 1	valid_1's auc: 0.777607
[223]	training's auc: 1	valid_1's auc: 0.77829
[224]	training's auc: 1	valid_1's auc: 0.779143
[225]	training's auc: 1	valid_1's auc: 0.778802
[226]	training's auc: 1	valid_1's auc: 0.77846
[227]	training's auc: 1	valid_1's auc: 0.77

[431]	training's auc: 1	valid_1's auc: 0.78597
[432]	training's auc: 1	valid_1's auc: 0.78597
[433]	training's auc: 1	valid_1's auc: 0.784946
[434]	training's auc: 1	valid_1's auc: 0.785629
[435]	training's auc: 1	valid_1's auc: 0.785288
[436]	training's auc: 1	valid_1's auc: 0.785117
[437]	training's auc: 1	valid_1's auc: 0.784434
[438]	training's auc: 1	valid_1's auc: 0.784434
[439]	training's auc: 1	valid_1's auc: 0.785458
[440]	training's auc: 1	valid_1's auc: 0.785117
[441]	training's auc: 1	valid_1's auc: 0.784434
[442]	training's auc: 1	valid_1's auc: 0.784264
[443]	training's auc: 1	valid_1's auc: 0.784434
[444]	training's auc: 1	valid_1's auc: 0.784264
[445]	training's auc: 1	valid_1's auc: 0.784264
[446]	training's auc: 1	valid_1's auc: 0.784776
[447]	training's auc: 1	valid_1's auc: 0.784093
[448]	training's auc: 1	valid_1's auc: 0.783922
[449]	training's auc: 1	valid_1's auc: 0.785117
[450]	training's auc: 1	valid_1's auc: 0.784434
[451]	training's auc: 1	valid_1's auc: 0.7

[1]	training's auc: 0.867862	valid_1's auc: 0.744792
Training until validation scores don't improve for 100 rounds
[2]	training's auc: 0.907926	valid_1's auc: 0.774004
[3]	training's auc: 0.919711	valid_1's auc: 0.746603
[4]	training's auc: 0.928113	valid_1's auc: 0.747056
[5]	training's auc: 0.937927	valid_1's auc: 0.745924
[6]	training's auc: 0.941796	valid_1's auc: 0.744112
[7]	training's auc: 0.947412	valid_1's auc: 0.745773
[8]	training's auc: 0.949321	valid_1's auc: 0.76087
[9]	training's auc: 0.952279	valid_1's auc: 0.766757
[10]	training's auc: 0.95579	valid_1's auc: 0.761171
[11]	training's auc: 0.954931	valid_1's auc: 0.757095
[12]	training's auc: 0.955084	valid_1's auc: 0.779136
[13]	training's auc: 0.957026	valid_1's auc: 0.769173
[14]	training's auc: 0.956721	valid_1's auc: 0.774758
[15]	training's auc: 0.959006	valid_1's auc: 0.778986
[16]	training's auc: 0.960895	valid_1's auc: 0.783967
[17]	training's auc: 0.960743	valid_1's auc: 0.783514
[18]	training's auc: 0.961745	v