# <center>Pandas案例分析，有难度</center>

### 实践题目
#### 员工离职预测与劳资谈判应用

* 为什么我们最好和最有经验的员工过早离职？数据来自Kaggle中的，想并尝试预测下一个什么样的有价值的员工将离开。通过分析数据，了解影响员工辞职的因素有哪些，以及最主要的原因，预测哪些优秀员工会离职。

### 数据文件
* pfm_train.csv
* pfm_test.csv

### 本节任务
* 用pandas进行数据清洗，数据转换，以及特征提取
* 用sklearn进行模型构建，模型评估，并进行相关预测。
* 从给定的影响员工离职的因素和员工是否离职的记录，建立一个逻辑回归模型预测有可能离职的员工

> 数据主要包括影响员工离职的各种因素（工资、出差、工作环境满意度、工作投入度、是否加班、是否升职、工资提升比例等）以及员工是否已经离职的对应记录。 
> 数据分为训练数据和测试数据，分别保存在pfm_train.csv和pfm_test.csv两个文件中。 其中训练数据主要包括1100条记录，31个字段，主要字段说明如下：
* （1）Age：员工年龄
* （2）Attrition：员工是否已经离职，1表示已经离职，2表示未离职，这是目标预测值；
* （3）BusinessTravel：商务差旅频率，Non-Travel表示不出差，Travel_Rarely表示不经常出差，Travel_Frequently表示经常出差；
* （4）Department：员工所在部门，Sales表示销售部，Research & Development表示研发部，Human Resources表示人力资源部；
* （5）DistanceFromHome：公司跟家庭住址的距离，从1到29，1表示最近，29表示最远；
* （6）Education：员工的教育程度，从1到5，5表示教育程度最高；
* （7）EducationField：员工所学习的专业领域，Life Sciences表示生命科学，Medical表示医疗，Marketing表示市场营销，Technical Degree表示技术学位，Human Resources表示人力资源，Other表示其他；
* （8）EmployeeNumber：员工号码；
* （9）EnvironmentSatisfaction：员工对于工作环境的满意程度，从1到4，1的满意程度最低，4的满意程度最高；
* （10）Gender：员工性别，Male表示男性，Female表示女性；
* （11）JobInvolvement：员工工作投入度，从1到4，1为投入度最低，4为投入度最高；
* （12）JobLevel：职业级别，从1到5，1为最低级别，5为最高级别；
* （13）JobRole：工作角色：Sales Executive是销售主管，Research Scientist是科学研究员，Laboratory Technician实验室技术员，Manufacturing Director是制造总监，Healthcare Representative是医疗代表，Manager是经理，Sales Representative是销售代表，Research Director是研究总监，Human Resources是人力资源；
* （14）JobSatisfaction：工作满意度，从1到4，1代表满意程度最低，4代表满意程度最高；
* （15）MaritalStatus：员工婚姻状况，Single代表单身，Married代表已婚，Divorced代表离婚；
* （16）MonthlyIncome：员工月收入，范围在1009到19999之间；
* （17）NumCompaniesWorked：员工曾经工作过的公司数；
* （18）Over18：年龄是否超过18岁；
* （19）OverTime：是否加班，Yes表示加班，No表示不加班；
* （20）PercentSalaryHike：工资提高的百分比；
* （21）PerformanceRating：绩效评估；
* （22）RelationshipSatisfaction：关系满意度，从1到4，1表示满意度最低，4表示满意度最高；
* （23）StandardHours：标准工时；
* （24）StockOptionLevel：股票期权水平；
* （25）TotalWorkingYears：总工龄；
* （26）TrainingTimesLastYear：上一年的培训时长，从0到6，0表示没有培训，6表示培训时间最长；
* （27）WorkLifeBalance：工作与生活平衡程度，从1到4，1表示平衡程度最低，4表示平衡程度最高；
* （28）YearsAtCompany：在目前公司工作年数；
* （29）YearsInCurrentRole：在目前工作职责的工作年数
* （30）YearsSinceLastPromotion：距离上次升职时长
* （31）YearsWithCurrManager：跟目前的管理者共事年数；

In [85]:
# 引入文件操作所需要的包
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [86]:
# 获取文件的当前路径
current_dir = os.path.dirname(os.path.realpath('__file__'))
print(current_dir)

C:\Users\wzqcg\PythonCode


In [87]:
# 设置读取文件的路径
train_filename = os.path.join(current_dir, 'data/pfm_train.csv')
print(train_filename)
test_filename = os.path.join(current_dir, 'data/pfm_test.csv')
print(test_filename)


C:\Users\wzqcg\PythonCode\data/pfm_train.csv
C:\Users\wzqcg\PythonCode\data/pfm_test.csv


In [88]:
# 探索数据
# 加载训练集和测试集
train_data = pd.read_csv(train_filename)
test_data = pd.read_csv(test_filename)

In [89]:
train_data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,37,0,Travel_Rarely,Research & Development,1,4,Life Sciences,77,1,Male,2,2,Manufacturing Director,3,Divorced,5993,1,Y,No,18,3,3,80,1,7,2,4,7,5,0,7
1,54,0,Travel_Frequently,Research & Development,1,4,Life Sciences,1245,4,Female,3,3,Manufacturing Director,3,Divorced,10502,7,Y,No,17,3,1,80,1,33,2,1,5,4,1,4
2,34,1,Travel_Frequently,Research & Development,7,3,Life Sciences,147,1,Male,1,2,Laboratory Technician,3,Single,6074,1,Y,Yes,24,4,4,80,0,9,3,3,9,7,0,6
3,39,0,Travel_Rarely,Research & Development,1,1,Life Sciences,1026,4,Female,2,4,Manufacturing Director,4,Married,12742,1,Y,No,16,3,3,80,1,21,3,3,21,6,11,8
4,28,1,Travel_Frequently,Research & Development,1,3,Medical,1111,1,Male,2,1,Laboratory Technician,2,Divorced,2596,1,Y,No,15,3,1,80,2,1,2,3,1,0,0,0


In [90]:
# 查看数据集形状
print('train size:{}'.format(train_data.shape)) # train size:(1100, 31)
print('test size:{}'.format(test_data.shape)) #test size:(350, 30)

train size:(1100, 31)
test size:(350, 30)


In [91]:
# 查看数据集中是否含有缺失值
train_data.isnull().mean()
test_data.isnull().mean()

Age                         0.0
BusinessTravel              0.0
Department                  0.0
DistanceFromHome            0.0
Education                   0.0
EducationField              0.0
EmployeeNumber              0.0
EnvironmentSatisfaction     0.0
Gender                      0.0
JobInvolvement              0.0
JobLevel                    0.0
JobRole                     0.0
JobSatisfaction             0.0
MaritalStatus               0.0
MonthlyIncome               0.0
NumCompaniesWorked          0.0
Over18                      0.0
OverTime                    0.0
PercentSalaryHike           0.0
PerformanceRating           0.0
RelationshipSatisfaction    0.0
StandardHours               0.0
StockOptionLevel            0.0
TotalWorkingYears           0.0
TrainingTimesLastYear       0.0
WorkLifeBalance             0.0
YearsAtCompany              0.0
YearsInCurrentRole          0.0
YearsSinceLastPromotion     0.0
YearsWithCurrManager        0.0
dtype: float64

In [92]:
# 查看数据有没有缺失值
train_data.info()
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 31 columns):
Age                         1100 non-null int64
Attrition                   1100 non-null int64
BusinessTravel              1100 non-null object
Department                  1100 non-null object
DistanceFromHome            1100 non-null int64
Education                   1100 non-null int64
EducationField              1100 non-null object
EmployeeNumber              1100 non-null int64
EnvironmentSatisfaction     1100 non-null int64
Gender                      1100 non-null object
JobInvolvement              1100 non-null int64
JobLevel                    1100 non-null int64
JobRole                     1100 non-null object
JobSatisfaction             1100 non-null int64
MaritalStatus               1100 non-null object
MonthlyIncome               1100 non-null int64
NumCompaniesWorked          1100 non-null int64
Over18                      1100 non-null object
OverTime              

In [93]:
# 设置显示所有列
#pd.options.display.max_columns = None
pd.set_option('expand_frame_repr', False)


In [94]:
train_data.describe()

Unnamed: 0,Age,Attrition,DistanceFromHome,Education,EmployeeNumber,EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,NumCompaniesWorked,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0
mean,36.999091,0.161818,9.427273,2.922727,1028.157273,2.725455,2.730909,2.054545,2.732727,6483.620909,2.683636,15.235455,3.152727,2.696364,80.0,0.788182,11.221818,2.807273,2.746364,7.011818,4.207273,2.226364,4.123636
std,9.03723,0.368451,8.196694,1.022242,598.915204,1.098053,0.706366,1.107805,1.109731,4715.293419,2.510017,3.628571,0.359888,1.095356,0.0,0.843347,7.825548,1.291514,0.701121,6.223093,3.618115,3.31383,3.597996
min,18.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1009.0,0.0,11.0,3.0,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,0.0,2.0,2.0,504.25,2.0,2.0,1.0,2.0,2924.5,1.0,12.0,3.0,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,0.0,7.0,3.0,1026.5,3.0,3.0,2.0,3.0,4857.0,2.0,14.0,3.0,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,0.0,15.0,4.0,1556.5,4.0,3.0,3.0,4.0,8354.5,4.0,18.0,3.0,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1.0,29.0,5.0,2065.0,4.0,4.0,5.0,4.0,19999.0,9.0,25.0,4.0,4.0,80.0,3.0,40.0,6.0,4.0,37.0,18.0,15.0,17.0


In [95]:
# 分别对数值类型与object类型的数据做数据汇总
num_col = train_data.select_dtypes(['number']).columns
train_data[num_col].describe()
#train_data['JobLevel'].plot()

Unnamed: 0,Age,Attrition,DistanceFromHome,Education,EmployeeNumber,EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,NumCompaniesWorked,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0
mean,36.999091,0.161818,9.427273,2.922727,1028.157273,2.725455,2.730909,2.054545,2.732727,6483.620909,2.683636,15.235455,3.152727,2.696364,80.0,0.788182,11.221818,2.807273,2.746364,7.011818,4.207273,2.226364,4.123636
std,9.03723,0.368451,8.196694,1.022242,598.915204,1.098053,0.706366,1.107805,1.109731,4715.293419,2.510017,3.628571,0.359888,1.095356,0.0,0.843347,7.825548,1.291514,0.701121,6.223093,3.618115,3.31383,3.597996
min,18.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1009.0,0.0,11.0,3.0,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,0.0,2.0,2.0,504.25,2.0,2.0,1.0,2.0,2924.5,1.0,12.0,3.0,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,0.0,7.0,3.0,1026.5,3.0,3.0,2.0,3.0,4857.0,2.0,14.0,3.0,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,0.0,15.0,4.0,1556.5,4.0,3.0,3.0,4.0,8354.5,4.0,18.0,3.0,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1.0,29.0,5.0,2065.0,4.0,4.0,5.0,4.0,19999.0,9.0,25.0,4.0,4.0,80.0,3.0,40.0,6.0,4.0,37.0,18.0,15.0,17.0


In [96]:
obj_col = train_data.select_dtypes(['object']).columns
train_data[obj_col].describe()
#count（非空值数）、unique（唯一值数）、top（频数最高者）、freq（最高频数）

Unnamed: 0,BusinessTravel,Department,EducationField,Gender,JobRole,MaritalStatus,Over18,OverTime
count,1100,1100,1100,1100,1100,1100,1100,1100
unique,3,3,6,2,9,3,1,2
top,Travel_Rarely,Research & Development,Life Sciences,Male,Sales Executive,Married,Y,No
freq,787,727,462,653,247,500,1100,794


In [97]:
# 计算数值型各个类别的离职概率，大概了解一部分对于离职率的影响
for col in train_data.columns:
    if train_data[col].dtype == 'int64':
        print('column:'+col + ':')
        # 列名是Age，Attrition为1的元素，统计相同值元素的个数
        # 统计相同值元素的个数
        # 排序
        print((train_data[train_data['Attrition'] == 1.0][col].value_counts()/
               train_data[col].value_counts()).sort_values(ascending = False))
        print('-----------------------')

column:Age:
21    0.714286
19    0.625000
20    0.500000
58    0.428571
22    0.416667
23    0.400000
26    0.322581
28    0.285714
29    0.272727
31    0.270833
33    0.255319
25    0.250000
24    0.222222
30    0.187500
44    0.181818
55    0.176471
32    0.170213
39    0.166667
52    0.166667
41    0.161290
56    0.153846
53    0.153846
34    0.132075
47    0.125000
51    0.125000
46    0.120000
35    0.118644
37    0.108108
49    0.090909
36    0.072727
45    0.066667
40    0.063830
42    0.058824
27    0.052632
38    0.051282
50    0.043478
43    0.040000
18         NaN
48         NaN
54         NaN
57         NaN
59         NaN
60         NaN
Name: Age, dtype: float64
-----------------------
column:Attrition:
1    1.0
0    NaN
Name: Attrition, dtype: float64
-----------------------
column:DistanceFromHome:
12    0.428571
24    0.400000
22    0.333333
13    0.294118
27    0.272727
25    0.263158
16    0.230769
29    0.217391
20    0.210526
17    0.200000
23    0.200000
9     0.189

In [98]:
# 计算对象型object各个类别的离职概率，大概了解一部分对于离职率的影响
for i in train_data.columns:
    if train_data[i].dtype == 'O':
        print(i + ':')
        print((train_data[train_data['Attrition'] == 1.0][i].value_counts()/
        train_data[i].value_counts()).sort_values(ascending=False))
        print('-----------------------')
# 可以看出，大学专业和出差频率两项有明显的影响，单身职员的离职的概率比较大，职业角色里，代理销售的人员流动大。


BusinessTravel:
Travel_Frequently    0.224390
Travel_Rarely        0.156290
Non-Travel           0.083333
Name: BusinessTravel, dtype: float64
-----------------------
Department:
Human Resources           0.214286
Sales                     0.202417
Research & Development    0.140303
Name: Department, dtype: float64
-----------------------
EducationField:
Human Resources     0.315789
Technical Degree    0.239130
Marketing           0.212598
Life Sciences       0.151515
Medical             0.136499
Other               0.111111
Name: EducationField, dtype: float64
-----------------------
Gender:
Male      0.166922
Female    0.154362
Name: Gender, dtype: float64
-----------------------
JobRole:
Sales Representative         0.403509
Human Resources              0.272727
Laboratory Technician        0.209756
Research Scientist           0.185520
Sales Executive              0.170040
Manufacturing Director       0.079208
Manager                      0.062500
Healthcare Representative    0.050

In [99]:
# 在分析中发现有一些字段的值是单一的,进一步验证
single_value_feature = []
for col in train_data.columns:
    lenght = len(train_data[col].unique())
    if lenght == 1:
        single_value_feature.append(col)
print(single_value_feature)

['Over18', 'StandardHours']


In [100]:
# 整体数值比较平滑，部分数值受到极值影响，并且基本可以判断出几个字段基本没有意义。（EmployeeNumber，StandardHours，Over18）
# 丢弃掉没有作用的数据
train_data = train_data.drop(['StandardHours', 'Over18', 'EmployeeNumber'], axis=1)
print(train_data.shape)

(1100, 28)


In [101]:
# 将Attrition（该字段为标签）移至1列，方便索引
Attrition = train_data['Attrition']
train_data.drop(['Attrition'], axis = 1, inplace = True)
train_data.insert(0, 'Attrition', Attrition)
train_data.head(5)


Unnamed: 0,Attrition,Age,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,NumCompaniesWorked,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,0,37,Travel_Rarely,Research & Development,1,4,Life Sciences,1,Male,2,2,Manufacturing Director,3,Divorced,5993,1,No,18,3,3,1,7,2,4,7,5,0,7
1,0,54,Travel_Frequently,Research & Development,1,4,Life Sciences,4,Female,3,3,Manufacturing Director,3,Divorced,10502,7,No,17,3,1,1,33,2,1,5,4,1,4
2,1,34,Travel_Frequently,Research & Development,7,3,Life Sciences,1,Male,1,2,Laboratory Technician,3,Single,6074,1,Yes,24,4,4,0,9,3,3,9,7,0,6
3,0,39,Travel_Rarely,Research & Development,1,1,Life Sciences,4,Female,2,4,Manufacturing Director,4,Married,12742,1,No,16,3,3,1,21,3,3,21,6,11,8
4,1,28,Travel_Frequently,Research & Development,1,3,Medical,1,Male,2,1,Laboratory Technician,2,Divorced,2596,1,No,15,3,1,2,1,2,3,1,0,0,0


In [102]:
# 特征处理
# 主要是对部分特征进行分组以及one-hot编码
# 对收入进行分箱
print(train_data['MonthlyIncome'].min()) # 1009
print(train_data['MonthlyIncome'].max()) # 19999
print(test_data['MonthlyIncome'].min()) # 1051
print(test_data['MonthlyIncome'].max()) # 19973

1009
19999
1051
19973


In [103]:
# 为了在train和test中的MonthlyIncome进行分组后的区间一致，需要保持两个数据集中MonthlyIncome的最大值和最小值一致，这里使用等宽分组
# 使用pandas的cut进行分组，分为10组，返回的是数据种每个值所在的分类区间
train_data['MonthlyIncome'] = pd.cut(train_data['MonthlyIncome'], bins=10)

In [104]:
print(train_data['MonthlyIncome'])

0         (4807.0, 6706.0]
1        (8605.0, 10504.0]
2         (4807.0, 6706.0]
3       (12403.0, 14302.0]
4         (990.01, 2908.0]
5         (2908.0, 4807.0]
6         (2908.0, 4807.0]
7         (6706.0, 8605.0]
8         (990.01, 2908.0]
9         (4807.0, 6706.0]
10        (990.01, 2908.0]
11        (4807.0, 6706.0]
12        (990.01, 2908.0]
13        (2908.0, 4807.0]
14        (2908.0, 4807.0]
15        (2908.0, 4807.0]
16        (4807.0, 6706.0]
17        (990.01, 2908.0]
18       (8605.0, 10504.0]
19        (4807.0, 6706.0]
20        (990.01, 2908.0]
21       (8605.0, 10504.0]
22      (12403.0, 14302.0]
23      (16201.0, 18100.0]
24        (990.01, 2908.0]
25      (10504.0, 12403.0]
26        (2908.0, 4807.0]
27        (990.01, 2908.0]
28      (10504.0, 12403.0]
29        (990.01, 2908.0]
               ...        
1070      (990.01, 2908.0]
1071     (8605.0, 10504.0]
1072      (4807.0, 6706.0]
1073      (4807.0, 6706.0]
1074      (2908.0, 4807.0]
1075      (990.01, 2908.0]
1

In [107]:
# 将数据类型为‘object’的字段名提取出来，并使用one-hot-encode对其进行编码
col_object = []
for col in train_data.columns[1:]:
    if train_data[col].dtype == 'object':
        col_object.append(col)
print(col_object)

['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'OverTime']


In [109]:
# 对train数据集进行one-hot编码
train_encode = pd.get_dummies(train_data)
train_encode.head()

Unnamed: 0,Attrition,Age,DistanceFromHome,Education,EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,NumCompaniesWorked,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,BusinessTravel_Non-Travel,BusinessTravel_Travel_Frequently,BusinessTravel_Travel_Rarely,Department_Human Resources,Department_Research & Development,Department_Sales,EducationField_Human Resources,EducationField_Life Sciences,EducationField_Marketing,EducationField_Medical,EducationField_Other,EducationField_Technical Degree,Gender_Female,Gender_Male,JobRole_Healthcare Representative,JobRole_Human Resources,JobRole_Laboratory Technician,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Divorced,MaritalStatus_Married,MaritalStatus_Single,"MonthlyIncome_(990.01, 2908.0]","MonthlyIncome_(2908.0, 4807.0]","MonthlyIncome_(4807.0, 6706.0]","MonthlyIncome_(6706.0, 8605.0]","MonthlyIncome_(8605.0, 10504.0]","MonthlyIncome_(10504.0, 12403.0]","MonthlyIncome_(12403.0, 14302.0]","MonthlyIncome_(14302.0, 16201.0]","MonthlyIncome_(16201.0, 18100.0]","MonthlyIncome_(18100.0, 19999.0]",OverTime_No,OverTime_Yes
0,0,37,1,4,1,2,2,3,1,18,3,3,1,7,2,4,7,5,0,7,0,0,1,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0
1,0,54,1,4,4,3,3,3,7,17,3,1,1,33,2,1,5,4,1,4,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0
2,1,34,7,3,1,1,2,3,1,24,4,4,0,9,3,3,9,7,0,6,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1
3,0,39,1,1,4,2,4,4,1,16,3,3,1,21,3,3,21,6,11,8,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0
4,1,28,1,3,1,2,1,2,1,15,3,1,2,1,2,3,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0


In [111]:
#这个例子帮助理解one hot编码
import pandas as pd
df = pd.DataFrame([
    ['green' , 'A'], 
    ['red' , 'B'], 
    ['blue' , 'A']]) 
df

Unnamed: 0,0,1
0,green,A
1,red,B
2,blue,A


In [114]:
df.columns = ['color', 'class']
pd.get_dummies(df)

Unnamed: 0,color_blue,color_green,color_red,class_A,class_B
0,0,1,0,1,0
1,0,0,1,0,1
2,1,0,0,1,0


In [115]:
# 保存数据集，方便日后使用
#train_data.to_csv('trainwithoutencode.csv')
#train_encode.to_csv('train.csv')

In [116]:
# 3 特征共线性处理
corr = train_data.corr()
corr

Unnamed: 0,Attrition,Age,DistanceFromHome,Education,EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,NumCompaniesWorked,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
Attrition,1.0,-0.175393,0.088563,-0.046494,-0.097003,-0.122722,-0.168775,-0.125568,0.025889,0.026604,0.046762,-0.051749,-0.138498,-0.187922,-0.043395,-0.048794,-0.143697,-0.163059,-0.07176,-0.158558
Age,-0.175393,1.0,0.007081,0.198558,0.011803,0.066528,0.513882,-0.003744,0.291211,-0.011259,-0.029613,0.063489,-0.002413,0.682879,-0.051702,-0.001042,0.328651,0.231842,0.230587,0.21254
DistanceFromHome,0.088563,0.007081,1.0,0.011437,-0.010308,0.012333,0.01647,-0.009641,-0.016378,0.042627,0.021042,0.018112,0.050356,0.001287,-0.041208,-0.05095,4.4e-05,0.019317,-0.00276,0.008852
Education,-0.046494,0.198558,0.011437,1.0,-0.032698,0.022843,0.084075,-0.010201,0.118484,-0.008828,-4.5e-05,-0.006346,0.035881,0.125672,-0.021629,0.003099,0.074522,0.064363,0.067754,0.07187
EnvironmentSatisfaction,-0.097003,0.011803,-0.010308,-0.032698,1.0,-0.028467,-0.015355,0.000212,-0.010743,-0.008882,-0.025044,0.033515,0.008874,-0.018532,-0.045686,0.026477,-0.012574,0.003572,0.008843,-0.02019
JobInvolvement,-0.122722,0.066528,0.012333,0.022843,-0.028467,1.0,0.005983,-0.016382,0.053557,0.002377,0.000742,0.048363,0.029483,0.01838,-0.018001,-0.025862,-0.032189,0.001194,-0.031097,0.014176
JobLevel,-0.168775,0.513882,0.01647,0.084075,-0.015355,0.005983,1.0,-0.005894,0.157068,-0.066353,-0.046019,0.042156,0.002638,0.78402,-0.03462,0.041258,0.544091,0.411481,0.395195,0.376119
JobSatisfaction,-0.125568,-0.003744,-0.009641,-0.010201,0.000212,-0.016382,-0.005894,1.0,-0.061091,0.019032,-0.011615,-0.033138,0.021123,-0.023343,0.002754,-0.042767,-0.013772,-0.011798,-0.009761,-0.041852
NumCompaniesWorked,0.025889,0.291211,-0.016378,0.118484,-0.010743,0.053557,0.157068,-0.061091,1.0,-0.000606,0.020296,0.067295,0.007861,0.236079,-0.078332,0.003999,-0.109451,-0.073029,-0.038203,-0.101457
PercentSalaryHike,0.026604,-0.011259,0.042627,-0.008828,-0.008882,0.002377,-0.066353,0.019032,-0.000606,1.0,0.764683,-0.024349,0.016313,-0.044268,0.015905,0.017415,-0.069956,-0.031167,-0.03637,-0.044119


In [118]:
corr[corr>0.75] 

Unnamed: 0,Attrition,Age,DistanceFromHome,Education,EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,NumCompaniesWorked,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
Attrition,1.0,,,,,,,,,,,,,,,,,,,
Age,,1.0,,,,,,,,,,,,,,,,,,
DistanceFromHome,,,1.0,,,,,,,,,,,,,,,,,
Education,,,,1.0,,,,,,,,,,,,,,,,
EnvironmentSatisfaction,,,,,1.0,,,,,,,,,,,,,,,
JobInvolvement,,,,,,1.0,,,,,,,,,,,,,,
JobLevel,,,,,,,1.0,,,,,,,0.78402,,,,,,
JobSatisfaction,,,,,,,,1.0,,,,,,,,,,,,
NumCompaniesWorked,,,,,,,,,1.0,,,,,,,,,,,
PercentSalaryHike,,,,,,,,,,1.0,0.764683,,,,,,,,,


In [119]:
# 'TotalWorkingYears' & 'JobLevel' 'YearsAtCompany' & 'YearsWithCurrManager'存在共线性，选择删除其中一个特征即可
train_encode.drop(['TotalWorkingYears', 'YearsWithCurrManager'], axis = 1, inplace = True)


### Scikit-learn（以前称为scikits.learn，也称为sklearn）
* 是针对Python 编程语言的免费软件机器学习库。它具有各种分类，回归和聚类算法，包括支持向量机，随机森林，梯度提升，k均值和DBSCAN，并且旨在与Python数值科学库NumPy和SciPy联合使用。

In [120]:
# 建模预测
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [122]:
X = train_encode.iloc[:, 1:]
y = train_encode.iloc[:, 0]

In [124]:
# 划分训练集以及测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [126]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
train_score = lr.score(X_train, y_train)
print(train_score)

0.8954545454545455




In [127]:
pred = lr.predict(X_test)
print(np.mean(pred == y_test))

0.9045454545454545
