# N/LAB Machine Learning and Advanced Analytics
## Practical 6

### Predicting employee attrition via ensemble tree-based methods. 

The issue of keeping one's employees happy and satisfied is a perennial and age-old challenge. If an employee you have invested so much time and money leaves for "greener pastures", then this would mean that you would have to spend even more time and money to hire somebody else. Therefore we turn to our predictive modelling capabilities and see if we can predict employee attrition.

The dataset we are going to use is `WA_Fn-UseC_-HR-Employee-Attrition.csv` **(Please upload the file to the same category as the notebook)**.

#### The data has the following columns:
```
Age - numeric                                   Attrition - categorical (output feature)
BusinessTravel - categorical                    DailyRate - numeric
Department - categorical                        DistanceFromHome - numeric
Education - categorical                         EducationField - categorical
EmployeeCount - numeric                         EmployeeNumber - numeric
EnvironmentSatisfaction - categorical           Gender - categorical
HourlyRate - numeric                            JobInvolvement - categorical
JobLevel - categorical                          JobRole - categorical
JobSatisfaction - categorical                   MaritalStatus - categorical
MonthlyIncome - numeric                         MonthlyRate - numeric
NumCompaniesWorked - numeric                    Over18 - categorical
OverTime - categorical                          PercentSalaryHike - numeric
PerformanceRating - categorical                 RelationshipSatisfaction - categorical
StandardHours - numeric                         StockOptionLevel - categorical
TotalWorkingYears - numeric                     TrainingTimesLastYear - numeric
WorkLifeBalance - categorical                   YearsAtCompany - numeric
YearsInCurrentRole - numeric                    YearsSinceLastPromotion - numeric
YearsWithCurrManager - numeric
```

**Problem: Predict employee attrition**

In this practical, we will implement a **Bagging** and an **Adaptive Boosting** model after we finish categorical encoding and handling the unbalanced class problem.

In [1]:
#pip install imblearn then restart to use

## Step 1 
Let's import modules that will be further used and load the data as a pandas dataframe

As always let's call either .head() or .describe() or both to check our load worked like we expected.

In [2]:
#Load data as pandas dataframe
import pandas as pd

data = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")

In [3]:
#Check data loaded as we expected

data.head(10)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2
5,32,No,Travel_Frequently,1005,Research & Development,2,2,Life Sciences,1,8,...,3,80,0,8,2,2,7,7,3,6
6,59,No,Travel_Rarely,1324,Research & Development,3,3,Medical,1,10,...,1,80,3,12,3,2,1,0,0,0
7,30,No,Travel_Rarely,1358,Research & Development,24,1,Life Sciences,1,11,...,2,80,1,1,2,3,1,0,0,0
8,38,No,Travel_Frequently,216,Research & Development,23,3,Life Sciences,1,12,...,2,80,0,10,2,3,9,7,1,8
9,36,No,Travel_Rarely,1299,Research & Development,27,3,Medical,1,13,...,2,80,2,17,3,2,7,7,7,7


In [4]:
#Check missing values
data.isnull().any()

Age                         False
Attrition                   False
BusinessTravel              False
DailyRate                   False
Department                  False
DistanceFromHome            False
Education                   False
EducationField              False
EmployeeCount               False
EmployeeNumber              False
EnvironmentSatisfaction     False
Gender                      False
HourlyRate                  False
JobInvolvement              False
JobLevel                    False
JobRole                     False
JobSatisfaction             False
MaritalStatus               False
MonthlyIncome               False
MonthlyRate                 False
NumCompaniesWorked          False
Over18                      False
OverTime                    False
PercentSalaryHike           False
PerformanceRating           False
RelationshipSatisfaction    False
StandardHours               False
StockOptionLevel            False
TotalWorkingYears           False
TrainingTimesL

## Step 2
We're going to do a standard prediction task so we need to split our data into input features and the output feature. **Let's call them X and y as is convention.**

We also need to convert y, to be an integer binary output feature, where **1 represents "Attrition" and 0 represents "Not Attrition".**

In [5]:
# Define our input features and our output feature
# Call our input features X and our output feature y (the sklearn standard)
X = data.drop(columns = 'Attrition')
y = data.Attrition

# Now we need to encode our output feature to be an integer 0 or 1. 
# This is because we have a binary classification problem and in order to use sklearn's
# built-in evaluation measures we need to have one class defined as 1 (target) and one as 0 (non-target).
y.replace(to_replace="Yes", value=1, inplace=True)
y.replace(to_replace="No", value=0, inplace=True)
y

0       1
1       0
2       1
3       0
4       0
       ..
1465    0
1466    0
1467    0
1468    0
1469    0
Name: Attrition, Length: 1470, dtype: int64

In [6]:
#Check the frequency counts of the values of output feature
y.value_counts()

0    1233
1     237
Name: Attrition, dtype: int64

## Step 3
Now we should think about evaluation.

In this practical, we are going to do the simpliest possible form of evaluation, a single test train split. **You can extend it to a cross validation version by yourself**

**NOTE 1:** Since we have an unbalanced target class we need to be careful that we don't accidentally sample all attrtion observations into our training set. Then there will be no positive observation in the testing set. That would make our task way to easy! Therefore we are going to tell the `train_test_split(..)` function to sample in a stratified way based on the class labels.

**NOTE 2:** So you can follow along with the solution later I'd recommend a `test_size of 0.20` and a `random_state of 42`. I'd also use the standard names for the output, i.e. `X_train, X_test, y_train, y_test`.

**StratifiedKFold** is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

The stratify parameter makes a split so that the proportion of values in the sample produced will be the same as the proportion of values provided to parameter stratify.

In [7]:
# Split data into train and test sets as well as for validation and testing

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42, stratify = y)

## Step 4:  Encoding categorical features into dummy variables

We will use the same **`OneHotEncoder`** that we used before. 

[The documentation for this package](https://contrib.scikit-learn.org/category_encoders/).

**NOTE:** Because the dataset contains categorical features and the method we will use to do oversampling, `SMOTE`, only takes continuous features, we need to encode categorical features before handling the unbalanced class problem unsing `SMOTE`. 

Another solution is to use `SMOTENC`. It supports categorical features. But the list of categorical features are required to be specified when initiate a `SMOTENC` object.

In [8]:
from category_encoders.one_hot import OneHotEncoder

#Initialzie an OneHotEncoder object
enc = OneHotEncoder(handle_unknown='value')

#Learn and apply the encoding on training set
X_train_enc = enc.fit_transform(X_train)

In [9]:
#Check the data encoded as we expected  和原数据对比
X_train_enc.head(10)

Unnamed: 0,BusinessTravel_1,BusinessTravel_2,BusinessTravel_3,Department_1,Department_2,Department_3,EducationField_1,EducationField_2,EducationField_3,EducationField_4,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
1194,1,0,0,1,0,0,1,0,0,0,...,3,80,3,29,2,3,3,2,1,2
128,1,0,0,0,1,0,0,1,0,0,...,3,80,1,3,2,3,2,1,2,1
810,1,0,0,1,0,0,0,0,1,0,...,4,80,1,23,3,3,12,9,4,9
478,1,0,0,1,0,0,0,0,0,1,...,3,80,0,7,1,3,7,4,0,6
491,0,1,0,0,1,0,0,0,0,1,...,2,80,1,10,3,3,8,7,4,7
323,1,0,0,0,1,0,0,0,0,1,...,4,80,0,5,4,2,3,2,2,2
258,1,0,0,0,1,0,1,0,0,0,...,2,80,0,1,0,2,1,0,0,0
812,0,1,0,0,1,0,1,0,0,0,...,3,80,1,18,1,3,8,7,0,1
1132,1,0,0,1,0,0,1,0,0,0,...,3,80,1,5,2,3,5,4,1,2
996,1,0,0,1,0,0,0,0,1,0,...,4,80,0,6,3,3,6,2,4,4


In [14]:
#Apply the encoding on testing set
X_test_enc = enc.fit_transform(X_test)
X_test_enc

Unnamed: 0,BusinessTravel_1,BusinessTravel_2,BusinessTravel_3,Department_1,Department_2,Department_3,EducationField_1,EducationField_2,EducationField_3,EducationField_4,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
1061,1,0,0,1,0,0,1,0,0,0,...,3,80,1,1,2,3,1,0,0,0
891,0,1,0,0,1,0,1,0,0,0,...,4,80,1,10,5,3,10,5,7,7
456,0,1,0,1,0,0,1,0,0,0,...,3,80,1,10,3,2,5,4,0,1
922,0,1,0,0,1,0,1,0,0,0,...,4,80,2,26,4,2,25,9,14,13
69,0,1,0,0,1,0,0,1,0,0,...,1,80,1,2,0,2,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1269,0,1,0,0,0,1,1,0,0,0,...,2,80,0,10,5,3,9,7,1,8
1352,0,1,0,0,1,0,1,0,0,0,...,4,80,1,10,5,3,2,0,2,2
1236,0,1,0,1,0,0,0,0,1,0,...,2,80,3,16,3,3,2,2,2,2
1023,0,1,0,0,1,0,1,0,0,0,...,4,80,1,5,3,4,3,2,1,0


## Step 5: Dealing with unbalanced class using SMOTE

We will use the `imblearn` package. Specifically we will use the implemenation found here:

https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html

To use this package we will `.fit(..)` a SMOTE object using the training set and then call a function called `resample(..)` to generate a new training set.

In [28]:
from imblearn.over_sampling import SMOTE

#Initiate a SMOTE object
sm = SMOTE(random_state=42)

#Create the new (balanced) training set using SMOTE
X_train_enc_res, y_train_res = sm.fit_resample(X_train_enc, y_train)

In [29]:
#Check the frequency counts of the output feature in resampled training set

y_train_res.value_counts()

0    986
1    986
Name: Attrition, dtype: int64

## Step 6: Implementing Machine Learning Models
Having ensured that all categorical values are encoded and the training data are now balanced, we are now ready to proceed onto building our models. 

### Step 6a: Implementing Bagging Classifier

We'll first implement the **Bagging Classifier**. The document can be found here:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it. In this practical, we focus on tree-based Bagging classifier.

In [30]:
# Setup the Bagging classifier (BaggingClassifier from sklearn)

from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

#Set Bagging parameters
baggging_params = {
    'base_estimator': DecisionTreeClassifier(max_depth = 10),
    'n_estimators': 100,
    'max_samples': 1.0,
    'max_features': 1.0,
    'bootstrap': True,
    'bootstrap_features': False,
    'n_jobs': -1,
    'random_state' : 42
}

#Initiate a BaggingClassifier object using the parameters set above
bagging = BaggingClassifier(**baggging_params)

#Fit the BaggingClassifier object on the resampled training data
bagging.fit(X_train_enc_res, y_train_res)

BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=10),
                  n_estimators=100, n_jobs=-1, random_state=42)

In [31]:
#Get prediction on testing data
bagging_pred = bagging.predict(X_test_enc)

In [32]:
#Evaluate the bagging classifer

from sklearn.metrics import accuracy_score, classification_report

print(accuracy_score(y_test, bagging_pred))
print(classification_report(y_test, bagging_pred))

0.8605442176870748
              precision    recall  f1-score   support

           0       0.88      0.96      0.92       247
           1       0.62      0.32      0.42        47

    accuracy                           0.86       294
   macro avg       0.75      0.64      0.67       294
weighted avg       0.84      0.86      0.84       294



In [33]:
#Try different base_estimator and n_estimators

In [34]:
#Check the accuracy score of each base estimator
index = 1
for base_estimator in bagging.estimators_:
    print("Accuracy score of the No.{} base estimator: {}".format(index, accuracy_score(y_test, base_estimator.predict(X_test_enc))))
    index += 1

Accuracy score of the No.1 base estimator: 0.7517006802721088
Accuracy score of the No.2 base estimator: 0.7653061224489796
Accuracy score of the No.3 base estimator: 0.7721088435374149
Accuracy score of the No.4 base estimator: 0.8163265306122449
Accuracy score of the No.5 base estimator: 0.8061224489795918
Accuracy score of the No.6 base estimator: 0.7755102040816326
Accuracy score of the No.7 base estimator: 0.7653061224489796
Accuracy score of the No.8 base estimator: 0.7653061224489796
Accuracy score of the No.9 base estimator: 0.7653061224489796
Accuracy score of the No.10 base estimator: 0.7619047619047619
Accuracy score of the No.11 base estimator: 0.8027210884353742
Accuracy score of the No.12 base estimator: 0.7517006802721088
Accuracy score of the No.13 base estimator: 0.7993197278911565
Accuracy score of the No.14 base estimator: 0.7721088435374149
Accuracy score of the No.15 base estimator: 0.7414965986394558
Accuracy score of the No.16 base estimator: 0.7482993197278912
A




Accuracy score of the No.58 base estimator: 0.7585034013605442
Accuracy score of the No.59 base estimator: 0.7585034013605442
Accuracy score of the No.60 base estimator: 0.7517006802721088
Accuracy score of the No.61 base estimator: 0.7687074829931972
Accuracy score of the No.62 base estimator: 0.782312925170068
Accuracy score of the No.63 base estimator: 0.7653061224489796
Accuracy score of the No.64 base estimator: 0.7448979591836735
Accuracy score of the No.65 base estimator: 0.7721088435374149
Accuracy score of the No.66 base estimator: 0.7551020408163265
Accuracy score of the No.67 base estimator: 0.7687074829931972
Accuracy score of the No.68 base estimator: 0.7551020408163265
Accuracy score of the No.69 base estimator: 0.7619047619047619
Accuracy score of the No.70 base estimator: 0.7891156462585034
Accuracy score of the No.71 base estimator: 0.7925170068027211
Accuracy score of the No.72 base estimator: 0.7653061224489796
Accuracy score of the No.73 base estimator: 0.772108843



### Step 6b: Implementing Adaptive Boosting Classifier

Then we implement the **Adaptive Boosting Classifier**. The document can be found here:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

**Adaboost** is also a meta-estimator for machine learning. So it's not an actual machine learning model, but rather a way to combine machine learning models. AdaBoost takes an ensemble of other machine learning classifiers (e.g. regression, decision trees, random forests, neural networks) and combines them in a weighted fashion to create a new classifier. So it is like taking several weak predictive models and combining them to form one super model. In this practical, we focus on the tree-based AdaBoost classifier.

In [35]:
# Setup the Adaptive Boosting classifier (AdaBoostClassifier from sklearn)

from sklearn.ensemble import AdaBoostClassifier

#Set Adaptive Boosting parameters
adaBoost_params = {
    'base_estimator': DecisionTreeClassifier(max_depth=5),
    'n_estimators': 100,
    'algorithm': 'SAMME.R',
    'learning_rate': 1.0,
    'random_state' : 42
}

#Initiate a AdaBoostClassifier object using the parameters set above
adaBoost = AdaBoostClassifier(**adaBoost_params)

#Fit the AdaBoostClassifier object on the resampled training data
adaBoost.fit(X_train_enc_res, y_train_res)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5),
                   n_estimators=100, random_state=42)

In [36]:
#Get prediction on testing data
adaBoost_pred = adaBoost.predict(X_test_enc)

In [37]:
#Evaluate the Adaptive Boosting classifer

from sklearn.metrics import accuracy_score, classification_report

print(accuracy_score(y_test, adaBoost_pred))
print(classification_report(y_test, adaBoost_pred))

0.8299319727891157
              precision    recall  f1-score   support

           0       0.88      0.92      0.90       247
           1       0.46      0.34      0.39        47

    accuracy                           0.83       294
   macro avg       0.67      0.63      0.65       294
weighted avg       0.81      0.83      0.82       294



In [43]:
#Print the feature importance computed by AdaBoost model

feature_importance_dic = {}
for i in range(0, len(X_train_enc_res.columns.values)):
    feature_importance_dic[X_train_enc_res.columns.values[i]] = adaBoost.feature_importances_[i]

# key=lambda x: x[1] 为对前面的对象中的第二维数据（即value）的值进行排序
sorted(feature_importance_dic.items(), key=lambda x: x[1], reverse=True) 

[('DailyRate', nan),
 ('MonthlyRate', 0.07018875637157854),
 ('HourlyRate', 0.05458084975935501),
 ('MonthlyIncome', 0.050477180900378504),
 ('EmployeeNumber', 0.04935556186874991),
 ('DistanceFromHome', 0.04913011566352247),
 ('TotalWorkingYears', 0.044898587189874065),
 ('Age', 0.044084878824472984),
 ('PercentSalaryHike', 0.036933523736501835),
 ('EnvironmentSatisfaction', 0.03561522988157498),
 ('OverTime_1', 0.0341530352146607),
 ('YearsWithCurrManager', 0.033972455400375655),
 ('YearsSinceLastPromotion', 0.030034432784372435),
 ('WorkLifeBalance', 0.029862214143587678),
 ('NumCompaniesWorked', 0.02776532672727396),
 ('JobSatisfaction', 0.024055688316023977),
 ('EducationField_1', 0.023846356320681564),
 ('StockOptionLevel', 0.023347573241083542),
 ('YearsInCurrentRole', 0.021077034806612716),
 ('Education', 0.020850995118490805),
 ('RelationshipSatisfaction', 0.020333511738384692),
 ('TrainingTimesLastYear', 0.018858324385925697),
 ('YearsAtCompany', 0.0187243004726298),
 ('JobIn

In [44]:
#Check the accuracy score of the prediction on test set after each boost.

index = 1
for pred in adaBoost.staged_predict(X_test_enc):
    print("Accuracy score of first {} models: {}".format(index, accuracy_score(y_test, pred)))
    index += 1

Accuracy score of first 1 models: 0.7925170068027211
Accuracy score of first 2 models: 0.8163265306122449
Accuracy score of first 3 models: 0.7857142857142857
Accuracy score of first 4 models: 0.7687074829931972
Accuracy score of first 5 models: 0.782312925170068
Accuracy score of first 6 models: 0.7857142857142857
Accuracy score of first 7 models: 0.8231292517006803
Accuracy score of first 8 models: 0.8129251700680272
Accuracy score of first 9 models: 0.8027210884353742
Accuracy score of first 10 models: 0.8027210884353742
Accuracy score of first 11 models: 0.8027210884353742
Accuracy score of first 12 models: 0.7891156462585034
Accuracy score of first 13 models: 0.7857142857142857
Accuracy score of first 14 models: 0.7857142857142857
Accuracy score of first 15 models: 0.8027210884353742
Accuracy score of first 16 models: 0.7925170068027211
Accuracy score of first 17 models: 0.7755102040816326
Accuracy score of first 18 models: 0.7789115646258503
Accuracy score of first 19 models: 0.7

