# HR Analytics: Predicting Employee Churn in Python

Among all of the business domains, HR is still the least disrupted. However, the latest developments in data collection and analysis tools and technologies allow for data driven decision-making in all dimensions, including HR. This course will provide a solid basis for dealing with employee data and developing a predictive model to analyze employee turnover.

## A. Introduction to HR Analytics

In this chapter you will learn about the problems addressed by HR analytics, as well as will explore a sample HR dataset that will further be analyzed. You will describe and visualize some of the key variables, transform and manipulate the dataset to make it ready for analytics.

### 1. Introduction to HR analytics

Hello and welcome to "HR analytics in Python" course. My name is Hrant Davtyan, I am a Business Analyst teaching Data Science and providing consultancy related to statistics. Among all of the business domains, HR is still the least disrupted. However, the latest developments in Data collection and analysis tools and technologies allow for data driven decision-making in all dimensions including HR. As a consequence, HR analytics is a growing field and I believe it is the correct time to tap into that industry.

2. What is HR analytics?
00:34 - 00:44
HR analytics is also known as People analytics and it is nothing else than a data-driven approach to managing people at work.

3. Problems addressed by HR analytics
00:44 - 01:08
There are many problems in HR that can be addressed using data-driven approach. Among those are decisions related to employee hiring and retention, performance evaluation, collaboration and else. In this course, we will concentrate on Predicting employee turnover which is related to the first 2 bullet points: Hiring and retention.

4. Employee turnover
01:08 - 01:48
Employee turnover is the process of employees leaving the company also known as employee attrition or employee churn. When skilled employees leave, this can be very costly for the company, thus firms are interested in predicting turnover beforehand. Having that information in hand, companies can change their strategy to retain good workers or start the hiring process of new employees on time.

5. Course structure
01:48 - 02:11
In this course, we will use a sample employee dataset with variables that describe employees in the company to predict their turnover and understand what are the most important features affecting it. The 1st chapter will concentrate on descriptive analytics, where we will transform the dataset and make it ready for developing the predictive model. In the 2nd chapter we will develop an initial model that will then be tuned and improved in the 3rd chapter. The final chapter will introduce techniques that will allow selection of the best model for decision-making.

6. The dataset
02:11 - 03:15
So let's start by taking a quick look to our dataset. The data is provided in csv format and is located in the working directory. This means we can use the read_csv() function from the pandas library to read it. Once the dataset is read into a new pandas DataFrame called "data", we can use the info() method to get some information on it. As you can see from the output we have 10 columns and almost 15000 entries, which means the DataFrame includes data on almost 15000 employees about 10 different variables. Among those 10, only 2 have the type object, while others are either float or int. The latter means that our variables are numeric, numbers, that can be used to perform mathematical and statistical computations on, while the object types are called categorical variables and they need to be transformed first, before moving on. Therefore, let's take a quick look to our dataset to see what it looks like and what are those 2 categorical variables we have there.

7. The dataset
03:15 - 03:44
We can use the head() method to take a look to the first 5 rows of the DataFrame. As you can see, the last two columns are "department" and "salary", which are giving information about the department an employee is working at and the salary s/he is receiving, respectively. Both of them describe some category of an employee (belonging to this or that department or salary group), which is the reason they are called categorical variables.

8. Unique values
03:44 - 03:55
In order to understand what are the values that those columns get, we have to first choose the relevant columns, and then use a method called **unique()** to print only the unique values in that column.

**Finding categorical variables**

Categorical variables are variables that receive a limited number of values that describe a category. They can be of two types:

Ordinal – variables with two or more categories that can be ranked or ordered (e.g. “low”, “medium”, “high”)
Nominal – variables with two or more categories that do not have an intrinsic order (e.g. “men”, “women”)
In this exercise, you will find the categorical variables in the dataset. To do that, first of all, you will import the pandas library and read the CSV file called "turnover.csv". Then, after viewing the first 5 rows and learning (visually) that there are non-numeric values in the DataFrame, you will get some information about the types of variables that are available in the dataset.

In [None]:
# Import pandas (as pd) to read the data
import pandas as pd

# Read "turnover.csv" and save it in a DataFrame called data
data = pd.read_csv("turnover.csv")

# Take a quick look to the first 5 rows of data
print(data.head(5))

# Get some information on the types of variables in data
data.info()

**Observing categoricals**

Remember from the previous exercise that:

Ordinal variables have two or more categories which can be ranked or ordered
Nominal variables have two or more categories which do not have an intrinsic order
In your dataset:

salary is an ordinal variable
department is a nominal variable
In this exercise, you're going to observe the categorical variables found in the previous exercise. To do that, first of all, you will import the pandas library and read the CSV file called "turnover.csv". Then, you will print the unique values of those variables.

In [None]:
# Import pandas (as pd) to read the data
import pandas as pd

# Read "turnover.csv" file and save it in a DataFrame called data
data = pd.read_csv("turnover.csv")

# Print the unique values of the "department" column
print(data.department.unique())

# Print the unique values of the "salary" column
print(data.salary.unique())

### 2. Transforming categorical variables

**Encoding categories**

You need to help your algorithm understand that you're dealing with categories. You will encode categories of the salary variable, which you know is ordinal based on the values you observed:

you first have to tell Python that the salary column is actually categorical
you then have to specify the correct order of categories
finally, you should encode each category with a numeric value corresponding to its specific position in the order

In [None]:
# Change the type of the "salary" column to categorical
data.salary = data.salary.astype('category')

# Provide the correct order of categories
data.salary = data.salary.cat.reorder_categories(['low', 'medium', 'high'])

# Encode categories
data.salary = data.salary.cat.codes

**Getting dummies**

You will now transform the department variable, which you know is nominal based on the values you observed. To do that, you will use so-called dummy variables.

`get_dummies()` out of the department column of the data and save them inside a new DataFrame called departments.

In [None]:
# Get dummies and save them inside a new DataFrame
departments = pd.get_dummies(data.department)

# Take a quick look to the first 5 rows of the new DataFrame called departments
print(departments.head(5))

**Dummy trap**

A dummy trap is a situation where different dummy variables convey the same information. In this case, if an employee is, say, from the accounting department (i.e. value in the accounting column is 1), then you're certain that s/he is not from any other department (values everywhere else are 0). Thus, you could actually learn about his/her department by looking at all the other departments.

For that reason, whenever 
 dummies are created (in your case, 10), only 
 - 1 (in your case, 9) of them are enough, and the 
-th column's information is already included.

Therefore, you will get rid of the old department column, drop one of the department dummies to avoid dummy trap, and then join the two DataFrames.

Instructions

.drop() the accounting column to avoid "dummy trap".
.drop() the old column department as you do not need it anymore.
Join the new departments DataFrame to the employee dataset (this has been done for you).

In [None]:
# Drop the "accounting" column to avoid "dummy trap"
departments = departments.drop("accounting", axis=1)

# Drop the old column "department" as you don't need it anymore
data = data.drop("department", axis=1)

# Join the new dataframe "departments" to your employee dataset: done
data = data.join(departments)

### 3. Descriptive statistics

So now our dataset is ready to develop a predictive algorithm. But before then, let's first get some quick descriptive insights.

2. Turnover rate
00:07 - 01:06
The variable that is providing information whether an employee has left the company or not is the column **churn**. Basically, if the value of this column is 1 then an employee has churned, and if it is 0 then we have not observed turnover in this case. To calculate the turnover rate we have to count number of times this variable has the value 1 and 0 and then divide it by the total. If we multiply the result by 100 then the outcome will be the % of employees who left and stayed. This task is again accomplished in 3 steps: - First we get the number of all the employees, which is basically the length of our data, - Then, we count 1s and 0s in the column churn, - Finally, we divide the counted values by the number of employees and multiple by 100 to get percentages. As you can see around 76% of our employees stayed, while 24% have churned. Thus, we conclude that turnover rate is 24%.

3. Correlations
01:06 - 01:39
Next, we are interested to learn what are the variables that are in a positive or negative linear relationship with our target. To see that, we will first of all develop the correlation matrix using the `corr()` method provided by **pandas** and then visualize the matrix using the `heatmap()` function by seaborn, a statistical visualization library. As you can see the target varaible **churn** has the highest negative correlation with satisfaction level. This shows that the increase in satisfaction level is associated with decrease in probability of turnover.

4. Let's practice!
01:39 - 01:41
Now it's your turn to practice.

In [None]:
# Incluir la tabla de correlaciones tal cual está en la imagen dentro de esta carpeta.

**Percentage of employees who churn**

The column churn is providing information about whether an employee has left the company or not is the column churn:

if the value of this column is 0, the employee is still with the company
if the value of this column is 1, then the employee has left the company
Let’s calculate the turnover rate:

you will first count the number of times the variable churn has the value 1 and the value 0, respectively
you will then divide both counts by the total, and multiply the result by 100 to get the percentage of employees who left and stayed

In [None]:
# Use len() function to get the total number of observations and save it as the number of employees
n_employees = len(data)

print(type(data))
print(n_employees)
print(data.head(3))
print(data.tail(3))

# Print the number of employees who left/stayed
print(data.churn.value_counts())

# Print the percentage of employees who left/stayed
print(data.churn.value_counts()/n_employees*100)

## B. Predicting employee turnover

This chapter introduces one of the most popular classification techniques: the Decision Tree. You will use it to develop an algorithm that predicts employee turnover.

### 1. Splitting the data
00:00 - 00:09
Hello and welcome to Chapter 2. For now we already know how to transform HR data and make it ready for predictive analytics. Let's now concentrate on predictive component.

2. Target and features
00:09 - 00:38
Our target in this course is to predict employee turnover using data that we have on them. In business analytics or data science terminology, the variable that one aims to predict is known as target, while everything else that is used for prediction are called features. In other words, we will be using features to predict target.

3. Train/test split
00:38 - 01:54
To make an accurate prediction and build an algorithm that can be useful in reality, in analytics it is a usual practice to split the data into two components: train and test. Train component is used to conduct calculations, optimizations and develop the algorithm, while the remaining test component is used to validate it. For that reason, once our data is separated into target and features, the next step is to split both of them into train and test component. One of most popular Python libraries, that is widely used by data scientists and business analysts is called sklearn. In sklearn, there is almost always a built-in function for most of the analytics tasks, including train/test splitting. As you can see from the code, the function generates 4 outputs. This happens because we split between train and test both target and features so we end up with train and test components for target, and similarly for features. Last but not least, as you can see the functions takes a test_size argument which is 0.25 in our example. This argument tells sklearn to randomly choose 25% of the data and save it as test, while the rest of 75% will be kept for training. In general, when you have quite a big dataset with millions of observations, around 2-3% for test might be enough. But because our datasets in HR are not usually that big, 25% for test seems to be a good practice.

4. Overfitting
01:54 - 02:50
To understand better the reasoning behind train/test split, let's shortly cover the concept of overfitting. Overfitting is one of the most popular problems in analytics. Our first target is to have an accurate model that can helps us to make accurate predictions and decisions based on them. Yet, a model which is accurate on one data, might not be that much accurate on the other. So our second not less important objective is to achieve a model that is generalizable or in other words, works good not only on our current dataset but also in possible future datasets. Overfitting happens, when the model works well on the dataset it was developed on, but is not useful outside of it. So we split the data into train and test components, develop model on train and then validate it on test to make sure our model was not overfitting the training data.

5. Let's practice!
02:50 - 02:54
We will concentrate more on the concept of overfitting, but until then, let's practice splitting the data.

**Separating Target and Features**

In order to make a prediction (in this case, whether an employee would leave or not), one needs to separate the dataset into two components:

the dependent variable or target which needs to be predicted
the independent variables or features that will be used to make a prediction
Your task is to separate the target and features. The target you have here is the employee churn, and features include everything else.

Reminder: the dataset has already been modified by encoding categorical variables and getting dummies.

pandas has been imported for you as pd.

In [None]:
# Set the target and features

# Choose the dependent variable column (churn) and set it as target
target = data.churn

print(target.head())

print(target.value_counts())

# Drop column churn and set everything else as features
features = data.drop("churn",axis=1)

print(features.head())

**Spliting employee data**

Overfitting the dataset is a common problem in analytics. This happens when a model is working well on the dataset it was developed upon, but fails to generalize outside of it.

A train/test split is implemented to ensure model generalization: you develop the model using the training sample and try it out on the test sample later on.

In this exercise, you will split both target and features into train and test sets with 75%/25% ratio, respectively.

Instructions
100 XP
Import train_test_split from the sklearn.model_selection module
Use train_test_split() to split your dataset into training and testing sets
Assign 25% of your observations to the testing set

In [None]:
# Import the function for splitting dataset into train and test
from sklearn.model_selection import train_test_split 

# Use that function to create the splits both for target and for features
# Set the test sample to be 25% of your observations
target_train, target_test, features_train, features_test = train_test_split(target,features,test_size=0.25,random_state=42)

### 2. Introduction to Decision Tree classification

In this lecture we will start the process of building the predictive algorithm. What we want to accomplish is to have an algorithm, that will learn from our historical data the important variables affecting the decision of leaving the company and use that information to predict turnover. As the values our target, turnover, gets are 2: 1 and 0, this problem is called binary classification.

2. Classification in Python
00:25 - 00:56
There are many different data science/machine learning algorithms that one can use to address binary classification problem such as prediction of employee turnover. Each of them has its own pros and cons and business cases where they are best to apply. The algorithm which we will use showed to be quite popular in HR analytics and is called Decision Tree. The latter is popular for 2 reasons among all: 1st it is able to provide accurate predictions and 2nd, it can be used to understand factors that are driving the decision to leave the company.

3. Decision Tree Classification
00:56 - 02:13
The picture you see know is the visualization of a small sample Decision Tree for employee turnover. The appearance of the algorithm is the reason it is called Decision Tree. Let's go step by step over the tree to understand the classification process. The tree is growing first starting from the variable Satisfaction. It is checked whether for a given employee the satisfaction level was higher than 0.5 or not. If it was, we go to the right branch of the tree, otherwise we move to the left one. If we moved to the right, then the next question we need to ask according to the tree is whether Salary is High or not. As you can see, if Salary is High, then we reach one of the last nodes or leaves of the tree, where the output is that the employee will not Churn. Thus we have a decision path: employees with High Satisfaction level and High Salary do not Churn. Analogically, employees with low satisfaction level who spent, say 3 years with the company do Churn as presented by the last leaf of the leftmost branch of the tree. Therefore, once we have this tree, we can easily predict whether a given employee will churn or not and also understand what are the important variables that drive churn decision.

4. Splitting rule
02:13 - 02:43
Let's now concentrate shortly on the intuition which is used to split the tree. In general Decision Tree algorithm wants to achieve as pure samples in the last leafs as possible. Mathematically, 2 different rules are quite popular to achieve this task: Gini and Entropy. Objective is the same in both cases, we aim to minimize Gini or Entropy, and both will result in purer samples in the last nodes. As theoretically there is no proven dominance between those 2 methods, we will go on using Gini, as it is doing calculations faster.

5. Decision Tree splitting: hypothetical example
02:43 - 03:27
Let's discuss a hypothetical example. Assume we have a dataset of 100 people. Assume also 40 of them are leavers and 60 stayers. So now let's divide them based on satisfaction level being higher than 0.8 or not. If yes, suppose we end up with 50 people on the left branch, all stayers. On the other hand, right branch includes 10 stayers and 40 leavers. As you can see, this hypothetical splittion results in tremendously decreased Gini: from 0.48 to 0 and 0.08 in two branches respectively. As a result, we have purer samples, especially in the left branch, where we have only stayers, which helps us to make more accurate predictions.

6. Let's practice!
03:27 - 03:31
Good, let's new practice the theory before moving to analytics.

**Computing Gini index**

The decision tree algorithm aims to achieve partitions in the terminal nodes that are as pure as possible. The Gini index is one of the methods used to achieve this. It is calculated based on the proportion of samples in each group.

Given the number of people who stayed and left respectively, calculate the Gini index for that node.

In [None]:
#number of people who stayed/left
stayed = 37
left = 1138

#sum of stayed and left
total = stayed + left

#gini index
gini = 2*(stayed/total)*(left/total)

print(gini)

**Splitting the tree**

Given the Gini index that would result from splitting by either variable A or B, respectively, decide by which variable the tree should split next.

In [None]:
# Gini index in case of splitting by variable A or B
gini_A = 0.65
gini_B = 0.15

# check which Gini is lower and use it for spliting
if gini_A < gini_B:
    print("split by A!")
else:
    print("split by B!")

### 3. Predicting employee churn using decision trees

As for now, we know how Decision tree works in theory. Let's apply this knowledge and use Python to predict employee churn.

2. Decision Tree in Python
00:07 - 01:44
To get the tree, we need to first import the necessary functions and initialize them. For that reason, we will again use already familiar `sklearn` library. Once imported, we need to initialize this long-named function with a more friendly name and also provided a parameter called random_state. This parameter does not really affect the model results, it just ensures that if you run it 2nd time you will still get the same results. As a consequence, it is not important whether it will be = 1, 20 or anything else, what is important is to give the same values if you need to reproduce same results. Once the model is set up, we can go on and use a fit() method on it to fit our features to the target. As you remember, we used train/test split to develop model on train component but then validate on test. This is done to avoid overfitting. For that reason, we use `features_train` and `target_train` components for fitting. Once we run this piece of code, the tree is already calculated and grown. To test out how good this tree is making its prediction we need to use a method called score(), which is calculating the accuracy score of the prediction. Again, because we developed the model based on the training component, we calculate accuracy score on the test component. The score will show how correct prediction is. For example, the score of 0.65 is showing that we made a correct prediction whether an employee will leave or stay based on our tree for 65% of cases. So to get percentages, we just need to multiply the accuracy score by 100.

3. Let's practice!
01:44 - 01:47
OK, Now it's your turn to calculate the accuracies.

**Fitting the tree to employee data**

A train/test split provides the opportunity to develop the classifier on the training component and test it on the rest of the dataset. In this exercise, you will start developing an employee turnover prediction model using the decision tree classification algorithm. The algorithm provides a .fit() method, which can be used to fit the features to the model in the training set.

Reminder: both target and features are already split into train and test components (Train: features_train, target_train, Test: features_test, target_test)

Instructions
100 XP
Import the classification algorithm called DecisionTreeClassifier.
Initialize it as model and set the random state to 42.
Apply the decision tree model by fitting the training set features to the model.

In [None]:
# Import the classification algorithm
from sklearn.tree import DecisionTreeClassifier

# Initialize it and call model by specifying the random_state parameter
model = DecisionTreeClassifier(random_state=42)

# Apply a decision tree model to fit features to the target
model.fit(features_train, target_train)

**Checking the accuracy of prediction**

It’s now time to check how well your trained model can make predictions! Let’s use your testing set to check the accuracy of your Decision Tree model, with the score() method.

Instructions
100 XP
Apply the decision tree model to fit the features to the target in the training set.
Check the accuracy score() of the prediction for the training set.
Check the accuracy score() of the prediction for the test set.

In [None]:
# Apply a decision tree model to fit features to the target in the training set
print(model.fit(features_train,target_train))

# Check the accuracy score of the prediction for the training set
print(model.score(features_train,target_train)*100)

# Check the accuracy score of the prediction for the test set
print(model.score(features_test,target_test)*100)

### 4. Interpretation of the decision tree

One of the main advantages of using Decision Tree algorithm, is that it is interpretable. We can visualize the tree to understand the path taking us to the final decision.

2. Visualization
00:11 - 00:39
The visualization consists of 3 steps. First, one needs to export the tree. It will be exported into a file called tree.dot which will reside in the working directory, together with turnover.csv dataset file that we are using. Then you need to open the file and copy content. Last step is going to the webgraphviz website, pasting the content and visualizing the tree.

3. Interpretation
00:39 - 01:09
After the three steps are implemented, you will see similar tree on the webgraphviz website. As you can see each node includes information on sample size, which is number of employees in that leaf who satisfied the proceeding decision rules. It also provides number of stayers and leavers in each node and the corresponding Gini index value. It is visible that once the tree is growing Gini is decreasing, which is the objective we wanted to achieve.

4. Let's practice!
01:09 - 01:14
Now let's try some examples.

**Exporting the tree**

In Decision Tree classification tasks, overfitting is usually the result of deeply grown trees. As the comparison of accuracy scores on the train and test sets shows, you have overfitting in your results. This can also be learned from the tree visualization.

In this exercise, you will export the decision tree into a text document, which can then be used for visualization.

Instructions
100 XP
Import the the export_graphviz() function from the the sklearn.tree submodule.
Fit the model to the training data.
Export the visualization to the file tree.dot.

In [None]:
# Import the graphical visualization export function
from sklearn.tree import export_graphviz

# Apply Decision Tree model to fit Features to the Target
model.fit(features_train,target_train)

# Export the tree to a dot file
export_graphviz(model,"tree.dot")

Incluir imagen 'DesicionTreeInterpretation.png'

## C. Evaluating the turnover prediction model

Here, you will learn how to evaluate a model and understand how "good" it is. You will compare different trees to choose the best among them.

### 1. Tuning employee turnover classifier

In chapter 2 we shortly touched the concept of overfitting. As it was mentioned there, train/test split helps us to learn whether we have any overfitting error or not. Yet, it does not really provide solution to that. In this chapter, we will concentrate on tuning our classifier to get better results and some of these methods will be related to fighting overfitting.

2. Overfitting
00:23 - 01:25
For that reason, let's remember what Overfitting was about. Once we develop model on the train component, it may work perfectly on that, but fail outside of it. This is the reason we use test component to understand whether our model is useful outside of train data or not. As you can see, the accuracy score is perfect on training set, but not that much high on testing set. This is speaking about overfitting problem. The reason we have it, is because currently, our tree is growing as much as it can grow, and in the end becomes very large and very specific to training data only. To solve this issue, we have two solutions: either we need to limit the maximum depth of the tree, say we do not let the tree to grow more than 5 steps OR we limit the sample size in each leaf and, say, do not allow the tree to grow more if only 100 employees are left in the node/leaf. Let's go on and apply both separately.

3. Pruning the tree
01:25 - 02:41
In the upper block we limit the tree depth. As you can see, this can easily be done by setting an additional parameter `max_depth=5` in the DecisionTreeClassifier during the initialization process. It will help us to keep everything else the same, but limit the tree to at most 5 levels to grow in depth. Thus, let's call this model `model_depth_5`. As you can see, afterwards, the fitting and scoring processes are still the same, with only one tiny but important difference: we fit features to the target and we calculate the accuracy for `model_depth_5` instead of the general model without any limitation. As a result, the accuracy is decreased on both sets, but the difference between them is negligible, which means we reduced overfitting and current model is more realistic. In the lower block we implement everything absolutely the same, apart from the model initialization step again: this time we set `min_sample_leaf=100` to limit the sample size inside a leaf. After fitting and scoring this new model we receive a test accuracy of 96.13% which is again lower, but again, more realistic than the old one.

4. Let's practice!
02:41 - 02:55
We will learn more realistic metrics for evaluating the model. Until then, let's practice.

#### Pruning the tree

Overfitting is a classic problem in analytics, especially for the decision tree algorithm. Once the tree is fully grown, it may provide highly accurate predictions for the training sample, yet fail to be that accurate on the test set. For that reason, the growth of the decision tree is usually controlled by:

“Pruning” the tree and setting a limit on the maximum depth it can have.
Limiting the minimum number of observations in one leaf of the tree.
In this exercise, you will:

prune the tree and limit the growth of the tree to 5 levels of depth
fit it to the employee data
test prediction results on both training and testing sets.
The variables features_train, target_train, features_test and target_test are already available in your workspace.

Instructions
100 XP
Initialize the DecisionTreeClassifier while limiting the depth of the tree to 5.
Fit the Decision Tree model using the features and the target in the training set.
Check the accuracy of the predictions on both the training and test sets.

In [None]:
# Initialize the DecisionTreeClassifier while limiting the depth of the tree to 5
model_depth_5 = DecisionTreeClassifier(max_depth = 5, random_state = 42)

# Fit the model
model_depth_5.fit(features_train,target_train)

# Print the accuracy of the prediction for the training set
print(model_depth_5.score(features_train,target_train) * 100)

# Print the accuracy of the prediction for the test set
print(model_depth_5.score(features_test,target_test) * 100)

#### Limiting the sample size

Another method to prevent overfitting is to specify the minimum number of observations necessary to grow a leaf (or node), in the Decision Tree.

In this exercise, you will:

set this minimum limit to 100
fit the new model to the employee data
examine prediction results on both training and test sets
The variables features_train, target_train, features_test and target_test are already available in your workspace.

Instructions
100 XP
Initialize the DecisionTreeClassifier and set the leaf minimum limit to 100 observations
Fit the decision tree model to the training data.
Check the accuracy of the predictions on both the training and test sets.

In [None]:
# Initialize the DecisionTreeClassifier while limiting the sample 
# size in leaves to 100

model_sample_100 = DecisionTreeClassifier(  min_samples_leaf = 100, 
                                            random_state = 42)

# Fit the model
model_sample_100.fit(features_train,target_train)

# Print the accuracy of the prediction (in percentage points) for the training set
print(model_sample_100.score(features_train,target_train) * 100)

# Print the accuracy of the prediction (in percentage points) for the test set
print(model_sample_100.score(features_test,target_test) * 100)

### 2. Evaluating the model
00:00 - 00:17
Before now, we were using only general accuracy score to evaluate the performance of our model. However, it turns out only accuracy is not enough to claim that the model is a good one.

2. Prediction errors
00:17 - 00:55
To understand what other metrics of evaluation are doing, let me introduce you to prediction errors first. We have two possible outcomes in reality which means in general we have 4 possible situations presented in this so called confusion matrix. When the prediction is 0, we call it negative, and when it is 1, it is widely accepted to call it positive. Similarly, when prediction is correct, we say it is True, otherwise it is False. Thus, if in reality someone left the company but was predicted to be a stayer, then we have False Negative, as the prediction was both False and Negative. Based on this 4 possibilities, many different metrics are developed in analytics to measure performance of the model.

3. Evaluation metrics (1)
00:55 - 01:29
If the target of your predictions is mostly to focus on those who are churning, then you probably want to have less False Negatives, people who leave in reality but your algorithm is not able to predict it. For that reason, Recall score can be useful. Higher values of recall correspond lower values of False Negatives. One the other hand, if you want to keep your attention on those who stay, less False Positives will be your target, which can be achieved with higher Specificity score.

4. Evaluation metrics (2)
01:29 - 02:17
There are some other metrics that can be derived from the same confusion matrix. For example, if one is interested in learning what is the percentage of people who truly left the company among those who were predicted to leave, then Precision score will be handy to use. The reason those scores are important is that general accuracy is not providing information about separate classes. For example, in our model around 76% are stayers. So if we just say "everybody is staying" we will have 76% accurate prediction. But in terms of recall, we will have very low value, as everybody who churned will be wrongly classified.

5. Let's practice!
02:17 - 02:23
My experience shows that sometimes those scores sound very similar and are difficult to differentiate in between. If you feel so, do not worry, and take your time to go over confusion matrix again to understand the intuition behind each of them. As for now let's calculate some measures for our employee dataset.

#### Calculating accuracy metrics: precision

The Precision score is an important metric used to measure the accuracy of a classification algorithm. It is calculated as the fraction of True Positives over the sum of True Positives and False Positives, or

$$
`# of True Positives / (# of True Positives + # of False Positives)`
$$

we define True Positives as the number of employees who actually left, and were classified correctly as leaving
we define False Positives as the number of employees who actually stayed, but were wrongly classified as leaving
If there are no False Positives, the precision score is equal to 1. If there are no True Positives, the precision score is equal to 0.

In this exercise, we will calculate the precision score (using the sklearn function precision_score) for our initial classification model.

The variables features_test and target_test are available in your workspace.

Instructions
100 XP
Import the function precision_score from the module sklearn.metrics.
Use the initial model to predict churn (based on features of the test set).
Calculate the precision score by comparing target_test with the test set predictions.

In [None]:
# Import the function to calculate precision score
from sklearn.metrics import precision_score

# Predict whether employees will churn using the test set
prediction = model.predict(features_test)

print(prediction.size)
print(prediction.shape)
print(prediction.mean)
print(prediction)

# Calculate precision score by comparing target_test with the prediction
precision_score(target_test, prediction)

#### Calculating accuracy metrics: recall
The Recall score is another important metric used to measure the accuracy of a classification algorithm. It is calculated as the** fraction of True Positives over the sum of True Positives and False Negatives**, or

$$
`# of True Positives / (# of True Positives + # of False Negatives)`
$$

If there are no False Negatives, the recall score is equal to 1. If there are no True Positives, the recall score is equal to 0.

In this exercise, you will calculate the recall score (using the sklearn function recall_score) for your initial classification model.

The variables features_test and target_test are available in your workspace.

Instructions
100 XP
Import the function to calculate the recall score.
Use the initial model to predict churn (based on features of the test set).
Calculate the recall score by comparing target_test with the predictions.

In [None]:
# Import the function to calculate recall score
from sklearn.metrics import recall_score

# Use the initial model to predict churn
prediction = model.predict(features_test)

# Calculate recall score by comparing 
# target_test with the prediction
recall_score(target_test, prediction)

### 3. Targeting both leavers and stayers

As the objective of this course, is to develop a model that will correctly predict churn, recall score seems to be our target. However, recall alone is not enough, as by only targeting one class, we may have dramatically low accuracy for the other. Thus, a general rule is to use a measure that is not concentrated on one class alone.

2. AUC score
00:20 - 01:14
If our target are leavers, we would concentrate on recall, if stayers, then on specificity. But if your target is to have good predictions on both, then probably the best choice is to use AUC score. AUC stands for Area Under Curve and is basically a compound measure that is maximized when both recall and specificity are maximized. To calculate AUC score, one needs to place Recall on vertical, and 1- Specificity on horizontal axis and draw the Blue curve in the graph, which is called ROC. The are between ROC that we obtained and the green diagonal line that a random prediction could obtain is the AUC score Using AUC as a target to maximize, the model will try to correctly classify both 1s and 0s keeping an eye on recall and specificity at the same time.

3. Let's practice!
01:14 - 01:19
Excellent, let's now put this into practice.

#### Calculating the ROC/AUC score

While the Recall score is an important metric for measuring the accuracy of a classification algorithm, it puts too much weight on the number of False Negatives. On the other hand, Precision is concentrated on the number of False Positives.

The combination of those two results in the ROC curve allows us to measure both recall and precision. The area under the ROC curve is calculated as the AUC score.

In this exercise, you will calculate the ROC/AUC score for the initial model using the sklearn roc_auc_score() function.

The variables features_test and target_test are available in your workspace.

Instructions
100 XP
Import the function to calculate ROC/AUC score.
Use the initial model to predict churn (based on the features of the test set).
Calculate ROC/AUC score by comparing target_test with the prediction.

In [None]:
# Import the function to calculate ROC/AUC score
from sklearn.metrics import roc_auc_score

# Use initial model to predict churn 
# (based on features_test)
prediction = model.predict(features_test)

# Calculate ROC/AUC score by comparing 
# target_test with the prediction
roc_auc_score(target_test, prediction)

### 4. Class imbalance

General accuracy score is a good choice only if classes in the dataset are balanced. However, as discussed in this chapter, class imbalance may lead to higher accuracy score, when in fact our model is failing to correctly predict churn. This was the reason we covered evaluation metrics other than accuracy score. While those other metrics are more robust and informative, they only partially solve the class imbalance problem. To solve it, what we can do is to change prior probabilities.

2. Prior probabilities
00:28 - 01:33
As you remember, Gini index was the objective of our Decision tree to minimize and it was calculated based on probability of being 1 or 0. As we have no other information about probabilities, in the very beginning, when the tree just starts to grow, in order to calculate the Gini index, it takes proportions of 0s and 1s as probabilities in Gini formula. As a result, Class 0, which are stayers, becomes more influential as they are 76% of the observations in our dataset. This is the reason, our algorithm was able to correctly predict 0s but not 1s. To solve it, we just need to tell Python to balance class weights which will make probability of both being 0 and 1 equal to 50%. This will probably negatively affect the general accuracy as a result of increased Gini, but AUC and especially Recall should probably be improved, as now both classes are equally important.

3. Let's practice!
01:33 - 01:39
Let's now implement this change and see what happens.



#### Balancing classes

It can significantly affect prediction results, as shown by the difference between the recall and accuracy scores. To solve the imbalance, equal weights are usually given to each class. Using the class_weight argument in sklearn's DecisionTreeClassifier, one can make the classes become "balanced".

Let’s correct our model by solving its imbalance problem:

first, you’re going to set up a model with balanced classes
then, you will fit it to the training data
finally, you will check its accuracy on the test set
The variables features_train, target_train, features_test and target_test are already available in your workspace.

Instructions
100 XP
Initialize the Decision Tree Classifier, prune your tree by limiting its maximum depth to 5, and balance the class weights.
Fit the new model.
Print the accuracy score of the prediction (in percentage points) for the test set.

In [None]:
# Initialize the DecisionTreeClassifier 
model_depth_5_b = DecisionTreeClassifier( max_depth=5,
                            class_weight="balanced",
                            random_state=42)

# Fit the model
model_depth_5_b.fit(features_train,target_train)

# Print the accuracy of the prediction 
# (in percentage points) for the test set
print(model_depth_5_b.score(features_test,target_test) * 100)

#### Comparison of Employee attrition models

In this exercise, your task is to compare the balanced and imbalanced (default) models using the pruned tree (max_depth=7). The imbalanced model is already done using recall and ROC/AUC scores. Complete the same steps for the balanced model.

The variables features_train, target_train, features_test and target_test are already available in your workspace.
An imbalanced model has already been fit for you and, and its predictions saved as prediction.
The functions recall_score() and roc_auc_score() have been imported for you.
Instructions
100 XP
Initialize the balanced model, setting its maximum depth to 7, and its seed to 42.
Fit it to the training component using the training set.
Make predictions using the testing set.
Print the recall score and ROC/AUC score.

In [None]:
# Print the recall score
print(recall_score(target_test,prediction))
# Print the ROC/AUC score
print(roc_auc_score(target_test,prediction))

# Initialize the model
model_depth_7_b = DecisionTreeClassifier(max_depth = 7, 
                    class_weight = 'balanced',
                    random_state = 42)

# Fit it to the training component
model_depth_7_b.fit(features_train, target_train)

# Make prediction using test component
prediction_b = model_depth_7_b.predict(features_test)

# Print the recall score for the balanced model
print(recall_score(target_test, prediction_b))

# Print the ROC/AUC score for the balanced model
print(roc_auc_score(target_test, prediction_b))

## D. Choosing the best turnover prediction model

In this final chapter, you will learn how to use cross-validation to avoid overfitting the training data. You will also learn how to know which features are impactful, and which are negligible. Finally, you will use these newly acquired skills to build a better performing Decision Tree!

### 1. Hyperparameter tuning

Welcome to the final Chapter of the course. Congratulations on making this far. Now, we are already familiar with the key approaches to tune and evaluate our model. However, one question that you may have been asking until now is how we decide whether, for example, maximum depth of the tree should be set to 5 or 6 or 10 or any other value. Same goes for other parameters covered until now. The answer is very simple, we just try different values and find the one that provides best possible predictions.

2. GridSearch
00:48 - 02:03
Maximum depth, minimum sample size and similar other parameters that need to be tuned to find the best value are known as hyperparameters. To find the optimal values for those hyperparameters, one needs to create a grid, a list of applicable values that he or she wants to test and then search among those values the one that achieves highest accuracy. For example, the maximum depth should not attain very high values, as the tree will start to overfit, but low values are not acceptable as well, as they may provide biased and less accurate predictions. For that reason, let's try to find the optimal value between 5 and 20. Similarly, for minimum sample size in the leaf nodes, let's check values between 50 and 450 with a step of 50. Once those values are generated inside a list, the only thing left is to develop Decision Tree for all possible combinations of those values and compare them to find the values that provide best performance on the test set. This process is known as GridSearch and while it may sound confusing, implementation in Python using sklearn is fairly easy.

3. Cross-Validation
02:03 - 02:36
While Train/test split ensures that the model does not overfit training component, hyperparameter tuning may result in overfitting the test component. As a solution, one is encouraged to validate the model on different test components, which is achieved using **Cross Validation**. The latter is general case of Train/test split, as it splits the data into **k** components or folds, where each component has the opportunity of being the **test component**. In this example picture, we have 5 folds, and during each one of the components is Test while others are used as Train. Having different Folds, ensures that our model does not overfit the test component. This is exactly what GridSearch in sklearn is using to understand which model is better.

4. Let's practice!
02:36 - 02:42
Now let's try some examples.

#### Cross-validation using sklearn

As explained in Chapter 2, overfitting the dataset is a common problem in analytics. This happens when a model has learned the data too closely: it has great performances on the dataset it was trained on, but fails to generalize outside of it.

While the train/test split technique you learned in Chapter 2 ensures that the model does not overfit the training set, hyperparameter tuning may result in overfitting the test component, since it consists in tuning the model to get the best prediction results on the test set. Therefore, it is recommended to validate the model on different testing sets. K-fold cross-validation allows us to achieve this:

- it splits the dataset into a training set and a testing set
- it fits the model, makes predictions and calculates a score (you can specify if you want the accuracy, precision, recall…)
- it repeats the process k times in total
- it outputs the average of the 10 scores

In this exercise, you will use Cross Validation on our dataset, and evaluate our results with the cross_val_score function.

Instructions
100 XP
Import the function for implementing cross-validation, cross_val_score(), from the module sklearn.model_selection.
Print the cross-validation score for your model, specifying 10 folds with the cv hyperparameter.

In [None]:
# Import the function for implementing cross validation
from sklearn.model_selection import cross_val_score

# Use that function to print the cross validation score for 10 folds
print(cross_val_score(model,features,target,cv=10))

#### Setting up GridSearch parameters

A hyperparameter is a parameter inside a function. For example, max_depth or min_samples_leaf are hyperparameters of the DecisionTreeClassifier() function. Hyperparameter tuning is the process of testing different values of hyperparameters to find the optimal ones: the one that gives the best predictions according to your objectives. In sklearn, you can use GridSearch to test different combinations of hyperparameters. Even better, you can use GridSearchCV() to test different combinations and run cross-validation on them in one function!

In this exercise, you are going to prepare the different values you want to test for max_depth and min_samples_leaf. You will then put these in a dictionary, because that’s what is required for GridSearchCV():

the dictionary keys will be the hyperparameters names
the dictionary values will be the attributes (the hyperparameter values) you want to test
Instead of writing all the values manually, you will use the range() function, which allows us to generate values incrementally. For example, range(1, 10, 2) will generate a list containing values ranging from 1 included to 10 not included, by increments of 2. So the final result will be [1, 3, 5, 7, 9].

Instructions
100 XP
Following the format in the example above, generate values for the maximum depth ranging from 5 to 20 with increments of 1
Do the same for the minimum sample size with values from 50 to 450 with increments of 50
Create the dictionary by specifying the max_depth and min_samples_leaf values to try, respective values, using the variables you just created

In [None]:
# Generate values for maximum depth
depth = [i for i in range(5,21,1)]

# Generate values for minimum sample size
samples = [i for i in range(50,500,50)]

# Create the dictionary with parameters to be checked
parameters = dict(max_depth=depth, min_samples_leaf=samples)

print(parameters)

#### Implementing GridSearch

You can now use the sklearn GridSearchCV() function to find the best combination of all of the max_depth and min_samples_leaf values you generated in the previous exercise.

Instructions
100 XP
Import the GridSearchCV function
Apply a GridSearchCV() function to your model using the parameters dictionary you defined earlier. Save this as param_search.
Fit param_search to the training dataset.
Print the best parameters found using best_params_ attribute.

In [None]:
# import the GridSearchCV function
from sklearn.model_selection import GridSearchCV

# set up parameters: done
parameters = dict(max_depth=depth, min_samples_leaf=samples)

# initialize the param_search function using the GridSearchCV function, initial model and parameters above
param_search = GridSearchCV(model, parameters)

# fit the param_search to the training dataset
param_search.fit(features_train, target_train)

# print the best parameters found
print(param_search.best_params_)

### 2. Important features for predicting attrition
00:00 - 00:14
One of the main reasons we chose to use Decision Tree algorithm is that it provides interpretability. We can not only visualize and explain it, but we can also understand what are the important features that drive the decision to leave the company.

2. Feature Importances
00:14 - 00:50
Fortunately, once Decision Tree is developed, sklearn can easily calculate feature importances. The latter is equal to the relative decrease in Gini due to the selected feature. Once the calculation is done for all features, the values are rescaled to sum up to 100%. As a result, higher percentage speaks about the feature being more important. Usually, results show that not all the features are that important. As a consequence, if you learn that a feature is not important at all, it is suggested to drop it and run the model without that feature.

3. Let's practice!
00:50 - 00:53
Let's now find the important features in our dataset.

#### Sorting important features

Among other things, Decision Trees are very popular because of their interpretability. Many models can provide accurate predictions, but Decision Trees can also quantify the effect of the different features on the target. Here, it can tell you which features have the strongest and weakest impacts on the decision to leave the company. In sklearn, you can get this information by using the feature_importances_ attribute.

In this exercise, you're going to get the quantified importance of each feature, save them in a pandas DataFrame (a Pythonic table), and sort them from the most important to the less important. The model_ best Decision Tree Classifier used in the previous exercises is available in your workspace, as well as the features_test and features_train variables.

pandas has been imported as pd.

Instructions
100 XP
Use the feature_importances_ attribute to calculate relative feature importances
Create a list of features
Save the results inside a DataFrame using the DataFrame() function, where the features are rows and their respective values are a column
Sort the relative_importances DataFrame to get the most important features on top using the sort_values() function and print the result

In [None]:
# Calculate feature importances using the 
# 'feature_importances_' attribute
feature_importances = model_best.feature_importances_

# Create a list of features: done
feature_list = list(features)

# Save the results inside a DataFrame using feature_list 
# as an index the features are the rows
relative_importances = pd.DataFrame(index = feature_list, 
                            data=feature_importances, 
                            columns=["importance"])

# Sort values to learn most important features
relative_importances.sort_values(by="importance", 
                            ascending=False)

#### Selecting important features
In this exercise, your task is to select only the most important features that will be used by the final model. Remember, that the relative importances are saved in the column importance of the DataFrame called relative_importances.

Instructions
100 XP
Select only the features with an importance value higher than 1%.
Create a list from those features and print them (this has been done for you).
Using the index saved in selected_list, transform both features_train and features_test to include the features with an importance higher than 1% only.

In [None]:
# SELECTING IMPORTANT FEATURES
##############################
# select only features with relative importance higher than 1%
selected_features = relative_importances [relative_importances.values > 0.01]

# create a list from those features: done
selected_list = selected_features.index

# transform both features_train and features_test components to include only selected features
features_train_selected = features_train[selected_list]
features_test_selected = features_test[selected_list]

#### Develop and test the best model
In Chapter 3, you found out that the following parameters allow you to get better model:

max_depth = 8,
min_samples_leaf = 150,
class_weight = "balanced"
In this chapter, you discovered that some of the features have a negligible impact. You realized that you could get accurate predictions using just a small number of selected, impactful features and you updated your training and testing set accordingly, creating the variables features_train_selected and features_test_selected.

With all this information at your disposal, you're now going to develop the best model for predicting employee turnover and evaluate it using the appropriate metrics.

The features_train_selected and features_test_selected variables are available in your workspace, and the recall_score and roc_auc_score functions have been imported for you.

Instructions
100 XP
Initialize the best model using the parameters provided in the description.
Fit the model using only the selected features from the training set.
Make a prediction based on the selected features from the test set.
Print the accuracy, recall and ROC/AUC scores of the model.

In [None]:
# Develop and test the best model
# Initialize the best model using parameters provided 
# in description
model_best = DecisionTreeClassifier(max_depth=8,
                min_samples_leaf = 150,
                class_weight = 'balanced', 
                random_state=42)

# Fit the model using only selected features from training set: done
model_best.fit(features_train_selected, target_train)

# Make prediction based on selected list of features from test set
prediction_best = model_best.predict(features_test_selected)

# Print the general accuracy of the model_best
print(model_best.score(features_test_selected, target_test) * 100)

# Print the recall score of the model predictions
print(recall_score(target_test, prediction_best) * 100)

# Print the ROC/AUC score of the model predictions
print(roc_auc_score(target_test, prediction_best) * 100)


### 3. Final thoughts
00:00 - 00:13
Congratulations! Now you have mastered predicting churn using Decision Trees. One thing that I would like to note here, is that although we concentrated on this model, Decision Trees are not the only choice to predict employee turnover.

2. Alternative methods
00:13 - 01:18
One very popular alternative that is used widely in HR analytics is Logistic Regression. You may still use Python and sklearn to make predictions with Logistic Regression using the evaluation metrics and approaches discussed in this course. While single Decision Tree is good, sometimes many is better. The tree based algorithms like Random Forest or Gradient Boosting usually provide better results than a single Decision Tree. The reason we use single one, is that those complex models are not interpretable and cannot be visualized to make decisions using Decision path. Last but not list, Neural networks are popular alternative nowadays for many prediction tasks including turnover prediction. However, they are considered a black box and do not provide the clue behind their predictions, which especially in HR is very important. As you have completed this course, now my advice would be to take some other HR datasets and attach your predictive skills on them following the same order of tasks that we have completed during this course.