# Fundamentals of Machine Learning - Assignment 2
## Introduction
In this excercise I will predict whether an employee is leaving the company, using an IBM HR analytic dataset from Kaggle.
## Data cleaning
The data consists of a number of variables [which are described on Kaggle](https://www.kaggle.com/code/adityawithdoublea/ibm-hr-analytic-for-attrition-using-regression/data). There are 35 columns. I am predicting *Attrition*: Did the employee leave the company (yes) or not (no)?


In [64]:
import sklearn as sk
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

In [65]:
# Loading the dataset
dfg = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
dfg.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


The variables I will focus on are:
- Attrition
- Gender
- Hourly rate
- Work-Life-Balance
- Job involvement
- Job satisfaction
- Environment satisfaction

In [66]:
# Subsetting the dataset
df = dfg[["Attrition",  "Gender", "HourlyRate", "WorkLifeBalance", "JobInvolvement", "JobSatisfaction", "EnvironmentSatisfaction"]]
df.head(n=10)

Unnamed: 0,Attrition,Gender,HourlyRate,WorkLifeBalance,JobInvolvement,JobSatisfaction,EnvironmentSatisfaction
0,Yes,Female,94,1,3,4,2
1,No,Male,61,3,2,2,3
2,Yes,Male,92,3,2,3,4
3,No,Female,56,3,3,3,4
4,No,Male,40,3,3,2,1
5,No,Male,79,2,3,4,4
6,No,Female,81,2,4,1,3
7,No,Male,67,3,3,3,4
8,No,Male,44,3,2,3,4
9,No,Male,94,2,3,3,3


In [67]:
# Looking at the data
df.describe()

Unnamed: 0,HourlyRate,WorkLifeBalance,JobInvolvement,JobSatisfaction,EnvironmentSatisfaction
count,1470.0,1470.0,1470.0,1470.0,1470.0
mean,65.891156,2.761224,2.729932,2.728571,2.721769
std,20.329428,0.706476,0.711561,1.102846,1.093082
min,30.0,1.0,1.0,1.0,1.0
25%,48.0,2.0,2.0,2.0,2.0
50%,66.0,3.0,3.0,3.0,3.0
75%,83.75,3.0,3.0,4.0,4.0
max,100.0,4.0,4.0,4.0,4.0


In [68]:
# Getting rid of empty cells
df.dropna()
df.head(n=10)

Unnamed: 0,Attrition,Gender,HourlyRate,WorkLifeBalance,JobInvolvement,JobSatisfaction,EnvironmentSatisfaction
0,Yes,Female,94,1,3,4,2
1,No,Male,61,3,2,2,3
2,Yes,Male,92,3,2,3,4
3,No,Female,56,3,3,3,4
4,No,Male,40,3,3,2,1
5,No,Male,79,2,3,4,4
6,No,Female,81,2,4,1,3
7,No,Male,67,3,3,3,4
8,No,Male,44,3,2,3,4
9,No,Male,94,2,3,3,3


In [69]:
# Checking if there are any null variables
df.isnull().sum()

Attrition                  0
Gender                     0
HourlyRate                 0
WorkLifeBalance            0
JobInvolvement             0
JobSatisfaction            0
EnvironmentSatisfaction    0
dtype: int64

In [70]:
# Checking how many employees left the company
df["Attrition"].value_counts()

No     1233
Yes     237
Name: Attrition, dtype: int64

The data set contains data from 1470 employees. There are a lot more employees that are still with the company. Only 237 (19.221%) of them have left the company. The difficulty will be in detecting attrition, as there are less data on employees who left the company.

## Splitting the data
In this step I will be spiltting the data into a training and a testing set. I want to predict Attrition (y) based on hourly rate, work-life-balance, job involvement, job satisfaction, and environment satisfaction (X).

In [71]:
X = df.loc[:, "HourlyRate":"EnvironmentSatisfaction"]
y = df["Attrition"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

## Training the algorithm

In [72]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=1, n_estimators=100) #RF is a random algorithm, so to get the same results we need to use random_state
rf = rf.fit(X_train, y_train)

In [73]:
rf.score(X_test,y_test)

0.7868480725623582

The model shows an accuracy of 78.68%.

## Evaluating the model

In [74]:
rf.classes_

array(['No', 'Yes'], dtype=object)

In [75]:
y_pred = rf.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_pred) #creates a "confusion matrix"
cm = pd.DataFrame(cm, index=["no attrition (actual)", "attrition (actual)"], columns = ["no attrition (pred)", "attrition (pred)"]) #label and make df
cm

Unnamed: 0,no attrition (pred),attrition (pred)
no attrition (actual),340,24
attrition (actual),70,7


In [81]:
7/31

0.22580645161290322

The algorithm performs rather poorly with only 22.58% precision on detecting attrition.

In [82]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          No       0.83      0.93      0.88       364
         Yes       0.23      0.09      0.13        77

    accuracy                           0.79       441
   macro avg       0.53      0.51      0.50       441
weighted avg       0.72      0.79      0.75       441



The precision for no attrition is quite good (83%), but the precision for attrition is rather bad with as low as 23%. The recall for attrition is very low (0.09). That means that the algorithm misses detecting 99.91% of employees leaving the company.

In [107]:
rf_new  = RandomForestClassifier(n_estimators = 26, max_features = 5, random_state=1) #RF is a random algorithm, so to get the same results we need to use random_state
rf_new = rf_new.fit(X_train, y_train)
y_pred_new = rf_new.predict(X_test) #the predicted values
print(classification_report(y_test, y_pred_new))


              precision    recall  f1-score   support

          No       0.83      0.94      0.88       364
         Yes       0.26      0.10      0.15        77

    accuracy                           0.79       441
   macro avg       0.54      0.52      0.51       441
weighted avg       0.73      0.79      0.75       441

