# Predicting Employee Retention

# Data Pre-processing

In this section of the model, we are preparing the given data in a format the model can understand. NumPy allows support for creating multi-demention arrays and matrices, as well as other mathematical functions to operate on these structures. We will be using the Pandas library in order to access data analysis and manipulation tools. 

In [637]:
import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/mwitiderrick/kerasDO/master/HR_comma_sep.csv")

.head() allows us to preview the first five records of the dataframe. 

In [639]:
df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,department,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


# Creating Dummy Variables 
The dummy variable trap is a situation whereby two or more variables are highly correlated. This leads to your model performing poorly. You, therefore, drop one dummy variable to always remain with N-1 dummy variables.

In [640]:
feats = ['department','salary']
df_final = pd.get_dummies(df,columns=feats,drop_first=True)

# Separating Training and Testing Datasets 

We implement this split in the dataset so the model you build doesn’t have access to the testing data during the training process. This ensures that the model learns only from the training data, and you can then test its performance with the testing data.

In [641]:
from sklearn.model_selection import train_test_split

In [642]:
X = df_final.drop(['left'],axis=1).values
y = df_final['left'].values

In [643]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Transforming/Scaling the Data

It is important to scale the dataset in order to make the computations more efficient. The code below scales the values such that we will have a mean of 0 and a standard deviation of 1. This step is crucial because we are comparing features that have different measurements  

In [644]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [645]:
import sys

!$sys.executable -m pip install tensorflow



# Building the Artificial Neural Network

We will now implement Keras to build the deep learning model. Our model will have three layers: input, hidden, output. The input layer is the layer to which we pass the features of the dataset. The hidden layers perform the computations and pass the information to the output layer. The outer layer is the layer responsible for delivering the results of the model.

In [646]:
import keras
from keras.models import Sequential
from keras.layers import Dense

In [647]:
classifier = Sequential()

In [648]:
classifier.add(Dense(9, kernel_initializer = "uniform",activation = "relu", input_dim=18))

In [649]:
classifier.add(Dense(1, kernel_initializer = "uniform",activation = "sigmoid"))

In [650]:
classifier.compile(optimizer= "adam",loss = "binary_crossentropy",metrics = ["accuracy"])

# Fitting Classifer into the Dataset

In [651]:
classifier.fit(X_train, y_train, batch_size = 10, epochs = 1)



<keras.callbacks.History at 0x20c500a73a0>

# Running Predictions on the Test Set

We will now use the testing dataset to test our model. 

In [652]:
y_pred = classifier.predict(X_test)



In [653]:
y_pred = (y_pred > 0.5)

# Checking the Confusion Matrix

In this step we will be using a confusion matrix to check the number of correct and incorrect predictions. The confusion matrix will report the number of true positives, false positives, true negatives, and false negatives.


In [654]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[3238,  157],
       [ 653,  452]], dtype=int64)

# Making a Prediction 

In [655]:
new_pred = classifier.predict(sc.transform(np.array([[0,0.2,3., 28., 1., 0.,0.,0.,0., 0.,0.,0.,0.,0.,1.,0., 0.,1.]])))



In [656]:
new_pred = (new_pred > 0.5)
new_pred

array([[False]])

In [657]:
new_pred = classifier.predict(sc.transform(np.array([[0.45,0.48 ,2., 245., 3., 0.,0.,0.,0., 0.,0.,0.,0.,0.,0,0., 0.,1.]])))



In [658]:
new_pred = (new_pred > 0.5)
new_pred

array([[False]])

In [659]:
new_pred = classifier.predict(sc.transform(np.array([[0.9,0.7 ,5., 240., 0. , 0.,0.,0.,0., 0.,0.,0.,0.,0.,1.,0., 0.,1.]])))




In [660]:
new_pred = (new_pred > 0.5)
new_pred

array([[False]])

# Improving Model Accuracy

In [661]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score

In [662]:
def make_classifier():
    classifier = Sequential()
    classifier.add(Dense(9, kernel_initializer = "uniform", activation = "relu", input_dim=18))
    classifier.add(Dense(1, kernel_initializer = "uniform", activation = "sigmoid"))
    classifier.compile(optimizer= "adam",loss = "binary_crossentropy",metrics = ["accuracy"])
    return classifier

In [663]:
classifier = KerasClassifier(build_fn = make_classifier, batch_size=10, nb_epoch=1)

  classifier = KerasClassifier(build_fn = make_classifier, batch_size=10, nb_epoch=1)


In [667]:
accuracies = cross_val_score(estimator = classifier,X = X_train,y = y_train,cv = 10,n_jobs = -1)

In [668]:
mean = accuracies.mean()
mean

0.8141704022884368

In [669]:
variance = accuracies.var()
variance

0.0031464708918996463

# Adding Dropout Regularization

Predictive models are prone to a problem known as overfitting. This is a scenario whereby the model memorizes the results in the training set and isn’t able to generalize on data that it hasn’t seen. We counteract this by adding dropout regularization. 

In [670]:
from keras.layers import Dropout

classifier = Sequential()
classifier.add(Dense(9, kernel_initializer = "uniform", activation = "relu", input_dim=18))
classifier.add(Dropout(rate = 0.1))
classifier.add(Dense(1, kernel_initializer = "uniform", activation = "sigmoid"))
classifier.compile(optimizer= "adam",loss = "binary_crossentropy",metrics = ["accuracy"])

# Hyperparameter Tuning

Grid search is a technique that you can use to experiment with different model parameters in order to obtain the ones that give you the best accuracy. 

In [671]:
from sklearn.model_selection import GridSearchCV
def make_classifier(optimizer):
    classifier = Sequential()
    classifier.add(Dense(9, kernel_initializer = "uniform", activation = "relu", input_dim=18))
    classifier.add(Dense(1, kernel_initializer = "uniform", activation = "sigmoid"))
    classifier.compile(optimizer= optimizer,loss = "binary_crossentropy",metrics = ["accuracy"])
    return classifier

In [672]:
classifier = KerasClassifier(build_fn = make_classifier)

  classifier = KerasClassifier(build_fn = make_classifier)


In [673]:
params = {
    'batch_size':[20,35],
    'epochs':[2,3],
    'optimizer':['adam','rmsprop']
}

In [674]:
grid_search = GridSearchCV(estimator=classifier,
                           param_grid=params,
                           scoring="accuracy",
                           cv=2)

In [675]:
grid_search = grid_search.fit(X_train,y_train)

Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [676]:
best_param = grid_search.best_params_
best_accuracy = grid_search.best_score_

In [677]:
best_param

{'batch_size': 35, 'epochs': 3, 'optimizer': 'adam'}

In [678]:
best_accuracy

0.8276059657621859