#  Employee Performance Analysis (INX Future Inc.) 

# Modeling

- Import Libraries
- Import Dataset
- Separation of Predictors and Target
- Split data into Train and Test
- Balance the dataset
- Hyperparameter Tuning
- Models
  - Random Forest
  - XGBoost
  - Decision Tree
  - Evaluation 
- Summary

## Importing Libraries

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier

from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report



## Importing Processed Dataset

In [2]:
# Importing processed dataset for modeling
df = pd.read_excel('Employees_ProcessedDataset.xls')

In [3]:
# Check the columns and shape of dataset
df.columns , df.shape

(Index(['EmpDepartment_Sales', 'EmpDepartment_Development',
        'EmpEnvironmentSatisfaction', 'EmpJobLevel', 'EmpLastSalaryHikePercent',
        'EmpWorkLifeBalance', 'ExperienceYearsAtThisCompany',
        'ExperienceYearsInCurrentRole', 'YearsSinceLastPromotion',
        'YearsWithCurrManager', 'PerformanceRating'],
       dtype='object'),
 (1200, 11))

## Seperate the Data into Predictors and Target.

In [4]:
## Separate the dataset into predictors and target variable.
X = df.drop(['PerformanceRating'], axis=1)
y = df.PerformanceRating

## Splitting Data (X,y) into Train and Test

In [8]:
## Splitting the target variable and predictors into train and test.
X_train , X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=41)

## Balancing Train data (SMOTE-Synthetic Minority Oversampling Technique)
SMOTE is a statistical technique for increasing the number of cases in your dataset in a balanced way. The module works by generating new instances from existing minority cases that you supply as input.

In [6]:
## Balancing Dataset based on the categorical values using SMOTE Oversampling Method.
from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_train, y_train = smote.fit_sample(X_train, y_train)

y_train.value_counts()

4    656
3    656
2    656
Name: PerformanceRating, dtype: int64

## Modelling

## Random Forest
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees.It is also one of the most used algorithms, because of its simplicity and diversity.

### Hyperparameter tuning using GridSearch CV
Hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process.

In [7]:
## Hyperparameter Tuning with GridSearch CV

params = {'n_estimators':[50,100,150],
    'max_depth': [12,14,15],
    'min_samples_split':[2,3,4,5],
     'min_samples_leaf':[2,3,4,5]}

grid_model = GridSearchCV(RandomForestClassifier(), params)
grid_model.fit(X_train, y_train)

GridSearchCV(estimator=RandomForestClassifier(),
             param_grid={'max_depth': [12, 14, 15],
                         'min_samples_leaf': [2, 3, 4, 5],
                         'min_samples_split': [2, 3, 4, 5],
                         'n_estimators': [50, 100, 150]})

In [9]:
### Best parameters and score after Hyperparameter Tuning
print(grid_model.best_params_)
print(grid_model.best_score_)

{'max_depth': 14, 'min_samples_leaf': 2, 'min_samples_split': 3, 'n_estimators': 50}
0.9583472184549413


In [26]:
# Building Random Forest Classifier model
model_ran = RandomForestClassifier(
                      n_estimators=50,
                      max_depth=5,
                      min_samples_leaf=2, min_samples_split= 3)

# fit the model
model_ran.fit(X_train, y_train)

RandomForestClassifier(max_depth=5, min_samples_leaf=2, min_samples_split=3,
                       n_estimators=50)

### Evaluation:

In [27]:
# Prediction on test data:

y_pred_ran = model_ran.predict(X_test)

print('Random Forest Accuracy Score :',"{0:.0%}".format(accuracy_score(y_test,y_pred_ran)), '\n')
print('----------------------------------------')
print('Confusion Matrix:','\n','\n',pd.crosstab(y_test,y_pred_ran))
print('----------------------------------------')
print('Classification Report:','\n','\n',classification_report(y_test,y_pred_ran))

Random Forest Accuracy Score : 96% 

----------------------------------------
Confusion Matrix: 
 
 col_0               2    3   4
PerformanceRating             
2                  47    4   0
3                   2  216   0
4                   1    6  24
----------------------------------------
Classification Report: 
 
               precision    recall  f1-score   support

           2       0.94      0.92      0.93        51
           3       0.96      0.99      0.97       218
           4       1.00      0.77      0.87        31

    accuracy                           0.96       300
   macro avg       0.97      0.90      0.93       300
weighted avg       0.96      0.96      0.96       300



## XGBoost
XGBoost is a scalable and accurate implementation of gradient boosting machines and it has proven to push the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed.

In [None]:
## Hyperparameter Tuning with GridSearch CV

params_xg = {'n_estimators':[50,100,150],
          'max_depth': [12,14,15],
          'min_child_weight':[3,4,5],
         'learning_rate':[0.1,0.2,0.5]}

grid_model_xg = GridSearchCV(XGBClassifier(), params_xg)
grid_model_xg.fit(X_train, y_train)

In [78]:
### Best parameters and score after Hyperparameter Tuning
print(grid_model_xg.best_params_)
print(grid_model_xg.best_score_)

{'learning_rate': 0.5, 'max_depth': 12, 'min_child_weight': 3, 'n_estimators': 50}
0.9532710763229615


In [38]:
# Building XGBClassifier model:

from xgboost import XGBClassifier

model_xg = XGBClassifier(learning_rate=0.6, max_depth=14, min_child_weight=5, n_estimators=10)

## Fit the model
model_xg.fit(X_train, y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.6, max_delta_step=0, max_depth=14,
              min_child_weight=5, missing=nan, monotone_constraints='()',
              n_estimators=10, n_jobs=4, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

### Evaluation:

In [58]:
# Prediction on test data:

y_pred_xg = model_xg.predict(X_test)

print('XGBClassifier Accuracy Score :',"{0:.0%}".format(accuracy_score(y_test,y_pred_xg)), '\n')
print('----------------------------------------')
print('Confusion Matrix:','\n','\n',pd.crosstab(y_test,y_pred_xg))
print('----------------------------------------')
print('Classification Report:','\n','\n',classification_report(y_test,y_pred_xg))

XGBClassifier Accuracy Score : 95% 

----------------------------------------
Confusion Matrix: 
 
 col_0               2    3   4
PerformanceRating             
2                  46    5   0
3                   2  213   3
4                   0    6  25
----------------------------------------
Classification Report: 
 
               precision    recall  f1-score   support

           2       0.96      0.90      0.93        51
           3       0.95      0.98      0.96       218
           4       0.89      0.81      0.85        31

    accuracy                           0.95       300
   macro avg       0.93      0.90      0.91       300
weighted avg       0.95      0.95      0.95       300



## Decision Tree
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

In [42]:
## Hyperparameter Tuning with GridSearch CV

params_tr = {
    'max_depth': [12,14,15],
    'min_samples_split':[2,3,4],
    'min_samples_leaf':[2,3,4,5],
    'max_features':[8,9,10]
                      }

grid_model_tr = GridSearchCV(DecisionTreeClassifier(), params_tr)
grid_model_tr.fit(X_train, y_train)

GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': [12, 14, 15], 'max_features': [8, 9, 10],
                         'min_samples_leaf': [2, 3, 4, 5],
                         'min_samples_split': [2, 3, 4]})

In [43]:
### Best parameters and score after Hyperparameter Tuning
print(grid_model_tr.best_params_)
print(grid_model_tr.best_score_)

{'max_depth': 14, 'max_features': 8, 'min_samples_leaf': 5, 'min_samples_split': 3}
0.9144444444444444


In [56]:
# Building XGBClassifier model:

model_tr = DecisionTreeClassifier(max_depth=10,
                                 max_features=9,
                                  min_samples_leaf=2,
                                  min_samples_split=2)

model_tr.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=10, max_features=9, min_samples_leaf=2)

### Evaluation:

In [59]:
# Prediction on test data:

y_pred_tr = model_tr.predict(X_test)

print('DecisionTreeClassifier Accuracy Score :',"{0:.0%}".format(accuracy_score(y_test,y_pred_tr)), '\n')
print('----------------------------------------')
print('Confusion Matrix:','\n','\n',pd.crosstab(y_test,y_pred_tr))
print('----------------------------------------')
print('Classification Report:','\n','\n',classification_report(y_test,y_pred_tr))

DecisionTreeClassifier Accuracy Score : 94% 

----------------------------------------
Confusion Matrix: 
 
 col_0               2    3   4
PerformanceRating             
2                  47    4   0
3                   5  212   1
4                   1    6  24
----------------------------------------
Classification Report: 
 
               precision    recall  f1-score   support

           2       0.89      0.92      0.90        51
           3       0.95      0.97      0.96       218
           4       0.96      0.77      0.86        31

    accuracy                           0.94       300
   macro avg       0.93      0.89      0.91       300
weighted avg       0.94      0.94      0.94       300



## Summary
- Random Forest Classifier with Hyperparameter tuning GridSearch CV gives the accuracy score 96%  with F1 score of 2,3 and 4 is   93%, 97% and 87% respectively. This model performed well compared to other models. 

- XGBoost Classifier with Hyperparameter tuning gives the accuracy score of 95%  with F1 score of 2,3 and 4 as 93%, 96% and 85% respectively. This model performed well compared to Decision Tree.

- DecisionTreeClassifier with Hyperparameter tuning gives the accuracy score of 94%  with F1 score of 2,3 and 4 as 90%, 96% and 86% respectively.