# HR Analytics: Employee Promotion Predictor 

**Our client is a large MNC and they have 9 broad verticals across the organisation. One of the problem our client is facing is around identifying the right people for promotion (only for manager position and below) and prepare them in time. Currently the process, they are following is:**

1. They first identify a set of employees based on recommendations/ past performance
2. Selected employees go through the separate training and evaluation program for each vertical. These programs are based on the required skill of each vertical
3. At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., employee gets promotion

For above mentioned process, the final promotions are only announced after the evaluation and this leads to delay in transition to their new roles. Hence, company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle.

They have provided multiple attributes around Employee's past and current performance along with demographics. **Now, The task is to predict whether a potential promotee at checkpoint in the test set will be promoted or not after the evaluation process.**

In [1]:
## Importing required libraries....
import pandas as pd
import numpy as np
from sklearn import metrics
import random
import seaborn as sns
from sklearn import metrics

pd.options.display.max_rows = None
pd.options.display.max_columns = None

## About Data:
- **employee_id**: Unique ID for employee
- **department**: Department of employee
- **region**: Region of employment (unordered)
- **education**: Education Level
- **gender**: Gender of Employee
- **recruitment_channel**: Channel of recruitment for employee
- **no_of_trainings**: no of other trainings completed in previous year on soft skills, technical skills etc.
- **age**: Age of Employee
- **previous_year_rating**: Employee Rating for the previous year
- **length_of_service**: Length of service in years
- **KPIs_met >80%**: if Percent of KPIs(Key performance Indicators) >80% then 1 else 0
- **awards_won?**: if awards won during previous year then 1 else 0
- **avg_training_score**: Average score in current training evaluations
- **is_promoted**: Recommended for promotion

In [2]:
## Training data...
data_train = pd.read_csv('train_LZdllcl.csv')
data_train.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [3]:
print("Dimensions of train data is :" , data_train.shape)

Dimensions of train data is : (54808, 14)


In [4]:
## Test data...
data_test = pd.read_csv('test_2umaH9m.csv')
data_test.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score
0,8724,Technology,region_26,Bachelor's,m,sourcing,1,24,,1,1,0,77
1,74430,HR,region_4,Bachelor's,f,other,1,31,3.0,5,0,0,51
2,72255,Sales & Marketing,region_13,Bachelor's,m,other,1,31,1.0,4,0,0,47
3,38562,Procurement,region_2,Bachelor's,f,other,3,31,2.0,9,0,0,65
4,64486,Finance,region_29,Bachelor's,m,sourcing,1,30,4.0,7,0,0,61


In [5]:
print("Dimensions of test data is :" , data_test.shape)

Dimensions of test data is : (23490, 13)


#### Null Entries:

In [6]:
## Null entries in Train Data....
data_train.isnull().sum()

employee_id                0
department                 0
region                     0
education               2409
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    4124
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
dtype: int64

In [7]:
## Null entries in Test/Submission Data....
data_test.isnull().sum()

employee_id                0
department                 0
region                     0
education               1034
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    1812
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
dtype: int64

In [8]:
## Merging train and test/submission data....
df = data_train.append(data_test)
df.reset_index(drop = True , inplace = True)

In [9]:
df.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0.0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0.0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0.0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0.0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0.0


In [10]:
## Feature wise null entries in merged data frame....
df.isnull().sum()

employee_id                 0
department                  0
region                      0
education                3443
gender                      0
recruitment_channel         0
no_of_trainings             0
age                         0
previous_year_rating     5936
length_of_service           0
KPIs_met >80%               0
awards_won?                 0
avg_training_score          0
is_promoted             23490
dtype: int64

## Feature Engineering:

#### Education:

In [11]:
## Finding counts of all unique entries in 'education'....
df['education'].value_counts()

Bachelor's          52247
Master's & above    21429
Below Secondary      1179
Name: education, dtype: int64

In [12]:
## Applying Frequent Entry Imputation for dealing with missing entries in "Education" feature....
## Filling all Null entries with - "Bachelor's"...
temp = df['education'].value_counts().head(1).index
temp_inp = random.choices(temp , k = df['education'].isnull().sum())
df.loc[df['education'].isnull() , 'education'] = temp_inp

#### Previous Year Rating:

In [13]:
## All possible unique entries are....
df['previous_year_rating'].dropna().unique()

array([5., 3., 1., 4., 2.])

In [14]:
## Finding counts of all unique entries in 'education'....
df['previous_year_rating'].value_counts()

3.0    26539
5.0    16838
4.0    14126
1.0     8903
2.0     5956
Name: previous_year_rating, dtype: int64

In [15]:
## Applying random value imputation using top 2 most occuring entries...
## Top two most occuring entries are - '3.0' and '5.0'....
sample_rating = df['previous_year_rating'].value_counts().head(2).index

sample = random.choices(sample_rating , weights = (2.6 , 1.6) , k = df['previous_year_rating'].isnull().sum())

df.loc[df['previous_year_rating'].isnull() , 'previous_year_rating'] = sample

### Dealing with Categorical Features:

In [16]:
## Creating dummy variables for all Categorical Features.....
cate = pd.get_dummies(df[['department', 'region', 'education', 'gender', 'recruitment_channel']] , drop_first = True)
cate.head()

Unnamed: 0,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,department_Technology,region_region_10,region_region_11,region_region_12,region_region_13,region_region_14,region_region_15,region_region_16,region_region_17,region_region_18,region_region_19,region_region_2,region_region_20,region_region_21,region_region_22,region_region_23,region_region_24,region_region_25,region_region_26,region_region_27,region_region_28,region_region_29,region_region_3,region_region_30,region_region_31,region_region_32,region_region_33,region_region_34,region_region_4,region_region_5,region_region_6,region_region_7,region_region_8,region_region_9,education_Below Secondary,education_Master's & above,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1
1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [17]:
## Droppinf unwanted features....
df1 = df.drop(['employee_id' , 'department', 'region', 'education', 'gender', 'recruitment_channel'] , axis = 1)

## Adding all categorical features....
df1 = pd.concat([df1 , cate] , axis = 1)

In [18]:
df1.head()

Unnamed: 0,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,department_Technology,region_region_10,region_region_11,region_region_12,region_region_13,region_region_14,region_region_15,region_region_16,region_region_17,region_region_18,region_region_19,region_region_2,region_region_20,region_region_21,region_region_22,region_region_23,region_region_24,region_region_25,region_region_26,region_region_27,region_region_28,region_region_29,region_region_3,region_region_30,region_region_31,region_region_32,region_region_33,region_region_34,region_region_4,region_region_5,region_region_6,region_region_7,region_region_8,region_region_9,education_Below Secondary,education_Master's & above,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
0,1,35,5.0,8,1,0,49,0.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1
1,1,30,5.0,4,0,0,60,0.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,1,34,3.0,7,0,0,50,0.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
3,2,39,1.0,10,0,0,50,0.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,1,45,3.0,2,0,0,73,0.0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


## Train - Test/Submission Split:

In [19]:
training = df1.iloc[:data_train.shape[0] , :]       ## Training data will be used for building ML Model
submission = df1.iloc[data_train.shape[0]: , :]     ## Processed Submission data that we need to submit on website....

In [20]:
## Scaling the numerical features....
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(training[['no_of_trainings' , 'age' , 'previous_year_rating' , 'length_of_service' , 'avg_training_score']])

training[['no_of_trainings', 'age', 'previous_year_rating', 
          'length_of_service', 'avg_training_score']] = scaler.transform(training[['no_of_trainings', 'age', 
                                                                                   'previous_year_rating', 'length_of_service',
                                                                                   'avg_training_score']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())


#### Balancing the data set:

In [21]:
from collections import Counter
print("The Number of '1' and '0' in the training set is :"  , Counter(training['is_promoted']))

The Number of '1' and '0' in the training set is : Counter({0.0: 50140, 1.0: 4668})


##### Applying Over and Under Sampling....

In [22]:
from imblearn.under_sampling import RandomUnderSampler
under = RandomUnderSampler(sampling_strategy = 0.1)
X_under , Y_under = under.fit_resample(training.drop('is_promoted' , axis = 1), training['is_promoted'].astype('int64'))

In [23]:
print("Counting '1' and '0's :" , Counter(Y_under))

Counting '1' and '0's : Counter({0: 46680, 1: 4668})


In [24]:
from imblearn.over_sampling import SMOTE
over_sampler = SMOTE()
X , Y = over_sampler.fit_resample(X_under , Y_under)

In [25]:
print("Counting '1' and '0's :" , Counter(Y))

Counting '1' and '0's : Counter({0: 46680, 1: 46680})


In [26]:
Counter(Y)

Counter({0: 46680, 1: 46680})

#### Train - Test Split of training data:

In [27]:
## Applying train test split for model building....
from sklearn.model_selection import train_test_split

x_train , x_test , y_train , y_test = train_test_split(X , Y , test_size = 0.2 , random_state = 33)

In [28]:
x_train.head()

Unnamed: 0,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,department_Technology,region_region_10,region_region_11,region_region_12,region_region_13,region_region_14,region_region_15,region_region_16,region_region_17,region_region_18,region_region_19,region_region_2,region_region_20,region_region_21,region_region_22,region_region_23,region_region_24,region_region_25,region_region_26,region_region_27,region_region_28,region_region_29,region_region_3,region_region_30,region_region_31,region_region_32,region_region_33,region_region_34,region_region_4,region_region_5,region_region_6,region_region_7,region_region_8,region_region_9,education_Below Secondary,education_Master's & above,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
35357,-0.415276,-0.235495,1.316262,0.265996,1,0,-1.075931,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
5187,-0.415276,0.025598,1.316262,0.969387,1,0,-0.477641,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0
69870,-0.415276,-0.50909,1.316262,-0.158026,1,0,-0.591406,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
48957,1.226063,-0.757681,-0.289742,-0.906322,1,0,1.541588,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
28955,-0.415276,0.156145,-1.895746,0.734923,0,0,-1.225504,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


## Model:

#### Logistic Regression:

In [29]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(x_train , y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [30]:
print("The train accuracy is:" , log_reg.score(x_train , y_train)*100)
print("The test accuracy is:" , log_reg.score(x_test , y_test)*100)

The train accuracy is: 80.77067266495287
The test accuracy is: 80.30205655526991


In [31]:
pd.DataFrame(metrics.confusion_matrix(y_test , log_reg.predict(x_test)))

Unnamed: 0,0,1
0,7451,1970
1,1708,7543


In [32]:
print(metrics.classification_report(y_test , log_reg.predict(x_test)))

              precision    recall  f1-score   support

           0       0.81      0.79      0.80      9421
           1       0.79      0.82      0.80      9251

    accuracy                           0.80     18672
   macro avg       0.80      0.80      0.80     18672
weighted avg       0.80      0.80      0.80     18672



#### Random Forest Classifier:

In [33]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier()
rf_clf.fit(x_train , y_train)

RandomForestClassifier()

In [34]:
print("The train accuracy is:" , rf_clf.score(x_train , y_train)*100)
print("The test accuracy is:" , rf_clf.score(x_test , y_test)*100)

The train accuracy is: 99.99330548414738
The test accuracy is: 95.28170522707798


In [35]:
pd.DataFrame(metrics.confusion_matrix(y_test , rf_clf.predict(x_test)))

Unnamed: 0,0,1
0,8914,507
1,374,8877


In [36]:
print(metrics.classification_report(y_test , rf_clf.predict(x_test)))

              precision    recall  f1-score   support

           0       0.96      0.95      0.95      9421
           1       0.95      0.96      0.95      9251

    accuracy                           0.95     18672
   macro avg       0.95      0.95      0.95     18672
weighted avg       0.95      0.95      0.95     18672



## Submission Data:

In [37]:
## applying scaling....
df_sub = submission.copy()

df_sub[['no_of_trainings', 'age', 'previous_year_rating', 
          'length_of_service', 'avg_training_score']] = scaler.transform(df_sub[['no_of_trainings', 'age', 
                                                                                   'previous_year_rating', 'length_of_service',
                                                                                   'avg_training_score']])

## Dropping Unwanted features....
sub = df_sub.drop('is_promoted' , axis = 1)

In [38]:
sub.head()

Unnamed: 0,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,department_Technology,region_region_10,region_region_11,region_region_12,region_region_13,region_region_14,region_region_15,region_region_16,region_region_17,region_region_18,region_region_19,region_region_2,region_region_20,region_region_21,region_region_22,region_region_23,region_region_24,region_region_25,region_region_26,region_region_27,region_region_28,region_region_29,region_region_3,region_region_30,region_region_31,region_region_32,region_region_33,region_region_34,region_region_4,region_region_5,region_region_6,region_region_7,region_region_8,region_region_9,education_Below Secondary,education_Master's & above,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
54808,-0.415276,-1.410415,1.316262,-1.140785,1,0,1.018084,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
54809,-0.415276,-0.496588,-0.289742,-0.202931,0,0,-0.926359,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
54810,-0.415276,-0.496588,-1.895746,-0.437395,0,0,-1.225504,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
54811,2.867403,-0.496588,-1.092744,0.734923,0,0,0.120649,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
54812,-0.415276,-0.627135,0.51326,0.265996,0,0,-0.178496,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1


In [39]:
## Predicting the results using our "Random Forest Classifier" Model....
pred_sub = rf_clf.predict(sub)

In [40]:
## Loading submission file...
sub_file = pd.read_csv('sample_submission_M0L0uXE.csv')
sub_file.head()

Unnamed: 0,employee_id,is_promoted
0,8724,0
1,74430,0
2,72255,0
3,38562,0
4,64486,0


In [41]:
sub_file['is_promoted'] = pred_sub

In [42]:
sub_file.to_csv('Results_File.csv' , index = False)