# HR Analytics

<img src = 'https://datahack-prod.s3.ap-south-1.amazonaws.com/__sized__/contest_cover/hr_1920x480_s5WuoZs-thumbnail-1200x1200-90.jpg'>

Practice Problem: https://datahack.analyticsvidhya.com/contest/wns-analytics-hackathon-2018-1/

## HR Analytics

HR analytics is revolutionising the way human resources departments operate, leading to higher efficiency and better results overall. Human resources has been using analytics for years. However, the collection, processing and analysis of data has been largely manual, and given the nature of human resources dynamics and HR KPIs, the approach has been constraining HR. Therefore, it is surprising that HR departments woke up to the utility of machine learning so late in the game. Here is an opportunity to try predictive analytics in identifying the employees most likely to get promoted.

## Problem Statement

Your client is a large MNC and they have 9 broad verticals across the organisation. One of the problem your client is facing is around identifying the right people for promotion *(only for manager position and below)* and prepare them in time. Currently the process, they are following is:

* They first identify a set of employees based on recommendations/ past performance
* Selected employees go through the separate training and evaluation program for each vertical. These programs are based on the required skill of each vertical
* At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., employee gets promotion

For above mentioned process, the final promotions are only announced after the evaluation and this leads to delay in transition to their new roles. Hence, company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle. 

<img src = 'https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/09/wns_hack_im_1.jpg'>

They have provided multiple attributes around Employee's past and current performance along with demographics. Now, The task is to predict whether a potential promotee at checkpoint in the test set will be promoted or not after the evaluation process.

## Evaluation Metric

The evaluation metric for this competition is F1 Score.

## Public and Private Split

Test data is further randomly divided into Public (40%) and Private (60%) data.

Your initial responses will be checked and scored on the Public data.
The final rankings would be based on your private score which will be published once the competition is over.

## Entorno

In [1]:
import sys
sys.version

In [2]:
!conda info --envs

# conda environments:
#
micromaster              /Users/manuel/.conda/envs/micromaster
                         /Users/manuel/.julia/conda/3
base                     /Users/manuel/opt/anaconda3
belcorp                  /Users/manuel/opt/anaconda3/envs/belcorp
courseragcp              /Users/manuel/opt/anaconda3/envs/courseragcp
iapucp                *  /Users/manuel/opt/anaconda3/envs/iapucp
mitxpro                  /Users/manuel/opt/anaconda3/envs/mitxpro
style-transfer           /Users/manuel/opt/anaconda3/envs/style-transfer
taller-dmc               /Users/manuel/opt/anaconda3/envs/taller-dmc
udacity                  /Users/manuel/opt/anaconda3/envs/udacity



## Paquetes

In [3]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import os
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm, tqdm_notebook
from pathlib import Path
import random
import warnings
import pickle

warnings.filterwarnings('ignore')

seed = 2020
random.seed(seed)

pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 400)
sns.set()

DATA = Path('../../data') 
RAW  = DATA/'raw'
PROCESSED = DATA/'processed'
SUBMISSIONS = DATA/'submissions'    

MODEL = Path('../../model') 

In [4]:
pd.__version__

'1.1.3'

In [5]:
np.__version__

'1.19.1'

In [6]:
sklearn.__version__

'0.23.2'

In [7]:
preproc_label = 'preprocess_v1'

## Lectura de datos

In [8]:
os.listdir(RAW)

['variables.txt', '.DS_Store', 'test_2umaH9m.csv', 'train_LZdllcl.csv']

In [9]:
df_train = pd.read_csv(f'{RAW}/train_LZdllcl.csv')
df_test = pd.read_csv((f'{RAW}/test_2umaH9m.csv'))

In [10]:
df_train.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [11]:
id_columns = 'employee_id'
target = 'is_promoted'

# Preprocesamiento v1

* train / val split con 20% val
* imputación con media y mediana para continuas, categóricas con creación de flag de missing
* capeo de outliers ratio 1.5

## Holdout train validation

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
X = df_train.drop(target, axis = 1).drop(id_columns, axis = 1)
y = df_train[target]

In [14]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.20, random_state = seed, stratify = y)

In [15]:
X_train.shape, X_val.shape

((43846, 12), (10962, 12))

In [16]:
X_val.shape[0] / X_train.shape[0] 

0.25001140354878437

In [17]:
y_train.mean(), y_val.mean()

(0.08516170232176254, 0.08520343003101624)

## Exportación de train y validacion raw

In [18]:
df_train_exp = X_train.copy()
df_train_exp[target] = y_train

df_train_exp.to_csv(f'{RAW}/{preproc_label}_train.csv', index = False, compression= 'zip')

In [19]:
df_val_exp = X_val.copy()
df_val_exp[target] = y_val

df_val_exp.to_csv(f'{RAW}/{preproc_label}_val.csv', index = False, compression = 'zip')

## Imputación de datos

In [20]:
impute_values = {}

In [21]:
X_val.isnull().sum()

department                0
region                    0
education               514
gender                    0
recruitment_channel       0
no_of_trainings           0
age                       0
previous_year_rating    842
length_of_service         0
KPIs_met >80%             0
awards_won?               0
avg_training_score        0
dtype: int64

Crear columnas con indicadores de valores nulos antes de imputar

In [22]:
X_train['na_previous_year_rating'] = X_train['previous_year_rating'].isnull().astype('int')
X_train['na_education'] = X_train['education'].isnull().astype('int')

In [23]:
mean = X_train['previous_year_rating'].mean()
impute_values['previous_year_rating'] = mean

X_train['previous_year_rating'] = X_train['previous_year_rating'].fillna(mean)

In [24]:
mode = X_train['education'].mode()[0]
impute_values['education'] = mode

X_train['education'] = X_train['education'].fillna(mode)

In [25]:
X_train.isnull().sum()

department                 0
region                     0
education                  0
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating       0
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
na_previous_year_rating    0
na_education               0
dtype: int64

In [26]:
impute_values

{'previous_year_rating': 3.331993886204516, 'education': "Bachelor's"}

In [27]:
with open(f'{PROCESSED}/{preproc_label}_impute_values.pkl', 'wb') as file:
    pickle.dump(impute_values, file)

## Capeo de outliers

In [28]:
capping_values = {}

In [29]:
percentiles = list(np.arange(0.1, 0.9, 0.1)) + list(np.arange(0.9, 1.0, 0.01))
descriptives = X_train.describe(percentiles = percentiles)
descriptives

Unnamed: 0,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,na_previous_year_rating,na_education
count,43846.0,43846.0,43846.0,43846.0,43846.0,43846.0,43846.0,43846.0,43846.0
mean,1.25195,34.820873,3.331994,5.873649,0.35205,0.022921,63.383957,0.074853,0.043219
std,0.610654,7.673484,1.212792,4.281273,0.477615,0.149654,13.356721,0.263157,0.203353
min,1.0,20.0,1.0,1.0,0.0,0.0,40.0,0.0,0.0
10%,1.0,27.0,1.0,2.0,0.0,0.0,48.0,0.0,0.0
20%,1.0,28.0,3.0,2.0,0.0,0.0,50.0,0.0,0.0
30%,1.0,30.0,3.0,3.0,0.0,0.0,53.0,0.0,0.0
40%,1.0,32.0,3.0,4.0,0.0,0.0,58.0,0.0,0.0
50%,1.0,33.0,3.0,5.0,0.0,0.0,60.0,0.0,0.0
60%,1.0,35.0,3.331994,6.0,0.0,0.0,64.0,0.0,0.0


In [30]:
(descriptives.loc['max'] / descriptives.loc['99%']) > 1.5

no_of_trainings             True
age                        False
previous_year_rating       False
length_of_service           True
KPIs_met >80%              False
awards_won?                False
avg_training_score         False
na_previous_year_rating    False
na_education               False
dtype: bool

In [31]:
p99 = descriptives['no_of_trainings']['99%']
capping_values['no_of_trainings'] = p99

X_train['no_of_trainings'] = np.where(X_train['no_of_trainings'] > p99, p99, X_train['no_of_trainings'])

In [32]:
p99 = descriptives['length_of_service']['99%']
capping_values['length_of_service'] = p99

X_train['length_of_service'] = np.where(X_train['length_of_service'] > p99, p99, X_train['length_of_service'])

In [33]:
X_train.describe(percentiles = percentiles)

Unnamed: 0,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,na_previous_year_rating,na_education
count,43846.0,43846.0,43846.0,43846.0,43846.0,43846.0,43846.0,43846.0,43846.0
mean,1.245838,34.820873,3.331994,5.824522,0.35205,0.022921,63.383957,0.074853,0.043219
std,0.569076,7.673484,1.212792,4.072022,0.477615,0.149654,13.356721,0.263157,0.203353
min,1.0,20.0,1.0,1.0,0.0,0.0,40.0,0.0,0.0
10%,1.0,27.0,1.0,2.0,0.0,0.0,48.0,0.0,0.0
20%,1.0,28.0,3.0,2.0,0.0,0.0,50.0,0.0,0.0
30%,1.0,30.0,3.0,3.0,0.0,0.0,53.0,0.0,0.0
40%,1.0,32.0,3.0,4.0,0.0,0.0,58.0,0.0,0.0
50%,1.0,33.0,3.0,5.0,0.0,0.0,60.0,0.0,0.0
60%,1.0,35.0,3.331994,6.0,0.0,0.0,64.0,0.0,0.0


In [34]:
capping_values

{'no_of_trainings': 4.0, 'length_of_service': 20.0}

In [35]:
with open(f'{PROCESSED}/{preproc_label}_capping_values.pkl', 'wb') as file:
    pickle.dump(capping_values, file)

## Dummy generation

In [36]:
from sklearn.preprocessing import OneHotEncoder

In [37]:
ohe = OneHotEncoder(handle_unknown= 'ignore')

In [38]:
X_categorical = X_train.select_dtypes('object')
X_numerical = X_train.select_dtypes('number')

In [39]:
ohe.fit(X_categorical)

OneHotEncoder(handle_unknown='ignore')

In [40]:
ohe_columns = ohe.get_feature_names(X_categorical.columns)

In [41]:
X_categorical_dummies = pd.DataFrame(ohe.transform(X_categorical).toarray(), 
                                     columns = ohe_columns, index = X_numerical.index)

In [42]:
X_train = pd.concat([X_numerical, X_categorical_dummies], axis = 1)

In [43]:
with open(f'{PROCESSED}/{preproc_label}_ohe.pkl', 'wb') as file:
    pickle.dump(ohe, file)

In [44]:
with open(f'{PROCESSED}/{preproc_label}_ohe_columns.pkl', 'wb') as file:
    pickle.dump(ohe_columns, file)

## Scaler

In [45]:
from sklearn.preprocessing import StandardScaler

In [46]:
scaler = StandardScaler()

In [47]:
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns)

In [48]:
with open(f'{PROCESSED}/{preproc_label}_scaler.pkl', 'wb') as file:
    pickle.dump(scaler, file)

## Replicate preprocess on validation data

Impute nulls

In [49]:
impute_values

{'previous_year_rating': 3.331993886204516, 'education': "Bachelor's"}

In [50]:
X_val['na_previous_year_rating'] = X_val['previous_year_rating'].isnull().astype('int')
X_val['na_education'] = X_val['education'].isnull().astype('int')

In [51]:
X_val['previous_year_rating'] = X_val['previous_year_rating'].fillna(impute_values['previous_year_rating'])
X_val['education'] = X_val['education'].fillna(impute_values['education'])

Capping outliers

In [52]:
capping_values

{'no_of_trainings': 4.0, 'length_of_service': 20.0}

In [53]:
X_val['no_of_trainings'] = np.where(X_val['no_of_trainings'] > capping_values['no_of_trainings'],
                                    capping_values['no_of_trainings'], X_val['no_of_trainings'])

X_val['length_of_service'] = np.where(X_val['length_of_service'] > capping_values['length_of_service'],
                                    capping_values['length_of_service'], X_val['length_of_service'])

Dummy columns

In [54]:
X_val_categorical = X_val.select_dtypes('object')
X_val_numerical = X_val.select_dtypes('number')

In [55]:
X_val_categorical

Unnamed: 0,department,region,education,gender,recruitment_channel
9351,Procurement,region_23,Bachelor's,m,other
7289,Procurement,region_15,Master's & above,m,other
22339,Analytics,region_32,Bachelor's,m,other
23422,Technology,region_1,Bachelor's,f,sourcing
36639,Sales & Marketing,region_7,Bachelor's,f,sourcing
...,...,...,...,...,...
38740,Operations,region_32,Master's & above,f,other
26552,HR,region_22,Master's & above,f,other
19852,Sales & Marketing,region_32,Bachelor's,m,sourcing
42338,Sales & Marketing,region_20,Bachelor's,m,other


In [56]:
X_val_numerical

Unnamed: 0,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,na_previous_year_rating,na_education
9351,1.0,27,4.0,2.0,0,0,69,0,0
7289,1.0,45,3.0,18.0,0,0,67,0,0
22339,2.0,36,2.0,3.0,0,0,82,0,0
23422,1.0,28,5.0,5.0,1,0,75,0,0
36639,1.0,34,2.0,10.0,1,0,52,0,0
...,...,...,...,...,...,...,...,...,...
38740,1.0,50,4.0,20.0,0,0,56,0,0
26552,1.0,38,3.0,8.0,0,0,48,0,0
19852,2.0,31,1.0,7.0,0,0,46,0,0
42338,2.0,30,5.0,4.0,1,0,52,0,0


In [57]:
X_val_categorical_dummies = pd.DataFrame(ohe.transform(X_val_categorical).toarray(), 
                                     columns = ohe_columns, index = X_val_numerical.index)

In [58]:
X_val = pd.concat([X_val_numerical, X_val_categorical_dummies], axis = 1)

Scaling columns

In [59]:
X_val_scaled = pd.DataFrame(scaler.transform(X_val), columns = X_val.columns)

## Save preprocessed datasets

In [60]:
X_train_scaled[target] = y_train.values
X_train_scaled.to_csv(f'{PROCESSED}/{preproc_label}_train.csv', index = False, compression = 'zip')

In [61]:
X_val_scaled[target] = y_val.values
X_val_scaled.to_csv(f'{PROCESSED}/{preproc_label}_val.csv', index = False, compression = 'zip')