# HR Analytics

<img src = 'https://datahack-prod.s3.ap-south-1.amazonaws.com/__sized__/contest_cover/hr_1920x480_s5WuoZs-thumbnail-1200x1200-90.jpg'>

Practice Problem: https://datahack.analyticsvidhya.com/contest/wns-analytics-hackathon-2018-1/

## HR Analytics

HR analytics is revolutionising the way human resources departments operate, leading to higher efficiency and better results overall. Human resources has been using analytics for years. However, the collection, processing and analysis of data has been largely manual, and given the nature of human resources dynamics and HR KPIs, the approach has been constraining HR. Therefore, it is surprising that HR departments woke up to the utility of machine learning so late in the game. Here is an opportunity to try predictive analytics in identifying the employees most likely to get promoted.

## Problem Statement

Your client is a large MNC and they have 9 broad verticals across the organisation. One of the problem your client is facing is around identifying the right people for promotion *(only for manager position and below)* and prepare them in time. Currently the process, they are following is:

* They first identify a set of employees based on recommendations/ past performance
* Selected employees go through the separate training and evaluation program for each vertical. These programs are based on the required skill of each vertical
* At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., employee gets promotion

For above mentioned process, the final promotions are only announced after the evaluation and this leads to delay in transition to their new roles. Hence, company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle. 

<img src = 'https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/09/wns_hack_im_1.jpg'>

They have provided multiple attributes around Employee's past and current performance along with demographics. Now, The task is to predict whether a potential promotee at checkpoint in the test set will be promoted or not after the evaluation process.

## Evaluation Metric

The evaluation metric for this competition is F1 Score.

## Public and Private Split

Test data is further randomly divided into Public (40%) and Private (60%) data.

Your initial responses will be checked and scored on the Public data.
The final rankings would be based on your private score which will be published once the competition is over.

## Entorno

In [None]:
import sys
sys.version

'3.6.12 |Anaconda, Inc.| (default, Sep  8 2020, 17:50:39) \n[GCC Clang 10.0.0 ]'

In [None]:
!conda info --envs

# conda environments:
#
micromaster              /Users/manuel/.conda/envs/micromaster
                         /Users/manuel/.julia/conda/3
base                  *  /Users/manuel/opt/anaconda3
belcorp                  /Users/manuel/opt/anaconda3/envs/belcorp
courseragcp              /Users/manuel/opt/anaconda3/envs/courseragcp
iapucp                   /Users/manuel/opt/anaconda3/envs/iapucp
mitxpro                  /Users/manuel/opt/anaconda3/envs/mitxpro
style-transfer           /Users/manuel/opt/anaconda3/envs/style-transfer
taller-dmc               /Users/manuel/opt/anaconda3/envs/taller-dmc
udacity                  /Users/manuel/opt/anaconda3/envs/udacity



## Paquetes

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import os
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm, tqdm_notebook
from pathlib import Path
import random
import warnings
import pickle
import xgboost

warnings.filterwarnings('ignore')


seed = 2020
random.seed(seed)

pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 400)
sns.set()

DATA = Path('../../data') 
RAW  = DATA/'raw'
PROCESSED = DATA/'processed'
SUBMISSIONS = DATA/'submissions'    

MODEL = Path('../../model') 

In [None]:
pd.__version__

'1.1.3'

In [None]:
np.__version__

'1.19.1'

In [None]:
sklearn.__version__

'0.23.2'

## Lectura de datos

In [None]:
#os.listdir(f'{PROCESSED}')

['preprocess_v1_train.csv', '.ipynb_checkpoints', 'preprocess_v1_val.csv']

In [None]:
#df_train = pd.read_csv(f'{PROCESSED}/preprocess_v1_train.csv')
#df_val = pd.read_csv((f'{PROCESSED}/preprocess_v1_val.csv'))

In [2]:
# Solo para ejecutar en colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Solo para ejecutar en colab
train_v1 = pd.read_csv('/content/drive/My Drive/Colab Notebooks/DiplomadoIA/M1_AprendizajeDeMaquina/Proyecto/preprocess_v1_train.csv', compression='zip')
train_v2 = pd.read_csv('/content/drive/My Drive/Colab Notebooks/DiplomadoIA/M1_AprendizajeDeMaquina/Proyecto/preprocess_v2_train.csv', compression='zip')

In [4]:
train_v1.shape, train_v2.shape

((43846, 61), (43846, 61))

In [5]:
id_columns = 'employee_id'
target = 'is_promoted'

## Balanceo de clases

In [6]:
SEED = 2020
train_v1['is_promoted'].value_counts() 
train_v2['is_promoted'].value_counts() 

0    40112
1     3734
Name: is_promoted, dtype: int64

### Oversampling

In [9]:
muestra_si_v1_over50 = train_v1[train_v1['is_promoted'] == 1].sample(n=(train_v1['is_promoted'].value_counts()[0] - train_v1['is_promoted'].value_counts()[1]), 
                                                                   random_state=SEED, replace=True)

train_v1_over50 = train_v1.append(muestra_si_v1_over50)

# Verificamos la distribución de la clase
train_v1_over50['is_promoted'].value_counts()

1    40112
0    40112
Name: is_promoted, dtype: int64

In [10]:
train_v1_over50.to_csv('/content/drive/My Drive/Colab Notebooks/DiplomadoIA/M1_AprendizajeDeMaquina/Proyecto/balanceo/preprocess_v1_over50_train.csv', index=False, compression='zip')

In [11]:
muestra_si_v2_over50 = train_v2[train_v2['is_promoted'] == 1].sample(n=(train_v2['is_promoted'].value_counts()[0] - train_v2['is_promoted'].value_counts()[1]), 
                                                                   random_state=SEED, replace=True)

train_v2_over50 = train_v2.append(muestra_si_v1_over50)

# Verificamos la distribución de la clase
train_v2_over50['is_promoted'].value_counts()

1    40112
0    40112
Name: is_promoted, dtype: int64

In [12]:
train_v2_over50.to_csv('/content/drive/My Drive/Colab Notebooks/DiplomadoIA/M1_AprendizajeDeMaquina/Proyecto/balanceo/preprocess_v2_over50_train.csv', index=False, compression='zip')

### Undersampling

In [13]:
muestra_no_v1_under50 = train_v1[train_v1['is_promoted'] == 0].sample(n=train_v1['is_promoted'].value_counts()[1], 
                                                                    random_state=SEED, replace=False)

train_v1_under50 = train_v1[train_v1['is_promoted'] == 1].append(muestra_no_v1_under50)

# Verificamos la distribución de la clase
train_v1_under50['is_promoted'].value_counts()

1    3734
0    3734
Name: is_promoted, dtype: int64

In [14]:
train_v1_under50.to_csv('/content/drive/My Drive/Colab Notebooks/DiplomadoIA/M1_AprendizajeDeMaquina/Proyecto/balanceo/preprocess_v1_under50_train.csv', index=False, compression='zip')

In [15]:
muestra_no_v2_under50 = train_v2[train_v2['is_promoted'] == 0].sample(n=train_v2['is_promoted'].value_counts()[1], 
                                                                    random_state=SEED, replace=False)

train_v2_under50 = train_v2[train_v2['is_promoted'] == 1].append(muestra_no_v2_under50)

# Verificamos la distribución de la clase
train_v2_under50['is_promoted'].value_counts()

1    3734
0    3734
Name: is_promoted, dtype: int64

In [16]:
train_v2_under50.to_csv('/content/drive/My Drive/Colab Notebooks/DiplomadoIA/M1_AprendizajeDeMaquina/Proyecto/balanceo/preprocess_v2_under50_train.csv', index=False, compression='zip')

### SMOTE -> "not majority"

In [17]:
import xgboost
from imblearn.over_sampling import SMOTE

In [18]:
smote50_v1 = SMOTE(sampling_strategy='not majority', k_neighbors=2, random_state=SEED)    #, kind='svm', out_step=0.2
X_train_v1_smote50, y_train_v1_smote50 = smote50_v1.fit_resample(train_v1.drop(target, axis = 1), train_v1[target])
pd.DataFrame(y_train_v1_smote50).value_counts()

train_v1_smote50 = np.concatenate((X_train_v1_smote50, y_train_v1_smote50.reshape(y_train_v1_smote50.shape[0],1)), axis=1)
train_v1_smote50 = pd.DataFrame(train_v1_smote50)
train_v1_smote50.shape
train_v1_smote50.columns = train_v1.columns

# Verificamos la distribución de la clase
train_v1_smote50['is_promoted'].value_counts()

1.0    40112
0.0    40112
Name: is_promoted, dtype: int64

In [19]:
train_v1_smote50.to_csv('/content/drive/My Drive/Colab Notebooks/DiplomadoIA/M1_AprendizajeDeMaquina/Proyecto/balanceo/preprocess_v1_smote50_train.csv', index=False, compression='zip')

In [21]:
smote50_v2 = SMOTE(sampling_strategy='not majority', k_neighbors=2, random_state=SEED)    #, kind='svm', out_step=0.2
X_train_v2_smote50, y_train_v2_smote50 = smote50_v2.fit_resample(train_v2.drop(target, axis = 1), train_v2[target])
pd.DataFrame(y_train_v2_smote50).value_counts()

train_v2_smote50 = np.concatenate((X_train_v2_smote50, y_train_v2_smote50.reshape(y_train_v2_smote50.shape[0],1)), axis=1)
train_v2_smote50 = pd.DataFrame(train_v2_smote50)
train_v2_smote50.shape
train_v2_smote50.columns = train_v2.columns

# Verificamos la distribución de la clase
train_v2_smote50['is_promoted'].value_counts()

1.0    40112
0.0    40112
Name: is_promoted, dtype: int64

In [22]:
train_v2_smote50.to_csv('/content/drive/My Drive/Colab Notebooks/DiplomadoIA/M1_AprendizajeDeMaquina/Proyecto/balanceo/preprocess_v2_smote50_train.csv', index=False, compression='zip')

In [38]:
smote20_v1 = SMOTE(sampling_strategy=0.2, k_neighbors=2, random_state=SEED)    
X_train_v1_smote20, y_train_v1_smote20 = smote20_v1.fit_resample(train_v1.drop(target, axis = 1), train_v1[target])
pd.DataFrame(y_train_v1_smote20).value_counts()

train_v1_smote20 = np.concatenate((X_train_v1_smote20, y_train_v1_smote20.reshape(y_train_v1_smote20.shape[0],1)), axis=1)
train_v1_smote20 = pd.DataFrame(train_v1_smote20)
train_v1_smote20.shape
train_v1_smote20.columns = train_v1.columns

# Verificamos la distribución de la clase
train_v1_smote20['is_promoted'].value_counts()

0.0    40112
1.0     8022
Name: is_promoted, dtype: int64

In [39]:
train_v1_smote20.to_csv('/content/drive/My Drive/Colab Notebooks/DiplomadoIA/M1_AprendizajeDeMaquina/Proyecto/balanceo/preprocess_v1_smote20_train.csv', index=False, compression='zip')

In [40]:
smote20_v2 = SMOTE(sampling_strategy=0.2, k_neighbors=2, random_state=SEED)    
X_train_v2_smote20, y_train_v2_smote20 = smote20_v2.fit_resample(train_v2.drop(target, axis = 1), train_v2[target])
pd.DataFrame(y_train_v2_smote20).value_counts()

train_v2_smote20 = np.concatenate((X_train_v2_smote20, y_train_v2_smote20.reshape(y_train_v2_smote20.shape[0],1)), axis=1)
train_v2_smote20 = pd.DataFrame(train_v2_smote20)
train_v2_smote20.shape
train_v2_smote20.columns = train_v2.columns

# Verificamos la distribución de la clase
train_v2_smote20['is_promoted'].value_counts()

0.0    40112
1.0     8022
Name: is_promoted, dtype: int64

In [41]:
train_v2_smote20.to_csv('/content/drive/My Drive/Colab Notebooks/DiplomadoIA/M1_AprendizajeDeMaquina/Proyecto/balanceo/preprocess_v2_smote20_train.csv', index=False, compression='zip')

### SMOTE + TOMEK 

In [23]:
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.combine import SMOTETomek # doctest: +NORMALIZE_WHITESPACE

In [25]:
smoteTomek50_v1 = SMOTETomek(random_state=SEED)
X_train_v1_smoteTomek50, y_train_v1_smoteTomek50 = smoteTomek50_v1.fit_resample(train_v1.drop(target, axis = 1), train_v1[target])
pd.DataFrame(y_train_v1_smoteTomek50).value_counts()

train_v1_smoteTomek50 = np.concatenate((X_train_v1_smoteTomek50, y_train_v1_smoteTomek50.reshape(y_train_v1_smoteTomek50.shape[0],1)), axis=1)
train_v1_smoteTomek50 = pd.DataFrame(train_v1_smoteTomek50)
train_v1_smoteTomek50.shape
train_v1_smoteTomek50.columns = train_v1.columns

# Verificamos la distribución de la clase
train_v1_smoteTomek50['is_promoted'].value_counts()

1    39819
0    39819
dtype: int64

In [27]:
train_v1_smoteTomek50.to_csv('/content/drive/My Drive/Colab Notebooks/DiplomadoIA/M1_AprendizajeDeMaquina/Proyecto/balanceo/preprocess_v1_smoteTomek50_train.csv', index=False, compression='zip')

In [28]:
smoteTomek50_v2 = SMOTETomek(random_state=SEED)
X_train_v2_smoteTomek50, y_train_v2_smoteTomek50 = smoteTomek50_v2.fit_resample(train_v2.drop(target, axis = 1), train_v2[target])
pd.DataFrame(y_train_v2_smoteTomek50).value_counts()

train_v2_smoteTomek50 = np.concatenate((X_train_v2_smoteTomek50, y_train_v2_smoteTomek50.reshape(y_train_v2_smoteTomek50.shape[0],1)), axis=1)
train_v2_smoteTomek50 = pd.DataFrame(train_v2_smoteTomek50)
train_v2_smoteTomek50.shape
train_v2_smoteTomek50.columns = train_v2.columns

# Verificamos la distribución de la clase
train_v2_smoteTomek50['is_promoted'].value_counts()

1    39819
0    39819
dtype: int64

In [30]:
train_v2_smoteTomek50.to_csv('/content/drive/My Drive/Colab Notebooks/DiplomadoIA/M1_AprendizajeDeMaquina/Proyecto/balanceo/preprocess_v2_smoteTomek50_train.csv', index=False, compression='zip')

In [43]:
smoteTomek20_v1 = SMOTETomek(sampling_strategy=0.2, random_state=SEED)
X_train_v1_smoteTomek20, y_train_v1_smoteTomek20 = smoteTomek20_v1.fit_resample(train_v1.drop(target, axis = 1), train_v1[target])
pd.DataFrame(y_train_v1_smoteTomek20).value_counts()

train_v1_smoteTomek20 = np.concatenate((X_train_v1_smoteTomek20, y_train_v1_smoteTomek20.reshape(y_train_v1_smoteTomek20.shape[0],1)), axis=1)
train_v1_smoteTomek20 = pd.DataFrame(train_v1_smoteTomek20)
train_v1_smoteTomek20.shape
train_v1_smoteTomek20.columns = train_v1.columns

# Verificamos la distribución de la clase
train_v1_smoteTomek20['is_promoted'].value_counts()

0.0    39251
1.0     7161
Name: is_promoted, dtype: int64

In [44]:
train_v1_smoteTomek20.to_csv('/content/drive/My Drive/Colab Notebooks/DiplomadoIA/M1_AprendizajeDeMaquina/Proyecto/balanceo/preprocess_v1_smoteTomek20_train.csv', index=False, compression='zip')

In [45]:
smoteTomek20_v2 = SMOTETomek(sampling_strategy=0.2, random_state=SEED)
X_train_v2_smoteTomek20, y_train_v2_smoteTomek20 = smoteTomek20_v2.fit_resample(train_v2.drop(target, axis = 1), train_v2[target])
pd.DataFrame(y_train_v2_smoteTomek20).value_counts()

train_v2_smoteTomek20 = np.concatenate((X_train_v2_smoteTomek20, y_train_v2_smoteTomek20.reshape(y_train_v2_smoteTomek20.shape[0],1)), axis=1)
train_v2_smoteTomek20 = pd.DataFrame(train_v2_smoteTomek20)
train_v2_smoteTomek20.shape
train_v2_smoteTomek20.columns = train_v2.columns

# Verificamos la distribución de la clase
train_v2_smoteTomek20['is_promoted'].value_counts()

0.0    39240
1.0     7150
Name: is_promoted, dtype: int64

In [46]:
train_v2_smoteTomek20.to_csv('/content/drive/My Drive/Colab Notebooks/DiplomadoIA/M1_AprendizajeDeMaquina/Proyecto/balanceo/preprocess_v2_smoteTomek20_train.csv', index=False, compression='zip')

### SMOTE -> "all"

In [None]:
#not minority (?) misma distribución 
#from imblearn.over_sampling import SMOTE
#
#sm = SMOTE(sampling_strategy='all', k_neighbors=2, random_state=SEED)
#x_SMOTE_m4, y_SMOTE_m4 = sm.fit_resample(df_train.drop(target, axis = 1), df_train[target])
#pd.DataFrame(y_SMOTE_m4).value_counts()

### Tomek -> "not majority"

In [None]:
#from sklearn.datasets import make_classification
#from imblearn.under_sampling import TomekLinks # doctest: +NORMALIZE_WHITESPACE
#
#smt = TomekLinks(sampling_strategy='not majority',)
#x_SMOTE_m5, y_SMOTE_m5 = smt.fit_resample(df_train.drop(target, axis = 1), df_train[target])
#pd.DataFrame(y_SMOTE_m5).value_counts()

0    40112
1     2410
dtype: int64

### Tomek -> "all"

In [None]:
#from sklearn.datasets import make_classification
#from imblearn.under_sampling import TomekLinks # doctest: +NORMALIZE_WHITESPACE
#
#smt = TomekLinks(sampling_strategy='all')
#x_SMOTE_m6, y_SMOTE_m6 = smt.fit_resample(df_train.drop(target, axis = 1), df_train[target])
#pd.DataFrame(y_SMOTE_m6).value_counts()

0    38788
1     2410
dtype: int64

### Tomek -> "not minority"

In [None]:
#from sklearn.datasets import make_classification
#from imblearn.under_sampling import TomekLinks # doctest: +NORMALIZE_WHITESPACE
#
#smt = TomekLinks(sampling_strategy='not minority')
#x_SMOTE_m7, y_SMOTE_m7 = smt.fit_resample(df_train.drop(target, axis = 1), df_train[target])
#pd.DataFrame(y_SMOTE_m7).value_counts()

0    38788
1     3734
dtype: int64

## Importancia de variables en xgboost

In [None]:
#pd.DataFrame(dataunica).columns

In [None]:
#clf.get_booster().feature_names = list(['', '', '']) 
#xx = clf.get_booster().get_score(importance_type= 'gain')  
#xx = {k: round(v,2) for k, v in sorted(xx.items(), key=lambda item: item[1])}
#x = {k: xx[k] for k in list(xx)[-10:]}
#plt.figure(figsize=(5,10))
#plt.barh(range(len(x)), list(x.values()), align='center')
#plt.yticks(range(len(x)), list(x.keys()))
#plt.xlabel('score', fontsize=18)
#plt.ylabel('variables', fontsize=18)
#plt.title('Importancia de variables', fontsize=20)
#for i, v in enumerate(x.values()):
#    plt.text(v -1, i-0.1, str(v), color='white', fontweight='bold',fontsize=18)
#plt.show()

## Imputación por regresión

In [None]:
## explicitly require this experimental feature
#from sklearn.experimental import enable_iterative_imputer  # noqa
## now you can import normally from sklearn.impute
#from sklearn.impute import IterativeImputer
#
#import numpy as np
#from sklearn.experimental import enable_iterative_imputer
#from sklearn.impute import IterativeImputer
#imp_mean = IterativeImputer(random_state=0)

In [None]:
#imp_mean.fit(data_sinNA)
#dataImp = pd.DataFrame(imp_mean.transform(data))
#dataImp.columns = data.columns