# HR Analytics

<img src = 'https://datahack-prod.s3.ap-south-1.amazonaws.com/__sized__/contest_cover/hr_1920x480_s5WuoZs-thumbnail-1200x1200-90.jpg'>

Practice Problem: https://datahack.analyticsvidhya.com/contest/wns-analytics-hackathon-2018-1/

## HR Analytics

HR analytics is revolutionising the way human resources departments operate, leading to higher efficiency and better results overall. Human resources has been using analytics for years. However, the collection, processing and analysis of data has been largely manual, and given the nature of human resources dynamics and HR KPIs, the approach has been constraining HR. Therefore, it is surprising that HR departments woke up to the utility of machine learning so late in the game. Here is an opportunity to try predictive analytics in identifying the employees most likely to get promoted.

## Problem Statement

Your client is a large MNC and they have 9 broad verticals across the organisation. One of the problem your client is facing is around identifying the right people for promotion *(only for manager position and below)* and prepare them in time. Currently the process, they are following is:

* They first identify a set of employees based on recommendations/ past performance
* Selected employees go through the separate training and evaluation program for each vertical. These programs are based on the required skill of each vertical
* At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., employee gets promotion

For above mentioned process, the final promotions are only announced after the evaluation and this leads to delay in transition to their new roles. Hence, company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle. 

<img src = 'https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/09/wns_hack_im_1.jpg'>

They have provided multiple attributes around Employee's past and current performance along with demographics. Now, The task is to predict whether a potential promotee at checkpoint in the test set will be promoted or not after the evaluation process.

## Evaluation Metric

The evaluation metric for this competition is F1 Score.

## Public and Private Split

Test data is further randomly divided into Public (40%) and Private (60%) data.

Your initial responses will be checked and scored on the Public data.
The final rankings would be based on your private score which will be published once the competition is over.

## Entorno

In [1]:
import sys
sys.version

'3.7.9 (default, Aug 31 2020, 17:10:11) [MSC v.1916 64 bit (AMD64)]'

In [2]:
!conda info --envs

# conda environments:
#
base                  *  C:\Users\antho\Anaconda3



## Paquetes

In [3]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import os
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm, tqdm_notebook
from pathlib import Path
import random
import warnings
import pickle

warnings.filterwarnings('ignore')


seed = 2020
random.seed(seed)

pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 400)
sns.set()

DATA = Path('../../data') 
RAW  = DATA/'raw'
PROCESSED = DATA/'processed'
SUBMISSIONS = DATA/'submissions'    

MODEL = Path('../../model') 

In [4]:
pd.__version__

'1.1.3'

In [5]:
np.__version__

'1.19.2'

In [6]:
sklearn.__version__

'0.23.2'

In [7]:
id_columns = 'employee_id'
target = 'is_promoted'

## Lectura de datos

In [8]:
os.listdir(f'{PROCESSED}')

['.DS_Store',
 'preprocess_v1_capping_values.pkl',
 'preprocess_v1_impute_values.pkl',
 'preprocess_v1_ohe.pkl',
 'preprocess_v1_ohe_columns.pkl',
 'preprocess_v1_over50_train.csv',
 'preprocess_v1_scaler.pkl',
 'preprocess_v1_smote20_train.csv',
 'preprocess_v1_smote50_train.csv',
 'preprocess_v1_smoteTomek20_train.csv',
 'preprocess_v1_smoteTomek50_train.csv',
 'preprocess_v1_train.csv',
 'preprocess_v1_under50_train.csv',
 'preprocess_v1_val.csv',
 'preprocess_v2_capping_values.pkl',
 'preprocess_v2_knnimputation.pkl',
 'preprocess_v2_ohe.pkl',
 'preprocess_v2_ohe_columns.pkl',
 'preprocess_v2_over50_train.csv',
 'preprocess_v2_scaler.pkl',
 'preprocess_v2_scalerimputation.pkl',
 'preprocess_v2_smote20_train.csv',
 'preprocess_v2_smote50_train.csv',
 'preprocess_v2_smoteTomek20_train.csv',
 'preprocess_v2_smoteTomek50_train.csv',
 'preprocess_v2_test.csv',
 'preprocess_v2_train.csv',
 'preprocess_v2_under50_train.csv',
 'preprocess_v2_val.csv']

## Entrenamiento V1 sin balanceo

In [9]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_recall_curve, roc_auc_score, f1_score

In [10]:
preproc_train = [file for file in os.listdir(f'{PROCESSED}') if file.endswith('train.csv')]
preproc_train

['preprocess_v1_over50_train.csv',
 'preprocess_v1_smote20_train.csv',
 'preprocess_v1_smote50_train.csv',
 'preprocess_v1_smoteTomek20_train.csv',
 'preprocess_v1_smoteTomek50_train.csv',
 'preprocess_v1_train.csv',
 'preprocess_v1_under50_train.csv',
 'preprocess_v2_over50_train.csv',
 'preprocess_v2_smote20_train.csv',
 'preprocess_v2_smote50_train.csv',
 'preprocess_v2_smoteTomek20_train.csv',
 'preprocess_v2_smoteTomek50_train.csv',
 'preprocess_v2_train.csv',
 'preprocess_v2_under50_train.csv']

In [11]:
preproc_val = [file for file in os.listdir(f'{PROCESSED}') if file.endswith('val.csv')]
preproc_val

['preprocess_v1_val.csv', 'preprocess_v2_val.csv']

In [12]:
for train_file in sorted(preproc_train):
    df_train = pd.read_csv(f'{PROCESSED}/{train_file}', compression = 'zip')
    df_val = pd.read_csv(f'{PROCESSED}/{preproc_val[0]}', compression = 'zip')
    
    print(f'label: {train_file:35} \tnrows: {len(df_train)} \t%target train: {df_train[target].mean():.4f} \t%target val: {df_val[target].mean():.4f}')

label: preprocess_v1_over50_train.csv      	nrows: 80224 	%target train: 0.5000 	%target val: 0.0852
label: preprocess_v1_smote20_train.csv     	nrows: 48134 	%target train: 0.1667 	%target val: 0.0852
label: preprocess_v1_smote50_train.csv     	nrows: 80224 	%target train: 0.5000 	%target val: 0.0852
label: preprocess_v1_smoteTomek20_train.csv 	nrows: 46412 	%target train: 0.1543 	%target val: 0.0852
label: preprocess_v1_smoteTomek50_train.csv 	nrows: 79638 	%target train: 0.5000 	%target val: 0.0852
label: preprocess_v1_train.csv             	nrows: 43846 	%target train: 0.0852 	%target val: 0.0852
label: preprocess_v1_under50_train.csv     	nrows: 7468 	%target train: 0.5000 	%target val: 0.0852
label: preprocess_v2_over50_train.csv      	nrows: 80224 	%target train: 0.5000 	%target val: 0.0852
label: preprocess_v2_smote20_train.csv     	nrows: 48134 	%target train: 0.1667 	%target val: 0.0852
label: preprocess_v2_smote50_train.csv     	nrows: 80224 	%target train: 0.5000 	%target v

In [13]:
train_file = 'preprocess_v1_train.csv'
val_file = 'preprocess_v1_val.csv'


In [14]:
from sklearn.model_selection import ParameterGrid

n_neighbors = list(range(5,10))
p=[1,2]
#random_state= [seed]
cv_grid = dict(n_neighbors=n_neighbors, p=p)

In [15]:


params_grid = list(ParameterGrid(cv_grid))

In [16]:
df_results = pd.DataFrame(columns = ['preproc_label', 'model_label', 'método', 'parámetros', 'columnas_out',
                                     'auc_train', 'auc_val', 'threshold','f1_train', 'f1_val'])


for xgb_params in tqdm(params_grid):
    
    for train_file in sorted(preproc_train):

        preproc_label = train_file.split('_train')[0]

        print('----------------------------------------------------------------------')
        print(xgb_params)
        print(train_file)
        print('----------------------------------------------------------------------')

        df_train = pd.read_csv(f'{PROCESSED}/{train_file}', compression = 'zip')
        df_val = pd.read_csv(f'{PROCESSED}/{preproc_val[0]}', compression = 'zip')

        X_train, y_train = df_train.drop(target, axis = 1), df_train[target]
        X_val, y_val = df_val.drop(target, axis = 1), df_val[target]

        knn = KNeighborsClassifier(n_neighbors = xgb_params["n_neighbors"],p =xgb_params["p"])
        knn_fit = knn.fit( X_train,y_train )
                        
        
        #xgb_params_export = xgb_params.copy()
        #xgb_params_export.update(logi_fit.attributes())

        probs_train = knn.predict(X_train)
        probs_val = knn.predict(X_val)

        auc_train = roc_auc_score(y_train, probs_train)
        auc_val = roc_auc_score(y_val, probs_val)

        #best threshold
        prec, recall, threshold = precision_recall_curve(y_train, probs_train)
        prec_recall = pd.DataFrame({'prec': prec[:-1], 'recall': recall[:-1], 'threshold': threshold})
        prec_recall['f1'] = 2*prec_recall['prec']*prec_recall['recall'] / (prec_recall['prec'] + prec_recall['recall'])
        prec_recall = prec_recall.sort_values(by = 'f1', ascending = False).head(1)

        #f1 scores
        best_threshold = prec_recall['threshold'].values[0]
        f1_train = prec_recall['f1'].values[0]

        labels_val = np.where(probs_val >= best_threshold, 1, 0)
        f1_val = f1_score(y_val, labels_val)

        print(f'auc_train: {auc_train:.6f} \tauc_val: {auc_val:.6f} \tf1_train: {f1_train:.6f} \tf1_val: {f1_val:.6f}')

        results = [preproc_label, 'KNeighborsClassifier', 'fit', xgb_params, '',
                  auc_train, auc_val, best_threshold, f1_train, f1_val]


        df_results.loc[len(df_results)] = results

  0%|          | 0/10 [00:00<?, ?it/s]

----------------------------------------------------------------------
{'n_neighbors': 5, 'p': 1}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.948033 	auc_val: 0.646267 	f1_train: 0.950598 	f1_val: 0.292578
----------------------------------------------------------------------
{'n_neighbors': 5, 'p': 1}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.849801 	auc_val: 0.586403 	f1_train: 0.770056 	f1_val: 0.250441
----------------------------------------------------------------------
{'n_neighbors': 5, 'p': 1}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.951050 	auc_val: 0.643785 	f1_train: 0.953186 	f1_val: 0.295737
----------------------------------------------------------------------
{'n_neighbors': 5, 'p': 1}
preprocess_v1_smoteTomek20_train.csv
--------------------------------

 10%|█         | 1/10 [1:05:12<9:46:49, 3912.13s/it]

auc_train: 0.802625 	auc_val: 0.671237 	f1_train: 0.801882 	f1_val: 0.261859
----------------------------------------------------------------------
{'n_neighbors': 5, 'p': 2}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.946325 	auc_val: 0.651668 	f1_train: 0.949060 	f1_val: 0.296661
----------------------------------------------------------------------
{'n_neighbors': 5, 'p': 2}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.853142 	auc_val: 0.611356 	f1_train: 0.759620 	f1_val: 0.287375
----------------------------------------------------------------------
{'n_neighbors': 5, 'p': 2}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.939644 	auc_val: 0.664970 	f1_train: 0.942960 	f1_val: 0.305788
----------------------------------------------------------------------
{'n_neighbors': 5, 

 10%|█         | 1/10 [1:46:37<15:59:35, 6397.26s/it]


KeyboardInterrupt: 

In [17]:
df_results

Unnamed: 0,preproc_label,model_label,método,parámetros,columnas_out,auc_train,auc_val,threshold,f1_train,f1_val
0,preprocess_v1_over50,KNeighborsClassifier,fit,"{'n_neighbors': 5, 'p': 1}",,0.948033,0.646267,1,0.950598,0.292578
1,preprocess_v1_smote20,KNeighborsClassifier,fit,"{'n_neighbors': 5, 'p': 1}",,0.849801,0.586403,1,0.770056,0.250441
2,preprocess_v1_smote50,KNeighborsClassifier,fit,"{'n_neighbors': 5, 'p': 1}",,0.95105,0.643785,1,0.953186,0.295737
3,preprocess_v1_smoteTomek20,KNeighborsClassifier,fit,"{'n_neighbors': 5, 'p': 1}",,0.853254,0.589248,1,0.783278,0.261741
4,preprocess_v1_smoteTomek50,KNeighborsClassifier,fit,"{'n_neighbors': 5, 'p': 1}",,0.953464,0.643389,1,0.955319,0.29709
5,preprocess_v1,KNeighborsClassifier,fit,"{'n_neighbors': 5, 'p': 1}",,0.583763,0.538339,1,0.281168,0.144928
6,preprocess_v1_under50,KNeighborsClassifier,fit,"{'n_neighbors': 5, 'p': 1}",,0.801419,0.668461,1,0.800645,0.25988
7,preprocess_v2_over50,KNeighborsClassifier,fit,"{'n_neighbors': 5, 'p': 1}",,0.952869,0.654187,1,0.954956,0.293253
8,preprocess_v2_smote20,KNeighborsClassifier,fit,"{'n_neighbors': 5, 'p': 1}",,0.849527,0.585618,1,0.769804,0.24868
9,preprocess_v2_smote50,KNeighborsClassifier,fit,"{'n_neighbors': 5, 'p': 1}",,0.950975,0.644593,1,0.953131,0.29588


In [18]:
MODELS = DATA/'models'

In [19]:
df_results.to_csv(f'{MODELS}/KNeighborsClassifier.csv', index = False)

In [20]:
df_results.to_csv('KNeighborsClassifier.csv')

In [21]:
df_results.to_excel(f'{MODELS}/KNeighborsClassifier.xlsx', index = False)