#Introduction to Machine Learning

## 1. Defining the Question


Identifying the eligible candidates at a particular checkpoint so that
they can expedite the entire promotion cycle

### b) Defining the Metric for Success

predict whether a potential promotee at a
checkpoint will be promoted or not after the evaluation process

### c) Understanding the context 

HR analytics is revolutionising the way human resources departments operate, leading
to higher efficiency and better results overall. Human resources have been using
analytics for years. However, the collection, processing, and analysis of data have been
largely manual, and given the nature of human resources dynamics and HR KPIs, the
approach has been constraining HR. Therefore, it is surprising that HR departments
woke up to the utility of machine learning so late in the game

### d) Recording the Experimental Design

Describe the steps/approach that you will use to answer the given question.



1.  Data Exploration
2.  Data Preparation
1.  Data Modelling
2.  Summary and Findings






### e) Data Relevance

How relevant was the provided data?
Very relevant

## 2. Reading the Data

In [2]:
# Importing our libraries - we import pandas and numpy
# ---
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

In [3]:
# Load the data below
# --- 
# Dataset url = : https://bit.ly/2ODZvLCHRDataset
# --- 
df = pd.read_csv('https://bit.ly/2ODZvLCHRDataset')

In [4]:
# Checking the first 5 rows of data
# ---
#
df.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [5]:
# Checking the last 5 rows of data
# ---
#
df.tail()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
54803,3030,Technology,region_14,Bachelor's,m,sourcing,1,48,3.0,17,0,0,78,0
54804,74592,Operations,region_27,Master's & above,f,other,1,37,2.0,6,0,0,56,0
54805,13918,Analytics,region_1,Bachelor's,m,other,1,27,5.0,3,1,0,79,0
54806,13614,Sales & Marketing,region_9,,m,sourcing,1,29,1.0,2,0,0,45,0
54807,51526,HR,region_22,Bachelor's,m,other,1,27,1.0,5,0,0,49,0


In [6]:
# Sample 10 rows of data
# ---
#
df.sample(10)

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
29865,25946,Technology,region_16,Bachelor's,m,sourcing,2,39,3.0,10,0,0,81,0
6787,66476,Sales & Marketing,region_12,Bachelor's,m,sourcing,1,25,,1,0,0,43,0
26211,10412,Sales & Marketing,region_13,Bachelor's,f,sourcing,1,32,4.0,3,0,0,62,1
13321,40446,Operations,region_15,Master's & above,m,sourcing,1,35,5.0,3,1,0,64,0
33866,5061,Procurement,region_20,Bachelor's,m,other,2,37,3.0,5,0,0,70,0
13963,36548,Sales & Marketing,region_3,Master's & above,m,sourcing,1,39,4.0,5,0,0,50,0
42540,25547,Sales & Marketing,region_8,Bachelor's,m,other,1,29,1.0,3,0,0,46,0
30699,2067,Procurement,region_30,Bachelor's,m,sourcing,1,34,4.0,7,0,0,74,0
30723,56141,Sales & Marketing,region_2,Bachelor's,m,other,1,41,1.0,4,0,0,51,0
47689,37040,Procurement,region_11,Bachelor's,f,sourcing,1,36,3.0,9,0,0,68,0


In [7]:
# Checking number of rows and columns
# ---
#  
df.shape

(54808, 14)

In [8]:
# Checking datatypes
# ---
df.dtypes

employee_id               int64
department               object
region                   object
education                object
gender                   object
recruitment_channel      object
no_of_trainings           int64
age                       int64
previous_year_rating    float64
length_of_service         int64
KPIs_met >80%             int64
awards_won?               int64
avg_training_score        int64
is_promoted               int64
dtype: object

Record your observations below:

*   Most of the variables are int  types
*   The data provided has 54808 rows and feauture 14



## 3. External Data Source Validation

Your client is a large MNC and they have 9 broad verticals across the organisation. One of the problem your client is facing is around identifying the right people for promotion (only for manager position and below) and prepare them in time. Currently the process, they are following is:

They first identify a set of employees based on recommendations/ past performance
Selected employees go through the separate training and evaluation program for each vertical. These programs are based on the required skill of each vertical
At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., employee gets promotion.

For above mentioned process, the final promotions are only announced after the evaluation and this leads to delay in transition to their new roles. Hence, company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle.

They have provided multiple attributes around Employee's past and current performance along with demographics. Now, The task is to predict whether a potential promotee at checkpoint in the test set will be promoted or not after the evaluation process.

## 4. Data Preparation

### Performing Data Cleaning

In [None]:
# Checking missing entries of all the variables. 
# ---
df.isnull().sum()

We observe the following from our dataset:

*   2 columns have missing values 



In [None]:
# Standardizing your dataset i.e. variable renaming 
# we make all our column headings to have lower case characters and check the first five rows to confirm changes
df.columns = df.columns.str.lower()
df.head()

We observe the following from our dataset:

*   We renamed all columns to have lower cases and checked to confirm the changes. All columns now have lower cases in the column names.



In [None]:
# Checking how many duplicate rows are there in the data
# ---
df.duplicated().sum()

We observe the following from our dataset:

*   There are no duplicates in our data



In [None]:
# Checking if any of the columns are all null
# ---
df.isnull().all(axis = 0)

We observe the following from our dataset:

*   None of the columns contains all null values



In [None]:
# Checking if any of the rows are all null
# ---
sum(df.isnull().all(axis = 1))

We observe the following from our dataset:

*   No row contains completely null values



In [None]:
# Checking unique values in department variable
# ---

df.department.unique()

In [None]:
# Checking unique values in region variable
# ---

df.region.unique()

In [None]:
# Checking unique values in education variable
# ---

df.education.unique()

In [None]:
# Checking unique values in recruitment_channel variable
# ---

df.recruitment_channel.unique()

In [None]:
# Checking unique values in gender variable
# ---

df.gender.unique()

### Overall Data Cleaning Observations


- There are a large number of missing values in the fields "education" which is important to our analysis. 

- previous_year_rating has missing values  important to our analysis



In [9]:
# Lets first start by creating a copy of our dataframe 
# df_clean = df.copy(). We will use this copy as our cleaning copy.
# ---
#
df_clean = df.copy()
df_clean.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [10]:
# Then extracting the lowest value in the column previous_year_rating

df_clean['previous_year_rating'].min()

1.0

In [11]:
# replace null values with minimum rating of previous year column and check for missing values in the column
# ---
df_clean['previous_year_rating'] = df_clean['previous_year_rating'].fillna(df_clean['previous_year_rating'].min())
df_clean['previous_year_rating'].isnull().any()

False

In [12]:
df_clean['education'] = df_clean['education'].fillna("Below Secondary")
df_clean['education'].isnull().any()

False

In [13]:
df_clean.sample(10)
df_clean.isnull().sum()

employee_id             0
department              0
region                  0
education               0
gender                  0
recruitment_channel     0
no_of_trainings         0
age                     0
previous_year_rating    0
length_of_service       0
KPIs_met >80%           0
awards_won?             0
avg_training_score      0
is_promoted             0
dtype: int64

Observation 



1.   we have  added all missing education  with least education (Below Secondary)
2.   previous  year rating missing figures replaced with least rating 



In [14]:
# Check the first 5 record the cleaned dataset
# ---
#
df_clean.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [15]:
df_clean['is_promoted'].value_counts()

0    50140
1     4668
Name: is_promoted, dtype: int64

In [59]:
print('1. The percentage of employees who have not received a promotion are ' + str(round((50140/54808)*100,2)) + '%')
print('2. The percentage of employees who get promoted are ' + str(round((4668/54808)*100,2)) + '%')

1. The percentage of employees who have not received a promotion are 91.48%
2. The percentage of employees who get promoted are 8.52%


there is 16 % chance of being promoted if you met 80% of your KPI

In [20]:
rating_probs = df_clean.groupby(['previous_year_rating','is_promoted']).agg({
    'department': ['count']
}).reset_index()

rating_probs.columns = ['previous_year_rating', 'is_promoted', "employees"]

rating_probs = pd.pivot_table(rating_probs,
                              index = 'previous_year_rating',
                              columns = 'is_promoted',
                              values = "employees"
                             ).reset_index()

rating_probs.columns = ['previous_year_rating', 'not_promoted', 'promoted']
rating_probs['total_employees'] = rating_probs['not_promoted'] + rating_probs['promoted']
rating_probs['promotion_probs'] = (rating_probs['promoted']/(rating_probs['not_promoted']+rating_probs['promoted']))*100
rating_probs = rating_probs.sort_values('promotion_probs', ascending=False).reset_index(drop=True)
rating_probs

Unnamed: 0,previous_year_rating,not_promoted,promoted,total_employees,promotion_probs
0,5.0,9820,1921,11741,16.361468
1,4.0,9093,784,9877,7.937633
2,3.0,17263,1355,18618,7.277903
3,2.0,4044,181,4225,4.284024
4,1.0,9920,427,10347,4.1268


there is 7.9 % chance of being promoted with previous year rating of 4

In [21]:
data_awards = pd.pivot_table(df_clean,
                    index = ['awards_won?'],
                    columns = ['is_promoted'],
                    aggfunc = {'is_promoted' : ['count']}).reset_index()
data_awards.columns = ['awards_won', 'not_promoted', 'promoted']
data_awards['total'] = data_awards['not_promoted']+data_awards['promoted']
data_awards['probability'] = round((data_awards['promoted']/data_awards['total'])*100,2)
data_awards = data_awards.sort_values(['probability'], ascending=False)
data_awards

Unnamed: 0,awards_won,not_promoted,promoted,total,probability
1,1,711,559,1270,44.02
0,0,49429,4109,53538,7.67


there is 44% chance of being promoted  if you got an award

In [22]:
prob_edu = df_clean.groupby(['education','is_promoted'])['employee_id'].count().reset_index()
edu = prob_edu.pivot_table(index='education', columns='is_promoted', 
                       values='employee_id').reset_index()
edu.columns = ['education', 'not_promoted', 'promoted']
edu['total'] = edu['not_promoted']+edu['promoted']
edu['probability'] = round((edu['promoted']/edu['total'])*100,2)
edu = edu.sort_values(['probability'], ascending=False)
edu

Unnamed: 0,education,not_promoted,promoted,total,probability
2,Master's & above,13454,1471,14925,9.86
0,Bachelor's,33661,3008,36669,8.2
1,Below Secondary,3025,189,3214,5.88


In [61]:
df_clean.columns

Index(['employee_id', 'department', 'region', 'education', 'gender',
       'recruitment_channel', 'no_of_trainings', 'age', 'previous_year_rating',
       'length_of_service', 'KPIs_met >80%', 'awards_won?',
       'avg_training_score', 'is_promoted'],
      dtype='object')

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split 
from sklearn import metrics 

x = df_clean.drop(['is_promoted'], axis = 1)
y = df_clean.loc[:,"is_promoted"].values

testx = df_clean.drop(['is_promoted'], axis = 1)
testy = df_clean.loc[:,"is_promoted"].values

model = DecisionTreeClassifier(random_state=12345,max_depth=5)

model.fit(x, y)


train_predictions = model.predict(x)
test_predictions = model.predict(testx)

print('Accuracy')
print('Training set:', accuracy_score(testy, test_predictions))

 **Recommendations**

 



1.   There is 44% chance of being promoted if you got an award
2.    There is 7.9 % chance of being promoted with previous year rating of 4
1.   there is 16 % chance of being promoted if you met 80% of your KPI




