# Introduction to Machine Learning Project

## Question

#Background
A large Multinational has nine broad verticals across the organization. One of the problems they face is identifying the right people for promotion (only for the manager position and below) and preparing them in time.

Currently the process, they are following is:
● They first identify a set of employees based on recommendations/ past
performance.
● Selected employees go through the separate training and evaluation program for
each vertical.
● These programs are based on the required skill of each vertical. At the end of the
program, based on various factors such as training performance, KPI completion
(only employees with KPIs completed greater than 60% are considered) etc., the
employee gets a promotion.

# Task
Predict whether a potential promotee at a checkpoint will be promoted or not after the evaluation process.



# 1. Data Exploration

In [1]:
import pandas as pd
import numpy as np


hr_employee_df = pd.read_csv('https://bit.ly/2ODZvLCHRDataset')
glossary_df = pd.read_csv('https://bit.ly/2Wz3sWcGlossary')

hr_employee_df.head()

#important information
#previous_year_rating, KPIs_met >80%, 
#drop all is_promoted == 1
# df_train, df_valid = train_test_split(df,random_state=12345, test_size=0.25)
# df_train.loc[df_train['last_price'] > 113000, 'price_class'] = 1
# df_train.loc[df_train['last_price'] <= 113000, 'price_class'] = 0

# #validate the model
# df_valid.loc[df_valid['last_price'] > 113000, 'price_class'] = 1
# df_valid.loc[df_valid['last_price'] <= 113000, 'price_class'] = 0

# #target & feature
# feature_train = df_train.drop(['last_price', 'price_class'], axis=1)
# feature_valid = df_valid.drop(['last_price', 'price_class'], axis=1)

# target_train = df_train['price_class']
# target_valid = df_valid['price_class']

# model = LogisticRegression(random_state=12345, solver='liblinear')
# model.fit(feature_train, target_train)
# model.fit(feature_valid, target_valid)
# # print
# print("Training:", model.score(feature_train, target_train))
# print("Validation:", model.score(feature_valid, target_valid))

# test the model


Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [2]:
#get the shape of the df
hr_employee_df.shape

(54808, 14)

In [3]:
#get the column name & data types
hr_employee_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54808 entries, 0 to 54807
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           54808 non-null  int64  
 1   department            54808 non-null  object 
 2   region                54808 non-null  object 
 3   education             52399 non-null  object 
 4   gender                54808 non-null  object 
 5   recruitment_channel   54808 non-null  object 
 6   no_of_trainings       54808 non-null  int64  
 7   age                   54808 non-null  int64  
 8   previous_year_rating  50684 non-null  float64
 9   length_of_service     54808 non-null  int64  
 10  KPIs_met >80%         54808 non-null  int64  
 11  awards_won?           54808 non-null  int64  
 12  avg_training_score    54808 non-null  int64  
 13  is_promoted           54808 non-null  int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 5.9+ MB


#2. Data Preparation

In [4]:
# Standardize a dataset by stripping leading and trailing spaces
hr_employee_df.columns = hr_employee_df.columns.str.strip()

In [5]:
# check for missing data in a dataset.
hr_employee_df.isna().sum()

employee_id                0
department                 0
region                     0
education               2409
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    4124
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
dtype: int64

In [6]:
# replace missing data previous_year_rating column with the mean
mean_value = hr_employee_df['previous_year_rating'].mean()
hr_employee_df['previous_year_rating'].fillna(value=mean_value, inplace=True)

# replace missing education value with no education
hr_employee_df['education'].fillna(value="No Education", inplace=True)

#check for missing records to confirm replacement
hr_employee_df.isna().sum()


employee_id             0
department              0
region                  0
education               0
gender                  0
recruitment_channel     0
no_of_trainings         0
age                     0
previous_year_rating    0
length_of_service       0
KPIs_met >80%           0
awards_won?             0
avg_training_score      0
is_promoted             0
dtype: int64

In [7]:
#perform data type conversion to fix previous year rating to int
hr_employee_df['previous_year_rating'] = hr_employee_df['previous_year_rating'].astype(np.int64)

#confirm if conversion was successful
hr_employee_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54808 entries, 0 to 54807
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   employee_id           54808 non-null  int64 
 1   department            54808 non-null  object
 2   region                54808 non-null  object
 3   education             54808 non-null  object
 4   gender                54808 non-null  object
 5   recruitment_channel   54808 non-null  object
 6   no_of_trainings       54808 non-null  int64 
 7   age                   54808 non-null  int64 
 8   previous_year_rating  54808 non-null  int64 
 9   length_of_service     54808 non-null  int64 
 10  KPIs_met >80%         54808 non-null  int64 
 11  awards_won?           54808 non-null  int64 
 12  avg_training_score    54808 non-null  int64 
 13  is_promoted           54808 non-null  int64 
dtypes: int64(9), object(5)
memory usage: 5.9+ MB


In [8]:
# I can find and remove any duplicate records from a dataset.
hr_employee_df.duplicated().sum()

0

#3 Data Modelling


In [11]:
#define features and target

#import the necessary functions from sklearn
from sklearn.tree import DecisionTreeClassifier

features = hr_employee_df.drop(['education', 'recruitment_channel', 'is_promoted', 'department','region','gender'], axis=1)
target = hr_employee_df['is_promoted']

model = DecisionTreeClassifier()
model.fit(features, target)

testing_df = hr_employee_df[:20]

test_features = testing_df.drop(['education', 'recruitment_channel', 'is_promoted','department','region','gender'], axis=1)
test_target = testing_df['is_promoted']
model = DecisionTreeClassifier()

model.fit(test_features, test_target)
test_predictions = model.predict(test_features)

print('Predictions:', test_predictions)
print('Correct answers:', test_target.values)


Predictions: [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
Correct answers: [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]


In [12]:
testing_df.shape

(20, 14)

#4 Conclusion

After the evaluation process, it can be prediced that potential promotee at a checkpoint will be promoted based on model testing.