# Optimizing ML Model for Merit Recommender

Problem Statement : 

A leading multinational corporation is facing the challenge of efficiently identifying suitable candidates for promotion, specifically targeting positions from managers and below across the organization. The existing procedure involves initially pinpointing potential candidates based on recommendations and past performance. The selected individuals undergo distinct training and evaluation programs tailored to the skill requirements of each department. After the program is completed, promotions are determined by factors such as training performance, key performance indicator (KPI) completion (with a threshold of over 60% completion), and other relevant criteria.

However, the current process delays the announcement of promotions until after the comprehensive evaluation, delaying a seamless transition to new roles.

In this project, we’ll create and improve the ML model’s performance in identifying the right people for promotion. We’ll improve the current promotion identification system for managerial and lower-level positions within the client’s multinational organization. We’ll utilize data science tools like pandas to load and analyze the dataset to discover the insights presented. Additionally, we’ll implement the ML algorithm using the scikit-learn library and optimization techniques using scikit-optimize to create the best ML model that can accurately identify the eligible candidates for promotion.

This optimized ML model will facilitate a quicker identification and more efficient transition for selected individuals into their new roles across different parts of the organiz
ation.

## Task 1: Import Modules

Import the pandas and NumPy libraries for data analysis.

Import feature engineering methods, ML algorithms, and evaluation metrics from the scikit-learn library.

Add the random seed generator using NumPy.

In [1]:
# Import numpy and pandas library
import numpy as np 
import pandas as pd 
# Import feature engineering methods, machine learning algorithms and evaluation metrics from scikit learn
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score

# Add seeding using Numpy
np.random.seed(123)


## Task 2: Load the Dataset

In this task, perform the following actions:

Load the dataset using the pandas library.

Show the first 10 rows from the dataset.

Show a list of all columns presented in the dataset.

Show the dataset’s shape.

Show the DataFrame’s summary.

In [2]:
# Load the dataset using pandas libary 
data = pd.read_csv('usercode/hr_analytics_data.csv')

In [3]:
# Print the first 10 rows

data.head(10)

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,8455,Operations,region_5,Bachelor's,m,sourcing,1,32,4.0,5,0,0,61,0
1,45188,Technology,region_11,Bachelor's,f,other,1,35,1.0,4,1,0,76,0
2,10881,Operations,region_26,Bachelor's,m,other,1,28,3.0,4,0,0,59,0
3,70189,Technology,region_15,Bachelor's,m,other,2,40,5.0,13,1,0,78,0
4,69740,Sales & Marketing,region_2,Bachelor's,f,sourcing,1,40,3.0,11,0,0,54,0
5,55412,Analytics,region_22,Bachelor's,m,sourcing,2,30,3.0,6,1,0,85,0
6,11507,Operations,region_29,Bachelor's,f,sourcing,1,32,4.0,2,0,0,63,0
7,36461,Technology,region_4,Bachelor's,m,sourcing,2,32,4.0,6,1,0,79,0
8,53185,Procurement,region_17,Bachelor's,m,other,1,35,3.0,7,1,1,65,0
9,27508,Sales & Marketing,region_2,Bachelor's,m,sourcing,3,43,1.0,3,0,0,54,0


In [4]:
data.is_promoted.value_counts()

0    6268
1    4232
Name: is_promoted, dtype: int64

In [5]:
# print list of all columns

list(data.columns)

['employee_id',
 'department',
 'region',
 'education',
 'gender',
 'recruitment_channel',
 'no_of_trainings',
 'age',
 'previous_year_rating',
 'length_of_service',
 'KPIs_met >80%',
 'awards_won?',
 'avg_training_score',
 'is_promoted']

Definition of each variable in the dataset: 

| Variable              | Definition                                                           |
|-----------------------|----------------------------------------------------------------------|
| employee_id           | Unique ID for employee                                               |
| department            | Department of employee                                               |
| region                | Region of employment (unordered)                                     |
| education             | Education Level                                                      |
| gender                | Gender of Employee                                                   |
| recruitment_channel   | Channel of recruitment for employee                                  |
| no_of_trainings       | No of other trainings completed in the previous year on soft skills, technical skills, etc. |
| age                   | Age of Employee                                                      |
| previous_year_rating  | Employee Rating for the previous year                                 |
| length_of_service     | Length of service in years                                           |
| KPIs_met >80%         | If Percent of KPIs (Key Performance Indicators) >80% then 1 else 0    |
| awards_won?           | If awards won during the previous year then 1 else 0                 |
| avg_training_score    | Average score in current training evaluations                        |
| is_promoted           | (Target) Recommended for promotion                                   |


In [6]:
# Print the shape of the dataset 

data.shape

(10500, 14)

In [7]:
# Print a concise summary of the Data Frame

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10500 entries, 0 to 10499
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           10500 non-null  int64  
 1   department            10500 non-null  object 
 2   region                10500 non-null  object 
 3   education             10500 non-null  object 
 4   gender                10500 non-null  object 
 5   recruitment_channel   10500 non-null  object 
 6   no_of_trainings       10500 non-null  int64  
 7   age                   10500 non-null  int64  
 8   previous_year_rating  10500 non-null  float64
 9   length_of_service     10500 non-null  int64  
 10  KPIs_met >80%         10500 non-null  int64  
 11  awards_won?           10500 non-null  int64  
 12  avg_training_score    10500 non-null  int64  
 13  is_promoted           10500 non-null  int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 1.1+ MB


## Task 3: Clean the Dataset

In this task, perform the following actions:

Check if the dataset provided has missing values.

Remove the ID column presented in the dataset.

In [8]:
# Check missing values in data using isnull and sum functions

data.isnull().sum()

employee_id             0
department              0
region                  0
education               0
gender                  0
recruitment_channel     0
no_of_trainings         0
age                     0
previous_year_rating    0
length_of_service       0
KPIs_met >80%           0
awards_won?             0
avg_training_score      0
is_promoted             0
dtype: int64

In [9]:
# Drop ID column from the dataset using drop function.

data = data.drop(['employee_id'],axis=1)
data.head()

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,Operations,region_5,Bachelor's,m,sourcing,1,32,4.0,5,0,0,61,0
1,Technology,region_11,Bachelor's,f,other,1,35,1.0,4,1,0,76,0
2,Operations,region_26,Bachelor's,m,other,1,28,3.0,4,0,0,59,0
3,Technology,region_15,Bachelor's,m,other,2,40,5.0,13,1,0,78,0
4,Sales & Marketing,region_2,Bachelor's,f,sourcing,1,40,3.0,11,0,0,54,0


## Task 4: Perform Exploratory Data Analysis

In this task, evaluate the distribution for different values presented in each column in the dataset.

In [10]:
# Evaluate target distribution using value_counts function

data.is_promoted.value_counts()

0    6268
1    4232
Name: is_promoted, dtype: int64

In [11]:
# Evaluate department distribution using value_counts function

data.department.value_counts()

Sales & Marketing    2946
Operations           2328
Technology           1525
Procurement          1505
Analytics            1024
Finance               474
HR                    387
Legal                 164
R&D                   147
Name: department, dtype: int64

In [12]:
# Evaluate education distribution using value_counts function

data.education.value_counts()

Bachelor's          7093
Master's & above    3300
Below Secondary      107
Name: education, dtype: int64

In [13]:
# Evaluate gender distribution using value_counts function

data.gender.value_counts()

m    7259
f    3241
Name: gender, dtype: int64

In [14]:
data.columns

Index(['department', 'region', 'education', 'gender', 'recruitment_channel',
       'no_of_trainings', 'age', 'previous_year_rating', 'length_of_service',
       'KPIs_met >80%', 'awards_won?', 'avg_training_score', 'is_promoted'],
      dtype='object')

In [15]:
# Evaluate awards_won? distribution using value_counts function

data['awards_won?'].value_counts()

0    9910
1     590
Name: awards_won?, dtype: int64

In [16]:
# Evaluate KPIs_met >80% distribution using value_counts function

data['KPIs_met >80%'].value_counts()

0    5504
1    4996
Name: KPIs_met >80%, dtype: int64

In [17]:
# Evaluate the recruitment_channel distribution using value_counts function

data.recruitment_channel.value_counts()

other       5813
sourcing    4445
referred     242
Name: recruitment_channel, dtype: int64

In [18]:
# Evaluate the region distribution using value_counts function

data.region.value_counts()

region_2     2252
region_22    1350
region_7     1021
region_15     547
region_13     546
region_4      398
region_26     389
region_31     344
region_27     299
region_16     287
region_28     276
region_23     263
region_11     219
region_17     197
region_25     165
region_20     158
region_19     149
region_29     146
region_1      144
region_32     144
region_14     139
region_30     135
region_5      135
region_8      127
region_10     126
region_6      100
region_12      96
region_24      77
region_3       71
region_21      62
region_33      52
region_9       50
region_34      33
region_18       3
Name: region, dtype: int64

## Task 5: Split Features and Target Variables

In this task, save all feature variables in one DataFrame and the target variable in another DataFrame.

In [19]:
# Split features and target from  data

features = data.drop('is_promoted', axis=1)
target = data.is_promoted.values

# Task 6: Preprocess Numerical Variables

In this task, scale the following numerical columns:

length_of_service

previous_year_rating

no_of_trainings

avg_training_score

age

In [20]:
# Scale the numerical columns (age, length_of_service,previous_year_rating,no_of_trainings & avg_training_score)
# with MinMaxScaler()

features["age"] = MinMaxScaler().fit_transform(features["age"].values.reshape(
    -1, 1))
features["length_of_service"] = MinMaxScaler().fit_transform(
    features["length_of_service"].values.reshape(-1, 1))
features["previous_year_rating"] = MinMaxScaler().fit_transform(
    features["previous_year_rating"].values.reshape(-1, 1))
features["no_of_trainings"] = MinMaxScaler().fit_transform(
    features["no_of_trainings"].values.reshape(-1, 1))
features["avg_training_score"] = MinMaxScaler().fit_transform(
    features["avg_training_score"].values.reshape(-1, 1))

In [21]:
# Show top rows to check numerical transformation

features.head()

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score
0,Operations,region_5,Bachelor's,m,sourcing,0.0,0.3,0.75,0.121212,0,0,0.344828
1,Technology,region_11,Bachelor's,f,other,0.0,0.375,0.0,0.090909,1,0,0.603448
2,Operations,region_26,Bachelor's,m,other,0.0,0.2,0.5,0.090909,0,0,0.310345
3,Technology,region_15,Bachelor's,m,other,0.111111,0.5,1.0,0.363636,1,0,0.637931
4,Sales & Marketing,region_2,Bachelor's,f,sourcing,0.0,0.5,0.5,0.30303,0,0,0.224138


# Task 7:  Preprocesss Binary Variables

In this task, perform the following actions:

Transform the values of the gender column into numerical values.

Show the output of the transformed binary variable.

In [22]:
# Transform gender column into numerical representation using LabelEcoder() method.

features["gender"] = LabelEncoder().fit_transform(
    features["gender"].values.reshape(-1, 1))

  y = column_or_1d(y, warn=True)


In [23]:
# Show top rows top check binary variables transformation

features.head()

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score
0,Operations,region_5,Bachelor's,1,sourcing,0.0,0.3,0.75,0.121212,0,0,0.344828
1,Technology,region_11,Bachelor's,0,other,0.0,0.375,0.0,0.090909,1,0,0.603448
2,Operations,region_26,Bachelor's,1,other,0.0,0.2,0.5,0.090909,0,0,0.310345
3,Technology,region_15,Bachelor's,1,other,0.111111,0.5,1.0,0.363636,1,0,0.637931
4,Sales & Marketing,region_2,Bachelor's,0,sourcing,0.0,0.5,0.5,0.30303,0,0,0.224138


# Task 8: Preprocess Multiple Categorical Variables

In this task, perform the following actions:

Transform multiple categorical variables into a numerical representation.

Show the result after transformation.

Convert the features DataFrame into a NumPy array.

Show the first 10 rows from the NumPy array.

In [24]:
# Change categorical variables (region,education,department & recruitment_channel) to numerical with get_dummies() function 

# Change categorical variables (region, education, department, and recruitment_channel) to numerical with the get_dummies() function 
features = pd.get_dummies(features)

In [25]:
# show top rows top check categorical variables transformation

features.head()

Unnamed: 0,gender,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,department_Analytics,department_Finance,...,region_region_6,region_region_7,region_region_8,region_region_9,education_Bachelor's,education_Below Secondary,education_Master's & above,recruitment_channel_other,recruitment_channel_referred,recruitment_channel_sourcing
0,1,0.0,0.3,0.75,0.121212,0,0,0.344828,0,0,...,0,0,0,0,1,0,0,0,0,1
1,0,0.0,0.375,0.0,0.090909,1,0,0.603448,0,0,...,0,0,0,0,1,0,0,1,0,0
2,1,0.0,0.2,0.5,0.090909,0,0,0.310345,0,0,...,0,0,0,0,1,0,0,1,0,0
3,1,0.111111,0.5,1.0,0.363636,1,0,0.637931,0,0,...,0,0,0,0,1,0,0,1,0,0
4,0,0.0,0.5,0.5,0.30303,0,0,0.224138,0,0,...,0,0,0,0,1,0,0,0,0,1


In [26]:
# Convert the DataFrame to a NumPy array. 

features = features.to_numpy()

In [27]:
# Show top 10 rows of the features.

features[:10]

array([[1.        , 0.        , 0.3       , 0.75      , 0.12121212,
        0.        , 0.        , 0.34482759, 0.        , 0.        ,
        0.        , 0.        , 1.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 1.        , 0.        , 0.        , 0.        ,
        0.        , 1.        , 0.        , 0.        , 0.        ,
        0.        , 1.        ],
       [0.        , 0.        , 0.375     , 0.        , 0.09090909,
        1.        , 0.        , 0.60344828, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.     

# Task 9:  Train a Base Machine Learning Model

In this task, perform the following actions:

Instantiate the ML algorithm for classification tasks. For this task, use the KNN algorithm.

Perform the cross-validation technique when training the ML model. Remember to use the F1 score as an evaluation metric.

In [28]:
# Create a classifier

knn_classifier = KNeighborsClassifier()

In [29]:
# Implement and run the cross-val-score method
score = cross_val_score(estimator=knn_classifier,
                        X=features,
                        y=target,
                        scoring='f1',
                        cv=3,
                        n_jobs=-1)


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


# Task 10: Show Model Performance

In this task, evaluate the performance of the ML model to be used as a reference when optimizing its performance. Also, show the average performance of the trained ML model using the cross-validation technique.

In [30]:
# Show the mean score
print('Model performance: F1-score =', round(score.mean(), 4))



Model performance: F1-score = 0.5762


# Task 11: Import Modules from Scikit-optimize Library 

In this task, perform the following actions:

Import modules to define the search space.

Import module to handle the search space.

Import the optimization function to be used.

Configure the integer value in NumPy.

In [31]:
# Import scikit-optimize module to define the search space
from skopt.space import Integer, Real, Categorical ,Real

# Import decorator to handle search space
from skopt.utils import use_named_args

# import optimization function from scikit-optimize library
from skopt import gp_minimize

# configure integer using Numpy
np.int = int

# Task 12: Define list of Hyperparamters' values for Optimization

In this task, perform the following actions:

Read and understand the hyperparameters of the KNN algorithm.

Define the hyperparameter values for weights, n_neighbors, leaf_size, p, and algorithm.

In [32]:
# Define the hyperparamters values 

search_space = list()
search_space.append(Categorical(['uniform', 'distance'], name='weights'))
search_space.append(Integer(5, 100, name='n_neighbors'))
search_space.append(Integer(5, 100, name='leaf_size'))
search_space.append(Real(1,5,name='p'))
search_space.append(Categorical(['auto', 'ball_tree', 'kd_tree','brute'], name='algorithm'))

# Task 13: Define the Objective Function

In this task, perform the following steps:

Use the imported decorator from scikit-optimize to handle the search space.

Define the objective function to perform the following tasks:

Instantiate the classifier.

Pass the selected hyperparameter values.

Train the classification algorithm using the cross-validation technique.

Use the F1 score to evaluate performance.

Return the mean score around four decimal places with a negative sign (minimize the result).

In [33]:
# Define the Objective function used to evaluate a given configuration

@use_named_args(search_space)
def objective_function(**params):
    # Configure the model with specific hyperparameters
    clf = KNeighborsClassifier(**params, n_jobs=-1)
    
    # Train the model
    acc = cross_val_score(clf, features, target, scoring='f1', cv=3, n_jobs=-1).mean()
    acc = round(acc,4)
    
    return -acc

# Task 14: Run the Optimization Method

In this task, run the SMBO method provided by scikit-optimize. Remember to define the required parameters from the optimization method before running the code.

In [34]:
# perform optimization
result = gp_minimize(
    func=objective_function,
    dimensions=search_space,
    n_calls=15,
    random_state=442,
    verbose=True,
    n_jobs=-1,
)


Iteration No: 1 started. Evaluating function at random point.
Iteration No: 1 ended. Evaluation done at random point.
Time taken: 63.3853
Function value obtained: -0.6031
Current minimum: -0.6031
Iteration No: 2 started. Evaluating function at random point.
Iteration No: 2 ended. Evaluation done at random point.
Time taken: 60.5145
Function value obtained: -0.6069
Current minimum: -0.6069
Iteration No: 3 started. Evaluating function at random point.


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Iteration No: 3 ended. Evaluation done at random point.
Time taken: 65.3581
Function value obtained: -0.5776
Current minimum: -0.6069
Iteration No: 4 started. Evaluating function at random point.
Iteration No: 4 ended. Evaluation done at random point.
Time taken: 62.3908
Function value obtained: -0.6166
Current minimum: -0.6166
Iteration No: 5 started. Evaluating function at random point.
Iteration No: 5 ended. Evaluation done at random point.
Time taken: 62.7225
Function value obtained: -0.6019
Current minimum: -0.6166
Iteration No: 6 started. Evaluating function at random point.
Iteration No: 6 ended. Evaluation done at random point.
Time taken: 59.4900
Function value obtained: -0.6046
Current minimum: -0.6166
Iteration No: 7 started. Evaluating function at random point.


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Iteration No: 7 ended. Evaluation done at random point.
Time taken: 64.4284
Function value obtained: -0.5942
Current minimum: -0.6166
Iteration No: 8 started. Evaluating function at random point.


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Iteration No: 8 ended. Evaluation done at random point.
Time taken: 63.6336
Function value obtained: -0.5829
Current minimum: -0.6166
Iteration No: 9 started. Evaluating function at random point.
Iteration No: 9 ended. Evaluation done at random point.
Time taken: 62.0861
Function value obtained: -0.6050
Current minimum: -0.6166
Iteration No: 10 started. Evaluating function at random point.


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Iteration No: 10 ended. Evaluation done at random point.
Time taken: 61.7713
Function value obtained: -0.5779
Current minimum: -0.6166
Iteration No: 11 started. Searching for the next optimal point.
Iteration No: 11 ended. Search finished for the next optimal point.
Time taken: 8.4437
Function value obtained: -0.6077
Current minimum: -0.6166
Iteration No: 12 started. Searching for the next optimal point.
Iteration No: 12 ended. Search finished for the next optimal point.
Time taken: 65.9258
Function value obtained: -0.6043
Current minimum: -0.6166
Iteration No: 13 started. Searching for the next optimal point.
Iteration No: 13 ended. Search finished for the next optimal point.
Time taken: 4.7655
Function value obtained: -0.6123
Current minimum: -0.6166
Iteration No: 14 started. Searching for the next optimal point.
Iteration No: 14 ended. Search finished for the next optimal point.
Time taken: 4.6413
Function value obtained: -0.5947
Current minimum: -0.6166
Iteration No: 15 started. Se

# Task 15: Print the Best Combination of Hyperparameters and the Best Score

In this task, perform the following actions:

Show the best combination of hyperparameters that return the best performance of the ML model.

Show the best score.

In [35]:
# Print best combination of hyperparameters usign x attribute.

print("best combination of hyperparameters:", result.x)

best combination of hyperparameters: ['distance', 24, 57, 1.0649879219761123, 'auto']


In [36]:
# Show the best score after performing Sequential model-based optimisation Method

print("Best score:", abs(result.fun))

Best score: 0.6166


# Task 16: Retrain the Machine Learning Model with Best Combination of Hyperparamters

In this task, perform the following actions:

Instantiate the defined ML algorithm.

Add the values of the best combination of hyperparameters in the ML algorithm class.

Use the cross-validation technique to retrain the ML model.

Use the F1 score as an evaluation metric in the cross-validation method.

Show the new performance after retraining the ML model.

In [37]:
# Create a classifier

knn_classifier = KNeighborsClassifier(weights='distance',
                                      n_neighbors=24,
                                      leaf_size=57,
                                      p=1.0,
                                      algorithm='auto')

In [38]:
# Implement and run the cross-val-score method

score = cross_val_score(estimator=knn_classifier,
                        X=features,
                        y=target,
                        scoring='f1',
                        cv=3,
                        n_jobs=-1)

In [39]:
# Show the mean score

print('Model performance: F1-score =', round(score.mean(), 4))

Model performance: F1-score = 0.6171


# Task 17: Save the Trained Machine Learning Model using Joblib Library

In this task, perform the following actions:

Import a Python package to save and load ML models.

Save the retrained machine learning model in the current directory.

In [40]:
# Import joblib

import joblib

In [41]:
# Save the trained model

joblib.dump(knn_classifier,'./knn_model.pkl')

['./knn_model.pkl']