# Project Overview

## Collaborative Initiative with the HR Department of an IT Company

### Objective

In collaboration with the HR department of a leading IT company, our goal is to optimize and streamline the hiring process. The company is actively recruiting for various technical roles, each requiring distinct skill sets.

### Targeted Roles for Recruitment:

- **Back-End Developers**
- **Full-Stack Developers**
- **Mobile Developers**
- **Data Scientists & Machine Learning Specialists**
- **Data Engineers**

### Challenge

The Talent Acquisition team collects extensive data from application forms, detailing the skills of candidates applying for different roles. However, candidates may express interest in positions that do not align with their actual skill sets, or they might be better suited for a different role than they initially selected. This misalignment creates inefficiencies in the hiring process.

### Solution

To address this challenge, we are developing a **machine learning model** that will analyze candidate data and accurately predict the most suitable job role for each applicant. This predictive approach will:

- Improve **candidate-role matching accuracy**  
- **Reduce hiring time** by prioritizing best-fit candidates  
- **Optimize resource allocation** during the recruitment process  

By integrating this model into the hiring pipeline, we aim to enhance efficiency, ensure better placements, and support data-driven decision-making in talent acquisition.

## Importing essential libraries

In [1]:
!upgrade pip
!pip install imbalanced-learn
!pip install --upgrade scikit-learn
!pip install xgboost
!pip install kmodes
!pip install joblib
!pip install ipywidgets
!pip install IPython

import numpy as np
import pandas as pd
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
import ast
import warnings
import joblib

from sklearn.preprocessing import MultiLabelBinarizer, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from kmodes.kmodes import KModes
from sklearn.metrics import classification_report, precision_recall_fscore_support, accuracy_score, f1_score, make_scorer
from imblearn.over_sampling import SMOTE

import ipywidgets as widgets
from IPython.display import display
!jupyter nbextension enable --py widgetsnbextension

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 300)

/bin/sh: 1: upgrade: not found

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run:


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


## Importing Our Dataset

In [2]:
data = pd.read_csv('cleaned_data/CleanedData.csv')

## Data Prepration
Majority of the cleaning has been completed in Part I

In [3]:
# Filtering for important columns
df = data[['LanguageHaveWorkedWith', 'DatabaseHaveWorkedWith', 'PlatformHaveWorkedWith', 
           'WebframeHaveWorkedWith', 'MiscTechHaveWorkedWith', 'ToolsTechHaveWorkedWith', 
           'OfficeStackAsyncHaveWorkedWith', 'NEWCollabToolsHaveWorkedWith', 'OfficeStackSyncHaveWorkedWith',
           'DevType']].dropna(subset=['LanguageHaveWorkedWith']).copy()

list_cols = ['LanguageHaveWorkedWith', 'DatabaseHaveWorkedWith', 'PlatformHaveWorkedWith', 
             'WebframeHaveWorkedWith', 'MiscTechHaveWorkedWith', 'ToolsTechHaveWorkedWith', 
             'OfficeStackAsyncHaveWorkedWith', 'NEWCollabToolsHaveWorkedWith', 'OfficeStackSyncHaveWorkedWith']


# Splitting List columns using Multi Label Binaizer
for i in list_cols:
    df[i] = df[i].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
    
    mlb = MultiLabelBinarizer()
    transformed = mlb.fit_transform(df[i])
    binarized_df = pd.DataFrame(transformed, columns=mlb.classes_, index=df.index)
    
    df = df.drop(columns=[i])
    df = pd.concat([df, binarized_df], axis=1)
    
    
# Filtering for roles we are looking to hire for
dev_types = [
    'Back-End Developer', 
    'Full-Stack Developer',
    'Mobile Developer',
    'Data Engineer',
    'Data Scientist/ML Specialist'
    ]

df = df[df['DevType'].isin(dev_types)]

### Performing a Train-Test Split

In [4]:
# Define independent and dependent variables
X = df.drop(columns=["DevType"])
y = df["DevType"]

# Split data before applying SMOTE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [5]:
y_train.value_counts()

DevType
Full-Stack Developer            13610
Back-End Developer               7611
Mobile Developer                 1310
Data Scientist/ML Specialist      784
Data Engineer                     720
Name: count, dtype: int64

### Using SMOTE to Oversample Minoroties

In [6]:
# Apply SMOTE to balance the training data

def apply_smote(X_train, y_train, threshold=3000):
    value_counts = y_train.value_counts()
    categories_to_oversample = value_counts[value_counts < threshold].index
    smote = SMOTE(sampling_strategy={cat: threshold for cat in categories_to_oversample}, random_state=42)
    X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
    return X_resampled, y_resampled

X_train_resampled, y_train_resampled = apply_smote(X_train, y_train)

### Apply K-Modes Clustering for Feature Engineering

![Elbow Method](images/Elbow.png)
In this case, the elbow seems to be around k = 3 or 4. We will consider 4.

In [7]:
#km_4 = KModes(n_clusters=4, init='Huang', n_init=10, verbose=0, random_state=42)
km_4 = joblib.load("models/km_4.pkl")
#clusters_train = km_4.fit_predict(X_train_resampled)
clusters_train = km_4.predict(X_train_resampled)
clusters_test = km_4.predict(X_test)
#joblib.dump(km_4, "models/km_4.pkl")
#joblib.dump(km_4,"models/km_4.joblib")

# Add cluster column to datasets
X_train_resampled["Cluster"] = clusters_train
X_test["Cluster"] = clusters_test

### Extracting Important Feaures using RandomForestClassifier

In [8]:
# Train a Random Forest Classifier to extract feature importance
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train_resampled, y_train_resampled)
feature_importances = pd.Series(rfc.feature_importances_, index=X_train_resampled.columns)

# Eliminate low significance features
important_features = feature_importances[feature_importances > 0.0001].index  # Adjust threshold as needed
X_train_filtered = X_train_resampled[important_features]
X_test_filtered = X_test[important_features]

### One Hot Encoding Our Cluster Column

In [9]:
# Apply One-Hot Encoding for Cluster
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded_clusters_train = ohe.fit_transform(X_train_filtered[["Cluster"]])
encoded_clusters_test = ohe.transform(X_test_filtered[["Cluster"]])

# Convert to DataFrame with correct index
encoded_cluster_columns = ohe.get_feature_names_out(["Cluster"])
encoded_cluster_df_train = pd.DataFrame(encoded_clusters_train, columns=encoded_cluster_columns, index=X_train_filtered.index)
encoded_cluster_df_test = pd.DataFrame(encoded_clusters_test, columns=encoded_cluster_columns, index=X_test_filtered.index)

X_train_linear = X_train_filtered.drop(columns=["Cluster"])
X_test_linear = X_test_filtered.drop(columns=["Cluster"])

X_train_linear = pd.concat([X_train_linear, encoded_cluster_df_train], axis=1)
X_test_linear = pd.concat([X_test_linear, encoded_cluster_df_test], axis=1)
X_test_linear = X_test_linear.reindex(columns=X_train_linear.columns, fill_value=0)

### Model Training

In [10]:
### Training Logistic Regression Model
logreg_smoted = LogisticRegression(n_jobs=-1, random_state=42)
logreg_smoted.fit(X_train_linear, y_train_resampled)
y_pred_logreg = logreg_smoted.predict(X_test_linear)

#Saving Our Model For further use
joblib.dump(logreg_smoted, "models/logreg_smoted.joblib")

#Model Evaluation
print(f"Logistic Regression(SMOTE) Accuracy: {accuracy_score(y_test, y_pred_logreg):.4f}")
print(f"Logistic Regression Classification Report:\n{classification_report(y_test, y_pred_logreg)}\n")



### Training Support Vector Classifier Model
svc_smoted = SVC(random_state=42)
svc_smoted.fit(X_train_linear, y_train_resampled)

# Model Evaluation
y_pred_svc = svc_smoted.predict(X_test_linear)
print(f"SVC (SMOTE) Accuracy: {accuracy_score(y_test, y_pred_svc):.4f}")
print(f"SVC Classification Report:\n{classification_report(y_test, y_pred_svc)}\n")

#Saving Our Model For further use
joblib.dump(svc_smoted, "models/svc_smoted.joblib")



# Training Gradient Boosting Classifier
X_train_gb = X_train_filtered.copy()
X_test_gb = X_test_filtered.copy()
gb_smoted = GradientBoostingClassifier(random_state = 42)
gb_smoted.fit(X_train_gb, y_train_resampled)

# Model Evaluation
y_pred_gb = gb_smoted.predict(X_test_gb)
print(f"Gradient Boosting (SMOTE) Accuracy: {accuracy_score(y_test, y_pred_gb):.4f}")
print(f"Gradient Boosting Classification Report:\n{classification_report(y_test, y_pred_gb)}\n")

#Saving Our Model For further use
joblib.dump(gb_smoted, "models/gb_smoted.joblib")



#Training Random Forest Classifier Model
X_train_rfc = X_train_filtered.copy()
X_test_rfc = X_test_filtered.copy()
rfc_smoted = RandomForestClassifier(n_jobs = -1, random_state = 42)
rfc_smoted.fit(X_train_rfc, y_train_resampled)

# Model Evaluation
y_pred_rfc = rfc_smoted.predict(X_test_rfc)
print(f"Random Forest (SMOTE) Accuracy: {accuracy_score(y_test, y_pred_rfc):.4f}")
print(f"Random Forest Classification Report:\n{classification_report(y_test, y_pred_rfc)}\n")

#Saving Our Model For further use
joblib.dump(rfc_smoted, "models/rfc_smoted.joblib")



# Training XGB model
X_train_xgb = X_train_filtered.copy()
X_test_xgb = X_test_filtered.copy()

# Applying Label encoding to our target variable
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train_resampled)
y_test_encoded = le.transform(y_test)

xgb_smoted = XGBClassifier(use_label_encoder=False, eval_metric='logloss', n_jobs=-1)
xgb_smoted.fit(X_train_xgb, y_train_encoded)
y_pred_xgb = xgb_smoted.predict(X_test_xgb)

#Model Evaluation
y_pred_xgb_original = le.inverse_transform(y_pred_xgb)
y_test_original = le.inverse_transform(y_test_encoded)
print(f"XGBoost(SMOTE) Accuracy: {accuracy_score(y_test_original, y_pred_xgb_original):.4f}")
print(f"XGBoost Classification Report:\n{classification_report(y_test_original, y_pred_xgb_original)}\n")

#Saving Our Model For further use
joblib.dump(xgb_smoted, "models/xgb_smoted.joblib")

Logistic Regression(SMOTE) Accuracy: 0.7602
Logistic Regression Classification Report:
                              precision    recall  f1-score   support

          Back-End Developer       0.71      0.60      0.65      1903
               Data Engineer       0.43      0.50      0.46       180
Data Scientist/ML Specialist       0.62      0.69      0.65       196
        Full-Stack Developer       0.80      0.86      0.83      3403
            Mobile Developer       0.80      0.82      0.81       327

                    accuracy                           0.76      6009
                   macro avg       0.67      0.69      0.68      6009
                weighted avg       0.76      0.76      0.76      6009


SVC (SMOTE) Accuracy: 0.7665
SVC Classification Report:
                              precision    recall  f1-score   support

          Back-End Developer       0.73      0.58      0.65      1903
               Data Engineer       0.46      0.37      0.41       180
Data Scienti

['xgb_smoted.joblib']

# Model Deployment

In this project I have trained multiple models on various ML algorithms, namely Logistic Regression, SVC, Gradient Boosting, XG Boost and Random Forest Classifier using a handful of ways to optimise the results. However, there is no one model fits all among them. There is a significant variance in their performance for each class. Selecting a model really depends on the hiring requirements. Given that, a model can be selected depending on the precision and recall and various other evaluation metrics.

I'm picking the Logistic Regression model. It has been aided by SMOTE applied to balance classes in the dataset and K-Modes clustering for feature engineering based on similarities. The reason I have selected Logistic Regression is because of it's better recall for minoroty classes like Data Scientists and Data Engineers. The count of Data Scientists and Data Engineers is fairly low as is and were trying to identify as many as possible. We are able to detect 50% of Data Engineers and 69% of Data Scientists as indicated by the recall scores of the Logistic Regression Model. 

**Note:** If the widgets do not show on first attempt, exit the notebook and reopen it.

In [15]:
# Load trained model
model = joblib.load(f"models/logreg_smoted.joblib")

# Set column names used in training
important_features = X_train_linear

# Create UI elements
widgets_list = {}
user_input = {}
categories = {}

# Convert string lists to Python lists
for col in list_cols:
    data[col] = data[col].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
    
    transformed = mlb.fit_transform(data[col])
    transformed_df = pd.DataFrame(transformed, columns=mlb.classes_)
    
    categories[col] = list(transformed_df.columns)

def predict_role(change):
    # Convert user selections to DataFrame
    global input_df
    input_data = {col: 0 for col in X_train_resampled}

    for category, checkboxes in widgets_list.items():
        for tech, checkbox in checkboxes.items():
            if checkbox.value:
                if tech in input_data:
                    input_data[tech] = 1

    input_df = pd.DataFrame([input_data])
    input_df = input_df.reindex(columns=X_train_resampled.columns, fill_value=0)
    # Apply clustering and one hot encode the cluster column
    cluster_prediction = km_4.predict(input_df.drop(columns=['Cluster'], errors='ignore'))
    cluster_encoded = ohe.transform([[cluster_prediction[0]]])
    global cluster_encoded_df
    cluster_encoded_df = pd.DataFrame(cluster_encoded, columns=encoded_cluster_columns, index=input_df.index)
    input_df = pd.concat([input_df.drop(columns=['Cluster'], errors='ignore'), cluster_encoded_df], axis=1)
    input_df = input_df[X_train_linear.columns]
    # Make prediction
    prediction = model.predict(input_df)[0]
    print(f"Predicted Developer Role: {prediction}")


# Creating checkboxes for each category
category_boxes = []
for category, options in categories.items():
    checkboxes = {option: widgets.Checkbox(value=False, description=option) for option in options}
    widgets_list[category] = checkboxes
    category_box = widgets.VBox([widgets.Label(category)] + list(checkboxes.values()))
    category_boxes.append(category_box)

# Displaying checkboxes
display(*category_boxes)

# Button for prediction
predict_button = widgets.Button(description="Predict Role")
predict_button.on_click(predict_role)
display(predict_button)

VBox(children=(Label(value='LanguageHaveWorkedWith'), Checkbox(value=False, description='APL'), Checkbox(value…

VBox(children=(Label(value='DatabaseHaveWorkedWith'), Checkbox(value=False, description='BigQuery'), Checkbox(…

VBox(children=(Label(value='PlatformHaveWorkedWith'), Checkbox(value=False, description='Amazon Web Services (…

VBox(children=(Label(value='WebframeHaveWorkedWith'), Checkbox(value=False, description='ASP.NET'), Checkbox(v…

VBox(children=(Label(value='MiscTechHaveWorkedWith'), Checkbox(value=False, description='.NET (5+) '), Checkbo…

VBox(children=(Label(value='ToolsTechHaveWorkedWith'), Checkbox(value=False, description='APT'), Checkbox(valu…

VBox(children=(Label(value='OfficeStackAsyncHaveWorkedWith'), Checkbox(value=False, description='Adobe Workfro…

VBox(children=(Label(value='NEWCollabToolsHaveWorkedWith'), Checkbox(value=False, description='Android Studio'…

VBox(children=(Label(value='OfficeStackSyncHaveWorkedWith'), Checkbox(value=False, description='Cisco Webex Te…

Button(description='Predict Role', style=ButtonStyle())

Predicted Developer Role: Data Scientist/ML Specialist


## Hyper Parameter Tuning for RFC

In [16]:
X = df.drop(columns=["DevType"])
y = df["DevType"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [17]:
# Define the model
rfc = RandomForestClassifier(n_jobs=-1,random_state=42)

# Define hyperparameter grid
param_grid = {
    "n_estimators": [150],
    "max_depth": [30, 35, None],
    'min_samples_leaf':[2, 5, 10],
    "min_samples_split": [2, 5, 10]
}

# Perform GridSearchCV (5-fold cross-validation)
grid_search = GridSearchCV(rfc, param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train_resampled, y_train_resampled)

# Get best parameters and train the final model
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

#Best Hyperparameters: {'max_depth': 30, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 150}

Fitting 3 folds for each of 27 candidates, totalling 81 fits
[CV] END max_depth=30, min_samples_leaf=2, min_samples_split=2, n_estimators=150; total time=   4.9s
[CV] END max_depth=30, min_samples_leaf=2, min_samples_split=2, n_estimators=150; total time=   4.2s
[CV] END max_depth=30, min_samples_leaf=2, min_samples_split=2, n_estimators=150; total time=   4.1s
[CV] END max_depth=30, min_samples_leaf=2, min_samples_split=5, n_estimators=150; total time=   4.2s
[CV] END max_depth=30, min_samples_leaf=2, min_samples_split=5, n_estimators=150; total time=   4.2s
[CV] END max_depth=30, min_samples_leaf=2, min_samples_split=5, n_estimators=150; total time=   4.3s
[CV] END max_depth=30, min_samples_leaf=2, min_samples_split=10, n_estimators=150; total time=   3.5s
[CV] END max_depth=30, min_samples_leaf=2, min_samples_split=10, n_estimators=150; total time=   3.5s
[CV] END max_depth=30, min_samples_leaf=2, min_samples_split=10, n_estimators=150; total time=   3.6s
[CV] END max_depth=30, min_

[CV] END max_depth=None, min_samples_leaf=10, min_samples_split=10, n_estimators=150; total time=   2.6s
Best Hyperparameters: {'max_depth': 30, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 150}


In [13]:
# Train RFC with best params
best_params = {'max_depth': 30, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 150}

best_rfc = RandomForestClassifier(**best_params, random_state=42)
best_rfc.fit(X_train_resampled, y_train_resampled)

# Evaluate on test set
y_pred = best_rfc.predict(X_test)

# Print evaluation metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.7518721917124314
Classification Report:
                               precision    recall  f1-score   support

          Back-End Developer       0.75      0.52      0.61      1903
               Data Engineer       0.40      0.26      0.31       180
Data Scientist/ML Specialist       0.64      0.67      0.65       196
        Full-Stack Developer       0.76      0.92      0.83      3403
            Mobile Developer       0.84      0.72      0.78       327

                    accuracy                           0.75      6009
                   macro avg       0.68      0.62      0.64      6009
                weighted avg       0.75      0.75      0.74      6009

