# Employee Training Course Recommendation System

**Authors:** Dermot O'Brien
***

## Overview

HR places significant emphasis on identifying suitable training and development initiatives tailored to each individual employee.

The provision of training is essential to help employees fill their skill gaps, not only for their present roles but also to prepare them for future advancements. Investing in training and development greatly contributes to boosting self-assurance and overall job contentment among employees, ultimately leading to a decrease in employee turnover. HR departments must consistently analyze employees' skill gaps and implement ongoing training programs accordingly.

## Business Problem

Sometimes it's not easy for an employee or manager to identify their own skill gaps, or which skills they should prioritize learning. In order to combat this, we will create a recommendation system that uses their past course ratings and the ratings of similar employees to offer the right courses to them.

## Data Understanding

The dataset comprises employee ratings assigned to courses they completed in past corporate training programs. Each rating entry includes the employee's ID and name, along with the course ID and course name. The ratings are scored on a scale from one to five, with five denoting the highest rating.

Our objective is to develop an algorithm capable of predicting the potential rating an employee might assign to a course they haven't yet undertaken. The underlying assumption is that if the employee finds the course relevant and fitting, they will tend to give a higher rating.

In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from math import pi
import seaborn as sns
import os
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, FunctionTransformer, StandardScaler
from sklearn.metrics import ConfusionMatrixDisplay, recall_score, precision_score, accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.impute import MissingIndicator
import xgboost as xgb
from xgboost import plot_importance
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImPipeline
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Flatten, Dot, Dense, Concatenate
from tensorflow.keras.models import Model

%matplotlib inline

2023-07-24 11:47:38.012249: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Import the data set
ratings_df = pd.read_csv("./ratings_data.csv")
ratings_df.head()

Unnamed: 0,EmployeeID,EmpName,CourseID,CourseName,Rating
0,1408,Ignace Ormonde,14,Video Production,3
1,1249,Gabriela Balcon,17,Translation,2
2,1158,Enrique Lewer,8,IT Architecture,3
3,1564,Wallie Byrd,18,Natural Language Processing,3
4,1334,Hannah Ganter,6,Java Programming,4


In [3]:
# Check the shape
ratings_df.shape

(1000, 5)

In [4]:
# Check for nulls and dtypes
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   EmployeeID  1000 non-null   int64 
 1   EmpName     1000 non-null   object
 2   CourseID    1000 non-null   int64 
 3   CourseName  1000 non-null   object
 4   Rating      1000 non-null   int64 
dtypes: int64(3), object(2)
memory usage: 39.2+ KB


In [5]:
# perform a value counts to see frequencies of courses taken
ratings_df.CourseID.value_counts()

18    50
22    47
12    47
3     46
5     46
2     45
9     45
14    44
25    44
11    44
1     43
16    43
6     42
20    40
15    40
24    39
21    36
13    36
4     36
17    33
19    33
7     32
8     32
10    30
23    27
Name: CourseID, dtype: int64

## Data Preparation

Let's start by creating two data frames, unique lists of employees and courses

In [6]:
# Build list of unique Employees
emp_list = ratings_df.groupby(
            ['EmployeeID', 'EmpName']).size().reset_index()
print("Total Employees: ", len(emp_list))
emp_list.sort_values(by=0, ascending=False).head()

Total Employees:  638


Unnamed: 0,EmployeeID,EmpName,0
392,1620,Antoinette Holleworth,6
34,1055,Teddie Lutwidge,5
627,1983,Yolane Braun,5
19,1029,Kattie Tenbrug,5
601,1948,Bev Vagg,5


In [7]:
# Build list of unique Courses
course_list = ratings_df.groupby(
                ['CourseID', 'CourseName']).size().reset_index()
print("Total Courses: ", len(course_list))
course_list.sort_values(by=0, ascending=False).head()

Total Courses:  25


Unnamed: 0,CourseID,CourseName,0
17,18,Natural Language Processing,50
11,12,People Management,47
21,22,Animation,47
2,3,Data Management,46
4,5,HelpDesk,46


### Prepare Embedings

In [8]:
# build employee embedding vector
# we are using IDs as the direct index to embedding
# since IDs are continous, we dont need ID-name mapping
# we can also build a vocabulary alternative

emp_input = Input(shape=[1], name='Emp-Input')
emp_embed = Embedding(2001,
                     5,
                     name="Emp-Embedding")(emp_input)
emp_vec = Flatten(name='Emp-Flatten')(emp_embed)

# build course embedding vector
course_input = Input(shape=[1], name='Course-Input')
course_embed = Embedding(len(course_list) +1,
                        5,
                        name='Course-Embedding')(course_input)
course_vec = Flatten(name='Course-Flatten')(course_embed)

# merge the vectors
merged_vec = Concatenate()([emp_vec, course_vec])

## Building Keras Model

The recommensation system works as follow
* Predict the ratings a given employee may give a course they have not taken
* Predict ratings of all courses for all employees
* Recommend the courses that have the top predicted ratings

markdown explaining the modeling process

In [10]:
ratings_train, ratings_test =train_test_split(ratings_df, test_size=0.1)

In [11]:
# add fully connected layers
fc_layer1 = Dense(128, activation="relu")(merged_vec)
fc_layer2 = Dense(32, activation="relu")(fc_layer1)
model_output = Dense(1)(fc_layer2)

rating_model = Model([emp_input, course_input], model_output)

rating_model.compile(optimizer="adam", loss="mean_squared_error")

rating_model.summary()

print("Fitting the model:")

#Fit the model
model_fit = rating_model.fit(
    x=[ratings_train.EmployeeID, ratings_train.CourseID],
    y=ratings_train.Rating,
    epochs=25,
    verbose=1,
    validation_split=0.1
    )

print("Evaluating the model:")
rating_model.evaluate(
    x=[ratings_test.EmployeeID, ratings_test.CourseID],
    y=ratings_test.Rating)

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 Emp-Input (InputLayer)      [(None, 1)]                  0         []                            
                                                                                                  
 Course-Input (InputLayer)   [(None, 1)]                  0         []                            
                                                                                                  
 Emp-Embedding (Embedding)   (None, 1, 5)                 10005     ['Emp-Input[0][0]']           
                                                                                                  
 Course-Embedding (Embeddin  (None, 1, 5)                 130       ['Course-Input[0][0]']        
 g)                                                                                           

  return t[start:end]


Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Evaluating the model:


2.5479085445404053

## Rating Predictions

In [12]:
# predicting the Rating for a given employee and a course
# for employee 1029 and course 8

rating_model.predict(
    [pd.Series([1029]),
     pd.Series([8])])



array([[3.524244]], dtype=float32)

Course Recommendations

For an example, let's recommend a list of courses for an employee, Harriot Laflin

In [None]:
emp_to_predict = "Harriot Laflin"

# get employee ID for the employee name
pred_emp_id = emp_list[emp_list['EmpName'] == emp_to_predict]["EmployeeID"].iloc[0]

# find Courses already taken by employee. We dont want to predict those.
completed_courses = ratings_df[
                    ratings_df["EmployeeID"] == pred_emp_id]["CourseID"].unique()

# courses not taken by employee
new_courses = course_list.query("CourseID not in @completed_courses")["CourseID"]

# Create a list with the same employee ID repeated for the same number of times as the
# number of new courses. This provides the employee and course Series with same size
emp_dummy_list = pd.Series(np.array([pred_emp_id for i in range(len(new_courses))]))

# Predict ratings for the new courses for this employee
projected_ratings = rating_model.predict([emp_dummy_list, new_courses])
flat_ratings = np.array([x[0] for x in projected_ratings])

print("Course Ratings: ", flat_ratings)

#Recommend top 5 courses
print("\nRating  CourseID CourseName\n-----------------------------------")
for idx in (-flat_ratings).argsort()[:5]:
    course_id = new_courses.iloc[idx]
    course_name = course_list.query("CourseID == @course_id")["CourseName"].iloc[0]
    print(" ", round(flat_ratings[idx],1),"    ", course_id, "   ", course_name)

## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***