# Job Classifications with the Linear Discriminant Analysis (LDA) model
dataset info:

Title: Job Classification Dataset

Dataset Source: HR Analytic Repository on Kaggle

Dataset URL: https://www.kaggle.com/datasets/HRAnalyticRepository/job-classification-dataset

Date Accessed: September 28, 2023

In [1]:
#importing the required libraries to complete the lDA model
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [2]:
#loading the dataset into a python dataframe
df = pd.read_csv('jobclassinfo2.csv')

In [3]:
#initial exploratory data analysis (EDA)
print(df.head())
print(df.info())
print(df.describe())
print(df.isnull().sum())

   ID  JobFamily    JobFamilyDescription  JobClass JobClassDescription  \
0   1          1  Accounting And Finance         1        Accountant I   
1   2          1  Accounting And Finance         2       Accountant II   
2   3          1  Accounting And Finance         3      Accountant III   
3   4          1  Accounting And Finance         4       Accountant IV   
4   5          2  Administrative Support         5     Admin Support I   

   PayGrade  EducationLevel  Experience  OrgImpact  ProblemSolving  \
0         5               3           1          3               3   
1         6               4           1          5               4   
2         8               4           2          6               5   
3        10               5           5          6               6   
4         1               1           0          1               1   

   Supervision  ContactLevel  FinancialBudget    PG  
0            4             3                5  PG05  
1            5            

In [4]:
#feature selection: choosing the features we want to use for the LDA model.
selected_features = ['EducationLevel', 'Experience', 'OrgImpact', 'ProblemSolving', 'Supervision', 'ContactLevel', 'FinancialBudget']

In [5]:
#spliting the data into training and testing sets.
X = df[selected_features]
y = df['PayGrade']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
#creatin an LDA model and fit it to the training data
lda = LinearDiscriminantAnalysis(n_components=7)  # we can adjust the number of components as needed
X_train_lda = lda.fit_transform(X_train, y_train)


In [7]:
# Evaluate the model (optional)
X_test_lda = lda.transform(X_test)
y_pred = lda.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f'Mean Absolute Error: {mae:.2f} which means that on average our LDA regression model predictions are off by approximately {mae:.2f} PayGrade units')
print(f'Mean Squared Error: {mse:.2f} which means that on average the squared differences between our models predictions and the true PayGrade values are {mse:.2f}')


Mean Absolute Error: 0.50 which means that on average our LDA regression model predictions are off by approximately 0.50 PayGrade units
Mean Squared Error: 0.50 which means that on average the squared differences between our models predictions and the true PayGrade values are 0.50


In [8]:
#Testing out our model with a new job description
new_job = {
    'EducationLevel': 3,
    'Experience': 1,
    'OrgImpact': 3,
    'ProblemSolving': 3,
    'Supervision': 4,
    'ContactLevel': 3,
    'FinancialBudget': 5
}

new_instance_df = pd.DataFrame(data=[new_job], columns=selected_features)

predicted_paygrade = lda.predict(new_instance_df)
print(f'Predicted PayGrade: {predicted_paygrade[0]:.2f}')

Predicted PayGrade: 5.00
