Group 14: Rylie Ramos-Marquez, Derek Atabayev, Vishnu Garigipati

Are employees likely to resign?

In [1]:
# Save both available and competition datasets as pd.DataFrame

import pandas as pd
with open("attrition_available_14.pkl", "rb") as f:
    available_data = pd.read_pickle(f)

with open("attrition_compet_14.pkl", "rb") as f:
    competition_data = pd.read_pickle(f)

Part 1: Exploratory EDA

Maniupating the dataframe available_data to gain insights

In [2]:
available_data.reset_index(drop=True, inplace=True)

print(available_data.info()) # inspect column types and counts
# we can see that this dataset is about 1426 unique employees (1281 of them have ID) and various fields relating to their employment


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1426 entries, 0 to 1425
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   hrs                      1426 non-null   float64
 1   absences                 1426 non-null   int64  
 2   JobInvolvement           1279 non-null   float64
 3   PerformanceRating        1278 non-null   float64
 4   EnvironmentSatisfaction  1419 non-null   float64
 5   JobSatisfaction          1420 non-null   float64
 6   WorkLifeBalance          1414 non-null   float64
 7   Age                      1426 non-null   int64  
 8   Attrition                1426 non-null   object 
 9   BusinessTravel           1426 non-null   object 
 10  Department               1426 non-null   object 
 11  DistanceFromHome         1426 non-null   int64  
 12  Education                1426 non-null   int64  
 13  EducationField           1426 non-null   object 
 14  EmployeeCount           

In [3]:
# find the number of features and instances

print(available_data.shape)
# shape returns a tupe of (instances, features) which is (1426, 35)
# however from above we can see there are missing values for EmployeeID in 1426 - 1281 = 145 rows

(1426, 31)


In [4]:
# check which variables are categorical / numerical

print(available_data.dtypes)
# we can see that the majority of the columns are numerical, with the exception of the following:
# Attrition, BusinessTravel, Department, Education, Gender, Job Role, MaritalStatus, Over18


hrs                        float64
absences                     int64
JobInvolvement             float64
PerformanceRating          float64
EnvironmentSatisfaction    float64
JobSatisfaction            float64
WorkLifeBalance            float64
Age                          int64
Attrition                   object
BusinessTravel              object
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EmployeeCount                int64
EmployeeID                 float64
Gender                      object
JobLevel                     int64
JobRole                     object
MaritalStatus               object
MonthlyIncome                int64
NumCompaniesWorked         float64
Over18                      object
PercentSalaryHike            int64
StandardHours                int64
StockOptionLevel             int64
TotalWorkingYears          float64
TrainingTimesLastYear        int64
YearsAtCompany      

In [5]:
# from the categorical variables: Attrition, BusinessTravel, Department, EducationField, Gender, Job Role, MaritalStatus, Over18, check which have high cardinality

list_of_categorical = ['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18']

# print cardinality for each
for col in list_of_categorical:
    if available_data[col].nunique() > 5: # check if cardinalty > 5, then this is considered high cardinality
        print(col, "is high cardinality")

# observe that EducationField and JobRole are high cardinality

print("Values for EducationField: ", available_data['EducationField'].unique()) # spans an all-encompassing range of education fields
print("Values for JobRole Field: " , available_data['JobRole'].unique()) # contains 9 unique job roles as possible fields

EducationField is high cardinality
JobRole is high cardinality
Values for EducationField:  ['Life Sciences' 'Medical' 'Human Resources' 'Marketing'
 'Technical Degree' 'Other']
Values for JobRole Field:  ['Research Scientist' 'Sales Executive' 'Manager' 'Laboratory Technician'
 'Manufacturing Director' 'Healthcare Representative'
 'Sales Representative' 'Research Director' nan 'Human Resources']


In [6]:
# Check columns that have missing values

print(available_data.isnull().sum()[available_data.isnull().sum() > 0])
# filter by columns that have missing values

# 9 features have missing values, with the most missing in JobInvolvement, PerformanceRating, EmployeeID, JobRole


JobInvolvement             147
PerformanceRating          148
EnvironmentSatisfaction      7
JobSatisfaction              6
WorkLifeBalance             12
EmployeeID                 145
JobRole                    134
NumCompaniesWorked           5
TotalWorkingYears            4
dtype: int64


In [8]:
# Check whether there are constant columns

print("Constant Columns: ", available_data.columns[available_data.nunique() == 1]) # if the number of unique values in a column is 1, then it is a constant column

# there are three constant columns in the dataset: Over18, StandardHours, EmployeeCount

print(available_data['Over18'].unique()) # only one unique value
print(available_data['StandardHours'].unique())
print(available_data['EmployeeCount'].unique())


Constant Columns:  Index(['EmployeeCount', 'Over18', 'StandardHours'], dtype='object')
['Y']
[8]
[1]


In [9]:
# inspecting the columns suggests all employees are Over18, all work standard hours, and there is only one employee count per row
# delete these constant columns
available_data.drop(columns=['Over18', 'StandardHours', 'EmployeeCount'], inplace=True)
competition_data.drop(columns=['Over18', 'StandardHours', 'EmployeeCount'], inplace=True)

In [10]:
# Attrition is the target variable in this problem

print(available_data['Attrition'].value_counts())

# so this is a binary classification problem

# balanced dataset, with 715 employees who have not left the company and 711 employees who have left the company
# this means every row has a target value, so we do not need to drop any rows

Attrition
No     715
Yes    711
Name: count, dtype: int64


Part 2: Basic Methods (Trees, KNN, and Logistic Regression)

General decisions:

1. Split 70% train, 30% test, because the dataset is small, and we need a large enough test size to help detect overfitting

2. Will return also a classification report: precision, recall, f1-score, and support, because this gives a more comprehensive view
 - Precision is especially valuable because it essentially checks how reliable our predictions (in the positive case are)
 - Support prevents any random imbalances that could have occured

In [11]:
from sklearn.model_selection import train_test_split

# Define feature matrix X and target vector y
X = available_data.drop(columns=['Attrition'])
y = available_data['Attrition']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 100545358)


Method One (default hyperparamaters): Decision Trees

In [38]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import time

# Use a pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder

# Encode the target variable
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Define the preprocessor
# scale
from sklearn.preprocessing import MinMaxScaler

preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='mean')), # impute missing values with the mean
            ('scaler', MinMaxScaler()) # scale the numerical features
        ]), X.select_dtypes(include=['int64', 'float64']).columns)
    ])
# the preprocessor will impute missing values with the mean and scale the numerical features

# Define the model

model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', DecisionTreeClassifier(random_state=100545358))])


# Fit the model
start = time.time()
model.fit(X_train, y_train_encoded)
end = time.time()

print("Training time: ", end - start)

# Predict the target values
y_pred = model.predict(X_test)

# Calculate the accuracy

accuracy = accuracy_score(y_test_encoded, y_pred)
print("Accuracy: ", accuracy)

# Also return a classification report 
from sklearn.metrics import classification_report
print(classification_report(y_test_encoded, y_pred))

# this returns a classification report with precision, recall, f1-score, and support for each class



Training time:  0.02600550651550293
Accuracy:  0.8636363636363636
              precision    recall  f1-score   support

           0       0.90      0.82      0.86       142
           1       0.83      0.91      0.87       144

    accuracy                           0.86       286
   macro avg       0.87      0.86      0.86       286
weighted avg       0.87      0.86      0.86       286



Method Two (default hyperparameters): KNN

In [39]:
from sklearn.neighbors import KNeighborsClassifier

# Define the model
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', KNeighborsClassifier())])

# Fit the model

start = time.time()
model.fit(X_train, y_train_encoded)
end = time.time()

print("Training time: ", end - start)

# Predict the target values
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test_encoded, y_pred)
print("Accuracy: ", accuracy)

print (classification_report(y_test_encoded, y_pred))

Training time:  0.012004613876342773
Accuracy:  0.7307692307692307
              precision    recall  f1-score   support

           0       0.73      0.73      0.73       142
           1       0.73      0.74      0.73       144

    accuracy                           0.73       286
   macro avg       0.73      0.73      0.73       286
weighted avg       0.73      0.73      0.73       286



Method Three (default hyperparameters): Logistic Regression

In [40]:
from sklearn.linear_model import LogisticRegression

# Define the model
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', LogisticRegression(random_state=100545358))])

# Fit the model

start = time.time()
model.fit(X_train, y_train_encoded)
end = time.time()

print("Training time: ", end - start)

# Predict the target values
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test_encoded, y_pred)

print("Accuracy: ", accuracy)

print(classification_report(y_test_encoded, y_pred))

Training time:  0.021003246307373047
Accuracy:  0.6783216783216783
              precision    recall  f1-score   support

           0       0.69      0.65      0.67       142
           1       0.67      0.71      0.69       144

    accuracy                           0.68       286
   macro avg       0.68      0.68      0.68       286
weighted avg       0.68      0.68      0.68       286



In order of performance:

Decision Trees performs best, followed by Logistic Regression and KNN

Part 3: We can check if it is justifiable to switch to an 80/20 split so that KNN and Logistic Regression might be able to perform better

In [41]:
# Change testing split to 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 100545358)

# Re-encode the target variable
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

print("Logistic Regression with 0.2 testing split")
# Define the model
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', LogisticRegression(random_state=100545358))])

# Fit the model
start = time.time()
model.fit(X_train, y_train_encoded)
end = time.time()

print("Training time: ", end - start)

# Predict the target values
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test_encoded, y_pred)
print("Accuracy: ", accuracy)
print(classification_report(y_test_encoded, y_pred))

print("KNN with 0.2 testing split")
# Define the model
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', KNeighborsClassifier())])

# Fit the model
start = time.time()
model.fit(X_train, y_train_encoded)
end = time.time()

print("Training time: ", end - start)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test_encoded, y_pred)
print("Accuracy: ", accuracy)
print(classification_report(y_test_encoded, y_pred))

print("Decision Tree with 0.2 testing split")

# Define the model
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', DecisionTreeClassifier(random_state=100545358))])

# Fit the model
start = time.time()
model.fit(X_train, y_train_encoded)
end = time.time()

print("Training time: ", end - start)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test_encoded, y_pred)
print("Accuracy: ", accuracy)
print(classification_report(y_test_encoded, y_pred))



Logistic Regression with 0.2 testing split
Training time:  0.026006698608398438
Accuracy:  0.6783216783216783
              precision    recall  f1-score   support

           0       0.69      0.65      0.67       142
           1       0.67      0.71      0.69       144

    accuracy                           0.68       286
   macro avg       0.68      0.68      0.68       286
weighted avg       0.68      0.68      0.68       286

KNN with 0.2 testing split
Training time:  0.013001680374145508
Accuracy:  0.7307692307692307
              precision    recall  f1-score   support

           0       0.73      0.73      0.73       142
           1       0.73      0.74      0.73       144

    accuracy                           0.73       286
   macro avg       0.73      0.73      0.73       286
weighted avg       0.73      0.73      0.73       286

Decision Tree with 0.2 testing split
Training time:  0.02200484275817871
Accuracy:  0.8636363636363636
              precision    recall  f1-s

Final Chosen Method is: __________?

Part 4: Compare with the Dummy Method 

In [44]:
# using DummyClassifier as a baseline
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy='most_frequent', random_state = 100545358)
dummy.fit(X_train, y_train_encoded)

y_pred = dummy.predict(X_test)

dummy_accuracy = accuracy_score(y_test_encoded, y_pred)
print("Dummy Classifier Accuracy: ", dummy_accuracy)
print(classification_report(y_test_encoded, y_pred))

# the dummy classifier has an accuracy of 0.5, and precision/recall/f1-score of 0 in the 1 class
# the 1 class is the class that has left the company, so the dummy classifier is predicting that no one has left the company



Dummy Classifier Accuracy:  0.4965034965034965
              precision    recall  f1-score   support

           0       0.50      1.00      0.66       142
           1       0.00      0.00      0.00       144

    accuracy                           0.50       286
   macro avg       0.25      0.50      0.33       286
weighted avg       0.25      0.50      0.33       286



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Visualization of Methods: Confusion Matrix 

We can check how many False Positives and False Negatives are being predicted by our model, to see how it is performing overall

We want to minimize both of these values

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

