# The Impact of Psychometric Factors on Employee Attrition
## Introduction

The Human Resources Function has changed.  Beginning in the previous century, HR functioned as an administrative center, focusing on the management of pesonnel recruiting, onboarding, compensation and termination of employees.  

During the 1970's, companies began to face a more competitive climate fostered by increased globalization, technologial innovation and business deregulation.  In such a competiatve climate there was a greater awareness of how attracting the right employees and retaining them, substantially impacts performance.  Beginning in the early years of the 20th century, there was an increased awareness that HR must substantially change from viewing employees not just as a major cost (which of course they are); but as a generator of revenue for which greater investment was crucial.  In order to realize harness the potential of Human Capital, It was seen that HR lagged other business areas such as marketing and finance in the adaptation of data analysis as well as data modeling for the purposes of prediction and classifiction. These are areas that I am very much interested in studying and to which this project is addresed.


Taking one problem, this project began as an effort to study employee data with an eye to determining whether we may predict employee attrition from Psychometric characteristics.  This project was complicated by the fact that generally speaking, employee data is confidential.  As such, the only I way I could proceed was to employ a realistic looking but manufactered data set.  

I found such a set in IBM's Developer's page for Data Science. They made a set of employee data freely available for study purposed on Kaggle. 

The data utilized is for study purposes only: the initial set of data was fabricated by IBM for the purposes of providing data for understanding and practicing data science tools & techniques. To this I supplemented actual psychometric data that come from online personality tests. For the purposes of the project, I imagined that the psychometric exams were administered to the employees. In such a fabricated scenario it is unrealistic to suggest that any "TRUE" insights may be derived but nonethless, will provide sufficiently complex data for a machine learning study.

However, I was interested in what role psychometrics would play in relation to employees.  It is possible to gather psychometric data from potential employees and I wanted to see what sort of infomration could be gathered by using psychometric data to complement the employee data. So for the purpsoses of study, I found an independent set of psychmetric data that enabled me to study patterns of behavior by modeling them.  While I will not be able to make preditions based on this data, it nonethlesss affords an opportunity for study.

## Part One:
Data Sources:

The Employee Data was found on Kaggle: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

There are many kinds of Psychometric data. One such test is the Big Five Personality Test. Also called the Ocean Model (after the first letter of each of the psychometric categories; more on that later), the Big Five is one of the best accepted and msot commonly used personaility models in psychlogy. The test consists of 50 questions for which respondents must answer on a five point scale from Disagree to Agree.

The psychometric data was taken fromthe OPen Source Psychometrics Project: https://openpsychometrics.org/_rawdata/

### Obtaining & Cleaning Data

Importing Relevant packages:

In [153]:
import numpy as np
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
import scipy.stats as stats
import scipy.stats as stats
from scipy.spatial.distance import cdist 
from scipy.spatial import distance
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff

Reading psychometric data from csv into a dataframe:

In [154]:
df_big_five = pd.read_csv("data/BIG5/Big5_employee_results.csv")
df_big_five.head()

Unnamed: 0,EmployeeNumber,hand,Openness,Conscieniousness,Extroversion,Agreeableness,Neuroticism,E1(1),E2(-1),E3(1),...,O1(1),O2(-1),O3(1),O4(-1),O5(1),O6(-1),O7(1),O8(1),O9(1),O10(1)
0,1,1,,,,,,4,2,5,...,4,1,3,1,5,1,4,2,5,5
1,2,1,,,,,,2,2,3,...,3,3,3,3,2,3,3,1,3,2
2,4,1,,,,,,5,1,1,...,4,5,5,1,5,1,5,5,5,5
3,5,1,,,,,,2,5,2,...,4,3,5,2,4,2,5,2,5,5
4,7,1,,,,,,3,1,3,...,3,1,1,1,3,1,3,1,5,3


Changing Column Names:

In [155]:
# changing 'Neuroticism' 'Emotional Balance';  removing parenthesis from column names
df_big_five = df_big_five.rename(columns=
                                 {'E1(1)' : 'E1',  'E2(-1)': 'E2', 'E3(1)' : 'E3','E4(-1)': 'E4', 'E5(1)': 'E5',
 'E6(-1)' : 'E6', 'E7(1)' : 'E7', 'E8(-1)' : 'E8', 'E9(1)' : 'E9', 'E10(-1)' : 'E10', 'N1(-1)' : 'N1', 'N2(1)' : 'N2',
 'N3(-1)' : 'N3', 'N4(1)' : 'N4','N5(-1)' : 'N5', 'N6(-1)' : 'N6', 'N7(-1)' : 'N7', 'N8(-1)' : 'N8', 'N9(-1)' : 'N9', 
 'N10(-1)' : 'N10', 'A1(-1)' : 'A1', 'A2(1)' : 'A2', 'A3(-1)': 'A3','A4(1)' : 'A4', 'A5(-1)' : 'A5', 'A6(1)' : 'A6',
 'A7(-1)' : 'A7', 'A8(1)' : 'A8', 'A9(1)' : 'A9', 'A10(1)' : 'A10', 'C1(1)' : 'C1', 'C2(-1)' : 'C2', 'C3(1)' : 'C3', 
 'C4(-1)' : 'C4', 'C5(1)' : 'C5', 'C6(-1)' : 'C6', 'C7(1)': 'C7', 'C8(-1)' : 'C8', 'C9(1)' : 'C9', 
 'C10(1)' : 'C10', 'O1(1)' : 'O1', 'O2(-1)' : 'O2', 'O3(1)' : 'O3', 'O4(-1)' : 'O4', 
 'O5(1)' : 'O5', 'O6(-1)' : 'O6', 'O7(1)' : 'O7', 'O8(1)' : 'O8', 'O9(1)' : 'O9', 'O10(1)' : 'O10'})

In [156]:
df_big_five.head()

Unnamed: 0,EmployeeNumber,hand,Openness,Conscieniousness,Extroversion,Agreeableness,Neuroticism,E1,E2,E3,...,O1,O2,O3,O4,O5,O6,O7,O8,O9,O10
0,1,1,,,,,,4,2,5,...,4,1,3,1,5,1,4,2,5,5
1,2,1,,,,,,2,2,3,...,3,3,3,3,2,3,3,1,3,2
2,4,1,,,,,,5,1,1,...,4,5,5,1,5,1,5,5,5,5
3,5,1,,,,,,2,5,2,...,4,3,5,2,4,2,5,2,5,5
4,7,1,,,,,,3,1,3,...,3,1,1,1,3,1,3,1,5,3


Checking for missing values:

In [157]:
df_big_five.isna().sum()

EmployeeNumber         0
hand                   0
Openness            1470
Conscieniousness    1470
Extroversion        1470
Agreeableness       1470
Neuroticism         1470
E1                     0
E2                     0
E3                     0
E4                     0
E5                     0
E6                     0
E7                     0
E8                     0
E9                     0
E10                    0
N1                     0
N2                     0
N3                     0
N4                     0
N5                     0
N6                     0
N7                     0
N8                     0
N9                     0
N10                    0
A1                     0
A2                     0
A3                     0
A4                     0
A5                     0
A6                     0
A7                     0
A8                     0
A9                     0
A10                    0
C1                     0
C2                     0
C3                     0


The only missing values are for the five psychometric categories because the question results have not yet been scored.  

Converting to CSV: This line is commented out because the csv was already created in another notebook.

In [158]:
# df_big_five.to_csv('big5_revised_1.csv')

Reading in joined csv:  To the employee data I added the psychometric data. 

In [159]:
df_joined = pd.read_csv('data/employee_joined_psychometric.csv')
print(df_joined.shape)
df_joined.head()

(1470, 41)


Unnamed: 0.1,Unnamed: 0,EmployeeNumber,Age,Attrition,Openness,Conscieniousness,Extroversion,Agreeableness,Emotional Balance,BusinessTravel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,0,1,41,Yes,43,47,44,46,49,Travel_Rarely,...,1,80,0,8,0,1,6,4,0,5
1,1,2,49,No,26,42,22,35,29,Travel_Frequently,...,4,80,1,10,3,3,10,7,1,7
2,2,4,37,Yes,45,49,35,38,14,Travel_Rarely,...,2,80,0,7,3,3,0,0,0,0
3,3,5,33,No,41,26,22,37,17,Travel_Frequently,...,3,80,0,8,3,3,8,7,3,0
4,4,7,27,No,34,34,34,44,30,Travel_Rarely,...,4,80,1,6,3,3,2,2,2,2


In [160]:
df_joined.columns

Index(['Unnamed: 0', 'EmployeeNumber', 'Age', 'Attrition', 'Openness',
       'Conscieniousness', 'Extroversion', 'Agreeableness',
       'Emotional Balance', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
       'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18',
       'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

Removing first column "Unnamed: 0":

In [161]:
df_joined = df_joined.drop('Unnamed: 0', axis=1)
df_joined.head()

Unnamed: 0,EmployeeNumber,Age,Attrition,Openness,Conscieniousness,Extroversion,Agreeableness,Emotional Balance,BusinessTravel,DailyRate,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,1,41,Yes,43,47,44,46,49,Travel_Rarely,1102,...,1,80,0,8,0,1,6,4,0,5
1,2,49,No,26,42,22,35,29,Travel_Frequently,279,...,4,80,1,10,3,3,10,7,1,7
2,4,37,Yes,45,49,35,38,14,Travel_Rarely,1373,...,2,80,0,7,3,3,0,0,0,0
3,5,33,No,41,26,22,37,17,Travel_Frequently,1392,...,3,80,0,8,3,3,8,7,3,0
4,7,27,No,34,34,34,44,30,Travel_Rarely,591,...,4,80,1,6,3,3,2,2,2,2


In [162]:
df_joined = df_joined.rename(columns={'Emotional Balance' : 'Neuroticism' })
df_joined.columns

Index(['EmployeeNumber', 'Age', 'Attrition', 'Openness', 'Conscieniousness',
       'Extroversion', 'Agreeableness', 'Neuroticism', 'BusinessTravel',
       'DailyRate', 'Department', 'DistanceFromHome', 'Education',
       'EducationField', 'EmployeeCount', 'EnvironmentSatisfaction', 'Gender',
       'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole',
       'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate',
       'NumCompaniesWorked', 'Over18', 'OverTime', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')

Checking data types:

In [163]:
print(df_joined.shape)
df_joined.info()

(1470, 40)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 40 columns):
EmployeeNumber              1470 non-null int64
Age                         1470 non-null int64
Attrition                   1470 non-null object
Openness                    1470 non-null int64
Conscieniousness            1470 non-null int64
Extroversion                1470 non-null int64
Agreeableness               1470 non-null int64
Neuroticism                 1470 non-null int64
BusinessTravel              1470 non-null object
DailyRate                   1470 non-null int64
Department                  1470 non-null object
DistanceFromHome            1470 non-null int64
Education                   1470 non-null int64
EducationField              1470 non-null object
EmployeeCount               1470 non-null int64
EnvironmentSatisfaction     1470 non-null int64
Gender                      1470 non-null object
HourlyRate                  1470 non-null int64
JobInvolvemen

Checking again to make sure missing values have all been addressed:

In [164]:
df_joined.isna().sum()

EmployeeNumber              0
Age                         0
Attrition                   0
Openness                    0
Conscieniousness            0
Extroversion                0
Agreeableness               0
Neuroticism                 0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorki

In [167]:
df_abnormal = df_joined

Dummy Variables for Categorical data

In [15]:
df_joined = pd.get_dummies(df_joined)
print(df_joined.shape)
df_joined.info()

(1470, 62)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 62 columns):
EmployeeNumber                       1470 non-null int64
Age                                  1470 non-null int64
Openness                             1470 non-null int64
Conscieniousness                     1470 non-null int64
Extroversion                         1470 non-null int64
Agreeableness                        1470 non-null int64
Emotional Balance                    1470 non-null int64
DailyRate                            1470 non-null int64
DistanceFromHome                     1470 non-null int64
Education                            1470 non-null int64
EmployeeCount                        1470 non-null int64
EnvironmentSatisfaction              1470 non-null int64
HourlyRate                           1470 non-null int64
JobInvolvement                       1470 non-null int64
JobLevel                             1470 non-null int64
JobSatisfaction            

Dummy variables expand the columns from 40 to 62

## Part Two

### Exploratory Data Analysis (EDA)

Since the features are scaled differently, their scales must be brought into alignment through normalization:

In [16]:
# I removed features that were converted into dummies from feature list
dummies = ['Attrition_No',
       'Attrition_Yes', 'BusinessTravel_Non-Travel',
       'BusinessTravel_Travel_Frequently', 'BusinessTravel_Travel_Rarely',
       'Department_Human Resources', 'Department_Research & Development',
       'Department_Sales', 'EducationField_Human Resources',
       'EducationField_Life Sciences', 'EducationField_Marketing',
       'EducationField_Medical', 'EducationField_Other',
       'EducationField_Technical Degree', 'Gender_Female', 'Gender_Male',
       'JobRole_Healthcare Representative', 'JobRole_Human Resources',
       'JobRole_Laboratory Technician', 'JobRole_Manager',
       'JobRole_Manufacturing Director', 'JobRole_Research Director',
       'JobRole_Research Scientist', 'JobRole_Sales Executive',
       'JobRole_Sales Representative', 'MaritalStatus_Divorced',
       'MaritalStatus_Married', 'MaritalStatus_Single', 'Over18_Y',
       'OverTime_No', 'OverTime_Yes']
features = ['Age', 'Openness', 'Conscieniousness', 'Extroversion',
       'Agreeableness', 'Neuroticism', 'DailyRate', 'DistanceFromHome',
       'Education', 'EmployeeCount', 'EnvironmentSatisfaction', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome',
       'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager']

Preprocessing the data

In [17]:
# Separating out the features
x = df_joined.loc[:, features].values

# Separating out the target
y = df_joined.loc[:,['Attrition_Yes']].values

# Standardizing the features
x = StandardScaler().fit_transform(x)



Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike


invalid value encountered in true_divide


Degrees of freedom <= 0 for slice.



In [25]:
x[0]

array([ 0.4463504 ,  0.62838029,  1.81536823,  1.50749486,  1.10430056,
               nan,  0.74252653, -1.01090934, -0.89168825,  0.        ,
       -0.66053067,  1.38313827,  0.37967213, -0.05778755,  1.15325359,
       -0.10834951,  0.72601994,  2.12513592, -1.1505541 , -0.42623002,
       -1.58417824,  0.        , -0.93201439, -0.42164246, -2.17198183,
       -2.49382042, -0.16461311, -0.0632959 , -0.67914568,  0.24583399])

Values of x now normalized.

Feature Histograms:

In [34]:
df_joined.co

0       41
1       49
2       37
3       33
4       27
        ..
1465    36
1466    39
1467    27
1468    49
1469    34
Name: Age, Length: 1470, dtype: int64

In [91]:
#Employee Data:
attrition = df_joined["Attrition_Yes"]
age = df_joined["Age"]
DistanceFromHome = df_joined["DistanceFromHome"]
Education = df_joined["Education"]
EnvironmentSatisfaction = df_joined["EnvironmentSatisfaction"]
HourlyRate = df_joined["HourlyRate"]
JobSatisfaction = df_joined["JobSatisfaction"]
MonthlyIncome = df_joined["MonthlyIncome"]
PercentSalaryHike = df_joined["PercentSalaryHike"]
PerformanceRating = df_joined["PerformanceRating"]
StockOptionLevel = df_joined["StockOptionLevel"]
TrainingTimesLastYear = df_joined["TrainingTimesLastYear"]
WorkLifeBalance = df_joined["WorkLifeBalance"]
YearsSinceLastPromotion = df_joined["YearsSinceLastPromotion"]



# Psychometric Data:

Openness = df_joined["Openness"]
Conscieniousness = df_joined["Conscieniousness"]
Extroversion = df_joined["Extroversion"]
Agreeableness = df_joined["Agreeableness"]
Neuroticism = df_joined["Neuroticism"]

In [72]:
# Age:
fig = px.histogram(age, x="Age", color = 'Age', title= 'Age')
fig.show()

In [71]:
# Education:
fig = px.histogram(Education, x="Education", color = 'Education', title= 'Education')
fig.show()

Education: 1 'Below College'
2 'College'
3 'Bachelor'
4 'Master'
5 'Doctor'

In [70]:
# Environmental Satisfaction:
fig = px.histogram(EnvironmentSatisfaction, x="EnvironmentSatisfaction", color = 'EnvironmentSatisfaction', title= 'Environment Satisfaction')
fig.show()

In [69]:
# Hourly Rate:
fig = px.histogram(HourlyRate, x="HourlyRate", color = 'HourlyRate', title= 'Hourly Rate')
fig.show()

The psychometric data will be looked at seperately later.

In [97]:
# # Openness:

# fig = px.histogram(Openness, x="Openness", color = 'Openness', title= 'Openness')
# fig.show()

In [98]:
# # Conscieniousness:
# fig = px.histogram(Conscieniousness, x="Conscieniousness", color = 'Conscieniousness', title= 'Conscientiousness')
# fig.show()

In [99]:
# # Extroversion:

# fig = px.histogram(Extroversion, x="Extroversion", color = 'Extroversion', title= 'Extroversion')
# fig.show()

In [101]:
# # Agreeableness:

# fig = px.histogram(Agreeableness, x="Agreeableness", color = 'Agreeableness', title= 'Agreeableness')
# fig.show()

In [102]:
# # Neuroticism:

# fig = px.histogram(Neuroticism, x="Neuroticism", color = 'Neuroticism', title= 'Neuroticism')
# fig.show()

##### Distance From Home:

In [208]:
df_joined['Age'].describe()

count    1470.000000
mean       36.923810
std         9.135373
min        18.000000
25%        30.000000
50%        36.000000
75%        43.000000
max        60.000000
Name: Age, dtype: float64

In [63]:
fig = px.bar(df_joined, x= 'DistanceFromHome', y= 'Attrition_Yes', color= 'DistanceFromHome', 
             title="Attrition by Distance from Home")
fig.show()

In [75]:
# Raw count- Distance from Home:
fig = px.histogram(df_joined, x="DistanceFromHome", color = 'DistanceFromHome', title= 'Distance From Home- Count')
fig.show()

Raw counts show us that people who live closer to work are more likely to leave.  That is becasue people who live closer are more likely to be employed  as can be seen by looking at the raw counts of distance from home and comparing to the previous graph.  

Looking at the mean value by attrtion would be more instructive:

In [80]:
distance_by_mean_attrition = pd.DataFrame(df_joined.groupby(['Attrition_Yes'])['DistanceFromHome'].mean())
distance_by_mean_attrition

Unnamed: 0_level_0,DistanceFromHome
Attrition_Yes,Unnamed: 1_level_1
0,8.915653
1,10.632911


In [81]:
# I had to reset index becasue I wanted to plot attrition as a value, not use it as an index.
distance_by_mean_attrition = distance_by_mean_attrition.reset_index()
distance_by_mean_attrition

Unnamed: 0,Attrition_Yes,DistanceFromHome
0,0,8.915653
1,1,10.632911


In [83]:
# Mean Distance from Home by Attrition:
fig = px.bar(distance_by_mean_attrition, x= 'Attrition_Yes', y = 'DistanceFromHome',  color = 'Attrition_Yes', 
             title='Mean Distance from Home by Attrition')
fig.show()

#### Attrition by Performance Review:

In [86]:
distance_by_rating = pd.DataFrame(df_joined.groupby(['Attrition_Yes'])['PerformanceRating'].mean())
distance_by_rating = distance_by_rating.reset_index()
distance_by_rating

Unnamed: 0,Attrition_Yes,PerformanceRating
0,0,3.153285
1,1,3.156118


In [212]:
df_joined['PerformanceRating'].describe()

count    1470.000000
mean        3.153741
std         0.360824
min         3.000000
25%         3.000000
50%         3.000000
75%         3.000000
max         4.000000
Name: PerformanceRating, dtype: float64

In [88]:
fig = px.bar(distance_by_rating, x= 'Attrition_Yes', y = 'PerformanceRating',  
             color = 'Attrition_Yes', title= 'Mean Performance Review Score by Attrition')
fig.show()

Performance Reviews don't seem to have any bearing on whether or not an employee will stay in the company.  This is concerning because we would want to see people with higher performance ratings to more likely stay in the company. A further look may help to determine why this may be.

In [96]:
df_joined['PerformanceRating'].describe()

count    1470.000000
mean        3.153741
std         0.360824
min         3.000000
25%         3.000000
50%         3.000000
75%         3.000000
max         4.000000
Name: PerformanceRating, dtype: float64

By looking at the Minimum and Maximum values for Performance Rating that there are no scores of 1 or 2.  THis means that supervisors are ALWAYS giving their direct reports scores of 3 or 4.  This clearly reveals a problem with the performance Review process. It could mean that supervsors are not takng the time to conduct an accurate assessment of performance.  It could mean that supervisors need more training.  It could also mean that the performance review process is too dificult to implement in an efficient way and that other processes/platforms should be considered.  

#### Job Satisfaction

In [106]:
job_satisfaction_by_age = pd.DataFrame(df_joined.groupby(['Age'])['JobSatisfaction'].mean())
job_satisfaction_by_age = job_satisfaction_by_age.reset_index()
job_satisfaction_by_age.head()

Unnamed: 0,Age,JobSatisfaction
0,18,3.25
1,19,2.555556
2,20,2.636364
3,21,2.692308
4,22,2.9375


In [107]:

fig = px.bar(job_satisfaction_by_age, x="Age", y="JobSatisfaction", color="Age", title="Job Satisfaction by Age")
fig.show()

In [168]:
# To Explore; delete if not useful

In [169]:
print(df_abnormal.shape)
df_abnormal.head()

(1470, 40)


Unnamed: 0,EmployeeNumber,Age,Attrition,Openness,Conscieniousness,Extroversion,Agreeableness,Neuroticism,BusinessTravel,DailyRate,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,1,41,Yes,43,47,44,46,49,Travel_Rarely,1102,...,1,80,0,8,0,1,6,4,0,5
1,2,49,No,26,42,22,35,29,Travel_Frequently,279,...,4,80,1,10,3,3,10,7,1,7
2,4,37,Yes,45,49,35,38,14,Travel_Rarely,1373,...,2,80,0,7,3,3,0,0,0,0
3,5,33,No,41,26,22,37,17,Travel_Frequently,1392,...,3,80,0,8,3,3,8,7,3,0
4,7,27,No,34,34,34,44,30,Travel_Rarely,591,...,4,80,1,6,3,3,2,2,2,2


In [170]:
df_abnormal.describe()

Unnamed: 0,EmployeeNumber,Age,Openness,Conscieniousness,Extroversion,Agreeableness,Neuroticism,DailyRate,DistanceFromHome,Education,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,1024.865306,36.92381,38.934014,33.640816,29.74966,38.068027,29.021088,802.485714,9.192517,2.912925,...,2.712245,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,602.024335,9.135373,6.472784,7.361443,9.456211,7.185247,8.686364,403.5091,8.106864,1.024165,...,1.081209,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,1.0,18.0,16.0,11.0,10.0,10.0,10.0,102.0,1.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,491.25,30.0,34.0,28.0,23.0,34.0,23.0,465.0,2.0,2.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,1020.5,36.0,39.0,34.0,30.0,39.0,29.0,802.0,7.0,3.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,1555.75,43.0,44.0,39.0,37.0,43.0,35.0,1157.0,14.0,4.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,2068.0,60.0,50.0,50.0,50.0,50.0,50.0,1499.0,29.0,5.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


In [172]:
df_abnormal.columns

Index(['EmployeeNumber', 'Age', 'Attrition', 'Openness', 'Conscieniousness',
       'Extroversion', 'Agreeableness', 'Neuroticism', 'BusinessTravel',
       'DailyRate', 'Department', 'DistanceFromHome', 'Education',
       'EducationField', 'EmployeeCount', 'EnvironmentSatisfaction', 'Gender',
       'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole',
       'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate',
       'NumCompaniesWorked', 'Over18', 'OverTime', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')

#### Depatment vs. Job Satisfaction:

In [200]:
satisfaction_by_department = pd.DataFrame(df_joined.groupby(['Department'])['JobSatisfaction'].mean())
satisfaction_by_department = satisfaction_by_department.reset_index()
satisfaction_by_department

Unnamed: 0,Department,JobSatisfaction
0,Human Resources,2.603175
1,Research & Development,2.726327
2,Sales,2.751121


In [176]:
fig = px.bar(satisfaction_by_department, x="Department", y="JobSatisfaction", 
             color="Department", title="Job Satisfaction by Department")
fig.show()

#### Job Satisfaction by Gender

In [209]:
satisfaction_by_gender = pd.DataFrame(df_joined.groupby(['Gender'])['JobSatisfaction'].mean())
satisfaction_by_gender = satisfaction_by_gender.reset_index()
satisfaction_by_gender

Unnamed: 0,Gender,JobSatisfaction
0,Female,2.683673
1,Male,2.758503


In [210]:

fig = px.bar(satisfaction_by_gender, x="Gender", y="JobSatisfaction", 
             color="Gender", title="Job Satisfaction by Gender")
fig.show()

In [195]:
# WorkLifeBalance  Attrition_Yes df
worklife_balance_by_attrition = pd.DataFrame(df_abnormal.groupby(['Attrition'])['WorkLifeBalance'].mean())
worklife_balance_by_attrition = worklife_balance_by_attrition.reset_index()
worklife_balance_by_attrition

Unnamed: 0,Attrition,WorkLifeBalance
0,No,2.781022
1,Yes,2.658228


In [197]:
# training_time_by_attrition  Attrition_Yes df
training_time_by_attrition = pd.DataFrame(df_abnormal.groupby(['Attrition'])['TrainingTimesLastYear'].mean())
training_time_by_attrition = training_time_by_attrition.reset_index()
training_time_by_attrition

Unnamed: 0,Attrition,TrainingTimesLastYear
0,No,2.832928
1,Yes,2.624473


In [198]:
TrainingTimesLastYear
fig = px.bar(training_time_by_attrition, x="Attrition", y="TrainingTimesLastYear", 
             color="Attrition", title="Training Time and Attrition")
fig.show()

## Part Two:
###  Modeling of Psychometric Data
As the psychometric data was a different data set than the employee data, the psychometric data could not serve as independent variables upon which a dependency with the employee data could be established.  As such, an unsupervised approach was better suited to this particular problem.  Accordingly, I utilized K-Means Clustering to to model types of respondants to the personality test.

The data is from online respondent results of the Big Five Personality Test. The BIg FIve groups personalities according to Five General Categories:  

##### -Openness vs Closedness to Experience:
Correlated trait adjective: Ideas (curious),  Fantasy (imaginative),  Aesthetics (artistic),  Actions (wide interests),  Feelings (excitable),  Values (unconventional)

##### -Conscientiousness vs Lack of Direction: 
Correlated trait adjective: Competence (efficient),  Order (organized),  Dutifulness (not careless),  Achievement striving (thorough),  Self-discipline (not lazy),  Deliberation (not impulsive)

##### -Extroversion/ Introversion:
Correlated trait adjective: Gregariousness (sociable),  Assertiveness (forceful),  Activity (energetic),  Excitement-seeking (adventurous),  Positive emotions (enthusiastic),  Warmth (outgoing) 

##### -Agreeableness vs. Antagonism:
Correlated trait adjective: Trust (forgiving),  Straightforwardness (not demanding),  Altruism (warm),  Compliance (not stubborn),  Modesty (not show-off),  Tender-mindedness (sympathetic)

##### -Neuroticism (Emotional Balance).  
Correlated trait adjective: Anxiety (tense),  Angry hostility (irritable),  Depression (not contented),  Self-consciousness (shy),  Impulsiveness (moody),  Vulnerability (not self-confident)

The data set consists of close to 20 thousand people who have taken the exam.  The exam consists of 50 questions: 10 for each psychometric category plus several demographic questions.

Here I am using the full data set, not just that which was spliced to the employee data.  So I will have to read in the relevent csv again.

In [114]:
# Reading data into notebook:

scored =pd.read_csv('data/big_five_final.csv')
print(scored.shape)
scored.head()

(19719, 12)


Unnamed: 0,race,age,English,gender,hand,source,country,Openness,Conscieniousness,Extroversion,Agreeableness,Emotional Balance
0,Caucasian (European),53,1,1,1,Test Website,US,43,47,44,46,49
1,Mixed Race,46,1,0,1,Test Website,US,26,42,22,35,29
2,Mixed Race,14,0,0,1,Test Website,PK,45,49,35,38,14
3,Caucasian (European),19,0,0,1,Test Website,RO,41,26,22,37,17
4,Mixed Race,25,0,0,1,Google,US,34,34,34,44,30


Emotional Balance should be changed back to Neuroticism as this category is framed negatively:

In [146]:
second_cluster = second_cluster.rename(columns={'Emotional Balance': 'Neuroticism'})
second_cluster.columns

Index(['Openness', 'Conscieniousness', 'Extroversion', 'Agreeableness',
       'Neuroticism', 'Cluster'],
      dtype='object')

In [116]:
psychometrics = scored.iloc[:, 4:9]

In [118]:
psychometrics = pd.DataFrame(psychometrics)
print(psychometrics.shape)
psychometrics.head()

(19719, 5)


Unnamed: 0,Openness,Conscieniousness,Extroversion,Agreeableness,Emotional Balance
0,43,47,44,46,49
1,26,42,22,35,29
2,45,49,35,38,14
3,41,26,22,37,17
4,34,34,34,44,30


### Scaling
Althought he data doesn't need to be scaled, I did so to get an array object

In [119]:
# preprocess the data 

from sklearn.preprocessing import StandardScaler
# Separating out the features
x = psychometrics.values

# Standardizing the features
x = StandardScaler().fit_transform(x)

In [126]:
# Distortion is the average of the squared distances from the clusters centers of each cluster
distortions = [] 
# Inertia is the sum of squared distances of samples to their closest cluster center.
inertias = [] 
mapping1 = {} 
mapping2 = {} 
K = range(1,10) 
  
for k in K: 
    #Building and fitting the model 
    kmeanModel = KMeans(n_clusters=k).fit(x) 
    kmeanModel.fit(x)     
      
    distortions.append(sum(np.min(cdist(x, kmeanModel.cluster_centers_, 
                      'euclidean'),axis=1)) / scored.shape[0]) 
    inertias.append(kmeanModel.inertia_) 
  
    mapping1[k] = sum(np.min(cdist(x, kmeanModel.cluster_centers_, 
                 'euclidean'),axis=1)) / x.shape[0] 
    mapping2[k] = kmeanModel.inertia_ 

In [127]:
for key,val in mapping1.items(): 
    print(str(key)+' : '+str(val)) 

1 : 2.112800583026997
2 : 1.8632495619131761
3 : 1.7533784100531145
4 : 1.6635716923505912
5 : 1.597468707941535
6 : 1.5385924232676265
7 : 1.4974007892009744
8 : 1.4611441891816535
9 : 1.4342616978106975


In [128]:
fig = px.line(x= K, y= distortions, title='Elbow Method- Distortion')
fig.show()

In [129]:
kmeans = KMeans(n_clusters=2).fit(x)
kmeans.labels_

kmeans.predict(x)

first_k_cluster = kmeans.cluster_centers_
first_k_cluster

array([[ 0.31140489,  0.39581936,  0.60717869,  0.51170523,  0.48387915],
       [-0.30456392, -0.38712396, -0.59384013, -0.50046404, -0.47324924]])

In [130]:
clusters = kmeans.predict(x)

In [149]:
psychometrics["Cluster"]= clusters
psychometrics.head()

Unnamed: 0,Openness,Conscieniousness,Extroversion,Agreeableness,Emotional Balance,Cluster
0,43,47,44,46,49,0
1,26,42,22,35,29,1
2,45,49,35,38,14,0
3,41,26,22,37,17,1
4,34,34,34,44,30,0


In [132]:
first_cluster = psychometrics[psychometrics['Cluster'] == 0]   
second_cluster =  psychometrics[psychometrics['Cluster'] == 1] 

In [133]:
first_cluster.describe()

Unnamed: 0,Openness,Conscieniousness,Extroversion,Agreeableness,Emotional Balance,Cluster
count,9720.0,9720.0,9720.0,9720.0,9720.0,9720.0
mean,41.050412,36.377572,32.847634,42.113272,33.198148,0.0
std,5.275488,6.717686,6.728162,5.206709,7.734976,0.0
min,16.0,11.0,9.0,14.0,10.0,0.0
25%,37.0,32.0,28.0,39.0,28.0,0.0
50%,41.0,37.0,33.0,43.0,33.0,0.0
75%,45.0,41.0,38.0,46.0,39.0,0.0
max,50.0,50.0,50.0,50.0,50.0,0.0


In [135]:
second_cluster.describe()

Unnamed: 0,Openness,Conscieniousness,Extroversion,Agreeableness,Emotional Balance,Cluster
count,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0
mean,37.176818,30.646965,22.644464,34.876288,24.972997,1.0
std,6.54687,6.732878,6.850408,6.975739,7.416068,0.0
min,0.0,0.0,0.0,0.0,0.0,1.0
25%,33.0,26.0,18.0,31.0,20.0,1.0
50%,37.0,31.0,23.0,36.0,25.0,1.0
75%,42.0,35.0,27.0,40.0,30.0,1.0
max,50.0,50.0,45.0,50.0,50.0,1.0


In [137]:
# Openness
cluster_1 = first_cluster['Openness']
cluster_2 = second_cluster['Openness'] 

fig = go.Figure()

fig.add_trace(go.Histogram(
    x= cluster_1,
 
    name="Open"       # this sets its legend entry
))


fig.add_trace(go.Histogram(
    x= cluster_2,
 
    name="Closed"       # this sets its legend entry
))


fig.update_layout(
    title="Openness to Experience",
    xaxis_title="Openness Score",
    yaxis_title="Count",
    font=dict(
        family="Garamond, monospace",
        size=18,
        color="#8c564b"
    )
)

fig.show()

In [138]:
cluster_1 = first_cluster['Conscieniousness']
cluster_2 = second_cluster['Conscieniousness'] 

fig = go.Figure()

fig.add_trace(go.Histogram(
    x= cluster_1,
 
    name="Conscientiousness"       # this sets its legend entry
))


fig.add_trace(go.Histogram(
    x= cluster_2,
 
    name="Unreliable"       # this sets its legend entry
))


fig.update_layout(
    title="Conscientiousness",
    xaxis_title="Conscientiousness Score",
    yaxis_title="Count",
    font=dict(
        family="Garamond, monospace",
        size=18,
        color="#7f7f7f"
    )
)

fig.show()

In [139]:
cluster_1 = first_cluster['Extroversion']
cluster_2 = second_cluster['Extroversion'] 

fig = go.Figure()

fig.add_trace(go.Histogram(
    x= cluster_1,
 
    name="Extroverted"       # this sets its legend entry
))


fig.add_trace(go.Histogram(
    x= cluster_2,
 
    name="Introverted"       # this sets its legend entry
))


fig.update_layout(
    title="Extroversion",
    xaxis_title="Extroversion Score",
    yaxis_title="Count",
    font=dict(
        family="Garamond, monospace",
        size=18,
        color="#7f7f7f"
    )
)

fig.show()

In [192]:
cluster_1 = first_cluster['Agreeableness']
cluster_2 = second_cluster['Agreeableness'] 

fig = go.Figure()

fig.add_trace(go.Histogram(
    x= cluster_1,
 
    name="Agreeable"       # this sets its legend entry
))


fig.add_trace(go.Histogram(
    x= cluster_2,
 
    name="Antagonistic"       # this sets its legend entry
))


fig.update_layout(
    title="Agreeableness",
    xaxis_title="Agreeableness Score",
    yaxis_title="Count",
    font=dict(
        family="Garamond, monospace",
        size=18,
        color="#7f7f7f"
    )
)

fig.show()

I forgot to rename one of the columns earlier; so I do so here:

In [150]:
first_cluster = first_cluster.rename(columns={'Emotional Balance': 'Neuroticism'})
second_cluster.columns

Index(['Openness', 'Conscieniousness', 'Extroversion', 'Agreeableness',
       'Neuroticism', 'Cluster'],
      dtype='object')

In [151]:
second_cluster = second_cluster.rename(columns={'Emotional Balance': 'Neuroticism'})
second_cluster.columns

Index(['Openness', 'Conscieniousness', 'Extroversion', 'Agreeableness',
       'Neuroticism', 'Cluster'],
      dtype='object')

In [152]:
cluster_1 = first_cluster['Neuroticism']
cluster_2 = second_cluster['Neuroticism']

fig = go.Figure()

fig.add_trace(go.Histogram(
    x= cluster_1,
 
    name="Neurotic"       # this sets its legend entry
))


fig.add_trace(go.Histogram(
    x= cluster_2,
 
    name="Emotionally Stable"       # this sets its legend entry
))


fig.update_layout(
    title="Neuroticism",
    xaxis_title="Neuroticism Score",
    yaxis_title="Count",
    font=dict(
        family="Garamond, monospace",
        size=18,
        color="#7f7f7f"
    )
)

fig.show()