# Exercise

1. Do some exploratory data analysis to figure out which variables have direct and clear impact on employee retention (i.e. whether they leave the company or continue to work)
2. Plot bar charts showing impact of employee salaries on retention
3. Plot bar charts showing corelation between department and employee retention
4. Now build logistic regression model using variables that were narrowed down in step 1
5. Measure the accuracy of the model

# Importing neccessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

In [2]:
df = pd.read_csv('HR_data.csv')

# EDA & Data Visualization

In [3]:
df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


In [4]:
df.head().T

Unnamed: 0,0,1,2,3,4
satisfaction_level,0.38,0.8,0.11,0.72,0.37
last_evaluation,0.53,0.86,0.88,0.87,0.52
number_project,2,5,7,5,2
average_montly_hours,157,262,272,223,159
time_spend_company,3,6,4,5,3
Work_accident,0,0,0,0,0
left,1,1,1,1,1
promotion_last_5years,0,0,0,0,0
Department,sales,sales,sales,sales,sales
salary,low,medium,medium,low,low


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   Department             14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB


In [6]:
df.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0
mean,0.612834,0.716102,3.803054,201.050337,3.498233,0.14461,0.238083,0.021268
std,0.248631,0.171169,1.232592,49.943099,1.460136,0.351719,0.425924,0.144281
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0,0.0
50%,0.64,0.72,4.0,200.0,3.0,0.0,0.0,0.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0,0.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0


`We have 8 numeric columns`

In [7]:
df.isnull().sum() / len(df)

satisfaction_level       0.0
last_evaluation          0.0
number_project           0.0
average_montly_hours     0.0
time_spend_company       0.0
Work_accident            0.0
left                     0.0
promotion_last_5years    0.0
Department               0.0
salary                   0.0
dtype: float64

`Data is Clean & Tidy`

In [8]:
print(f'We have {df.shape[0]} rows and {df.shape[1]} columns in the dataframe')

We have 14999 rows and 10 columns in the dataframe


In [9]:
df['Department'].value_counts()

Department
sales          4140
technical      2720
support        2229
IT             1227
product_mng     902
marketing       858
RandD           787
accounting      767
hr              739
management      630
Name: count, dtype: int64

In [10]:
df['number_project'].unique()

array([2, 5, 7, 6, 4, 3])

In [11]:
px.histogram(df['left'], df['salary'], color=df['salary'], title='Left by Salary')

`Above bar chart shows employees with high salaries are likely to not leave the company`

In [12]:
px.histogram(x=df['Department'], y=df['left'], color=df['Department'], title='Left by Department')

`From the above chart, there seem to be some impact of **sales** and **technical** department on employee retention but it is not major hence we will ignore department in our analysis`

In [13]:
px.pie(df, names=df['number_project'], title='Number of Projects')

`From the above chart, the number of projects are visualized as 3, 4, and 5 are the average projects are done by the various department employees.`

# Data Preprocessing

In [14]:
left = df[df.left==1]
left.shape

(3571, 10)

In [15]:
df.groupby("left").mean(numeric_only=True)

Unnamed: 0_level_0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years
left,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.66681,0.715473,3.786664,199.060203,3.380032,0.175009,0.026251
1,0.440098,0.718113,3.855503,207.41921,3.876505,0.047326,0.005321


`From the data analysis we can use the following columns:`
1. Satisfaction Level
2. Average Monthly Hours
3. Number Project
4. Promotion Last 5 Years
5. Salary

In [16]:
df1 = df[['satisfaction_level', 'average_montly_hours', 'promotion_last_5years', 'number_project' ,'salary']]
df1

Unnamed: 0,satisfaction_level,average_montly_hours,promotion_last_5years,number_project,salary
0,0.38,157,0,2,low
1,0.80,262,0,5,medium
2,0.11,272,0,7,medium
3,0.72,223,0,5,low
4,0.37,159,0,2,low
...,...,...,...,...,...
14994,0.40,151,0,2,low
14995,0.37,160,0,2,low
14996,0.37,143,0,2,low
14997,0.11,280,0,6,low


**Tackle Salary Dummy Variable to Avoid Dummy Variable Trap**

In [26]:
df_with_dummies = pd.get_dummies(df1, columns=['salary'], dtype=int)
df_with_dummies.head()

Unnamed: 0,satisfaction_level,average_montly_hours,promotion_last_5years,number_project,salary_high,salary_low,salary_medium
0,0.38,157,0,2,0,1,0
1,0.8,262,0,5,0,0,1
2,0.11,272,0,7,0,0,1
3,0.72,223,0,5,0,1,0
4,0.37,159,0,2,0,1,0


In [27]:
X = df_with_dummies
X.head()

Unnamed: 0,satisfaction_level,average_montly_hours,promotion_last_5years,number_project,salary_high,salary_low,salary_medium
0,0.38,157,0,2,0,1,0
1,0.8,262,0,5,0,0,1
2,0.11,272,0,7,0,0,1
3,0.72,223,0,5,0,1,0
4,0.37,159,0,2,0,1,0


In [28]:
y = df.left
y.head()

0    1
1    1
2    1
3    1
4    1
Name: left, dtype: int64

In [29]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [30]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

model.fit(X_train, y_train)


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



In [31]:
model.predict(X_test)

array([0, 0, 0, ..., 0, 0, 1])

In [33]:
model.score(X_test, y_test)

0.7863333333333333