# Improving Employee Retention By Predicting Employee Attrition Using Machine Learning
Name : Azarya Yehezkiel Pinondang Sipahutar<br><br>
**Project Overview**:<br>
A tech startup tech company is currently experiencing big problems, many employees submitted their resignations but the company has not yet taken a decision on this matter. I as a data scientist requested to help the company to explain the current condition of its employees, as well as explore the problems within the company that cause employees to resign so that they can reduce the rate of employee attritions, and can outline a strategy that can increase employee retention.<br><br>

**Project Goals**:<br>
1. To predict employee attrition using machine learning models.
2. To identify the factors that contribute to employee attrition.
3. To provide recommendations to the company to reduce the rate of employee attrition.<br><br>

**Project Objective**:<br>
1. Data Preprocessing
2. Exploratory Data Analysis
3. Data Modelling
4. Model Evaluation
5. Feature Importance
6. Recommendations


## Task 1 - Data Preprocessing
**Task Goals**:<br>
Handle various data problems, such as null data, inappropriate data, and identify the data that is not needed for the analysis.<br><br>
**Task Objectives**:<br>
1. Importing Libraries
2. Loading Data
3. Data Cleaning
4. Data Preprocessing


### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go

### Load Data

In [19]:
df = pd.read_csv('./data/Improving Employee Retention by Predicting Employee Attrition Using Machine Learning.xlsx - hr_data.csv')
pd.set_option('display.max_columns', None)

display(df.sample(7))
display(df.info())

#drop PernahBekerja

Unnamed: 0,Username,EnterpriseID,StatusPernikahan,JenisKelamin,StatusKepegawaian,Pekerjaan,JenjangKarir,PerformancePegawai,AsalDaerah,HiringPlatform,SkorSurveyEngagement,SkorKepuasanPegawai,JumlahKeikutsertaanProjek,JumlahKeterlambatanSebulanTerakhir,JumlahKetidakhadiran,NomorHP,Email,TingkatPendidikan,PernahBekerja,IkutProgramLOP,AlasanResign,TanggalLahir,TanggalHiring,TanggalPenilaianKaryawan,TanggalResign
50,giddyCheetah9,100381,Menikah,Wanita,FullTime,Product Design (UI & UX),Freshgraduate_program,Sangat_kurang,Jakarta Pusat,Indeed,3,3.0,0.0,0.0,2.0,+6282295263xxx,giddyCheetah9745@icloud.com,Sarjana,1,,ganti_karir,1958-12-27,2013-8-19,2020-01-10,2013-9-26
195,boastfulSyrup4,100869,Lainnya,Pria,FullTime,Software Engineer (Front End),Freshgraduate_program,Sangat_bagus,Jakarta Timur,Employee_Referral,4,3.0,0.0,0.0,2.0,+6281209655xxx,boastfulSyrup4371@yahoo.com,Sarjana,1,,,1981-07-11,2015-06-02,2020-01-04,-
273,amusedIcecream0,111104,Menikah,Pria,FullTime,Product Design (UI & UX),Freshgraduate_program,Bagus,Jakarta Pusat,Google_Search,3,3.0,0.0,0.0,10.0,+6285622739xxx,amusedIcecream0506@outlook.com,Sarjana,1,,,1982-09-02,2012-5-14,2020-2-22,-
67,mellowCheese1,105779,Bercerai,Pria,Outsource,Software Engineer (Back End),Mid_level,Kurang,Jakarta Barat,Employee_Referral,3,3.0,0.0,0.0,,+6287742497xxx,mellowCheese1411@icloud.com,Doktor,1,,masih_bekerja,1985-09-15,2015-3-30,2020-2-18,-
83,brainyRice8,106245,Bercerai,Wanita,FullTime,Product Design (UI & UX),Freshgraduate_program,Biasa,Jakarta Pusat,CareerBuilder,3,3.0,6.0,0.0,6.0,+6289935357xxx,brainyRice8142@icloud.com,Sarjana,1,,masih_bekerja,1979-04-04,2015-2-16,2020-02-11,-
29,grizzledFlamingo9,106473,Menikah,Wanita,FullTime,Software Engineer (Back End),Mid_level,Biasa,Jakarta Utara,LinkedIn,4,3.0,7.0,0.0,,+6289673952xxx,grizzledFlamingo9139@proton.com,Doktor,1,,masih_bekerja,1973-05-27,2015-01-05,2019-2-13,-
270,truthfulHawk9,106029,Menikah,Wanita,FullTime,Software Engineer (Back End),Senior_level,Biasa,Jakarta Selatan,Indeed,4,3.0,0.0,0.0,10.0,+6285919278xxx,truthfulHawk9561@yahoo.com,Sarjana,1,,,1954-09-21,2012-07-02,2020-1-17,-


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287 entries, 0 to 286
Data columns (total 25 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Username                            287 non-null    object 
 1   EnterpriseID                        287 non-null    int64  
 2   StatusPernikahan                    287 non-null    object 
 3   JenisKelamin                        287 non-null    object 
 4   StatusKepegawaian                   287 non-null    object 
 5   Pekerjaan                           287 non-null    object 
 6   JenjangKarir                        287 non-null    object 
 7   PerformancePegawai                  287 non-null    object 
 8   AsalDaerah                          287 non-null    object 
 9   HiringPlatform                      287 non-null    object 
 10  SkorSurveyEngagement                287 non-null    int64  
 11  SkorKepuasanPegawai                 282 non-n

None

### Task 1 - Data Processing
**Task Goals**:<br>
Handle various data problems, such as null data, inappropriate data, and identify the data that is not needed for the analysis.<br><br>

**Task Objectives**:<br>
1. Data Cleaning
2. Data Preprocessing


In [20]:
for i in df.columns:
    print('\nColumn Name:', i)
    print(df[i].value_counts())


Column Name: Username
Username
boredEggs0           2
brainyMagpie7        2
spiritedPorpoise3    1
grudgingMeerkat3     1
boastfulSyrup4       1
                    ..
lazyPorpoise0        1
brainyFish3          1
sincereSeafowl4      1
jumpyTomatoe4        1
puzzledFish5         1
Name: count, Length: 285, dtype: int64

Column Name: EnterpriseID
EnterpriseID
111065    1
106008    1
100869    1
101560    1
100874    1
         ..
105429    1
106638    1
100919    1
101306    1
106214    1
Name: count, Length: 287, dtype: int64

Column Name: StatusPernikahan
StatusPernikahan
Belum_menikah    132
Menikah           57
Lainnya           48
Bercerai          47
-                  3
Name: count, dtype: int64

Column Name: JenisKelamin
JenisKelamin
Wanita    167
Pria      120
Name: count, dtype: int64

Column Name: StatusKepegawaian
StatusKepegawaian
FullTime      217
Outsource      66
Internship      4
Name: count, dtype: int64

Column Name: Pekerjaan
Pekerjaan
Software Engineer (Back End)

In [21]:
misval = df.isna().sum()
misval = misval[misval > 0]
print('Total Missing Values on Columns:')
display(misval)

print('\nTotal Duplicates:')
display(df.duplicated().sum())

Total Missing Values on Columns:


SkorKepuasanPegawai                     5
JumlahKeikutsertaanProjek               3
JumlahKeterlambatanSebulanTerakhir      1
JumlahKetidakhadiran                    6
IkutProgramLOP                        258
AlasanResign                           66
dtype: int64


Total Duplicates:


0

In [22]:
# create a copy of the dataframe to avoid modifying the original
dfp = df.copy()

# fill missing values in 'AlasanResign' column with the mode (most frequent value)
dfp.loc[:, 'AlasanResign'] = dfp['AlasanResign'].fillna(dfp['AlasanResign'].mode()[0])

# columns to fill missing with median
columns2fill = ['JumlahKetidakhadiran', 'SkorKepuasanPegawai', 'JumlahKeikutsertaanProjek', 'JumlahKeterlambatanSebulanTerakhir']

# IIterate over the columns andfill missing values with trhe median
for col in columns2fill:
    dfp.loc[:, col] = dfp[col].fillna(dfp[col].median())

In [23]:
# Replace '-' with 'Belum_menikah' in the 'StatusPernikahan' column
dfp['StatusPernikahan'] = dfp['StatusPernikahan'].replace('-', 'Belum_menikah')

# Replace 'Product Design (UI & UX)' with 'masih_bekerja' in the 'AlasanResign' column
dfp['AlasanResign'] = dfp['AlasanResign'].replace('Product Design (UI & UX)', 'masih_bekerja')

# Convert 'TanggalLahir' column to datetime format
dfp['TanggalLahir'] = pd.to_datetime(dfp['TanggalLahir'])

# Extract the year from 'TanggalLahir' and store it in a new column 'TahunLahir'
dfp['TahunLahir'] = dfp['TanggalLahir'].dt.year

# Calculate the age of the employee and store it in a new column 'Umur'
dfp['Umur'] = 2024 - dfp['TahunLahir']

# Replace '-' with NaN in the 'TanggalResign' column
dfp['TanggalResign'] = dfp['TanggalResign'].replace('-', np.nan)

# Convert 'TanggalResign' & 'TanggalHiring column to datetime format
dfp['TanggalResign'] = pd.to_datetime(dfp['TanggalResign'])
dfp['TanggalHiring'] = pd.to_datetime(dfp['TanggalHiring'])

# Define a function to check if an employee has resigned
def check_resign(date):
    if pd.isnull(date):
        return 'No'
    else:
        return 'Yes'

# Apply the function to the 'TanggalResign' column and store the result in a new column 'Resign'
dfp['Resign'] = dfp['TanggalResign'].apply(check_resign)


# Extract TahunHiring From TanggalHiring
dfp['TahunHiring'] = dfp['TanggalHiring'].dt.year

# Extract TahunResign From TangggalResign
dfp['TahunResign'] = dfp['TanggalResign'].dt.year

# Define the columns to be dropped
col2drop = ['NomorHP', 'Email', 'PernahBekerja', 'IkutProgramLOP', 'TanggalResign']

# Drop the defined columns from the DataFrame
dfp.drop(columns=col2drop, axis=1, inplace=True)

HiringPlatform perlu di binning menjadi 3 kategori: 'LinkedIn', 'Indeed', 'Others' <br><br>

Pekerjaan perlu di binning
- Software Engineer (IOS) > Software Engineer
- Machine Learning Engineer > Data Scientist
- Poduct Designer > UI/UX Designer
- Software Architect > Software Engineer
- Scrum Master > Others

<br><br>


<br><br>

### Task 2 - Exploratory Data Analysis (EDA) Annual Report On Employee Number Changes
**Task Goals**:<br>
Analyze the number of employees who resigned each year and the number of employees who joined each year.<br><br>

**Task Objectives**:<br>
1. Create a visualization of the number of employees who resigned each year.
2. Create a visualization of the number of employees who joined each year.
3. Create a visualization of the number of employees who resigned and joined each year.



In [24]:
# Create a DataFrame for the annual hired employees count
annual_hiring_agg = (
    dfp
    .groupby('TahunHiring')['Username']
    .count()
    .to_frame()
    .reset_index()
    .rename(columns={'Username': 'JumlahPegawaiHire'})
    .sort_values(by='TahunHiring', ascending=True)
)

# Create a DataFrame for the annual resign count
annual_resign_agg = (
    dfp
    .groupby('TahunResign')['Username']
    .count()
    .to_frame()
    .reset_index()
    .rename(columns={'Username': 'JumlahPegawaiResign'})
    .sort_values(by='TahunResign', ascending=True)
)

# join the two DataFrames on the 'TahunHiring' and 'TahunResign' columns
annual_hr_agg = annual_hiring_agg.merge(annual_resign_agg, how='outer', left_on='TahunHiring', right_on='TahunResign')
annual_hr_agg['TahunHiring'] = annual_hr_agg['TahunHiring'].fillna(annual_hr_agg['TahunResign'])
annual_hr_agg['TahunResign'] = annual_hr_agg['TahunResign'].fillna(annual_hr_agg['TahunHiring'])
annual_hr_agg.fillna(0, inplace=True)

In [25]:
annual_hr_agg['PegawaiBertahan'] = annual_hr_agg['JumlahPegawaiHire'] - annual_hr_agg['JumlahPegawaiResign']
annual_hr_agg

Unnamed: 0,TahunHiring,JumlahPegawaiHire,TahunResign,JumlahPegawaiResign,PegawaiBertahan
0,2006.0,1.0,2006.0,0.0,1.0
1,2007.0,2.0,2007.0,0.0,2.0
2,2008.0,2.0,2008.0,0.0,2.0
3,2009.0,7.0,2009.0,0.0,7.0
4,2010.0,8.0,2010.0,0.0,8.0
5,2011.0,76.0,2011.0,0.0,76.0
6,2012.0,41.0,2012.0,0.0,41.0
7,2013.0,43.0,2013.0,5.0,38.0
8,2014.0,56.0,2014.0,12.0,44.0
9,2015.0,31.0,2015.0,8.0,23.0


In [28]:


# # Assuming annual_hr_agg is your DataFrame and it has columns 'Year', 'Hire', 'Resign'
# dfv = annual_hr_agg

# # Calculate net change for each year
# dfv['Net'] = dfv['JumlahPegawaiHire'] - dfv['JumlahPegawaiResign']

# # Create a list of measures
# measures = ['relative' if net >= 0 else 'relative' for net in df['Net']]

# fig = go.Figure(go.Waterfall(
#     name = "20", orientation = "v",
#     measure = measures,
#     x = dfv['TahunHiring'].tolist(),
#     textposition = "outside",
#     text = dfv['Net'].tolist(),
#     y = dfv['Net'].tolist(),
#     decreasing = {"marker":{"color":"Red"}},
#     increasing = {"marker":{"color":"Green"}},
# ))

# fig.update_layout(title = "Employee Hire and Resign Waterfall Chart", showlegend = False)

# fig.show()

In [None]:
def add_values(lst,num):
    lst.append(num)
    return lst

add_values([1,2,3,4], 10)

[1, 2, 3, 4, 10]

In [None]:
for num in range(1, 11):
    if num % 3 == 0:
        print("Fizz")
    elif num % 5 == 0:
        print("Buzz")
    elif num % 3 == 0 and num % 5 == 0:
        print('FizzBuzz')
    else:
        print("Num")

Num
Num
Fizz
Num
Buzz
Fizz
Num
Num
Fizz
Buzz


In [6]:
for num in range(1, 11):
    if num % 3 == 0 and num % 5 == 0:
        print('FizzBuzz')
    elif num % 3 == 0:
        print("Fizz")
    elif num % 5 == 0:
        print("Buzz")
    else:
        print("Num")

Num
Num
Fizz
Num
Buzz
Fizz
Num
Num
Fizz
Buzz


In [None]:
def calculate(a,b):
    sum_result = a+b
    multiply_result = a*b
    return sum_result, multiply_result

In [None]:
def calculate(a,b):
    result_sum = a+b
    result_multiply = a*b
    return result_sum, result_multiply

In [11]:
def calculate(a,b):
    return a+b, a*b

(5, 6)

In [1]:
True and False

False

In [12]:
def calculate(a,b):
    jumlah = a+b
    kali = a*b
    return jumlah, kali
calculate(2,3)

(5, 6)

In [40]:
dft = dfp.copy()
from sklearn.preprocessing import MinMaxScaler
def normalize(data):
    scaler=MinMaxScaler()
    data['SkorKepuasanPegawai'] = scaler.fit_transform(data[['SkorKepuasanPegawai']])
    return data



Unnamed: 0,Username,EnterpriseID,StatusPernikahan,JenisKelamin,StatusKepegawaian,Pekerjaan,JenjangKarir,PerformancePegawai,AsalDaerah,HiringPlatform,SkorSurveyEngagement,SkorKepuasanPegawai,JumlahKeikutsertaanProjek,JumlahKeterlambatanSebulanTerakhir,JumlahKetidakhadiran,TingkatPendidikan,AlasanResign,TanggalLahir,TanggalHiring,TanggalPenilaianKaryawan,TahunLahir,Umur,Resign,TahunHiring,TahunResign
0,spiritedPorpoise3,111065,Belum_menikah,Pria,Outsource,Software Engineer (Back End),Freshgraduate_program,Sangat_bagus,Jakarta Timur,Employee_Referral,4,0.75,0.0,0.0,9.0,Magister,masih_bekerja,1972-07-01,2011-01-10,2016-2-15,1972,52,No,2011,
1,jealousGelding2,106080,Belum_menikah,Pria,FullTime,Data Analyst,Freshgraduate_program,Sangat_kurang,Jakarta Utara,Website,4,0.75,4.0,0.0,3.0,Sarjana,toxic_culture,1984-04-26,2014-01-06,2020-1-17,1984,40,Yes,2014,2018.0
2,pluckyMuesli3,106452,Menikah,Pria,FullTime,Software Engineer (Front End),Freshgraduate_program,Bagus,Jakarta Timur,Indeed,4,0.50,0.0,0.0,11.0,Magister,jam_kerja,1974-01-07,2011-01-10,2016-01-10,1974,50,Yes,2011,2014.0
3,stressedTruffle1,106325,Belum_menikah,Pria,Outsource,Software Engineer (Front End),Freshgraduate_program,Bagus,Jakarta Pusat,LinkedIn,3,0.50,0.0,4.0,6.0,Sarjana,masih_bekerja,1979-11-24,2014-02-17,2020-02-04,1979,45,No,2014,
4,shyTermite7,111171,Belum_menikah,Wanita,FullTime,Product Manager,Freshgraduate_program,Bagus,Jakarta Timur,LinkedIn,3,0.50,0.0,0.0,11.0,Sarjana,ganti_karir,1974-11-07,2013-11-11,2020-1-22,1974,50,Yes,2013,2018.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
282,dopeySheep0,106034,Belum_menikah,Wanita,FullTime,Data Engineer,Mid_level,Bagus,Jakarta Pusat,Google_Search,2,1.00,0.0,0.0,16.0,Sarjana,masih_bekerja,1973-12-08,2011-09-26,2016-03-01,1973,51,No,2011,
283,yearningPorpoise4,106254,Belum_menikah,Wanita,FullTime,Product Design (UI & UX),Freshgraduate_program,Biasa,Jakarta Timur,LinkedIn,4,1.00,0.0,0.0,11.0,Sarjana,jam_kerja,1974-12-01,2013-05-13,2020-1-28,1974,50,Yes,2013,2017.0
284,murkySausage9,110433,Menikah,Wanita,FullTime,Software Engineer (Front End),Senior_level,Biasa,Jakarta Pusat,Diversity_Job_Fair,2,1.00,0.0,0.0,17.0,Sarjana,ganti_karir,1969-10-30,2013-11-11,2020-1-21,1969,55,Yes,2013,2018.0
285,truthfulMoth4,110744,Belum_menikah,Pria,FullTime,Software Engineer (Android),Mid_level,Bagus,Jakarta Utara,Google_Search,4,1.00,0.0,0.0,20.0,Sarjana,kejelasan_karir,1981-10-01,2011-05-16,2014-04-05,1981,43,Yes,2011,2018.0


In [None]:
from sklearn.preprocessing import StandardScaler
def standarize(data):
    scaler=StandardScaler()
    data[['Height', 'Weight']] = scaler.fit_transform(data[['Height', 'Weight']])
    return data

from sklearn.preprocessing import StandardScaler
def standarize(data):
    scaler=StandardScaler()
    data['Height'] = scaler.fit_transform(data['Height'].values.reshape(-1,1))
    data['Weight'] = scaler.fit_transform(data['Weight'].values.reshape(-1,1))
    return data

In [38]:
from sklearn.preprocessing import MinMaxScaler
def normalize(data):
    scaler=MinMaxScaler()
    data['SkorKepuasanPegawai'] = scaler.fit_transform(data['SkorKepuasanPegawai'].values.reshape(-1,1))
    # data['SkorSurveyEngagement'] = scaler.fit_transform(data[['SkorSurveyEngagement'].values.reshape(-1,1)])
    return data

dfa = dfp.copy()
normalize(dfa)
dfa

Unnamed: 0,Username,EnterpriseID,StatusPernikahan,JenisKelamin,StatusKepegawaian,Pekerjaan,JenjangKarir,PerformancePegawai,AsalDaerah,HiringPlatform,SkorSurveyEngagement,SkorKepuasanPegawai,JumlahKeikutsertaanProjek,JumlahKeterlambatanSebulanTerakhir,JumlahKetidakhadiran,TingkatPendidikan,AlasanResign,TanggalLahir,TanggalHiring,TanggalPenilaianKaryawan,TahunLahir,Umur,Resign,TahunHiring,TahunResign
0,spiritedPorpoise3,111065,Belum_menikah,Pria,Outsource,Software Engineer (Back End),Freshgraduate_program,Sangat_bagus,Jakarta Timur,Employee_Referral,4,0.75,0.0,0.0,9.0,Magister,masih_bekerja,1972-07-01,2011-01-10,2016-2-15,1972,52,No,2011,
1,jealousGelding2,106080,Belum_menikah,Pria,FullTime,Data Analyst,Freshgraduate_program,Sangat_kurang,Jakarta Utara,Website,4,0.75,4.0,0.0,3.0,Sarjana,toxic_culture,1984-04-26,2014-01-06,2020-1-17,1984,40,Yes,2014,2018.0
2,pluckyMuesli3,106452,Menikah,Pria,FullTime,Software Engineer (Front End),Freshgraduate_program,Bagus,Jakarta Timur,Indeed,4,0.50,0.0,0.0,11.0,Magister,jam_kerja,1974-01-07,2011-01-10,2016-01-10,1974,50,Yes,2011,2014.0
3,stressedTruffle1,106325,Belum_menikah,Pria,Outsource,Software Engineer (Front End),Freshgraduate_program,Bagus,Jakarta Pusat,LinkedIn,3,0.50,0.0,4.0,6.0,Sarjana,masih_bekerja,1979-11-24,2014-02-17,2020-02-04,1979,45,No,2014,
4,shyTermite7,111171,Belum_menikah,Wanita,FullTime,Product Manager,Freshgraduate_program,Bagus,Jakarta Timur,LinkedIn,3,0.50,0.0,0.0,11.0,Sarjana,ganti_karir,1974-11-07,2013-11-11,2020-1-22,1974,50,Yes,2013,2018.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
282,dopeySheep0,106034,Belum_menikah,Wanita,FullTime,Data Engineer,Mid_level,Bagus,Jakarta Pusat,Google_Search,2,1.00,0.0,0.0,16.0,Sarjana,masih_bekerja,1973-12-08,2011-09-26,2016-03-01,1973,51,No,2011,
283,yearningPorpoise4,106254,Belum_menikah,Wanita,FullTime,Product Design (UI & UX),Freshgraduate_program,Biasa,Jakarta Timur,LinkedIn,4,1.00,0.0,0.0,11.0,Sarjana,jam_kerja,1974-12-01,2013-05-13,2020-1-28,1974,50,Yes,2013,2017.0
284,murkySausage9,110433,Menikah,Wanita,FullTime,Software Engineer (Front End),Senior_level,Biasa,Jakarta Pusat,Diversity_Job_Fair,2,1.00,0.0,0.0,17.0,Sarjana,ganti_karir,1969-10-30,2013-11-11,2020-1-21,1969,55,Yes,2013,2018.0
285,truthfulMoth4,110744,Belum_menikah,Pria,FullTime,Software Engineer (Android),Mid_level,Bagus,Jakarta Utara,Google_Search,4,1.00,0.0,0.0,20.0,Sarjana,kejelasan_karir,1981-10-01,2011-05-16,2014-04-05,1981,43,Yes,2011,2018.0
