# __Final Project__
## **HR Analytics: Job Change Prediction**
__JCDS-1104-JKT
<br>
William Andreas H
<br>
Dataset was taken from [Kaggle](https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists)__

## __Context and Content__

The data is related to an anonymous company that's dynamic in Big Data and Data Science that needs to enlist data scientists among individuals who effectively pass a few courses which conduct by the company. The said company needs to know which of these candidates are really wants to work for the company after training or seeking out new employment because it makes a difference to reduce the cost and time as well as the quality of preparing or arranging the courses and categorization of candidates.



__Features__

* enrollee_id : Unique ID for candidate
* city: City code
* city_ development _index : Developement index of the city (scaled)
* gender: Gender of candidate
* relevent_experience: Relevant experience of candidate
* enrolled_university: Type of University course enrolled if any
* education_level: Education level of candidate
* major_discipline :Education major discipline of candidate
* experience: Candidate total experience in years
* company_size: No of employees in current employer's company
* company_type : Type of current employer
* lastnewjob: Difference in years between previous job and current job
* training_hours: training hours completed
* target: 0 – Not looking for job change, 1 – Looking for a job change

__The classification objectives are:__

* To predict whether or not candidates alter their occupations after they have completed their training.
* Help out the said company to reduce the cost and time as well as the quality of the courses and categorization of candidates by sorting out candidates that are false predicted (predicted not change their jobs while actually they are seeking out new employment).



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

pd.options.display.max_columns = None
# pd.options.display.max_rows = None

In [2]:
hr = pd.read_csv('aug_train.csv')
hr.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


In [3]:
hr.describe()

Unnamed: 0,enrollee_id,city_development_index,training_hours,target
count,19158.0,19158.0,19158.0,19158.0
mean,16875.358179,0.828848,65.366896,0.249348
std,9616.292592,0.123362,60.058462,0.432647
min,1.0,0.448,1.0,0.0
25%,8554.25,0.74,23.0,0.0
50%,16982.5,0.903,47.0,0.0
75%,25169.75,0.92,88.0,0.0
max,33380.0,0.949,336.0,1.0


In [4]:
hr.describe(include = 'object')

Unnamed: 0,city,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
count,19158,14650,19158,18772,18698,16345,19093,13220,13018,18735
unique,123,3,2,3,5,6,22,8,6,6
top,city_103,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,1
freq,4355,13221,13792,13817,11598,14492,3286,3083,9817,8040


In [5]:
hr.columns

Index(['enrollee_id', 'city', 'city_development_index', 'gender',
       'relevent_experience', 'enrolled_university', 'education_level',
       'major_discipline', 'experience', 'company_size', 'company_type',
       'last_new_job', 'training_hours', 'target'],
      dtype='object')

In [6]:
hr.isna().sum()

enrollee_id                  0
city                         0
city_development_index       0
gender                    4508
relevent_experience          0
enrolled_university        386
education_level            460
major_discipline          2813
experience                  65
company_size              5938
company_type              6140
last_new_job               423
training_hours               0
target                       0
dtype: int64

In [7]:
hr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

In [8]:
hr_items = []
for col in hr.columns:
    hr_items.append([col, hr[col].dtype, hr[col].isna().sum(), round((hr[col].isna().sum()/len(hr[col]))*100,2),
                      hr[col].nunique(), list(hr[col].sample(30).drop_duplicates().values)])

hrDesc = pd.DataFrame(columns=['feature', 'type_data', 'null', 'nulPct', 'unique', 'uniqueSample'],data=hr_items)
hrDesc

Unnamed: 0,feature,type_data,null,nulPct,unique,uniqueSample
0,enrollee_id,int64,0,0.0,19158,"[15416, 32640, 13535, 31761, 25154, 29723, 150..."
1,city,object,0,0.0,123,"[city_75, city_159, city_45, city_103, city_57..."
2,city_development_index,float64,0,0.0,93,"[0.804, 0.855, 0.624, 0.91, 0.893, 0.926, 0.92..."
3,gender,object,4508,23.53,3,"[Male, nan, Other]"
4,relevent_experience,object,0,0.0,2,"[Has relevent experience, No relevent experience]"
5,enrolled_university,object,386,2.01,3,"[no_enrollment, Part time course, Full time co..."
6,education_level,object,460,2.4,5,"[Masters, Graduate, Phd, High School]"
7,major_discipline,object,2813,14.68,6,"[STEM, nan, Other, Arts]"
8,experience,object,65,0.34,22,"[9, 3, <1, 16, 4, 5, 8, >20, 12, 15, 7, 2, 13,..."
9,company_size,object,5938,30.99,8,"[nan, 100-500, 500-999, 10000+, 50-99, <10, 10..."


### __Enrollee ID__

In [9]:
print('Duplicated ID: ',hr.enrollee_id.duplicated().sum(),'Duplicates')

Duplicated ID:  0 Duplicates


__Make sure that no duplicates enrollee id present__

# __Data Cleaning__

In [10]:
hr.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


In [11]:
hr_clean = hr.copy()
hr_clean.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


## __Change 'other' gender to male__

In [12]:
hr['gender'].value_counts()

Male      13221
Female     1238
Other       191
Name: gender, dtype: int64

In [13]:
hr[hr['gender']=='Other']['city'].value_counts().head(5)

city_103    60
city_16     16
city_114    13
city_67     10
city_21      8
Name: city, dtype: int64

In [14]:
hr[hr['city']=='city_103']['gender'].value_counts().head(5)

Male      3099
Female     420
Other       60
Name: gender, dtype: int64

In [15]:
hr_clean['gender'] = hr_clean['gender'].replace('Other', 'Male')

In [16]:
hr_clean['gender'].value_counts()

Male      13412
Female     1238
Name: gender, dtype: int64

In [17]:
hr['gender'].value_counts()

Male      13221
Female     1238
Other       191
Name: gender, dtype: int64

## __"Beautify" the looks of one of the data from company size__

In [18]:
hr['company_size'].value_counts() 

50-99        3083
100-500      2571
10000+       2019
10/49        1471
1000-4999    1328
<10          1308
500-999       877
5000-9999     563
Name: company_size, dtype: int64

In [19]:
hr_clean['company_size'] = hr_clean['company_size'].replace('10/49', '10-49')

In [20]:
hr_clean['company_size'].value_counts()

50-99        3083
100-500      2571
10000+       2019
10-49        1471
1000-4999    1328
<10          1308
500-999       877
5000-9999     563
Name: company_size, dtype: int64

In [21]:
hr['company_size'].value_counts()

50-99        3083
100-500      2571
10000+       2019
10/49        1471
1000-4999    1328
<10          1308
500-999       877
5000-9999     563
Name: company_size, dtype: int64

## __Drop enrolle_id as the column's useless for model prediction__

In [36]:
hr.columns

Index(['enrollee_id', 'city', 'city_development_index', 'gender',
       'relevent_experience', 'enrolled_university', 'education_level',
       'major_discipline', 'experience', 'company_size', 'company_type',
       'last_new_job', 'training_hours', 'target'],
      dtype='object')

In [37]:
hr_clean = hr_clean.drop(['enrollee_id'], axis = 1)
hr_clean.head()

Unnamed: 0,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


In [42]:
hr_clean.columns.nunique()

13

In [43]:
hr.columns.nunique()

14

In [46]:
hr_clean['company_type'].unique()

array([nan, 'Pvt Ltd', 'Funded Startup', 'Early Stage Startup', 'Other',
       'Public Sector', 'NGO'], dtype=object)

# __[Export to CSV](https://ilmudatapy.com/import-dan-export-data-di-python/)__

In [44]:
hr_clean.to_csv('hr_clean.csv',index=False)