# HR Analytics: Job Change of Data Scientists

**reference** https://www.kaggle.com/sathianpong/hr-analytics-predict-who-is-looking-for-a-new-job

## Data Preprocessing

## Modules

In [1]:
import numpy as np
import pandas as pd

# data visualization
import pylab as pl
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns

# use ggplot in python
from plotnine import *

import os # Set the working directory
print (os.getcwd())

C:\Users\15177\Python_Project\HR_Analysis


In [2]:
# Synthetic Minority Oversampling Technique for minior class
from imblearn.over_sampling import SMOTE

## 0) Data Loading

In [3]:
data= pd.read_csv('C:/Users/15177/Python_Project/HR_Analysis/Data/aug_train.csv')
data.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


# 1) Checking missing percentage and data type

In [4]:
pds_missing_pct = data.isnull().sum() / len(data) * 100
pds_data_type = data.dtypes

my_dct = {'index':      pds_missing_pct.index,
          'missing_pct':pds_missing_pct.values, 
          'data_type':  pds_data_type.values}

pd.DataFrame(my_dct).sort_values(by= ['missing_pct'],ascending=False)

Unnamed: 0,index,missing_pct,data_type
10,company_type,32.049274,object
9,company_size,30.994885,object
3,gender,23.53064,object
7,major_discipline,14.683161,object
6,education_level,2.401086,object
11,last_new_job,2.207955,object
5,enrolled_university,2.014824,object
8,experience,0.339284,object
0,enrollee_id,0.0,int64
1,city,0.0,object


# 2) Remove unneeded features
I remove participant's id (enrollee_id) which is a unneeded feature.

In [5]:
data.drop('enrollee_id', axis=1,  inplace= True) # Otherwise, do operation inplace and return None.

# check
'enrollee_id' in data

False

# 3) Distinguish categorical and continuous variables.


**Categorical variables** 

    Ordinal data
    1. education_level            
    2. experience             
    3. company_size           
    4. last_new_job                                
  
    Nominal data (cateogrical and non-ordinal data)
    1. city                   
    2. gender                                        
    3. relevent_experience
    4. enrolled_university    
    5. major_discipline       
    6. company_type           

**Continuous variables**

1. city_development_index
2. training_hours
3. target [This is the outcome y]

In [6]:
# Find catgorical data
data.select_dtypes(include='object').columns

Index(['city', 'gender', 'relevent_experience', 'enrolled_university',
       'education_level', 'major_discipline', 'experience', 'company_size',
       'company_type', 'last_new_job'],
      dtype='object')

In [7]:
# Find ordinal data by investigating the classes
def f_unique(pd):
    return(pd.unique())

data.select_dtypes(include='object').apply( f_unique, axis=0 )

city                   [city_103, city_40, city_21, city_115, city_16...
gender                                        [Male, nan, Female, Other]
relevent_experience    [Has relevent experience, No relevent experience]
enrolled_university    [no_enrollment, Full time course, nan, Part ti...
education_level        [Graduate, Masters, High School, nan, Phd, Pri...
major_discipline       [STEM, Business Degree, nan, Arts, Humanities,...
experience             [>20, 15, 5, <1, 11, 13, 7, 17, 2, 16, 1, 4, 1...
company_size           [nan, 50-99, <10, 10000+, 5000-9999, 1000-4999...
company_type           [nan, Pvt Ltd, Funded Startup, Early Stage Sta...
last_new_job                                [1, >4, never, 4, 3, 2, nan]
dtype: object

In [8]:
# Find continuous data
data.select_dtypes(exclude='object').columns

Index(['city_development_index', 'training_hours', 'target'], dtype='object')

**Split in to three dataset by data type** because the data preprocessing is different.

In [9]:
df_cont = data.select_dtypes(exclude='object')
df_cate = data.select_dtypes(include='object')

# 4) Impute missing values

**Action Plan**
1. Let's impute missing values for categorical data by filling NA with 'No'. 
2. In this dataset, continuous data have no missing values, so imputation for continuous variable is not needed.

In [10]:
df_cate = df_cate.fillna('No')

In [11]:
# In company_size (categorical data), "10/49" should shoud be "10-49" 
df_cate['company_size'] = df_cate['company_size'].replace('10/49', '10-49')  # replace(A,B) -> Map A to B

In [12]:
# fill NaN in categorical data as "No"
category_notNull =df_cate.copy()
df_cate_nonull =df_cate.copy()

# 5) Encode "ordinal variables"
**sklearn.preprocessing.OrdinalEncoder**

Encode categorical features as an integer array. This results in a single column of integers (0 to n_categories - 1) per feature.

reference: https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/

In [13]:
# Select the ordinal categorical variables
df_cate_nonull_ord = df_cate_nonull[['education_level', 'experience', 'company_size', 'last_new_job']]

output = df_cate_nonull_ord.apply(lambda x: np.unique(x) , axis=0)
output

# check
for i,j in zip( range( len(output)) , output) :
    print(output.index[i])
    print(j)
    print('      ')

education_level
['Graduate' 'High School' 'Masters' 'No' 'Phd' 'Primary School']
      
experience
['1' '10' '11' '12' '13' '14' '15' '16' '17' '18' '19' '2' '20' '3' '4'
 '5' '6' '7' '8' '9' '<1' '>20' 'No']
      
company_size
['10-49' '100-500' '1000-4999' '10000+' '50-99' '500-999' '5000-9999'
 '<10' 'No']
      
last_new_job
['1' '2' '3' '4' '>4' 'No' 'never']
      


**Determine the order for ordinal variable**

In [14]:
from sklearn.preprocessing import OrdinalEncoder

# set up a algorithm to reorder the class within each ordinal variable
Alg_OrdinalEncoder = OrdinalEncoder([
    # education_level
    ['No', 'Primary School',  'High School', 'Graduate', 'Masters', 'Phd'],
    
    # 'experience'
    'No,<1,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,>20'.split(','),
    
    # 'company_size'
    ['No','<10', '10-49', '50-99', '100-500', '500-999' ,'1000-4999', '5000-9999', '10000+'],
    
    # 'last_new_job'
    ['No','1', '2', '3', '4', '>4', 'never']
])




**Apply the order to ordinal variables**

In [15]:
# category_notNull_ordinalEncoded = Alg_OrdinalEncoder.fit_transform(df_cate_nonull_ord)
df_cate_nonull_ord_ordered = Alg_OrdinalEncoder.fit_transform(df_cate_nonull_ord)
# check
df_cate_nonull_ord_ordered

array([[ 3., 22.,  0.,  1.],
       [ 3., 16.,  3.,  5.],
       [ 3.,  6.,  0.,  6.],
       ...,
       [ 3., 22.,  3.,  4.],
       [ 2.,  1.,  5.,  2.],
       [ 1.,  3.,  0.,  1.]])

**Store the index of feature:** 

This info will be used, when we look at the feature importance.

In [58]:
ftr_index_ordinal_x = ['education_level', 'experience', 'company_size', 'last_new_job']
ftr_index_ordinal_x

['education_level', 'experience', 'company_size', 'last_new_job']

# 6) One hot encode for "Nominal variable"
reference: https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

In [16]:
nominal_columns = ['city','gender','relevent_experience','enrolled_university','major_discipline','company_type'] 

In [17]:
my_subset = df_cate_nonull.loc[:, nominal_columns]
    
output = my_subset.apply(lambda x: np.unique(x) , axis=0)
output

# check
for i,j in zip( range( len(output)) , output) :
    print(output.index[i])
    print(j)
    print('      ')    

city
['city_1' 'city_10' 'city_100' 'city_101' 'city_102' 'city_103' 'city_104'
 'city_105' 'city_106' 'city_107' 'city_109' 'city_11' 'city_111'
 'city_114' 'city_115' 'city_116' 'city_117' 'city_118' 'city_12'
 'city_120' 'city_121' 'city_123' 'city_126' 'city_127' 'city_128'
 'city_129' 'city_13' 'city_131' 'city_133' 'city_134' 'city_136'
 'city_138' 'city_139' 'city_14' 'city_140' 'city_141' 'city_142'
 'city_143' 'city_144' 'city_145' 'city_146' 'city_149' 'city_150'
 'city_152' 'city_155' 'city_157' 'city_158' 'city_159' 'city_16'
 'city_160' 'city_162' 'city_165' 'city_166' 'city_167' 'city_171'
 'city_173' 'city_175' 'city_176' 'city_179' 'city_18' 'city_180'
 'city_19' 'city_2' 'city_20' 'city_21' 'city_23' 'city_24' 'city_25'
 'city_26' 'city_27' 'city_28' 'city_30' 'city_31' 'city_33' 'city_36'
 'city_37' 'city_39' 'city_40' 'city_41' 'city_42' 'city_43' 'city_44'
 'city_45' 'city_46' 'city_48' 'city_50' 'city_53' 'city_54' 'city_55'
 'city_57' 'city_59' 'city_61' 'city_62'

**Apply the OneHotEncoder to nominal variables**

In [18]:
from sklearn.preprocessing import OneHotEncoder

# set a algorithm to do "One Hot Encoding" 
alg_OneHotEncoder = OneHotEncoder(sparse=False).fit( df_cate_nonull.loc[:, nominal_columns] )

# action
df_cate_nonull_nomial_dummy = alg_OneHotEncoder.transform(df_cate_nonull.loc[:, nominal_columns])

In [19]:
# Now, the each class is represented by a dummpy variable {1,0}
df_cate_nonull_nomial_dummy

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 0.]])

**Store the index of feature:** 

In [47]:
ftr_index_nomial_x  = list(alg_OneHotEncoder.get_feature_names())
ftr_index_nomial_x

['x0_city_1',
 'x0_city_10',
 'x0_city_100',
 'x0_city_101',
 'x0_city_102',
 'x0_city_103',
 'x0_city_104',
 'x0_city_105',
 'x0_city_106',
 'x0_city_107',
 'x0_city_109',
 'x0_city_11',
 'x0_city_111',
 'x0_city_114',
 'x0_city_115',
 'x0_city_116',
 'x0_city_117',
 'x0_city_118',
 'x0_city_12',
 'x0_city_120',
 'x0_city_121',
 'x0_city_123',
 'x0_city_126',
 'x0_city_127',
 'x0_city_128',
 'x0_city_129',
 'x0_city_13',
 'x0_city_131',
 'x0_city_133',
 'x0_city_134',
 'x0_city_136',
 'x0_city_138',
 'x0_city_139',
 'x0_city_14',
 'x0_city_140',
 'x0_city_141',
 'x0_city_142',
 'x0_city_143',
 'x0_city_144',
 'x0_city_145',
 'x0_city_146',
 'x0_city_149',
 'x0_city_150',
 'x0_city_152',
 'x0_city_155',
 'x0_city_157',
 'x0_city_158',
 'x0_city_159',
 'x0_city_16',
 'x0_city_160',
 'x0_city_162',
 'x0_city_165',
 'x0_city_166',
 'x0_city_167',
 'x0_city_171',
 'x0_city_173',
 'x0_city_175',
 'x0_city_176',
 'x0_city_179',
 'x0_city_18',
 'x0_city_180',
 'x0_city_19',
 'x0_city_2',
 'x0

# 7) Gather transformed dataset into one

**Action plan**
1. Use X (transformed features) and y (outcome) as the final dataset.
2. y can be selected from df_cont ( continuous dataset).
3. X has tree parts:
    - numerical feature (X_num)
    - categorical and no-ordered features (X_cat_nominal) 
    - categorical and ordered features (X_cat_ordinal) 

In [21]:
# y 
y =  df_cont['target'].values
y.shape

(19158,)

In [22]:
# X
X_num        = df_cont.drop('target', axis=1).values
X_cat_nominal= df_cate_nonull_nomial_dummy
X_cat_ordinal= df_cate_nonull_ord_ordered

X = np.concatenate( [X_num, X_cat_nominal, X_cat_ordinal], axis=1)


X.shape

(19158, 153)

**X feature index**

In [60]:
ftr_index_num_x = list( df_cont.drop('target', axis=1).columns )
X_index = ftr_index_num_x + ftr_index_nomial_x + ftr_index_ordinal_x
len(X_index) # corret

153

In [67]:
pd.DataFrame(X_index).to_csv("X_index.csv", header=['feature_name'], index= False)

# 8) Impute minor outcome

Reference: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

In [23]:
X, y = SMOTE().fit_resample(X, y)

In [24]:
#### check
pd.DataFrame(y).value_counts()

#### Note: outcomes {1,0} have balanced sample sizes.

1.0    14381
0.0    14381
dtype: int64

# 9) Split the dataset into train and test. Then, standardize the numeric column based on train set

In [25]:
from sklearn.model_selection import train_test_split

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

#### check
for i in [X_train.shape, X_test.shape, y_train.shape, y_test.shape]:
    print(i)

(23009, 153)
(5753, 153)
(23009,)
(5753,)


**Standardization** 

Action plan: The first 2 columns are numeric data (city_development_index and training_hours) which needs to be standardized (mean=0 and standard deviation=1)

In [27]:
from sklearn.preprocessing import StandardScaler

In [28]:
# set a scaler using train set
alg_scaler = StandardScaler()
alg_scaler.fit(X_train[:, :2])

StandardScaler()

**Apply standardization algorithm to numerical features (the first 2 columns)**

In [29]:
X_train_scaled = X_train.copy()
X_train_scaled[:, :2] = alg_scaler.transform(X_train[:, :2])

X_test_scaled = X_test.copy()
X_test_scaled[:, :2] = alg_scaler.transform(X_test[:, :2])

# 10) Output the modified Dataset

In [30]:
os.chdir('C:/Users/15177/Python_Project/HR_Analysis/Data/')

In [31]:
pd.DataFrame(X_train_scaled).to_csv("X_train_scaled.csv", header=False, index= False)
pd.DataFrame(X_test_scaled).to_csv("X_test_scaled.csv", header=False, index= False)
pd.DataFrame(y_train).to_csv("y_train.csv",  header=False, index= False)
pd.DataFrame(y_test).to_csv("y_test.csv",  header=False, index =False)

# END