## HR Analytics: Job Change of Data Scientists
### Predict who will move to a new job
A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Many people signup for their training. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education, experience are in hands from candidates signup and enrollment.

This dataset designed to understand the factors that lead a person to leave current job for HR researches too. By model(s) that uses the current credentials,demographics,experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision.


## Modeling Phase

- KNeighborsClassifier
- LogisticRegression
- DecisionTreeClassifier
- XGBClassifier
- SVM

##### important packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import DataFrame

## Classifiers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC

## pre-processing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score

#### Get Data

In [2]:
df = pd.read_csv('../3.preprocess_data/data_process.csv')
df.head()

Unnamed: 0,city,city_development_index,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target,relevent_experience_Has relevent experience,enrolled_university_Full time course,enrolled_university_no_enrollment,gender_Female,gender_Male,NEW1,NEW2
0,103,0.92,0,5,22,8,3,1,3.583519,1.0,1,0,1,0,1,0,0
1,40,0.776,0,5,15,4,5,5,3.850148,0.0,0,0,1,0,1,0,0
2,21,0.624,0,5,5,8,3,0,4.418841,0.0,0,1,0,1,0,0,0
3,162,0.767,2,5,22,4,1,4,2.079442,0.0,1,0,1,0,1,10,44
4,176,0.764,0,5,11,8,3,1,3.178054,1.0,1,0,0,1,0,0,0


#### divide data to train and test

In [3]:
X = df.drop('target', axis=1).values
y = df['target'].values

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### 1. KNeighborsClassifier

In [5]:
param_grid = {'n_neighbors': np.arange(1, 12)}
knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn, param_grid= param_grid, cv=5)
knn_cv.fit(X_train, y_train)

knn_cv.best_score_

0.7775999824764233

In [6]:
y_pred = knn_cv.predict(X_test)
report = classification_report(y_test, y_pred, output_dict=True)
report = pd.DataFrame(report).transpose()
report

Unnamed: 0,precision,recall,f1-score,support
0.0,0.826162,0.87569,0.850205,2719.0
1.0,0.531207,0.433258,0.477259,884.0
accuracy,0.767138,0.767138,0.767138,0.767138
macro avg,0.678685,0.654474,0.663732,3603.0
weighted avg,0.753795,0.767138,0.758702,3603.0


In [7]:
knn_pred_prob = knn_cv.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, knn_pred_prob)
auc

0.7469641320754402

### LogisticRegression


In [8]:
param_grid = {'penalty' : ['l1', 'l2'], 'C' : np.logspace(-4, 4, 20), 'solver' : ['liblinear']}
logreg = LogisticRegression()
log_cv = GridSearchCV(logreg, param_grid=param_grid, cv=5)
log_cv.fit(X_train, y_train)
log_cv.best_score_

0.7751720523983832

In [9]:
y_pred = log_cv.predict(X_test)
report = classification_report(y_test, y_pred, output_dict=True)
report = pd.DataFrame(report).transpose()
report

Unnamed: 0,precision,recall,f1-score,support
0.0,0.790081,0.931592,0.855021,2719.0
1.0,0.531486,0.238688,0.32943,884.0
accuracy,0.761588,0.761588,0.761588,0.761588
macro avg,0.660784,0.58514,0.592226,3603.0
weighted avg,0.726635,0.761588,0.726067,3603.0


In [10]:
log_pred_prob = log_cv.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, log_pred_prob)
auc

0.7509279845697864

### DecisionTreeClassifier

In [11]:
param_grid = { 'criterion':['gini','entropy'],'max_depth': np.arange(3, 15)}
dtc = DecisionTreeClassifier()
dtc_cv = GridSearchCV(dtc, param_grid=param_grid, cv=5)
dtc_cv.fit(X_train, y_train)
dtc_cv.best_score_

0.7947396292559966

In [12]:
y_pred = dtc_cv.predict(X_test)
report = classification_report(y_test, y_pred, output_dict=True)
report = pd.DataFrame(report).transpose()
report

Unnamed: 0,precision,recall,f1-score,support
0.0,0.850377,0.871644,0.860879,2719.0
1.0,0.572304,0.528281,0.549412,884.0
accuracy,0.787399,0.787399,0.787399,0.787399
macro avg,0.71134,0.699962,0.705145,3603.0
weighted avg,0.782151,0.787399,0.78446,3603.0


In [13]:
dtc_pred_prob = dtc_cv.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, dtc_pred_prob)
auc

0.7842601252456736