# Employee Turnover Prediction
This notebook provides an example of analyze data for prediction employee turnover. By preprocessing data and feed that into models  using Machine Learning, we can predict if one person will leave the company based on factors like salary, working hours and years at company.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

## Acquire Data
read csv file in and store dataset in a dataframe called "df"

In [None]:
df = pd.DataFrame.from_csv('../input/HR_comma_sep.csv', index_col=None)

In [None]:
import tensorflow as tf

In [None]:
node1 = tf.constant(3.0, dtype=tf.float32)
node2 = tf.constant(4.0) # also tf.float32 implicitly
print(node1, node2)

## Describe data to Analyze

In [None]:
print(df.columns.values)

Analyze the distribution of numerical feature values

In [None]:
df.describe()

In [None]:
# Preview the data
df.head()

## Check for missing data
By checking if there is any missing data, we see we luckly don't have any. Therefore no action is needed to fill in the null blanks.

In [None]:
df.isnull().any()

## Cast categorical data type to numerical
By showing the info of dataset, we see sales and salary are categorical data type, that will need conversion later.

In [None]:
df.info()

Now check the distribution of Categorical features

In [None]:
df.describe(include=['O'])

Sales has 10 ranges and salary has 3 ranges. This will be helpful when we convert the data to numerical features

In [None]:
#df = pd.DataFrame.from_csv('../input/HR_comma_sep.csv', index_col=None)
df['salary'] = df['salary'].map( {'high':2 ,'medium': 1, 'low': 0} ).astype(int)
df.head()

Now let's look at how we can convert "sales" variable to number. First we need to know what are the ten types "sales" has, since it is not obvious at the head of dataset.

In [None]:
old = [] 
for obj in df['sales']:
   if obj not in old:
    print (obj)
    old.append(obj)

From this for loop, we know there are ten types of sales: sales, accounting, hr, technical, support, IT, management
    , product_mng, marketing, RandD. We can now map all of job titles to numbers.

In [None]:
df['sales'] = df['sales'].map( {'sales':9 , 'accounting':8 , 'hr':7, 'technical':6,  'support':5,  'management':4, 'IT':3,  'product_mng':2,  'marketing':1,  'RandD':0} ).astype(float)
df.head()

## Find correlation of data and leaving rate

In [None]:
#Correlation Matrix
corr = df.corr()
corr = (corr)
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

corr

## Analyze by binary relationship table
From above we see "left" with "satisfaction_level", "salary" and "work_accident" has strong negative correlation, "left" and "time_spend_company" have strong positive correlation.
We can further discover the precise relationship below.

In [None]:
df[['left', 'satisfaction_level']].groupby(['left'], as_index=False).mean().sort_values(by='satisfaction_level', ascending=False)

Here those who had work accident actually have 0.1 higher possibility to stay in the company, which is counter intuitive. This is the time when Machine Learning comes into play, to discover underlying relationship that does not make sense with normal ways of anaylysis.

In [None]:
df[['left', 'Work_accident']].groupby(['left'], as_index=False).mean().sort_values(by='Work_accident', ascending=False)

When grouped by "left" factor, the data of time_spend_company is not very helpful. Therefore we try to group by time_spend_company as below. We see those who spent 3-6 years at the company are most likely to leave.

In [None]:
df[['left', 'time_spend_company']].groupby(['time_spend_company'], as_index=False).mean().sort_values(by='left', ascending=False)

The salary factor shows these who left are those who earns the least.

In [None]:
df[['left', 'salary']].groupby(['salary'], as_index=False).mean().sort_values(by='left', ascending=False)

## Visualize and convert continuous data to discrete
Other than the corerlation with binary "left" factor, exploring correlation between other factors can also give us very constructive information,

In [None]:
g = sns.FacetGrid(df, col='number_project')
g.map(plt.hist, 'average_montly_hours', bins=10)

We see a positive relationship between number of project with average monthly hours. Therefore we can combine these two columns. By either dropping one of them or multiply them to make correlation even stronger.

In [None]:
g = sns.FacetGrid(df, col='left')
g.map(plt.hist, 'average_montly_hours', bins=20)

We can see that those who work less than about 160 hours and more than about 270 hours are more likely to leave, but for some reason it significantly lower the accuracy when trying to convert this average hour into discreet representation, so we keep it as it for now.

In [None]:
g = sns.FacetGrid(df, col='left')
g.map(plt.hist, 'satisfaction_level', bins=20)

The relationship of satisfaction_level and left are also bimodially correlated. Since the correlation of "satisfaction_level" with  "left" is 0.39 which is pretty high, it is worth modifying the data to better reflect how each satisfaction_level range influence the "left".

In [None]:
#From the above graph, we see a high left rate between 0.25-0.5 and also >0.75
df.loc[ df['satisfaction_level'] <= 0.25, 'satisfaction_level'] = 0
df.loc[(df['satisfaction_level'] > 0.25) & (df['satisfaction_level'] <= 0.5), 'satisfaction_level'] = 1
df.loc[(df['satisfaction_level'] > 0.5) & (df['satisfaction_level'] <= 0.75), 'satisfaction_level']   = 0
df.loc[ df['satisfaction_level'] > 0.75, 'satisfaction_level'] = 1
df['satisfaction_level'] = df['satisfaction_level'].astype(int)

In [None]:
g = sns.FacetGrid(df, col='left')
g.map(plt.hist, 'last_evaluation', bins=20)

The last_evaluation also has a clear pattern for those who choose to leave the company. We see from above that if one's last_evaluation is <0.56 or >0.8, one has higher tendency to leave the company.

In [None]:
df.loc[ df['last_evaluation'] <= 0.56, 'last_evaluation'] = 0
df.loc[(df['last_evaluation'] > 0.56) & (df['last_evaluation'] <= 0.80), 'last_evaluation'] = 1
df.loc[ df['last_evaluation'] > 0.80, 'last_evaluation'] = 0
df['last_evaluation'] = df['last_evaluation'].astype(int)


## Create new column from existing data
We see from the previous correlation matrix analysis that 'number_project' and 'average_montly_hours' are strongly correlated, therefore we can combine these existing columns to create new feature.

In [None]:
df["proj*hour"] = df.number_project * df.average_montly_hours
df.loc[:, ['proj*hour','number_project','average_montly_hours']].head(10)

## Remove duplicate datasets
Since we have 'proj*hour' now, we can drop 'number_project' and 'average_montly_hours' to remove duplicate data from model

In [None]:
df = df.drop(['number_project','average_montly_hours'], axis=1)

## Split train and test datsets
Now we can split the preprocessed data to training dataset and testing dataset, which are 85% and 15% of the original dataset

In [None]:
nHead = int(len(df)*0.85)
nTail = int(len(df)*0.15)
X_train = df.drop("left", axis=1).head(nHead)
X_test  = df.drop("left", axis=1).tail(nTail)
Y_train = df["left"].head(nHead)
Y_test = df["left"].tail(nTail)
X_train.shape, X_test.shape

## Modeling with dataset
There are several modeling strategies. Here we list three distinct ones and test out which one works best.

### Support Vector Machines
support vector machines (SVMs, also support vector networks[1]) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other (Wikipedia).

In [None]:
svc = SVC()
svc.fit(X_train, Y_train)
acc_svc = round(svc.score(X_test, Y_test) * 100, 2)
acc_svc

 ### k-nearest neighbors
 k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression.[1] In both cases, the input consists of the k closest training examples in the feature space (Wikipedia).

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
acc_knn = round(knn.score(X_test, Y_test) * 100, 2)
acc_knn

### decision tree
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements (Wikipedia).

In [None]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
acc_decision_tree = round(decision_tree.score(X_test, Y_test) * 100, 2)
acc_decision_tree