# Introduction about the dataset

*The dataset has information collected from a company which faces issue with employees leaving the company. The database has information collected from different employees who have left or still staying in the company. The company wants to use your expertise in identifying which is the major contributor for employees leaving the company.* 

1. Satisfaction Level - ranges between 0 & 1 - gives the satisfaction level of the employee
2. last_evaluation - ranges between 0 & 1 - defines the normalised employee's rating in the last appraisal
3. number_project - numeric - No of projects the employee has worked on so far
4. average_monthly_hours - Average amount of hours employee spends in the office per month
5. time_spend_company - Time employee has spent in the company (in months)
6. Work_accident - categorical - If the employee has encountered any accident in the work environment
7. Department - Categorical - Department in which the employee is working/ has worked
8. Salary - Categorical - Divided into low,medium & high 

### Read the dataset

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import LabelEncoder
hr = pd.read_csv("hr-analytics.csv")
hr.head()

### Observe the shape and the type of dataset

In [None]:
hr.shape

In [None]:
type(hr)

### Use summary statistics to check if missing values,outlier  treament is necessary

In [None]:
hr.describe()

### Data Preprocessing -  Missing values Treatment 
        

In [None]:
sns.pairplot(hr2, diag_kind='kde')
# categorical variables show similar spread of satisfaction levels across categories
# low correlations between independent variables


### Standardization of Data


In [None]:
from sklearn.preprocessing import StandardScaler
std_scale = StandardScaler()

In [None]:
target = hr2['left']
features = hr2.drop(['left'], axis=1)

In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features,target, test_size = 0.3, random_state = 5)

In [None]:
y_train = y_train.reshape(len(y_train),1)
y_test = y_test.reshape(len(y_test),1)

In [None]:
X_train_scaled = std_scale.fit_transform(X_train)
X_test_scaled = std_scale.fit_transform(X_test)

### Use Naive Bayes Modelling and find out the accuracy of the model

In [None]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train_scaled, y_train)
print(model)

In [None]:
from sklearn import metrics
# make predictions

predicted = model.predict(X_test_scaled)
predicted_tr = model.predict(X_train_scaled)

print( "Accuracy for Test:\n",metrics.classification_report(y_test, predicted),
       "Accuracy for Training:\n",metrics.classification_report(y_train, predicted_tr))
print("Confusion Matrix:\n",metrics.confusion_matrix(y_test, predicted))

### Use SVM and find out the accuracy & print the confusion matrix

In [None]:
from sklearn.svm import SVC
# Building a Support Vector Machine on train data
svc = SVC(C=6, kernel='linear')
svc.fit(X_train_scaled, y_train)

In [None]:

prediction = svc.predict(X_test_scaled)
print('Accuracy for test: ', svc.score(X_test_scaled, y_test), "\n",
      'Accuracy for training: ',svc.score(X_train_scaled, y_train))

print("Confusion Matrix:\n",metrics.confusion_matrix(prediction,y_test))

### Find out cross - validation scores with 10-fold for both the models

In [None]:
from sklearn.cross_validation import cross_val_score, cross_val_predict
c,r = y_test.shape
y_test = y_test.reshape(c,)
scores = cross_val_score(svc, X_test_scaled, y_test, cv=10)
scores

### Find the Optimal parameters of the SVM model by tuning hyperparameters: Use c Values: (6,7) & Kernels (rbf, linear) 

In [None]:
def svcm(x,y):
    svc = SVC(C=x, kernel=y)
    svc.fit(X_train_scaled, y_train)
    return print("For c = ",x," and kernel = ", y, "\n",
                 "Accuracy on training set: ", svc.score(X_train_scaled, y_train), "\n",
                 "Accuracy on test set: ", svc.score(X_test_scaled, y_test))
svcm(6,'rbf')
svcm(7,'rbf')
svcm(6,'linear')
svcm(7,'linear')
    
# best hyperparameters are with rbf = kernel and c=6 or 7

In [None]:
svc = SVC(C=6, kernel='rbf')
svc.fit(X_train_scaled, y_train)

### Considering the best hyperparameters and performing CrossValidation

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(svc, X_test_scaled, y_test, cv=10)
scores


### Use Decision Tree and find out the feature importances scores and accuracy of the model

In [None]:
hr.isnull().sum()

### Dealing with Outliers - Find IQR and remove the row if there are any outlier
       

### Use Histogram to Check distribution of dependent variable 

In [None]:

plt.hist(hr2['left'], ec='black')

### Use correlation & scatter matrix to observe the dependency between variables (Drop the dependent variable if the abs(correlation) with dependent variable is <0.01)

In [None]:
#check correlation between continuous variables
hr2.corr()
#none of the continuous variables have correlation <0.01

In [None]:
# subset data for continuous variables
hr_ct = hr[['satisfaction_level','last_evaluation','number_project','average_montly_hours','time_spend_company']]
q1 = hr_ct.quantile(0.25)
q3 = hr_ct.quantile(0.75)
iqr = q3-q1

In [None]:
# remove outliers basis 1.5IQR
hr2 = hr[~((hr_ct<(q1-1.5*iqr))|(hr_ct>(q3+1.5*iqr))).any(axis=1)]
hr2.describe()

### Dealing with Categorical Values - Use LabelEncoder from skilearn to encode categorical variables

In [None]:
#Encode department and salary
labelencoder = LabelEncoder()
hr2['Department'] = labelencoder.fit_transform(hr2.Department)
hr2['salary'] = labelencoder.fit_transform(hr2.salary)
hr2.head()

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier(criterion = 'entropy' )
dt_model.fit(X_train, y_train)
dt_model.score(X_test , y_test)

In [None]:
print (pd.DataFrame(dt_model.feature_importances_, columns = ["Imp"], index = X_train.columns))

### Find out cross-validation scores with 10-fold for the above model