## Problem Statement

Within the context of human resources (HR), attrition is a reduction in the workforce caused by retirement or resignation. This is a serious problem faced by several organizations around the world as attrition is economically damaging to the organizations as the replacement employees have to be hired at a cost and trained again at a cost. High Rates of Attrition also damages the brand value of the company.
 
Now the Dataset belongs to a very fast-growing company. This company has witnessed several employees leaving the company in the last 3 years. The company’s HR team has always been reactive to attrition but now the team wants to be proactive and wished to predict attrition of employees using the data they have in hand. 
 
The goal here is to predict whether an employee will leave the company based upon the various variables given in the dataset.

### Working with Data
Data has been split into two groups and provided in the module:
training set 
test set 
The training set is used to build your machine learning model. For the training set, we provide the attrition details of an employee.
The test set should be used to see how well your model performs on unseen data. For the test set, it is your job to predict the attrition value of an employee.

### Metric to measure

Accuracy is the metric to measure the performance in this Hackathon.

Accuracy= (TP+TN)/(TP+TN+FP+FN)

### Submission File Format:

You are to submit a CSV file with exactly 2630 entries plus a header row. The file should have exactly two columns

1.      EmployeeID (sorted in any order)
2.      Attrition

Variable			: Description <br>

EmployeeID			: Unique employee code<br>

Attrition			: Attrition flag<br>

Age				: Age of employee<br>

TravelProfile			: Status of travel in job profile<br>

Department			: Department of employee<br>

HomeToWork			: Distance between home to work<br>

EducationField			: Field of education of an employee<br>

Gender			: Gender of an employee<br>

HourlnWeek			: Work hours of an employee in a week<br>

Involvement	: Involvement of any employee in engagement activity organised by HR team. 5 highest | 1 Lowest<br>

WorklifeBalance		: Work Life balance of an employee. 5 highest | 1 Lowest

Designation			: Employee designation<br>

JobSatisfaction		: Score of employee opinion survey. 5 highest | 1 Lowest<br>

ESOPS	: Do the employess owns companyís ESOPs  1 Means Yes and 0 Means No<br>

NumCompaniesWorked	: Total number of company employee had worked in past<br>

OverTime			: Is employee is eligible to be paid for overtime<br>

SalaryHikelastYear		: Increment percent in last cycle<br>

WorkExperience		: Total year of work experience<br>

LastPromotion			: Year since last promotion<br>

CurrentProfile			: Year since in current profile<br>

MaritalStatus			: Marital status of employee<br>

MonthlyIncome		: Gross monthly income of employee<br>


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as sns
%matplotlib inline
from sklearn import metrics


In [2]:
df = pd.read_csv('Train_Dataset.csv')
df.head()

Unnamed: 0,EmployeeID,Attrition,Age,TravelProfile,Department,HomeToWork,EducationField,Gender,HourlnWeek,Involvement,...,JobSatisfaction,ESOPs,NumCompaniesWorked,OverTime,SalaryHikelastYear,WorkExperience,LastPromotion,CurrentProfile,MaritalStatus,MonthlyIncome
0,5110001.0,0.0,35.0,Rarely,Analytics,5.0,CA,Male,69.0,1.0,...,1.0,1.0,1.0,1.0,20.0,7.0,2.0,,M,18932.0
1,5110002.0,1.0,32.0,Yes,Sales,5.0,Statistics,Female,62.0,4.0,...,2.0,0.0,8.0,0.0,20.0,4.0,1.0,,Single,18785.0
2,5110003.0,0.0,31.0,Rarely,Analytics,5.0,Statistics,F,45.0,5.0,...,2.0,1.0,3.0,0.0,26.0,12.0,1.0,3.0,Single,22091.0
3,5110004.0,0.0,34.0,Yes,Sales,10.0,Statistics,Female,32.0,3.0,...,4.0,1.0,1.0,0.0,23.0,5.0,1.0,3.0,Divorsed,20302.0
4,5110005.0,0.0,37.0,No,Analytics,27.0,Statistics,Female,49.0,3.0,...,4.0,1.0,8.0,0.0,21.0,12.0,1.0,9.0,Divorsed,21674.0


In [3]:
df.shape

(7810, 22)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7810 entries, 0 to 7809
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   EmployeeID          5180 non-null   float64
 1   Attrition           5180 non-null   float64
 2   Age                 4864 non-null   float64
 3   TravelProfile       5180 non-null   object 
 4   Department          5056 non-null   object 
 5   HomeToWork          4925 non-null   float64
 6   EducationField      5180 non-null   object 
 7   Gender              5134 non-null   object 
 8   HourlnWeek          4893 non-null   float64
 9   Involvement         5180 non-null   float64
 10  WorkLifeBalance     5180 non-null   float64
 11  Designation         5142 non-null   object 
 12  JobSatisfaction     5180 non-null   float64
 13  ESOPs               5180 non-null   float64
 14  NumCompaniesWorked  5180 non-null   float64
 15  OverTime            5180 non-null   float64
 16  Salary

In [6]:
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
EmployeeID,5180.0,,,,5112590.5,1495.481528,5110001.0,5111295.75,5112590.5,5113885.25,5115180.0
Attrition,5180.0,,,,0.278958,0.44853,0.0,0.0,0.0,1.0,1.0
Age,4864.0,,,,37.108553,9.248647,18.0,30.0,36.0,43.0,61.0
TravelProfile,5180.0,3.0,Rarely,3637.0,,,,,,,
Department,5056.0,3.0,Analytics,3219.0,,,,,,,
HomeToWork,4925.0,,,,11.107411,8.455577,1.0,5.0,9.0,16.0,121.0
EducationField,5180.0,6.0,Statistics,2129.0,,,,,,,
Gender,5134.0,3.0,Male,3094.0,,,,,,,
HourlnWeek,4893.0,,,,57.979767,12.996674,10.0,49.0,59.0,67.0,99.0
Involvement,5180.0,,,,3.226641,0.872431,1.0,3.0,3.0,4.0,5.0


In [12]:
dtype_col = ['Attrition','Involvement','WorkLifeBalance','JobSatisfaction','ESOPs','OverTime']
for i in dtype_col:
    df[i] =df[i].astype('object')

In [15]:
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
EmployeeID,5180.0,,,,5112590.5,1495.481528,5110001.0,5111295.75,5112590.5,5113885.25,5115180.0
Attrition,5180.0,2.0,0.0,3735.0,,,,,,,
Age,4864.0,,,,37.108553,9.248647,18.0,30.0,36.0,43.0,61.0
TravelProfile,5180.0,3.0,Rarely,3637.0,,,,,,,
Department,5056.0,3.0,Analytics,3219.0,,,,,,,
HomeToWork,4925.0,,,,11.107411,8.455577,1.0,5.0,9.0,16.0,121.0
EducationField,5180.0,6.0,Statistics,2129.0,,,,,,,
Gender,5134.0,3.0,Male,3094.0,,,,,,,
HourlnWeek,4893.0,,,,57.979767,12.996674,10.0,49.0,59.0,67.0,99.0
Involvement,5180.0,5.0,3.0,3030.0,,,,,,,
