## Predict the employee attrition rate in organizations

### ABOUT CHALLENGE

### Problem statement

Employees certainly are the most important asset of an organization. They are the ones working behind the scenes to ensure that your business functions seamlessly like a well-oiled machine. Hiring the best fit for an organization and ensuring that they stick with the company are two sides of the same coin. Employee attrition not only may lead to a minor hitch in the system, but it is also a major cost to the organization.

The Human Resources department of your organization is determined to predict the employee attrition rate in advance and put a corrective plan of action in place. As a Machine Learning Specialist, the HR team has requested you to build a sophisticated model that predicts the organization’s attrition rate.

### Dataset

The dataset consists of various details such as gender and age of the employee, education and relationship status, pay scale, and other factors that may influence the attrition rate.

The benefits of practicing this problem by using Machine Learning techniques are as follows:

- This challenge will encourage you to apply your Machine Learning skills to build models that can predict employee attrition rates.
- This challenge will help you enhance your knowledge of regression. Regression is one of the basic building blocks of Machine Learning.
We challenge you to build a model that computes the attrition rate for an employee working in an organization.

- Train.csv
- Test.csv
- sample_submission.csv

### Data Description

- __Employee_ID__:	Unique ID of each employee
- __Age__:	Age of each employee
- __Unit__:	Department under which the employee work
- __Education__:	Rating of Qualification of an employee (1-5)
- __Gender__	Male-0 or Female-1
- __Decision_skill_possess__	Decision skill that an employee possesses
- __Post_Level__	Level of the post in an organization (1-5)
- __Relationship_Status__	Categorical Married or Single 
- __Pay_Scale__	Rate in between 1 to 10
- __Time_of_service__	Years in the organization
- __growth_rate__	Growth rate in percentage of an employee
- __Time_since_promotion__	Time in years since the last promotion
- __Work_Life_balance__	Rating for work-life balance given by an employee.
- __Travel_Rate__	Rating based on travel history(1-3)
- __Hometown__	Name of the city
- __Compensation_and_Benefits__	Categorical Variabe
- __VAR1 - VAR5__	Anominised variables
- __Attrition_rate(TARGET VARIABLE)__	Attrition rate of each employee

### Submission format
You are required to write your predictions in a .csv file that contain the following columns:

Employee_ID

Attrition_rate

### Evaluation criteria
The evaluation metric that is used for this problem is the root mean squared error. The formula is as follows:
 
 score = 100 * max(0,1 - root_mean_squared_error(actual_values,predicted_values))

*Note : Creating a simple linear regression base line model with no imputations or no EDA 

In [1]:
#Importing basic libraries
import pandas as pd
import numpy as np
#Import visualization libraries
import seaborn as sns
from matplotlib import pyplot
import matplotlib.pyplot as plt

In [2]:
#Read the training datasets
train = pd.read_csv('./Dataset/Train.csv')
pd.set_option('display.max_columns', None)
train.head()

Unnamed: 0,Employee_ID,Gender,Age,Education_Level,Relationship_Status,Hometown,Unit,Decision_skill_possess,Time_of_service,Time_since_promotion,growth_rate,Travel_Rate,Post_Level,Pay_Scale,Compensation_and_Benefits,Work_Life_balance,VAR1,VAR2,VAR3,VAR4,VAR5,VAR6,VAR7,Attrition_rate
0,EID_23371,F,42.0,4,Married,Franklin,IT,Conceptual,4.0,4,33,1,1,7.0,type2,3.0,4,0.7516,1.8688,2.0,4,5,3,0.1841
1,EID_18000,M,24.0,3,Single,Springfield,Logistics,Analytical,5.0,4,36,0,3,6.0,type2,4.0,3,-0.9612,-0.4537,2.0,3,5,3,0.067
2,EID_3891,F,58.0,3,Married,Clinton,Quality,Conceptual,27.0,3,51,0,2,8.0,type2,1.0,4,-0.9612,-0.4537,3.0,3,8,3,0.0851
3,EID_17492,F,26.0,3,Single,Lebanon,Human Resource Management,Behavioral,4.0,3,56,1,3,8.0,type2,1.0,3,-1.8176,-0.4537,,3,7,3,0.0668
4,EID_22534,F,31.0,1,Married,Springfield,Logistics,Conceptual,5.0,4,62,1,3,2.0,type3,3.0,1,0.7516,-0.4537,2.0,2,8,2,0.1827


In [3]:
#Making a copy of the train data
train_data = train
train_data.head(5)

Unnamed: 0,Employee_ID,Gender,Age,Education_Level,Relationship_Status,Hometown,Unit,Decision_skill_possess,Time_of_service,Time_since_promotion,growth_rate,Travel_Rate,Post_Level,Pay_Scale,Compensation_and_Benefits,Work_Life_balance,VAR1,VAR2,VAR3,VAR4,VAR5,VAR6,VAR7,Attrition_rate
0,EID_23371,F,42.0,4,Married,Franklin,IT,Conceptual,4.0,4,33,1,1,7.0,type2,3.0,4,0.7516,1.8688,2.0,4,5,3,0.1841
1,EID_18000,M,24.0,3,Single,Springfield,Logistics,Analytical,5.0,4,36,0,3,6.0,type2,4.0,3,-0.9612,-0.4537,2.0,3,5,3,0.067
2,EID_3891,F,58.0,3,Married,Clinton,Quality,Conceptual,27.0,3,51,0,2,8.0,type2,1.0,4,-0.9612,-0.4537,3.0,3,8,3,0.0851
3,EID_17492,F,26.0,3,Single,Lebanon,Human Resource Management,Behavioral,4.0,3,56,1,3,8.0,type2,1.0,3,-1.8176,-0.4537,,3,7,3,0.0668
4,EID_22534,F,31.0,1,Married,Springfield,Logistics,Conceptual,5.0,4,62,1,3,2.0,type3,3.0,1,0.7516,-0.4537,2.0,2,8,2,0.1827


In [4]:
#Read the testing datasets
test = pd.read_csv('./Dataset/Test.csv')
pd.set_option('display.max_columns', None)
test.head()

Unnamed: 0,Employee_ID,Gender,Age,Education_Level,Relationship_Status,Hometown,Unit,Decision_skill_possess,Time_of_service,Time_since_promotion,growth_rate,Travel_Rate,Post_Level,Pay_Scale,Compensation_and_Benefits,Work_Life_balance,VAR1,VAR2,VAR3,VAR4,VAR5,VAR6,VAR7
0,EID_22713,F,32.0,5,Single,Springfield,R&D,Conceptual,7.0,4,30,1,5,4.0,type2,1.0,3,-0.9612,-0.4537,2.0,1,8,4
1,EID_9658,M,65.0,2,Single,Lebanon,IT,Directive,41.0,2,72,1,1,1.0,type2,1.0,4,-0.9612,0.7075,1.0,2,8,2
2,EID_22203,M,52.0,3,Married,Springfield,Sales,Directive,21.0,3,25,0,1,8.0,type3,1.0,4,-0.1048,0.7075,2.0,1,9,3
3,EID_7652,M,50.0,5,Single,Washington,Marketing,Analytical,11.0,4,28,1,1,2.0,type0,4.0,3,-0.1048,0.7075,2.0,2,8,3
4,EID_6516,F,44.0,3,Married,Franklin,R&D,Conceptual,12.0,4,47,1,3,2.0,type2,4.0,4,1.6081,0.7075,2.0,2,7,4


In [5]:
#Making a copy of the test data
test_data = test

## Baseline Model

In [6]:
#Replacing missing values in train with mean
train_data.fillna(train_data.mean(),inplace = True)

In [7]:
train_data.isnull().sum()

Employee_ID                  0
Gender                       0
Age                          0
Education_Level              0
Relationship_Status          0
Hometown                     0
Unit                         0
Decision_skill_possess       0
Time_of_service              0
Time_since_promotion         0
growth_rate                  0
Travel_Rate                  0
Post_Level                   0
Pay_Scale                    0
Compensation_and_Benefits    0
Work_Life_balance            0
VAR1                         0
VAR2                         0
VAR3                         0
VAR4                         0
VAR5                         0
VAR6                         0
VAR7                         0
Attrition_rate               0
dtype: int64

In [8]:
X= train_data.drop(['Employee_ID','Attrition_rate'],axis =1)
y = train_data.Attrition_rate

In [9]:
X = pd.get_dummies(X)

In [10]:
X.isnull().sum()

Age                                  0
Education_Level                      0
Time_of_service                      0
Time_since_promotion                 0
growth_rate                          0
Travel_Rate                          0
Post_Level                           0
Pay_Scale                            0
Work_Life_balance                    0
VAR1                                 0
VAR2                                 0
VAR3                                 0
VAR4                                 0
VAR5                                 0
VAR6                                 0
VAR7                                 0
Gender_F                             0
Gender_M                             0
Relationship_Status_Married          0
Relationship_Status_Single           0
Hometown_Clinton                     0
Hometown_Franklin                    0
Hometown_Lebanon                     0
Hometown_Springfield                 0
Hometown_Washington                  0
Unit_Accounting and Finan

In [11]:
X.head(5)

Unnamed: 0,Age,Education_Level,Time_of_service,Time_since_promotion,growth_rate,Travel_Rate,Post_Level,Pay_Scale,Work_Life_balance,VAR1,VAR2,VAR3,VAR4,VAR5,VAR6,VAR7,Gender_F,Gender_M,Relationship_Status_Married,Relationship_Status_Single,Hometown_Clinton,Hometown_Franklin,Hometown_Lebanon,Hometown_Springfield,Hometown_Washington,Unit_Accounting and Finance,Unit_Human Resource Management,Unit_IT,Unit_Logistics,Unit_Marketing,Unit_Operarions,Unit_Production,Unit_Purchasing,Unit_Quality,Unit_R&D,Unit_Sales,Unit_Security,Decision_skill_possess_Analytical,Decision_skill_possess_Behavioral,Decision_skill_possess_Conceptual,Decision_skill_possess_Directive,Compensation_and_Benefits_type0,Compensation_and_Benefits_type1,Compensation_and_Benefits_type2,Compensation_and_Benefits_type3,Compensation_and_Benefits_type4
0,42.0,4,4.0,4,33,1,1,7.0,3.0,4,0.7516,1.8688,2.0,4,5,3,1,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
1,24.0,3,5.0,4,36,0,3,6.0,4.0,3,-0.9612,-0.4537,2.0,3,5,3,0,1,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0
2,58.0,3,27.0,3,51,0,2,8.0,1.0,4,-0.9612,-0.4537,3.0,3,8,3,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0
3,26.0,3,4.0,3,56,1,3,8.0,1.0,3,-1.8176,-0.4537,1.891078,3,7,3,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0
4,31.0,1,5.0,4,62,1,3,2.0,3.0,1,0.7516,-0.4537,2.0,2,8,2,1,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0


In [12]:
#Replacing missing values in test with mean
test_data.fillna(test_data.mean(),inplace = True)
test_data.isnull().sum()

Employee_ID                  0
Gender                       0
Age                          0
Education_Level              0
Relationship_Status          0
Hometown                     0
Unit                         0
Decision_skill_possess       0
Time_of_service              0
Time_since_promotion         0
growth_rate                  0
Travel_Rate                  0
Post_Level                   0
Pay_Scale                    0
Compensation_and_Benefits    0
Work_Life_balance            0
VAR1                         0
VAR2                         0
VAR3                         0
VAR4                         0
VAR5                         0
VAR6                         0
VAR7                         0
dtype: int64

In [13]:
test_data = test_data.drop('Employee_ID',1)

In [14]:
test_data = pd.get_dummies(test_data)
test.isnull().sum()

Employee_ID                  0
Gender                       0
Age                          0
Education_Level              0
Relationship_Status          0
Hometown                     0
Unit                         0
Decision_skill_possess       0
Time_of_service              0
Time_since_promotion         0
growth_rate                  0
Travel_Rate                  0
Post_Level                   0
Pay_Scale                    0
Compensation_and_Benefits    0
Work_Life_balance            0
VAR1                         0
VAR2                         0
VAR3                         0
VAR4                         0
VAR5                         0
VAR6                         0
VAR7                         0
dtype: int64

In [15]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size =0.3, random_state = 24)
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

(4900, 46)
(4900,)
(2100, 46)
(2100,)


In [16]:
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(max_depth = 10,min_samples_leaf= 4)
dt.fit(X_train,Y_train)
predict = dt.predict(X_test)

In [17]:
from sklearn.metrics import mean_squared_error


In [18]:
def score(y_test,y_pred):
    return 100*max(0,1-mean_squared_error(y_test,y_pred))

In [19]:
score(Y_test,predict)

95.83012797932878

In [20]:
test_predict = dt.predict(test_data)

In [21]:
test_predict.shape

(3000,)

In [22]:
submission=pd.DataFrame(test_predict, columns=['Attrition_rate']) 

In [23]:
submission['Employee_ID']= test['Employee_ID']
submission.head(2)

Unnamed: 0,Attrition_rate,Employee_ID
0,0.316448,EID_22713
1,0.343814,EID_9658


In [24]:
pd.DataFrame(submission,columns = ['Employee_ID','Attrition_rate']).to_csv('Baseline.csv',index = False)