# TASK 3:
## A trained model which can predict the employee performance based on factors as inputs. This will be used to hire employees.
1. Import Libraries
2. Encoding
3. Split the data
4. Train Model
>  Below uses the chi-squared (chi²) statistical test for non-negative features to select the best features from Dataset.

In [1]:
#Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import warnings #Remove warnings
warnings.simplefilter('ignore')

In [2]:
#Load the dataset using Pickle
data=pd.read_pickle('data_eda')
#data=pd.read_excel('inx_emp.xls')
data.shape

(1009, 28)

## Encoding
*  LabelEncoder & OneHotEncoder The labelEncoder and OneHotEncoder only works on categorical features.
*  We need first to extract the categorial featuers using boolean mask.

In [3]:
# Categorical boolean mask
categorical = data.dtypes==object
# filter categorical columns using mask and turn it into a list
categorical_cols = data.columns[categorical].tolist()
categorical_cols

['EmpNumber',
 'Gender',
 'EducationBackground',
 'MaritalStatus',
 'EmpDepartment',
 'EmpJobRole',
 'BusinessTravelFrequency',
 'OverTime',
 'Attrition']

* LabelEncoder converts each class under specified feature to a numerical value.  
* Apply a function along an axis of the DataFrame.
* A lambda is function without name.A lambda function can take any number of arguments, but can only have one expression.

In [4]:
# Initiate LabelEncoder object
enc = LabelEncoder()
data[categorical_cols] = data[categorical_cols].apply(lambda col: enc.fit_transform(col))
data[categorical_cols].head()

Unnamed: 0,EmpNumber,Gender,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,OverTime,Attrition
0,0,1,2,2,5,13,2,0,0
1,1,1,2,2,5,13,2,0,0
2,2,1,1,1,5,13,1,1,0
4,3,1,2,2,5,13,2,0,0
5,4,1,1,0,1,3,1,0,0


In [5]:
data.head()

Unnamed: 0,EmpNumber,Age,Gender,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,...,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition,PerformanceRating
0,0,32,1,2,2,5,13,2,10,3,...,4,10,2,2,10,7,0,8,0,3
1,1,47,1,2,2,5,13,2,14,4,...,4,20,2,3,7,7,1,7,0,3
2,2,40,1,1,1,5,13,1,5,4,...,3,20,2,3,18,13,1,12,0,4
4,3,60,1,2,2,5,13,2,16,4,...,4,10,1,3,2,2,2,2,0,3
5,4,27,1,1,0,1,3,1,10,2,...,3,9,4,2,9,7,1,7,0,4


# Feature Engineering
* Statistical tests can be used to select those features that have the strongest relationship with the output variable.
* The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.


In [6]:
X = data.iloc[:,:-1]  #independent columns
y=data.PerformanceRating
#target column 
#apply SelectKBest class to extract top 10 best features
best_features = SelectKBest(score_func=chi2, k=10)
fit = best_features.fit(X,y)
scores = pd.DataFrame(fit.scores_)
columns = pd.DataFrame(data.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([columns,scores],axis=1)
featureScores.columns = ['best_feature','score']  #naming the dataframe columns
print(featureScores.nlargest(10,'score'))  #print 7 best feature

                    best_feature       score
17      EmpLastSalaryHikePercent  255.527147
24       YearsSinceLastPromotion  165.385986
0                      EmpNumber  139.442668
10    EmpEnvironmentSatisfaction   78.662379
23  ExperienceYearsInCurrentRole   46.990806
5                  EmpDepartment   42.328356
25          YearsWithCurrManager   42.232830
6                     EmpJobRole   41.143656
22  ExperienceYearsAtThisCompany   26.202212
8               DistanceFromHome   13.816698


In [7]:
#See the data frame of  Best featue and thair score.
featureScores.sort_values(by='score',ascending=False).head(8)

Unnamed: 0,best_feature,score
17,EmpLastSalaryHikePercent,255.527147
24,YearsSinceLastPromotion,165.385986
0,EmpNumber,139.442668
10,EmpEnvironmentSatisfaction,78.662379
23,ExperienceYearsInCurrentRole,46.990806
5,EmpDepartment,42.328356
25,YearsWithCurrManager,42.23283
6,EmpJobRole,41.143656


* Use 7 best feature in the 'x1' variable and convert into list.

In [8]:
x1=(featureScores.nlargest(10,'score')).iloc[:,0].tolist()
x1

['EmpLastSalaryHikePercent',
 'YearsSinceLastPromotion',
 'EmpNumber',
 'EmpEnvironmentSatisfaction',
 'ExperienceYearsInCurrentRole',
 'EmpDepartment',
 'YearsWithCurrManager',
 'EmpJobRole',
 'ExperienceYearsAtThisCompany',
 'DistanceFromHome']

### Dividing the dataset into training and test dataset
* After having analyzed the dataset, we shall divide the entire dataset into training and test set using train_test_split in the ratio 70:30 It uses random sorting.


In [9]:
X=data.loc[:,x1] #Update input variable with Best Feature
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3 ,random_state=10)

### Train Model
* Using RandomForestClassifier

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=10)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(max_depth=5,random_state=1,n_estimators=25)
model.fit(X_train,y_train)
y_predict = model.predict(X_test)
print(accuracy_score(y_test, y_predict))

0.9537953795379538


### Save the train model

In [11]:

from sklearn.externals import joblib
joblib.dump(model,'emp_rating_model')

['emp_rating_model']

### storing the TEST DATA  using to_pickle

In [12]:
X_test.to_pickle('x_test')
y_test.to_pickle('y_test')
