## Problem4

In the recruitment domain, HR faces the challenge of predicting if the candidate is faking their salary or not. For example, a candidate claims to have 5 years of experience and earns 70,000 per month working as a regional manager. The candidate expects more money than his previous CTC. We need a way to verify their claims (is 70,000 a month working as a regional manager with an experience of 5 years a genuine claim or does he/she make less than that?) Build a Decision Tree and Random Forest model with monthly income as the target variable. 

In [13]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

data = pd.read_csv('HR_DT.csv')
data.isnull().sum()

Position of the employee                 0
no of Years of Experience of employee    0
 monthly income of employee              0
dtype: int64

In [14]:
data.head()

Unnamed: 0,Position of the employee,no of Years of Experience of employee,monthly income of employee
0,Business Analyst,1.1,39343
1,Junior Consultant,1.3,46205
2,Senior Consultant,1.5,37731
3,Manager,2.0,43525
4,Country Manager,2.2,39891


In [16]:
data[data[' monthly income of employee']>70000]

Unnamed: 0,Position of the employee,no of Years of Experience of employee,monthly income of employee
17,Manager,5.3,83088
18,Country Manager,5.9,81363
19,Region Manager,6.0,93940
20,Partner,6.8,91738
21,Senior Partner,7.1,98273
...,...,...,...
175,C-level,9.0,105582
176,Senior Consultant,9.5,116969
177,Manager,9.6,112635
178,Country Manager,10.3,122391


In [31]:
#data.loc[13]
#data.loc[20]

Position of the employee                 Region Manager
no of Years of Experience of employee               4.1
 monthly income of employee                        less
Name: 13, dtype: object

In [26]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 3 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Position of the employee               196 non-null    object 
 1   no of Years of Experience of employee  196 non-null    float64
 2    monthly income of employee            196 non-null    int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 4.7+ KB


In [27]:
data.shape

(196, 3)

In [28]:
#make two classes 1) less than 70000 2) greater than 70000
data[' monthly income of employee'] = pd.cut(data[' monthly income of employee'], 
                                             bins=[min(data[' monthly income of employee'])-1, 70000, 
                                                   max(data[' monthly income of employee'])], 
                                                       labels=['less','more'])

In [29]:
data.head()

Unnamed: 0,Position of the employee,no of Years of Experience of employee,monthly income of employee
0,Business Analyst,1.1,less
1,Junior Consultant,1.3,less
2,Senior Consultant,1.5,less
3,Manager,2.0,less
4,Country Manager,2.2,less


In [32]:
data[' monthly income of employee'].value_counts()

less    118
more     78
Name:  monthly income of employee, dtype: int64

In [33]:
data["Position of the employee"].value_counts()

Partner              28
Senior Partner       25
C-level              24
Region Manager       23
CEO                  23
Country Manager      18
Manager              17
Senior Consultant    16
Junior Consultant    14
Business Analyst      8
Name: Position of the employee, dtype: int64

In [34]:
# We need to encode Undergrad and Marital.Status
lb = LabelEncoder()
data["Position of the employee"] = lb.fit_transform(data["Position of the employee"])

In [35]:
data["Position of the employee"].value_counts()

6    28
9    25
1    24
7    23
2    23
3    18
5    17
8    16
4    14
0     8
Name: Position of the employee, dtype: int64

In [36]:
data.sample(10)

Unnamed: 0,Position of the employee,no of Years of Experience of employee,monthly income of employee
35,1,2.9,less
17,5,5.3,more
102,7,4.0,less
96,2,3.0,less
178,3,10.3,more
44,8,4.5,less
132,2,4.0,less
110,0,6.8,more
105,1,4.9,less
67,1,3.2,less


In [37]:
data.dtypes

Position of the employee                    int32
no of Years of Experience of employee     float64
 monthly income of employee              category
dtype: object

In [38]:
# no zero variance
data.var()

  data.var()


Position of the employee                 7.874071
no of Years of Experience of employee    7.750619
dtype: float64

In [39]:
colnames = list(data.columns)
predictors = colnames[:2]
target = colnames[2]

In [40]:
predictors

['Position of the employee', 'no of Years of Experience of employee']

In [41]:
target

' monthly income of employee'

In [42]:
# Splitting data into training and testing data set
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size = 0.3)

In [43]:
from sklearn.tree import DecisionTreeClassifier as DT
model = DT(criterion = 'gini')
model.fit(train[predictors], train[target])

DecisionTreeClassifier()

In [44]:
# Prediction on Test Data
preds = model.predict(test[predictors])
pd.crosstab(test[target], preds, rownames=['Actual'], colnames=['Predictions'])


Predictions,less,more
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
less,33,0
more,0,26


In [45]:
np.mean(preds == test[target]) # Test Data Accuracy 

1.0

In [46]:
# Print the confusion matrix (alternate way)
from sklearn import metrics
metrics.confusion_matrix(test[target], preds)

array([[33,  0],
       [ 0, 26]], dtype=int64)

In [47]:
# Print the precision and recall, for  2 classes
print(metrics.classification_report(test[target], preds, digits=2))

              precision    recall  f1-score   support

        less       1.00      1.00      1.00        33
        more       1.00      1.00      1.00        26

    accuracy                           1.00        59
   macro avg       1.00      1.00      1.00        59
weighted avg       1.00      1.00      1.00        59



In [48]:
# Prediction on Train Data
preds = model.predict(train[predictors])
pd.crosstab(train[target], preds, rownames = ['Actual'], colnames = ['Predictions'])

Predictions,less,more
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
less,85,0
more,0,52


In [49]:
np.mean(preds == train[target]) # Train Data Accuracy

1.0

In [50]:
# Print the confusion matrix (alternate way)
from sklearn import metrics
metrics.confusion_matrix(train[target], preds)

array([[85,  0],
       [ 0, 52]], dtype=int64)

In [51]:
# Print the precision and recall, for  2 classes
print(metrics.classification_report(train[target], preds, digits=2))

              precision    recall  f1-score   support

        less       1.00      1.00      1.00        85
        more       1.00      1.00      1.00        52

    accuracy                           1.00       137
   macro avg       1.00      1.00      1.00       137
weighted avg       1.00      1.00      1.00       137



In [52]:
# Building  Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(criterion = 'entropy', random_state = 22)
rfc.fit(train[predictors], train[target])

RandomForestClassifier(criterion='entropy', random_state=22)

In [53]:
# Evaluating on Training set
rfc_pred_train = rfc.predict(train[predictors])
metrics.confusion_matrix(train[target], rfc_pred_train)

array([[85,  0],
       [ 0, 52]], dtype=int64)

In [54]:
# Print the precision and recall and f1 score, for 2 classes
print(metrics.classification_report(train[target], rfc_pred_train, digits=2))

              precision    recall  f1-score   support

        less       1.00      1.00      1.00        85
        more       1.00      1.00      1.00        52

    accuracy                           1.00       137
   macro avg       1.00      1.00      1.00       137
weighted avg       1.00      1.00      1.00       137



In [55]:
# Evaluating on Test set
rfc_pred_test = rfc.predict(test[predictors])
metrics.confusion_matrix(test[target], rfc_pred_test)

array([[33,  0],
       [ 0, 26]], dtype=int64)

In [56]:
# Print the precision and recall and f1 score, for 2 classes
print(metrics.classification_report(test[target], rfc_pred_test, digits=2))

              precision    recall  f1-score   support

        less       1.00      1.00      1.00        33
        more       1.00      1.00      1.00        26

    accuracy                           1.00        59
   macro avg       1.00      1.00      1.00        59
weighted avg       1.00      1.00      1.00        59



In [60]:
# 7 is the discretized value for Region Manager. 5 is the years of experience.
rfc.predict([[7,5]])



array(['less'], dtype=object)

the random forest predicts the salary would be less than 70000.