## **Table of Contents:**
* Introduction
* Import Libraries
* Getting the Data
* Data Exploration/Analysis
* Data Preprocessing
    - Check Missing Data and Visualize 
    - Check continuous and categorical data 
    - Feature Selection 
* Logestic Regression  
    - Run LR with different degress
    - Viz training and testing error. Check for overfit and Underfit
    - Hyperparameter Tuning   
* Further Evaluation 
    - Confusion Matrix
    - Precision and Recall 
    - F-Score
    - Precision Recall Curve
    - ROC AUC Curve
    - ROC AUC Score
* Submission
* Summary

# **Introduction**

Develop a logistic regression model that can be used to predict candidates who are likely to join after accepting the offer. What are the varoiables having statistical significance on renege

How would you interpret Sensitivity, Specivity and model accuracy. Calculate AUC. Comment on LR model developed using AUC

How will u handle outlier

What will be model deployment strategy



# **Import Libraries**

In [0]:
import numpy as np 
import pandas as pd
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.feature_selection import RFE

In [5]:
file_path = '/content/sample_data/IMB533_HR Analytics without Missing Values.csv'
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,SLNO,Candidate.Ref,DOJ.Extended,Duration.to.accept.offer,Notice.period,Offered.band,Pecent.hike.expected.in.CTC,Percent.hike.offered.in.CTC,Percent.difference.CTC,Joining.Bonus,Candidate.relocate.actual,Gender,Candidate.Source,Rex.in.Yrs,LOB,Location,Age,Status
0,1,2110407,Yes,14,30,E2,-20.79,13.16,42.86,No,No,Female,Agency,7,ERS,Noida,34,Joined
1,2,2112635,No,18,30,E2,50.0,320.0,180.0,No,No,Male,Employee Referral,8,INFRA,Chennai,34,Joined
2,3,2112838,No,3,45,E2,42.84,42.84,0.0,No,No,Male,Agency,4,INFRA,Noida,27,Joined
3,4,2115021,No,26,30,E2,42.84,42.84,0.0,No,No,Male,Employee Referral,4,INFRA,Noida,34,Joined
4,5,2115125,Yes,1,120,E2,42.59,42.59,0.0,No,Yes,Male,Employee Referral,6,INFRA,Noida,34,Joined


In [6]:
# drop unwanted coluns 
df = df.drop(['SLNO', 'Candidate.Ref'], axis=1)
df.head()

Unnamed: 0,DOJ.Extended,Duration.to.accept.offer,Notice.period,Offered.band,Pecent.hike.expected.in.CTC,Percent.hike.offered.in.CTC,Percent.difference.CTC,Joining.Bonus,Candidate.relocate.actual,Gender,Candidate.Source,Rex.in.Yrs,LOB,Location,Age,Status
0,Yes,14,30,E2,-20.79,13.16,42.86,No,No,Female,Agency,7,ERS,Noida,34,Joined
1,No,18,30,E2,50.0,320.0,180.0,No,No,Male,Employee Referral,8,INFRA,Chennai,34,Joined
2,No,3,45,E2,42.84,42.84,0.0,No,No,Male,Agency,4,INFRA,Noida,27,Joined
3,No,26,30,E2,42.84,42.84,0.0,No,No,Male,Employee Referral,4,INFRA,Noida,34,Joined
4,Yes,1,120,E2,42.59,42.59,0.0,No,Yes,Male,Employee Referral,6,INFRA,Noida,34,Joined


In [7]:
# Converting absoulte age to age group , making it categorical feature  

df['Age'] = df['Age'].astype(int)
df.loc[ df['Age'] <= 20, 'Age'] = 0
df.loc[(df['Age'] > 20) & (df['Age'] <= 30), 'Age'] = 1
df.loc[(df['Age'] > 30) & (df['Age'] <= 40), 'Age'] = 2
df.loc[(df['Age'] > 40) & (df['Age'] <= 50), 'Age'] = 3
df.loc[(df['Age'] > 50) & (df['Age'] <= 60), 'Age'] = 4
df.loc[ df['Age'] > 60, 'Age'] = 5

df.Age.unique()

array([2, 1, 3, 0, 5, 4])

In [8]:
df.head()

Unnamed: 0,DOJ.Extended,Duration.to.accept.offer,Notice.period,Offered.band,Pecent.hike.expected.in.CTC,Percent.hike.offered.in.CTC,Percent.difference.CTC,Joining.Bonus,Candidate.relocate.actual,Gender,Candidate.Source,Rex.in.Yrs,LOB,Location,Age,Status
0,Yes,14,30,E2,-20.79,13.16,42.86,No,No,Female,Agency,7,ERS,Noida,2,Joined
1,No,18,30,E2,50.0,320.0,180.0,No,No,Male,Employee Referral,8,INFRA,Chennai,2,Joined
2,No,3,45,E2,42.84,42.84,0.0,No,No,Male,Agency,4,INFRA,Noida,1,Joined
3,No,26,30,E2,42.84,42.84,0.0,No,No,Male,Employee Referral,4,INFRA,Noida,2,Joined
4,Yes,1,120,E2,42.59,42.59,0.0,No,Yes,Male,Employee Referral,6,INFRA,Noida,2,Joined


In [0]:
categorical_cols = ['DOJ.Extended', 'Age', 'Offered.band', 'Joining.Bonus', 'Candidate.Source', 'Candidate.relocate.actual', 'Gender', 'LOB', 'Location', 'Status']
continuous_cols = list(set(df.columns) - set(categorical_cols))


In [14]:
#Convert categorical strings into categorical digits 
le = LabelEncoder()
df[categorical_cols] = df[categorical_cols].apply(lambda col: le.fit_transform(col))
df.head()


Unnamed: 0,DOJ.Extended,Duration.to.accept.offer,Notice.period,Offered.band,Pecent.hike.expected.in.CTC,Percent.hike.offered.in.CTC,Percent.difference.CTC,Joining.Bonus,Candidate.relocate.actual,Gender,Candidate.Source,Rex.in.Yrs,LOB,Location,Age,Status
0,1,14,30,2,-20.79,13.16,42.86,0,0,0,0,7,4,8,2,0
1,0,18,30,2,50.0,320.0,180.0,0,0,1,2,8,7,2,2,0
2,0,3,45,2,42.84,42.84,0.0,0,0,1,0,4,7,8,1,0
3,0,26,30,2,42.84,42.84,0.0,0,0,1,2,4,7,8,2,0
4,1,1,120,2,42.59,42.59,0.0,0,1,1,2,6,7,8,2,0


In [15]:
# Min_Max  Normalize continous data 


df[continuous_cols] = df[continuous_cols].apply(lambda x:(x-x.min()) / (x.max()-x.min()))
df[continuous_cols].head()

Unnamed: 0,Percent.difference.CTC,Duration.to.accept.offer,Pecent.hike.expected.in.CTC,Notice.period,Rex.in.Yrs,Percent.hike.offered.in.CTC
0,0.299861,0.535398,0.112086,0.25,0.291667,0.138525
1,0.673265,0.544248,0.277252,0.25,0.333333,0.715336
2,0.183162,0.511062,0.260546,0.375,0.166667,0.194319
3,0.183162,0.561947,0.260546,0.25,0.166667,0.194319
4,0.183162,0.506637,0.259963,1.0,0.25,0.193849


# **Feature  Selection**

In [19]:
features = df.iloc[:,0:15]
classes = df['Status']
X = np.array(features)
y = np.array(classes).T
print('Feature set shape:', X.shape)
print('Response class shape:', y.shape)

Feature set shape: (9011, 15)
Response class shape: (9011,)


Let's use Logistic Regression for feature **selection**



In [22]:
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, y)
selected_features = [val for idx,val in enumerate(features.columns) if fit.support_[idx]]
print("Num Features: "+ str(fit.n_features_))  
print("Selected Features: "+ str(selected_features)) 
print("All Features: "+ str(list(features.columns)))
print("Feature Ranking: " + str(fit.ranking_))



Num Features: 3
Selected Features: ['Notice.period', 'Percent.difference.CTC', 'Candidate.relocate.actual']
All Features: ['DOJ.Extended', 'Duration.to.accept.offer', 'Notice.period', 'Offered.band', 'Pecent.hike.expected.in.CTC', 'Percent.hike.offered.in.CTC', 'Percent.difference.CTC', 'Joining.Bonus', 'Candidate.relocate.actual', 'Gender', 'Candidate.Source', 'Rex.in.Yrs', 'LOB', 'Location', 'Age']
Feature Ranking: [ 9 11  1  5  6  2  1  7  1 10  3  4 12 13  8]




Let's use random forest for feature **selection**

In [33]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X, y)



importance_scores = rfc.feature_importances_
feature_importances = [(feature, score) for feature, score in zip(features, importance_scores)]
sorted(feature_importances, key=lambda x: -x[1])



[('Duration.to.accept.offer', 0.16769265517608564),
 ('Percent.hike.offered.in.CTC', 0.14913707202123092),
 ('Pecent.hike.expected.in.CTC', 0.13664230634870175),
 ('Percent.difference.CTC', 0.1141833320884707),
 ('Rex.in.Yrs', 0.07601830102113195),
 ('Notice.period', 0.0670517864489991),
 ('LOB', 0.06013468811760439),
 ('Location', 0.05464746916643272),
 ('Candidate.relocate.actual', 0.03672242190705427),
 ('Candidate.Source', 0.034091067886480815),
 ('DOJ.Extended', 0.026735716871381918),
 ('Age', 0.023839008086709752),
 ('Offered.band', 0.023254699410263636),
 ('Gender', 0.02082353880911064),
 ('Joining.Bonus', 0.009025936640342026)]

In [34]:
selected_features = list(filter(lambda x: x[1] >= 0.05, feature_importances))                        
selected_features

[('Duration.to.accept.offer', 0.16769265517608564),
 ('Notice.period', 0.0670517864489991),
 ('Pecent.hike.expected.in.CTC', 0.13664230634870175),
 ('Percent.hike.offered.in.CTC', 0.14913707202123092),
 ('Percent.difference.CTC', 0.1141833320884707),
 ('Rex.in.Yrs', 0.07601830102113195),
 ('LOB', 0.06013468811760439),
 ('Location', 0.05464746916643272)]

In [41]:
a = list(zip(*selected_features))

selected_features = list(a[0])

df[selected_features].head()

Unnamed: 0,Duration.to.accept.offer,Notice.period,Pecent.hike.expected.in.CTC,Percent.hike.offered.in.CTC,Percent.difference.CTC,Rex.in.Yrs,LOB,Location
0,0.535398,0.25,0.112086,0.138525,0.299861,0.291667,4,8
1,0.544248,0.25,0.277252,0.715336,0.673265,0.333333,7,2
2,0.511062,0.375,0.260546,0.194319,0.183162,0.166667,7,8
3,0.561947,0.25,0.260546,0.194319,0.183162,0.166667,7,8
4,0.506637,1.0,0.259963,0.193849,0.183162,0.25,7,8


# **Build Model** 