# Predicting Interview No-shows

**Problem statement:**

Your company, Acme Co., sources candidates for companies hiring new employees. Recently, a number of our clients have complained that candidates have not been showing up to interviews. Your boss has provided you with the attached data set in hopes that you can find some way of identifying candidates at risk of not attending scheduled interviews

Logistic regression or a tree-based method. It seems like there are a good amount of categorical features, so may be learning towards trees.

cols:
- Date of Interview
- Client name
- Industry
- Location
- Position to be closed
- Nature of Skillset 
- Interview Type
- Name(Cand ID) 
- Gender 
- Candidate Current Location
- Candidate Job Location
- Interview Venue
- Candidate Native location
- Have you obtained the necessary permission to start at the required time
- Hope there will be no unscheduled meetings
- Can I Call you three hours before the interview and follow up on your attendance for the interview
- Can I have an alternative number/ desk number. I assure you that I will not trouble you too much
- Have you taken a printout of your updated resume. Have you read the JD and understood the same
- Are you clear with the venue details and the landmark.
- Has the call letter been shared
- Observed Attendance
- Marital Status



**Import Libraries:**

In [10]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report

Read in the data:

In [11]:
data = pd.read_csv("Interview_Input.csv")

### EDA

We now need to look for missing values, data types, correlations, etc. First let's deal with missing values.

Drop read in empty columns:

In [12]:
data.drop(['Unnamed: 22','Unnamed: 23','Unnamed: 24','Unnamed: 25','Unnamed: 26'],axis=1,inplace=True)

In [14]:
data.isnull().sum()

Date of Interview                                                                                       1
Client name                                                                                             0
Industry                                                                                                1
Location                                                                                                1
Position to be closed                                                                                   1
Nature of Skillset                                                                                      1
Interview Type                                                                                          1
Name(Cand ID)                                                                                           1
Gender                                                                                                  1
Candidate Current Location                    

for the missing date item, I may just drop. Once checking it (show below), we can see that the entire row is missing, so this one will be dropped:

In [20]:
data[data['Date of Interview'].isnull()]

Unnamed: 0,Date of Interview,Client name,Industry,Location,Position to be closed,Nature of Skillset,Interview Type,Name(Cand ID),Gender,Candidate Current Location,...,Candidate Native location,Have you obtained the necessary permission to start at the required time,Hope there will be no unscheduled meetings,Can I Call you three hours before the interview and follow up on your attendance for the interview,Can I have an alternative number/ desk number. I assure you that I will not trouble you too much,Have you taken a printout of your updated resume. Have you read the JD and understood the same,Are you clear with the venue details and the landmark.,Has the call letter been shared,Observed Attendance,Marital Status
1233,,﻿﻿,,,,,,,,,...,,,,,,,,,,


In [25]:
data.dropna(subset=["Date of Interview"],inplace=True)

Ok, lets look at the values in these binary columns. There seems to be some issues with input values:

To deal with these other ones I'm going to first one-hot enconde these yes/no columns that contain all of the missing values.

In [28]:
# pd.get_dummies(data,columns=["Have you obtained the necessary permission to start at the required time",
#                              "Hope there will be no unscheduled meetings",
#                              "Can I Call you three hours before the interview and follow up on your attendance for the interview",
#                              "Can I have an alternative number/ desk number. I assure you that I will not trouble you too much",
#                              "Have you taken a printout of your updated resume. Have you read the JD and understood the same",
#                              "Are you clear with the venue details and the landmark.",
#                              "Has the call letter been shared",
#                              "Observed Attendance"])

Unnamed: 0,Date of Interview,Client name,Industry,Location,Position to be closed,Nature of Skillset,Interview Type,Name(Cand ID),Gender,Candidate Current Location,...,Has the call letter been shared_no,Has the call letter been shared_yes,Observed Attendance_NO,Observed Attendance_No,Observed Attendance_No.1,Observed Attendance_Yes,Observed Attendance_no,Observed Attendance_no.1,Observed Attendance_yes,Observed Attendance_yes.1
0,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 1,Male,Chennai,...,0,0,0,1,0,0,0,0,0,0
1,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 2,Male,Chennai,...,0,0,0,1,0,0,0,0,0,0
2,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 3,Male,Chennai,...,0,0,0,1,0,0,0,0,0,0
3,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 4,Male,Chennai,...,0,0,0,1,0,0,0,0,0,0
4,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 5,Male,Chennai,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1228,06.02.2016,Standard Chartered Bank,BFSI,Chennai,Routine,JAVA/J2EE/Struts/Hibernate,Scheduled Walk In,Candidate 1171,Male,Chennai,...,0,0,0,0,0,0,0,0,0,0
1229,30.1.16,Standard Chartered Bank,BFSI,Chennai,Routine,Java,Scheduled Walk In,Candidate 1189,Female,Chennai,...,0,0,0,0,0,0,0,0,0,0
1230,30.01.2016,Standard Chartered Bank,BFSI,Chennai,Routine,Java,Scheduled Walk In,Candidate 1207,Male,Chennai,...,0,0,0,0,0,0,0,0,0,0
1231,07.05.2016,Pfizer,Pharmaceuticals,Chennai,Niche,EMEA,Scheduled,Candidate 1222,Male,Chennai,...,0,0,0,0,0,0,0,0,0,0


In [None]:
"Have you obtained the necessary permission to start at the required time",
"Hope there will be no unscheduled meetings",
"Can I Call you three hours before the interview and follow up on your attendance for the interview",
"Can I have an alternative number/ desk number. I assure you that I will not trouble you too much",
"Have you taken a printout of your updated resume. Have you read the JD and understood the same",
"Are you clear with the venue details and the landmark.",
"Has the call letter been shared",
"Observed Attendance"

In [15]:
data.head()

Unnamed: 0,Date of Interview,Client name,Industry,Location,Position to be closed,Nature of Skillset,Interview Type,Name(Cand ID),Gender,Candidate Current Location,...,Candidate Native location,Have you obtained the necessary permission to start at the required time,Hope there will be no unscheduled meetings,Can I Call you three hours before the interview and follow up on your attendance for the interview,Can I have an alternative number/ desk number. I assure you that I will not trouble you too much,Have you taken a printout of your updated resume. Have you read the JD and understood the same,Are you clear with the venue details and the landmark.,Has the call letter been shared,Observed Attendance,Marital Status
0,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 1,Male,Chennai,...,Hosur,Yes,Yes,Yes,Yes,Yes,Yes,Yes,No,Single
1,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 2,Male,Chennai,...,Trichy,Yes,Yes,Yes,Yes,Yes,Yes,Yes,No,Single
2,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 3,Male,Chennai,...,Chennai,,Na,,,,,,No,Single
3,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 4,Male,Chennai,...,Chennai,Yes,Yes,No,Yes,No,Yes,Yes,No,Single
4,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 5,Male,Chennai,...,Chennai,Yes,Yes,Yes,No,Yes,Yes,Yes,No,Married


In [16]:
data.columns

Index(['Date of Interview', 'Client name', 'Industry', 'Location',
       'Position to be closed', 'Nature of Skillset', 'Interview Type',
       'Name(Cand ID)', 'Gender', 'Candidate Current Location',
       'Candidate Job Location', 'Interview Venue',
       'Candidate Native location',
       'Have you obtained the necessary permission to start at the required time',
       'Hope there will be no unscheduled meetings',
       'Can I Call you three hours before the interview and follow up on your attendance for the interview',
       'Can I have an alternative number/ desk number. I assure you that I will not trouble you too much',
       'Have you taken a printout of your updated resume. Have you read the JD and understood the same',
       'Are you clear with the venue details and the landmark.',
       'Has the call letter been shared', 'Observed Attendance',
       'Marital Status'],
      dtype='object')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=16)

feature ideas: 
- current location same as job location
    