# Predicting Interview No-shows

**Problem statement:**

Your company, Acme Co., sources candidates for companies hiring new employees. Recently, a number of our clients have complained that candidates have not been showing up to interviews. Your boss has provided you with the attached data set in hopes that you can find some way of identifying candidates at risk of not attending scheduled interviews

Logistic regression or a tree-based method. It seems like there are a good amount of categorical features, so may be learning towards trees.

cols:
- Date of Interview
- Client name
- Industry
- Location
- Position to be closed
- Nature of Skillset 
- Interview Type
- Name(Cand ID) 
- Gender 
- Candidate Current Location
- Candidate Job Location
- Interview Venue
- Candidate Native location
- Have you obtained the necessary permission to start at the required time
- Hope there will be no unscheduled meetings
- Can I Call you three hours before the interview and follow up on your attendance for the interview
- Can I have an alternative number/ desk number. I assure you that I will not trouble you too much
- Have you taken a printout of your updated resume. Have you read the JD and understood the same
- Are you clear with the venue details and the landmark.
- Has the call letter been shared
- Observed Attendance
- Marital Status



**Import Libraries:**

In [190]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report
import re
import pickle
import googlemaps 
from datetime import datetime

**Functions:**

In [85]:
def clean_special_characters(string):
    l = re.sub('[^A-Za-z0-9]+', ' ', string)
    return l.strip()

def get_city_diff(city_1,city_2):
    distance = gmaps.distance_matrix(city_1,city_2)['rows'][0]['elements'][0]
    return distance['rows'][0]['elements'][0]['distance']['value']

def replace_value(val,replacable_words_li,replace_word):
    if val in replacable_words_li:
        return replace_word
    else: 
        return val
    
def get_date_string(date):
    val = re.search("^(\d)+[\.\-\/](\d)+[\.\-\/](\d)*",date).group()
    val = val.replace("/",'.').replace("-",".")
    d,m = val.split('.')[0].zfill(2),val.split('.')[1].zfill(2)
    if len(val.split('.')[2]) == 4:
        yr = val.split('.')[2]
    else:
        yr = "20".join(val.split('.')[2])
    return ".".join([d,m,yr])
    
def clean_date_formating(row):
    try:
        try: 
            if re.search("[A-Za-z]+",row).group() == 'apr':
                date = '.'.join([x[:2],'04.2016'])
                return datetime.strptime(date, '%d.%m.%Y')
            else:
                date = get_date_string(row)
                return datetime.strptime(date, '%d.%m.%Y')
        except: 
            date = get_date_string(row)
            return datetime.strptime(date, '%d.%m.%Y')
    except:
        print(get_date_string(row))
        
    
def format_date(row):
    r = clean_date_formating(row)
    
    if int(r.split('.')[0]) <= 12 and int(r.split('.')[1]) <= 12:
        return "both_under"
    elif int(r.split('.')[0]) > 12:
        return "d-m-y"
    elif int(r.split('.')[1]) > 12: 
        return "m-d-y"
    else:
        return "other"

Read in the data:

In [11]:
data = pd.read_csv("Interview_Input.csv")

### EDA:

We now need to look for missing values, data types, correlations, etc. First let's deal with missing values.

Drop read in empty columns:

In [12]:
data.drop(['Unnamed: 22','Unnamed: 23','Unnamed: 24','Unnamed: 25','Unnamed: 26'],axis=1,inplace=True)

In [14]:
data.isnull().sum()

Date of Interview                                                                                       1
Client name                                                                                             0
Industry                                                                                                1
Location                                                                                                1
Position to be closed                                                                                   1
Nature of Skillset                                                                                      1
Interview Type                                                                                          1
Name(Cand ID)                                                                                           1
Gender                                                                                                  1
Candidate Current Location                    

for the missing date item, I may just drop. Once checking it (show below), we can see that the entire row is missing, so this one will be dropped:

In [20]:
data[data['Date of Interview'].isnull()]

Unnamed: 0,Date of Interview,Client name,Industry,Location,Position to be closed,Nature of Skillset,Interview Type,Name(Cand ID),Gender,Candidate Current Location,...,Candidate Native location,Have you obtained the necessary permission to start at the required time,Hope there will be no unscheduled meetings,Can I Call you three hours before the interview and follow up on your attendance for the interview,Can I have an alternative number/ desk number. I assure you that I will not trouble you too much,Have you taken a printout of your updated resume. Have you read the JD and understood the same,Are you clear with the venue details and the landmark.,Has the call letter been shared,Observed Attendance,Marital Status
1233,,﻿﻿,,,,,,,,,...,,,,,,,,,,


In [25]:
data.dropna(subset=["Date of Interview"],inplace=True)

Ok, lets look at the values in these binary columns. There seems to be some issues with input values. The values need to be cleaned. So, let's look into the column values to clean them proplerly. First, let's lower all of values.

In [38]:
for col in data.columns:
    data[col] = data[col].str.lower()
    data[col] = data[col].str.strip()

I'm making a dictionary with all columns values, except for dates and names: 

In [49]:
col_vals = {}

for col in data.columns:
    if col in ["Date of Interview","Name(Cand ID)"]:
        pass
    else:
        col_vals[col] = data[col].value_counts().keys().to_list()

Update values for the question based answers:

In [91]:
cols_for_maybe = ['Has the call letter been shared','Hope there will be no unscheduled meetings']
    
cols_for_YN = ['Are you clear with the venue details and the landmark.',
               'Have you taken a printout of your updated resume. Have you read the JD and understood the same',
               'Can I have an alternative number/ desk number. I assure you that I will not trouble you too much',
               'Can I Call you three hours before the interview and follow up on your attendance for the interview',
               'Have you obtained the necessary permission to start at the required time']

vals_for_maybe_dict = ['need to check','not yet','havent checked','yet to check','cant say']

vals_for_YN = ['no- i need to check','not yet','no- will take it soon','no',
               'no i have only thi number','no','no dont','not yet','yet to confirm']

for col in cols_for_maybe:
    data[col] = data[col].apply(lambda val: replace_value(val,vals_for_maybe_dict,"not sure"))
    
for col in cols_for_YN:
    data[col] = data[col].apply(lambda val: replace_value(val,vals_for_YN,"no"))
    
#replace_value(val,replacable_words_li,replace_word

There is a string value that is "na". In a way to understand if it is meant to be read as "N/A" or an accident for "no", I've pulled the data where this value shows up. 

For most of the columns, it is showing up for the 20 same rows (found using `data[data['Hope there will be no unscheduled meetings']=='na']`). I looked to see what is common about these rows to determine if there are any commonaility that would make it sensible to be N/A or if it should be no. However, I think based on the commonaility across the board, I'm leaning to it falling into the NaN category. I will most likely assign it as such, and deal with null values after one-hot encoding.

In [111]:
#get all other coluns to see commonalities between these 20
corr_cols = data.columns.to_list()[:-9]
corr_cols.extend(data.columns.to_list()[-2:])
check_1 = data[data['Hope there will be no unscheduled meetings']=='na']
check_1[corr_cols]

I now need to address the formatting issues with the date column. Some values are separated differently. A few list date and time, while others start with months as the first value. Below we will figure out how we can manage these and fix them.

In [185]:
data["date_formats"] = data["Date of Interview"].apply(lambda row: format_date(row))

In [188]:
data["date_formats"].value_counts()

d-m-y         658
both_under    575
Name: date_formats, dtype: int64

based on this, I am assuming a format of "dd/mm/yyyy"

In [217]:
data["Date of Interview"] = data["Date of Interview"].apply(lambda row: clean_date_formating(row))

Based on the fact this model is forward looking, I'm going to **not going to use the year** in training our model. We will have a separate month and day column. But we will do this after cleaning the rest of the data.

In [None]:
"Industry":['it products and services','it services','it'],
"Location":["- cochin-"],
'Candidate Current Location':["- cochin-"],
'Candidate Job Location':["- cochin-"],
'Interview Venue':["- cochin-"],
'Candidate Native location':["- cochin-"],
"Position to be closed":['production- sterile'],
"Nature of Skillset":[],
'Interview Type': ['scheduled walk in','scheduled walkin','sceduled walkin'],


look into 'Nature of Skillset': get rid of times of day structure like dd.dd am/pm
    


    


'Interview Venue' same as 'Candidate Current Location'
'Candidate Job Location' same as 'Candidate Current Location'
'Candidate Native location' same as 'Candidate Current Location'

could get get distances between these location using mapbox or google maps?


In [219]:
data['Has the call letter been shared'].value_counts().keys().to_list()

['yes', 'na', 'no', 'not sure']

To deal with these other ones I'm going to first one-hot enconde these yes/no columns that contain all of the missing values.

In [28]:
# pd.get_dummies(data,columns=["Have you obtained the necessary permission to start at the required time",
#                              "Hope there will be no unscheduled meetings",
#                              "Can I Call you three hours before the interview and follow up on your attendance for the interview",
#                              "Can I have an alternative number/ desk number. I assure you that I will not trouble you too much",
#                              "Have you taken a printout of your updated resume. Have you read the JD and understood the same",
#                              "Are you clear with the venue details and the landmark.",
#                              "Has the call letter been shared",
#                              "Observed Attendance"])

Unnamed: 0,Date of Interview,Client name,Industry,Location,Position to be closed,Nature of Skillset,Interview Type,Name(Cand ID),Gender,Candidate Current Location,...,Has the call letter been shared_no,Has the call letter been shared_yes,Observed Attendance_NO,Observed Attendance_No,Observed Attendance_No.1,Observed Attendance_Yes,Observed Attendance_no,Observed Attendance_no.1,Observed Attendance_yes,Observed Attendance_yes.1
0,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 1,Male,Chennai,...,0,0,0,1,0,0,0,0,0,0
1,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 2,Male,Chennai,...,0,0,0,1,0,0,0,0,0,0
2,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 3,Male,Chennai,...,0,0,0,1,0,0,0,0,0,0
3,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 4,Male,Chennai,...,0,0,0,1,0,0,0,0,0,0
4,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 5,Male,Chennai,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1228,06.02.2016,Standard Chartered Bank,BFSI,Chennai,Routine,JAVA/J2EE/Struts/Hibernate,Scheduled Walk In,Candidate 1171,Male,Chennai,...,0,0,0,0,0,0,0,0,0,0
1229,30.1.16,Standard Chartered Bank,BFSI,Chennai,Routine,Java,Scheduled Walk In,Candidate 1189,Female,Chennai,...,0,0,0,0,0,0,0,0,0,0
1230,30.01.2016,Standard Chartered Bank,BFSI,Chennai,Routine,Java,Scheduled Walk In,Candidate 1207,Male,Chennai,...,0,0,0,0,0,0,0,0,0,0
1231,07.05.2016,Pfizer,Pharmaceuticals,Chennai,Niche,EMEA,Scheduled,Candidate 1222,Male,Chennai,...,0,0,0,0,0,0,0,0,0,0


In [None]:
"Have you obtained the necessary permission to start at the required time",
"Hope there will be no unscheduled meetings",
"Can I Call you three hours before the interview and follow up on your attendance for the interview",
"Can I have an alternative number/ desk number. I assure you that I will not trouble you too much",
"Have you taken a printout of your updated resume. Have you read the JD and understood the same",
"Are you clear with the venue details and the landmark.",
"Has the call letter been shared",
"Observed Attendance"

In [15]:
data.head()

Unnamed: 0,Date of Interview,Client name,Industry,Location,Position to be closed,Nature of Skillset,Interview Type,Name(Cand ID),Gender,Candidate Current Location,...,Candidate Native location,Have you obtained the necessary permission to start at the required time,Hope there will be no unscheduled meetings,Can I Call you three hours before the interview and follow up on your attendance for the interview,Can I have an alternative number/ desk number. I assure you that I will not trouble you too much,Have you taken a printout of your updated resume. Have you read the JD and understood the same,Are you clear with the venue details and the landmark.,Has the call letter been shared,Observed Attendance,Marital Status
0,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 1,Male,Chennai,...,Hosur,Yes,Yes,Yes,Yes,Yes,Yes,Yes,No,Single
1,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 2,Male,Chennai,...,Trichy,Yes,Yes,Yes,Yes,Yes,Yes,Yes,No,Single
2,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 3,Male,Chennai,...,Chennai,,Na,,,,,,No,Single
3,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 4,Male,Chennai,...,Chennai,Yes,Yes,No,Yes,No,Yes,Yes,No,Single
4,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 5,Male,Chennai,...,Chennai,Yes,Yes,Yes,No,Yes,Yes,Yes,No,Married


In [16]:
data.columns

Index(['Date of Interview', 'Client name', 'Industry', 'Location',
       'Position to be closed', 'Nature of Skillset', 'Interview Type',
       'Name(Cand ID)', 'Gender', 'Candidate Current Location',
       'Candidate Job Location', 'Interview Venue',
       'Candidate Native location',
       'Have you obtained the necessary permission to start at the required time',
       'Hope there will be no unscheduled meetings',
       'Can I Call you three hours before the interview and follow up on your attendance for the interview',
       'Can I have an alternative number/ desk number. I assure you that I will not trouble you too much',
       'Have you taken a printout of your updated resume. Have you read the JD and understood the same',
       'Are you clear with the venue details and the landmark.',
       'Has the call letter been shared', 'Observed Attendance',
       'Marital Status'],
      dtype='object')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=16)

feature ideas: 
- current location same as job location
    

In [None]:
with open('../api_key.pickle', 'rb') as handle:
    api_info = pickle.load(handle)

gmaps = googlemaps.Client(key=api_info["api_key"]) 