In [213]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats

## Instructions
As usual, it starts with picking a data set. It can be the same data set as a previous challenge or capstone, but make sure you're using a data set that makes sense for this task, so read all of the instructions first.

First, dive in and explore the data set. Include your code and visuals from this process in your final write up. While doing this, look for something that provokes a question; specifically one that can be answered with an experiment.

The main component of this capstone is an experimentation RFC. Using the data set you selected, propose and outline an experiment plan. The plan should consist of three key components:

Analysis that highlights your experimental hypothesis.
A rollout plan showing how you would implement and rollout the experiment
An evaluation plan showing what constitutes success in this experiment
Your experiment should be as real as possible. Though you obviously will not have access to the full production environment to deploy your experiment, it should be feasible and of interest to the parties involved with your actual data source.

The target size of your RFC should be 3-5 pages.

## Experiment
The company 'WORKWORKSWHENYOUWORKDATWORK ltd.' works helping candidates on their search for a new <s>work</s> job. To help them in that pursue 'WORKWORKSWHENYOUWORKDATWORK ltd.' brings full support during every application. After many years of hard work in their field, 'WORKWORKSWHENYOUWORKDATWORK ltd.' has realized that attending job interviews actually plays a key role in getting the role the candidate is working for. So 'WORKWORKSWHENYOUWORKDAWORK' wants to work out on increasing the percentage of observed interviews by the candidates hiring its services.

With that goal in mind, the brightest minds working on 'WORKWORKSWHENYOUWORKDATWORK ltd.' have worked very hard and come with a plan: if a candidate is married they're more likely to attend an interview. If that's the case, then 'WORKWORKSWHENYOUWORKDATWORK ltd.' will encourage their clients to get that married and get that job. Sure that'll work! 'WORKWORKSWHENYOUWORKDAWORK' thought.

To run the test, the company selects more than a thousand candidates, all single, and then marries some of them to see if that improves the assistance to the interviews (after the mandatory honeymoon). Below is the data has been collected*

__Hypothesis__: Being married improves the likelyhood of attending an interview
__Key metric__: Percentage of attendance to interview
__Secondary metric__: Expected attendance

* Source of the data: [Kaggle - The Interview Attendance Problem](#https://www.kaggle.com/vishnusraghavan/the-interview-attendance-problem/data)

## Results
First we need to load the data and take a look

In [214]:
path = '/Users/Stephanie/desktop/thinkful/projects/3_bootcamp/U1-capstone/Interview.csv'

df = pd.read_csv(path)

In [215]:
df.head(3)

Unnamed: 0,Date of Interview,Client name,Industry,Location,Position to be closed,Nature of Skillset,Interview Type,Name(Cand ID),Gender,Candidate Current Location,...,Are you clear with the venue details and the landmark.,Has the call letter been shared,Expected Attendance,Observed Attendance,Marital Status,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27
0,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 1,Male,Chennai,...,Yes,Yes,Yes,No,Single,,,,,
1,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 2,Male,Chennai,...,Yes,Yes,Yes,No,Single,,,,,
2,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 3,Male,Chennai,...,,,Uncertain,No,Single,,,,,


We're going to drop the last columns that contain no information and check if we still have any NaN values

In [216]:
df.dropna(axis=1, how='all', inplace=True)

In [217]:
df.isna().sum()

Date of Interview                                                                                       1
Client name                                                                                             0
Industry                                                                                                1
Location                                                                                                1
Position to be closed                                                                                   1
Nature of Skillset                                                                                      1
Interview Type                                                                                          1
Name(Cand ID)                                                                                           1
Gender                                                                                                  1
Candidate Current Location                    

We have a number of columns where approx 20% of the data is missing. Since we're not going to use these columns we can just drop them

In [218]:
df.drop(df.index[1233], inplace=True)

In [219]:
df.dropna(axis=1, thresh = 1220, inplace=True)

In [220]:
df.columns

Index(['Date of Interview', 'Client name', 'Industry', 'Location',
       'Position to be closed', 'Nature of Skillset', 'Interview Type',
       'Name(Cand ID)', 'Gender', 'Candidate Current Location',
       'Candidate Job Location', 'Interview Venue',
       'Candidate Native location', 'Expected Attendance',
       'Observed Attendance', 'Marital Status'],
      dtype='object')

A last exam tell us that we're still missing values in 5 rows for a column. Since it's a small number, we can just drop these five rows

In [221]:
df.isna().sum()

Date of Interview             0
Client name                   0
Industry                      0
Location                      0
Position to be closed         0
Nature of Skillset            0
Interview Type                0
Name(Cand ID)                 0
Gender                        0
Candidate Current Location    0
Candidate Job Location        0
Interview Venue               0
Candidate Native location     0
Expected Attendance           5
Observed Attendance           0
Marital Status                0
dtype: int64

In [222]:
df.dropna(inplace=True)

We're also going to rename some columns and create a new column that checks if the new job would require a relocation from the current place

In [223]:
col_names = []

for col in df.columns:
    if 'Candidate' in col:
        col = col.replace('Candidate ', '')
    col_names.append(col)

df.columns = col_names

In [224]:
df.head()

Unnamed: 0,Date of Interview,Client name,Industry,Location,Position to be closed,Nature of Skillset,Interview Type,Name(Cand ID),Gender,Current Location,Job Location,Interview Venue,Native location,Expected Attendance,Observed Attendance,Marital Status
0,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 1,Male,Chennai,Hosur,Hosur,Hosur,Yes,No,Single
1,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 2,Male,Chennai,Bangalore,Hosur,Trichy,Yes,No,Single
2,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 3,Male,Chennai,Chennai,Hosur,Chennai,Uncertain,No,Single
3,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 4,Male,Chennai,Chennai,Hosur,Chennai,Uncertain,No,Single
4,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 5,Male,Chennai,Bangalore,Hosur,Chennai,Uncertain,No,Married


We'll made Marital Status and Observed Attendance binary as well

In [225]:
df['Marital Status'] = df['Marital Status'].apply(lambda x: 1 if x == 'Married' else 0)
df['Observed Attendance'] = df['Observed Attendance'].apply(lambda x: 1 if x == 'Yes' else 0)

In [226]:
for group, value in zip(('test', 'control'),(1, 0)):
    sample = df[df['Marital Status'] == value]
    sample_attended = sample['Observed Attendance'][sample['Observed Attendance'] == 1].count()
    sample_total = len(sample)
    print(group + ' sample size:', sample_total)
    print(group + ' observance rate:', round(sample_attended / sample_total,2))

print('test proportion:', round(df['Marital Status'].mean(),2))
print()


stats.ttest_ind(df[df['Marital Status'] == 1]['Observed Attendance'],
                df[df['Marital Status'] == 0]['Observed Attendance'])

test sample size: 465
test observance rate: 0.56
control sample size: 763
control observance rate: 0.57
test proportion: 0.38



Ttest_indResult(statistic=-0.39260702115116153, pvalue=0.6946780002196427)