# EECS 491: Probabilistic Graphical Models Assignment 3
**David Fan**

3/28/18

## Problem Description
In this notebook I will attempt to create a larger graphical model (expanding on techniques used in the previous assignments) to model a specific domain. Using a dataset found on Kaggle, we will attempt to model the recruitment industry in India using a graphical model. In particular we will be exploring the effect of different factors on interview attendance.

## Dataset
For this problem, we will explore a dataset found [here](https://www.kaggle.com/vishnusraghavan/the-interview-attendance-problem/data). The author of the dataset describes the context of the dataset as follows:

>The data pertains to the recruitment industry in India for the years 2014-2016 and deals with candidate interview attendance for various clients ...

>The data have been collected by me and my fellow researchers over a period of over 2 years between September 2014 and January 2017.

>There are a set of questions that are asked by a recruiter while scheduling the candidate. The answers to these determine whether expected attendance is yes, no or uncertain.



In [170]:
# Imported Packages
import pandas as pd
import numpy as np
from dateutil import parser

In [171]:
# Load dataset as a dataframe using pandas
data = pd.read_csv('Interview.csv')

Let's take a quick look around the dataset so we can see what we're working with:

In [172]:
data.head()

Unnamed: 0,Date of Interview,Client name,Industry,Location,Position to be closed,Nature of Skillset,Interview Type,Name(Cand ID),Gender,Candidate Current Location,...,Are you clear with the venue details and the landmark.,Has the call letter been shared,Expected Attendance,Observed Attendance,Marital Status,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27
0,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 1,Male,Chennai,...,Yes,Yes,Yes,No,Single,,,,,
1,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 2,Male,Chennai,...,Yes,Yes,Yes,No,Single,,,,,
2,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 3,Male,Chennai,...,,,Uncertain,No,Single,,,,,
3,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 4,Male,Chennai,...,Yes,Yes,Uncertain,No,Single,,,,,
4,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 5,Male,Chennai,...,Yes,Yes,Uncertain,No,Married,,,,,


Here we can see the different variables contained within the dataset. Of particular interest are:
* **Date of interview:** This will be broken down into a month variable and a day of the week variable to explore if time of the year or day of the week has any effect on interview attendance.
* **Industry:** To see if particular industries are more attractive than others resulting in a higher interview attendance rate.
* **Location:** This appears to be candidate location. This value is equivalient to the variable 'Candidate Current Location'. Candidate location might have some effect on a candidate's ability to show up to an interview.
* **Position to be closed:** The type of job the candidate is interviewing for.
* **Nature of skillset:** The skills the candidate has (or claims to have).
* **Interview Type:** Walkins/ Scheduled/ Scheduled walkins
* **Gender**
* **Candidate Job Location:** The location for the interview. 
* **Expected Attendance**
* **Observed Attendance**
* **Marital Status**

### Formatting
Let's trim down the dataset to just what we need.

In [173]:
data = data.dropna(axis=0, thresh=2)

In [174]:
# Standardize Datestring format for parser
for i in range(data["Date of Interview"].shape[0]):
    data.iloc[i,0] = data.iloc[i,0].replace(" -","-")
    data.iloc[i,0] = data.iloc[i,0].replace("–","-")
    if data.iloc[i,0].find('&') is not -1:
        data.iloc[i,0] = data.iloc[i,0][:data.iloc[i,0].find('&')]

In [175]:
data.loc[:,"Date of Interview"] = data.loc[:, "Date of Interview"].apply(parser.parse)

In [176]:
to_trimmed = {
    "Month": data.loc[:,"Date of Interview"].apply(lambda x: x.month),
    "Day of the Week": data.loc[:,"Date of Interview"].apply(lambda x: x.weekday()),
    "Industry": data.loc[:, "Industry"],
    "Location": data.loc[:, "Location"],
    "Position to be closed": data.loc[:, "Position to be closed"],
    "Nature of Skillset": data.loc[:, "Nature of Skillset"],
    "Interview Type": data.loc[:, "Interview Type"],
    "Gender": data.loc[:, "Gender"],
    "Candidate Job Location": data.loc[:, "Candidate Job Location"],
    "Expected Attendance": data.loc[:, "Expected Attendance"],
    "Observed Attendance": data.loc[:, "Observed Attendance"],
    "Marital Status": data.loc[:, "Marital Status"]
}
trimmed = pd.DataFrame(to_trimmed)

In [177]:
trimmed = trimmed.dropna(axis=0, how="any")

In [178]:
trimmed.head()

Unnamed: 0,Candidate Job Location,Day of the Week,Expected Attendance,Gender,Industry,Interview Type,Location,Marital Status,Month,Nature of Skillset,Observed Attendance,Position to be closed
0,Hosur,4,Yes,Male,Pharmaceuticals,Scheduled Walkin,Chennai,Single,2,Routine,No,Production- Sterile
1,Bangalore,4,Yes,Male,Pharmaceuticals,Scheduled Walkin,Chennai,Single,2,Routine,No,Production- Sterile
2,Chennai,4,Uncertain,Male,Pharmaceuticals,Scheduled Walkin,Chennai,Single,2,Routine,No,Production- Sterile
3,Chennai,4,Uncertain,Male,Pharmaceuticals,Scheduled Walkin,Chennai,Single,2,Routine,No,Production- Sterile
4,Bangalore,4,Uncertain,Male,Pharmaceuticals,Scheduled Walkin,Chennai,Married,2,Routine,No,Production- Sterile


We now have a trimmed down version of our dataset with only the variables of interest. Let's now define our model.

## Model Definition
With some expert knowledge and hypotheses, we need to model these variables as a bayesian network. Our model shall be as follows:

<img src="bayesnet.png">

We're going to be exploring using belief propagation and Monte Carlo sampling to infer different queries. Let's say that we were interested in learning about the probability that a married individual will show up to their interview in June so we want to infer the query:

$$
P(ObservedAttendance \; | \; MaritalStatus=Married,Month=6)
$$

### Belief Propagation
We're going to be using the `pgmpy` package to create our model and perform belief propagation. Pgmpy plays well with Pandas dataframes, so we can feed pgmpy our data and use the built-in Maximum Likelihood estimator to learn our conditional probability distributions.

In [179]:
# Ignore this. This exists in case the MLE doesn't converge.
# This function is needed to manually construct the CPDs in the model
def calc_probability(variable):
    values = dict()
    for value in variable:
        if value not in values:
            values[value] = 1
        else:
            values[value] += 1
    total = variable.shape[0]
    for value in values:
        values[value] /= total
    
    return values

Now we can build our pgmpy model:

In [180]:
from pgmpy.models import BayesianModel
from pgmpy.estimators import MaximumLikelihoodEstimator
from pgmpy.inference import BeliefPropagation

In [181]:
trimmed.columns

Index(['Candidate Job Location', 'Day of the Week', 'Expected Attendance',
       'Gender', 'Industry', 'Interview Type', 'Location', 'Marital Status',
       'Month', 'Nature of Skillset', 'Observed Attendance',
       'Position to be closed'],
      dtype='object')

In [182]:
model = BayesianModel([
    ("Industry", "Interview Type"), 
    ('Candidate Job Location', "Interview Type"), 
    ("Industry", "Location"),
    ("Candidate Job Location", "Location"),
    ("Day of the Week","Expected Attendance"),
    ("Day of the Week", "Observed Attendance"),
    ("Month", "Expected Attendance"),
    ("Month", "Observed Attendance"),
    ("Interview Type", "Expected Attendance"),
    ("Interview Type", "Observed Attendance"),
    ("Location", "Expected Attendance"),
    ("Location", "Observed Attendance"),
    ("Nature of Skillset", "Position to be closed"),
    ("Position to be closed", "Observed Attendance"),
    ("Position to be closed", "Expected Attendance"),
    ("Gender", "Marital Status"),
    ("Marital Status", "Observed Attendance"),
    ("Marital Status", "Expected Attendance")
])

Now we can estimate our CPDs. **WARNING: THIS WILL TAKE A LONG TIME TO RUN**

In [183]:
model.fit(trimmed, estimator=MaximumLikelihoodEstimator)

In [189]:
print(model.get_cpds()[0])

╒═══════════════════════════════════════╤═════════════╕
│ Candidate Job Location(- Cochin- )    │ 0.00732899  │
├───────────────────────────────────────┼─────────────┤
│ Candidate Job Location(Bangalore)     │ 0.20684     │
├───────────────────────────────────────┼─────────────┤
│ Candidate Job Location(Chennai)       │ 0.727199    │
├───────────────────────────────────────┼─────────────┤
│ Candidate Job Location(Gurgaon)       │ 0.0285016   │
├───────────────────────────────────────┼─────────────┤
│ Candidate Job Location(Hosur)         │ 0.000814332 │
├───────────────────────────────────────┼─────────────┤
│ Candidate Job Location(Noida)         │ 0.012215    │
├───────────────────────────────────────┼─────────────┤
│ Candidate Job Location(Visakapatinam) │ 0.017101    │
╘═══════════════════════════════════════╧═════════════╛


In [190]:
print(model.get_cpds()[1])

╒════════════════════╤═══════════╕
│ Day of the Week(0) │ 0.0350163 │
├────────────────────┼───────────┤
│ Day of the Week(1) │ 0.192182  │
├────────────────────┼───────────┤
│ Day of the Week(2) │ 0.185668  │
├────────────────────┼───────────┤
│ Day of the Week(3) │ 0.324919  │
├────────────────────┼───────────┤
│ Day of the Week(4) │ 0.108306  │
├────────────────────┼───────────┤
│ Day of the Week(5) │ 0.12215   │
├────────────────────┼───────────┤
│ Day of the Week(6) │ 0.031759  │
╘════════════════════╧═══════════╛


In [191]:
print(model.get_cpds()[2])

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [192]:
print(model.get_cpds()[3])

╒════════════════╤══════════╕
│ Gender(Female) │ 0.218241 │
├────────────────┼──────────┤
│ Gender(Male)   │ 0.781759 │
╘════════════════╧══════════╛


In [193]:
print(model.get_cpds()[4])

╒════════════════════════════════════╤════════════╕
│ Industry(BFSI)                     │ 0.76873    │
├────────────────────────────────────┼────────────┤
│ Industry(Electronics)              │ 0.0187296  │
├────────────────────────────────────┼────────────┤
│ Industry(IT)                       │ 0.00895765 │
├────────────────────────────────────┼────────────┤
│ Industry(IT Products and Services) │ 0.036645   │
├────────────────────────────────────┼────────────┤
│ Industry(IT Services)              │ 0.0187296  │
├────────────────────────────────────┼────────────┤
│ Industry(Pharmaceuticals)          │ 0.134365   │
├────────────────────────────────────┼────────────┤
│ Industry(Telecom)                  │ 0.0138436  │
╘════════════════════════════════════╧════════════╛


In [196]:
print(model.get_cpds()[5])

╒═══════════════════════════════════╤════════════════════════════════════╤════════════════════════════════════╤════════════════════════════════════╤════════════════════════════════════╤════════════════════════════════════╤════════════════════════════════════╤════════════════════════════════════╤═══════════════════════════════════╤═══════════════════════════════════╤═══════════════════════════════════╤════════════════════════════════════╤═══════════════════════════════════╤═══════════════════════════════════╤═══════════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤════════════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤════════════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤═══

In [195]:
print(model.get_cpds()[6])

╒════════════════════════╤════════════════════════════════════╤════════════════════════════════════╤════════════════════════════════════╤════════════════════════════════════╤════════════════════════════════════╤════════════════════════════════════╤════════════════════════════════════╤═══════════════════════════════════╤═══════════════════════════════════╤═══════════════════════════════════╤════════════════════════════════════╤═══════════════════════════════════╤═══════════════════════════════════╤═══════════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤════════════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤════════════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤══════════════

In [197]:
print(model.get_cpds()[7])

╒═════════════════════════╤═══════════════════╤══════════════╕
│ Gender                  │ Gender(Female)    │ Gender(Male) │
├─────────────────────────┼───────────────────┼──────────────┤
│ Marital Status(Married) │ 0.582089552238806 │ 0.321875     │
├─────────────────────────┼───────────────────┼──────────────┤
│ Marital Status(Single)  │ 0.417910447761194 │ 0.678125     │
╘═════════════════════════╧═══════════════════╧══════════════╛


In [198]:
print(model.get_cpds()[8])

╒═══════════╤════════════╕
│ Month(1)  │ 0.0781759  │
├───────────┼────────────┤
│ Month(2)  │ 0.118893   │
├───────────┼────────────┤
│ Month(3)  │ 0.0871336  │
├───────────┼────────────┤
│ Month(4)  │ 0.236156   │
├───────────┼────────────┤
│ Month(5)  │ 0.0936482  │
├───────────┼────────────┤
│ Month(6)  │ 0.258958   │
├───────────┼────────────┤
│ Month(7)  │ 0.0374593  │
├───────────┼────────────┤
│ Month(8)  │ 0.0390879  │
├───────────┼────────────┤
│ Month(9)  │ 0.0211726  │
├───────────┼────────────┤
│ Month(10) │ 0.00732899 │
├───────────┼────────────┤
│ Month(11) │ 0.0154723  │
├───────────┼────────────┤
│ Month(12) │ 0.00651466 │
╘═══════════╧════════════╛


In [199]:
print(model.get_cpds()[9])

╒════════════════════════════════════════════════════════════╤═════════════╕
│ Nature of Skillset(- SAPBO, Informatica)                   │ 0.00325733  │
├────────────────────────────────────────────────────────────┼─────────────┤
│ Nature of Skillset(10.00 AM)                               │ 0.000814332 │
├────────────────────────────────────────────────────────────┼─────────────┤
│ Nature of Skillset(11.30 AM)                               │ 0.00162866  │
├────────────────────────────────────────────────────────────┼─────────────┤
│ Nature of Skillset(11.30 Am)                               │ 0.000814332 │
├────────────────────────────────────────────────────────────┼─────────────┤
│ Nature of Skillset(12.30 Pm)                               │ 0.000814332 │
├────────────────────────────────────────────────────────────┼─────────────┤
│ Nature of Skillset(9.00 Am)                                │ 0.000814332 │
├────────────────────────────────────────────────────────────┼─────────────┤

In [200]:
print(model.get_cpds()[10])

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [201]:
print(model.get_cpds()[11])

╒════════════════════════════════════════════╤══════════════════════════════════════════╤══════════════════════════════╤══════════════════════════════╤══════════════════════════════╤══════════════════════════════╤═════════════════════════════╤═════════════════════════════╤═════════════════════════════════╤═════════════════════════════════╤═══════════════════════════════════════════╤══════════════════════════════════════╤════════════════════════════════════╤═════════════════════════════════════════════╤════════════════════════════════════════╤════════════════════════════════════════╤════════════════════════════════════════════════╤═════════════════════════════════╤══════════════════════════════════╤═════════════════════════════════╤═════════════════════════════╤══════════════════════════╤════════════════════════════════════╤═══════════════════════════════╤═════════════════════════════╤══════════════════════════╤═════════════════════════╤═════════════════════════════╤════════════════════

Now we can use pgmpy to perform belief propagation:

In [210]:
belief_propagation = BeliefPropagation(model)
belief_propagation.query(variables=['Observed Attendance'], evidence={'Marital Status': 'Married', 'Month':6})

TypeError: values: must contain tuples or array-like elements of the form (hashable object, type int)

There appears to be a bug with pgmpy as how state-names are represented internally. Essentially, it doesn't appear to handle non-binary data all that well...