# Estimating Missing Political Data - machine learning techniques for pre/post processing

This lab will focus on handling missing data in a new way: leveraging some of the models you've learned to use.

In general this topic is more on the "art" side of the science/art spectrum, but there are some well-established ways to deal with and impute missing data, depending on what you want to accomplish in the end (increase the power, remove NaNs, impute with a numerical/label to prevent errors from your ML algorithms, etc.). 
	
Our overall goal is to see that there can be a "functional relationship" between the "missingness" of the data, and features found in our data. By doing this, we can categorize the kind of "missingness" we are dealing with for a particular dataset.

# Types of "Missingness" 

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Types of "Missingness"
| Type  | Description  | 
|---|---|
 | Missing Completely at Random  | This is basically the best scenario, all NaN, NA, or blanks are distributed totally at random can be safely omitted  |
 | Missing at Random  | This is less strong, but is "random" given the sample you are using. This is what we're aiming at for our analysis; functionally, we want to show that our missing data isn't dependent on data we haven't observed or accounted for in our dataset   | 
 | Missing not at Random  | "There is a data generating process that yields missing values". Basically, it means there is some "pattern" to the 'missingness' |

# Introducing the Inclusion Indicator 

As stated, the type of “missingness” we are most concerned about is the last row, "Missing not at Random". If there is a data generating process, this means we can model the “missingness” in our data set. If we can convincingly show that this model accounts for "most" (we're not being stringent statisticians, so that word will be left up to you to define) of the observable variation, we can be (relatively) well-at-ease that our "missingness" isn't functionally related to some features we don't have control/accounted/recorded in our data.

Before we move forward, we have to define the "inclusion indicator". We say I is an inclusion indicator if : $$\begin{array}{cc}
  I=\{ & 
    \begin{array}{cc}
      1 & x: missing \\
      0 & x: \neg{missing} \\
    \end{array}
\end{array} $$

# Loading up data with missing values

We are going to load up polling data. However we will take the analysis much broader this time, and we will be using a version of the data set where we have not removed missing values... because after all, that's the point of this entire lab! 

So load up the data and the libraries we will need to get started!


#### Loading the data 

In [1]:
from __future__ import division
import os
import math
import pylab as py
import sys
import glob
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd




In [2]:
pre_poll = pd.read_csv('./assets/datasets/polls_new.csv')
del pre_poll['Unnamed: 0']
pre_poll.head()

Unnamed: 0,bush,state,edu,age
0,1.0,7,2,2
1,1.0,33,4,3
2,0.0,20,2,1
3,1.0,31,3,2
4,1.0,18,3,1


#### Problem 1 - Construct the Inclusion indicator and append the column to the table 

Build an 'inclusion' indicator column that will be 1 when bush is missing a value, and 0 otherwise.

In [3]:
pre_poll['inclusion'] = np.where(pd.isnull(pre_poll['bush']), 1, 0);

In [4]:
pre_poll.head()

Unnamed: 0,bush,state,edu,age,inclusion
0,1.0,7,2,2,0
1,1.0,33,4,3,0
2,0.0,20,2,1,0
3,1.0,31,3,2,0
4,1.0,18,3,1,0


#### Problem 2 - Prepare your data by converting it into numpy arrays

Some of our ML work will be better suited if the input data is contained in a numpy object.

In [4]:
from sklearn import preprocessing

encode = preprocessing.LabelEncoder()
pre_poll['age'] = encode.fit_transform(pre_poll.age) 
pre_poll['state'] = encode.fit_transform(pre_poll.state)
pre_poll['edu'] = encode.fit_transform(pre_poll.edu)


pre_poll.head()

Unnamed: 0,bush,state,edu,age,inclusion
0,1.0,5,1,1,0
1,1.0,30,3,2,0
2,0.0,17,1,0,0
3,1.0,28,2,1,0
4,1.0,15,2,0,0


#### Problem 3 - Split the data  70/30 train/test

Split the data in the ordinary way, making sure you have a 70/30 split.

In [5]:
pre_poll['train'] = np.random.uniform(0, 1, len(pre_poll)) <= .70
pre_poll_train = pre_poll[pre_poll['train'] == True]
test = pre_poll[pre_poll['train'] == False]

In [7]:
pre_poll.train.value_counts()

True     9511
False    4033
Name: train, dtype: int64

# Wait... how can we tell if something is "Missing not at random"? 

That's a good question. One way is to understand "how much" of the variation in your data your model is accounting for. We'll do some preliminary work on that front here, but I'm going to ask you to ask yourself:

1. How can I apply what I've learned in regressions to this problem? 
2. What are other metrics I could use to account for variation in data outside of regressions? 

One approach we've strongly pointed towards is to construct regression models with the inclusion indicator as a target, and see what sort of performance you can get out of those family of techniques. 

In [11]:
import numpy as np
import pymc as pm
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

# Using Logistic Regression to model the "missingness"

#### Problem 4 - Build a classical logistic regression to model the inclusion indicator as a target

In [14]:
# This is my favorite logistiic implementation, very simple, close to R's lm. 
import statsmodels.formula.api as sm
#from sklearn.linear_model import LogisticRegression


rhs_columns = ['edu', 'state', 'age']
inclusion_lm = sm.Logit(pre_poll_train['inclusion'], pre_poll_train[rhs_columns]).fit()

inclusion_lm.summary()

Optimization terminated successfully.
         Current function value: 0.410249
         Iterations 6


0,1,2,3
Dep. Variable:,inclusion,No. Observations:,9511.0
Model:,Logit,Df Residuals:,9508.0
Method:,MLE,Df Model:,2.0
Date:,"Mon, 25 Jul 2016",Pseudo R-squ.:,0.01768
Time:,15:31:07,Log-Likelihood:,-3901.9
converged:,True,LL-Null:,-3972.1
,,LLR p-value:,3.1550000000000002e-31

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
edu,-0.4870,0.022,-22.456,0.000,-0.529 -0.444
state,-0.0071,0.002,-3.964,0.000,-0.011 -0.004
age,-0.1433,0.022,-6.534,0.000,-0.186 -0.100


#### Problem 5 - Build a vector of prediction from the trained model

In [15]:
y_pred = inclusion_lm.predict(test[rhs_columns]); y_pred

array([ 0.06844564,  0.1227757 ,  0.08395198, ...,  0.14395247,
        0.19386447,  0.16253144])

# Using K-Nearest Neighbor for imputing missing data

#### Problem 6 - Build a K-NN model (k = 5), to model the inclusion indicator 

The point of this model isn't really to shed more light on the "missingness", but rather to actually impute values into our column of data that contains missing values. Still, it's a good exercise to go through. After you've done the imputation, take a random subset of these imputed values and think about the results, is doing this a good way to fill in values? Would it be easier to do something simpler i.e. take the average for numerical data, or just select some label as fill-in for categorical data?

In [16]:
from sklearn.neighbors import KNeighborsClassifier

knn_impute = KNeighborsClassifier(n_neighbors = 5)
knn_impute.fit(pre_poll_train[rhs_columns], pre_poll_train['inclusion'])

knn_pred = knn_impute.predict(test[rhs_columns])



# Imputing with Random Forest

#### Problem 7 - Build a Random forest to model the inclusion indicator 

Similar to the KNN, this is more about actually doing the imputation. However still a good review/exercise, compare your results with the KNN. How can we objectively measure relative performance? 

In [17]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import Imputer


random_for = RandomForestClassifier(n_estimators=1000)
random_for.fit(pre_poll_train[rhs_columns], pre_poll_train['inclusion'])


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

# Doing some basic comparisons of results and forecasting efficaciousness

We need to compare our results -- construct ROC scores for each of the 3 methods 

Print the AUC for your non-Bayesian Logistic

In [18]:
fpr, tpr, thresholds =roc_curve(test['inclusion'], y_pred)
roc_auc = auc(fpr, tpr)
print("Area under the ROC curve : %f" % roc_auc)


Area under the ROC curve : 0.611303


Print the AUC for Random Forest

In [19]:
random_pred = random_for.predict_proba(test[rhs_columns]); random_pred

fpr, tpr, _ = roc_curve(test['inclusion'], random_pred[:, 1])
roc_auc = auc(fpr, tpr)
print("Area under the ROC curve : %f" % roc_auc) 

Area under the ROC curve : 0.670530


Print the AUC for KNN Impute

In [20]:
fpr, tpr, thresholds =roc_curve(test['inclusion'], knn_pred)
roc_auc = auc(fpr, tpr)
print("Area under the ROC curve : %f" % roc_auc)

Area under the ROC curve : 0.523246


**Open Ended Questions** Can we be fairly confident that there is some kind of functional relationship between the indicator variable and the few columns we studied in our data set? Is it obvious that there are probably other factors impacting "missingness" from this data? Which type of "missingness" are we probably in, and what does that say about the state of our missing data and how we should approach modeling on this data set in the future? What further actions can we take to augment this analysis?