# Unsupervised Learning with Non-Ignorable Missing Data

Christine Hwang

In [2]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression as Lin_Reg
from sklearn.linear_model import Ridge as Ridge_Reg
from sklearn.linear_model import Lasso as Lasso_Reg
from statsmodels.regression.linear_model import OLS
import sklearn.preprocessing as Preprocessing
from sklearn.preprocessing import StandardScaler as Standardize
import itertools as it
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cmx
import matplotlib.colors as colors
import scipy as sp
from itertools import combinations
%matplotlib inline

## Summary of Paper

### Overview
Missing data is defined to be non-ignorable if it is not missing at random. Non-ignorable missing data causes the inference based on observed data to lead to biased parameter estimates and can affect the significance of your results. This paper attempts to use probabilistic models to fill in missing data to reduce this bias

### Example
In the scenario where you have netflix movie ratings, you are more likely to rate movies that you really enjoy or really dislike, but rarely rate movies that you are neutral about. Therefore, most of the missing data tends to be from these middle ratings and the distribution of the observed data is shifted to the right from the true distribution of copmletely filled data. In this scenario, the probability of observing a particular response depends on the value itself. Therefore, ignoring missing data can lead to biased parameter estimates. 

### Standard Mixture Model

<img src="GraphicalModel.png">

This standard mixsture model shows a latent variable z behind each observation Y. In this example of a multinomial mixture model, $z_n$ represents the latent variable behind each entry which is pulled from a $\theta_k$ prior. This means that there are k different latent variables that our multinomial distribution can come from. For each observed value given its latent variable, there is a $\beta_{vmz}$ that represents the probability of any observation taking a specific value.

## Question ????

Just want to make sure I have the right interpretation of the distribution.

Let us consider the scenario where we represent the netflix movie ratings as a multinomial distribution. Then from the diagram above, $Y_{1n'}$ = the rating star count for movie 1 for person n' and we assume there are n # people. Then if $Y_{im}$ = v, that means that the star count for movie i from person m = v. To represent each $Y_i$ is a multinomial distribution, we will define it as Mult($\theta_j$, M), where $\theta_z$ = P(selecting word v from latent variable z) where M is the total number of ratings.

$z_n$ ~ Mult($\pi$,1) 


$\pi$ = [.3, .2, .1, .4]

This means that the probability of having a latent variable $z_1$ = .3, $z_2$ = .2, etc.


If we represent each $Y_{in}|z_i$ ~ Multi($\theta_z$,M), we view $\theta_z$ to be the probability distribution of ratings for a single movie. For example, $\theta = [.1 .2 .7]$ means that the probability you give the movie 1 star is .1, probability you give it 2 stars is .2, and probability you give it three stars is .7, and you can set your M = 1 so that it follows a Multinomial($\theta$, 1)? 

Then the $\beta$ estimates $\theta$ is if we view $\theta$ to be the probability distribution of ratings for a single movie. For example, $\theta = [.1 .2 .7]$ means that the probability you give the movie 1 star is .1, probability you give it 2 stars is .2, and probability you give it three stars is .7, and you can set your M = 1 so that it follows a Multinomial($\theta$, 1)? Then it would make sense that $Y_{in}$~Multi($\beta$,1) which represents the rating from person n for the $i^{th}$ movie.

Just want to make sure I am interpreting this correctly. My other alternative explanation was that $Y_n$ ~ Mult($\theta$, M), where M = sum of ratings for all movies rated by user n, but then I didn't know what to cap M at because it seemed to be more a Poisson distribution and didnt' make as much intuitive sense.

### CPT-v Model

<img src="Model2.png">

We extrapolate from the previous model to show the missing data. $\mu_v$ = $P(R_m = 1 | Y_m = v)$, whic his the probability the value is missing given the true value of the data. In our movie rating example, this probability is higher is $Y_m = 2-3$ because we are less likely to see a rating for a mediocre movie. Need to clarify previous questions before analyzing this model further.

## Medical Application

I hope to apply this model of missing data to medical data. For example, in data analyzing blood pressure or cholesterol, we may see that the probability of seeing missing values for younger, healthier people is higher than for older people because they are less likely to have health complications that require these measurements. Doctors may assume that these measurements are not necessary for the checkup and therefore the data will not be missing at random. In order to use this probabilistic model to fill in missing values, I need to understanding how these categorical variables (if you just convert them to integers) can represent a multinomial distribution. If a person's blood pressure can take the values [10, 50, 100], and our $\theta = [.2 .2 .6]$, does this mean that the person's blood pressure = Multinomial($\theta$, 1)? 

## Baseline Implementation

<img src="EM.png">

$\phi_{zn}$ = posterior distribution of the latent variable z 

$\theta_z$ = probability of that latent variable Z = z

$\beta_{vmz}$ = probability that Y = v given latent variable

$\mu_v$ = probability that $Y_i$ is missing given is has value v.

$\gamma$ and $\lambda$ are intermediate variables without interpretable value.

### QUESTION!
need to find out this $\delta$ function..

## Code

In [None]:
y = [[5, 3, 4, 2, 3, 1], [1,4,3,5,1,2],[5,2,3,1,4,3], [4,3,5,5,4,2]]
###########
# y = [5 4 3 2 3 1]
#     [1 4 3 5 1 2]
#     [5 2 3 1 4 3]
#     [4 3 5 5 4 2]

In [13]:
r = [[1,1,0,0,1,1],[1,1,0,1,0,1],[1,0,0,0,1,0],[1,1,1,1,0,0]]
########
# r = [1 1 0 0 1 1]
#     [1 1 0 1 0 1]
#     [1 0 0 0 1 0]
#     [1 1 1 1 0 0]

In [14]:
#initialize dimensions
z = 3 
n = 4
m = 6
v = 5

In [29]:
theta = np.zeros((1,z))
phi = np.ones((z,n))
gamma = np.zeros((m,z,n))
lambda_ = np.ones((v,m,z,n))
beta = np.zeros((v,m,z))
mu = np.ones(v)

In [36]:
gamma = lambda_.sum(axis = 0)
theta = phi.sum(axis=1)/phi.sum()
theta_rep = np.tile(theta, (z,1))
####beta
sum_phi = 0 
for v_ in range(v):
    for m_ in range(m):
        for z_ in range(z):
            sum_ = 0
            for n_ in range(n):
                sum_ =+ phi[z_][n_]*lambda_[v_][m_][z_][n_]/gamma[m_][z_][n_]
                sum_phi += phi[z_][n_]
            beta[v_][m_][z_] = sum_/sum_phi
######mu
for v_ in range(v):
    num = 0
    denom = 0
    for n_ in range(n):
        for z_ in range(z):
            for m_ in range(m):
                num += phi[z_][n_]*r[n_][m_]*lambda_[v_][m_][z_][n_]/gamma[m_][z_][n_]
                denom += phi[z_][n_]*lambda_[v_][m_][z_][n_]/gamma[m_][z_][n_]
    mu[v_] = num/denom
    
    
###### lambda
for v_ in range(v):
    for m_ in range(m):
        for z_ in range(z):
            for n_ in range(n):
                lambda_[v_][m_][z_][n_] = mu[v_]*beta[v_][m_][z_]**r[n_][m_]*((1-mu[v_])*beta[v_][m_][z_])**(1-r[n_][m_])

phi = np.tile(theta, (z,1)).T*gamma.prod(axis = 0)/np.tile(theta, (z,1)).T*gamma.prod(axis = 0).sum(axis=0)

### Future Steps

I had to hard code with for loop for three of the predictors. Working on how to use matrix multiplication but can't see a clear pattern. Also not sure how to use $\beta$ to actually predict the values after the fitting is done. Do I call np.random.multinomial($\beta$, 1) then take the index of the array that has 1 success and fill in that missing value?

In [62]:
np.tile(theta, (z,1)).T

array([[  2.,   2.,   2.,   2.,   2.],
       [  4.,   4.,   4.,   4.,   4.],
       [  6.,   6.,   6.,   6.,   6.],
       [  8.,   8.,   8.,   8.,   8.],
       [ 10.,  10.,  10.,  10.,  10.]])

## Links

Description of Multinomial Mixture Models
http://web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring14-lecture3_scribed.pdf
https://www.cs.princeton.edu/courses/archive/spring12/cos424/pdf/em-mixtures.pdf

My specific Paper
http://www.cs.ubc.ca/~bmarlin/research/presentations/lnimd_group_talk.pdf
https://people.cs.umass.edu/~marlin/research/papers/aistat-lnimd.pdf

Application of my paper
http://ijcai.org/Proceedings/11/Papers/447.pdf
http://www.cs.toronto.edu/~zemel/documents/cfmar-uai2007.pdf
https://pdfs.semanticscholar.org/2845/eda7ce8de14e351d41182f92b73ece8873ef.pdf
https://people.cs.umass.edu/~marlin/research/thesis/cfmlp.pdf