# Unsupervised Learning with Non-Ignorable Missing Data

Christine Hwang

In [436]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression as Lin_Reg
from sklearn.linear_model import Ridge as Ridge_Reg
from sklearn.linear_model import Lasso as Lasso_Reg
from statsmodels.regression.linear_model import OLS
import sklearn.preprocessing as Preprocessing
from sklearn.preprocessing import StandardScaler as Standardize
import itertools as it
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cmx
import matplotlib.colors as colors
import scipy as sp
from itertools import combinations
%matplotlib inline

## Summary of Paper

### Overview
Missing data is defined to be non-ignorable if it is not missing at random. Non-ignorable missing data causes the inference based on observed data to lead to biased parameter estimates and can affect the significance of your results. This paper attempts to use probabilistic models to fill in missing data to reduce this bias

### Example
In the scenario where you have netflix movie ratings, you are more likely to rate movies that you really enjoy or really dislike, but rarely rate movies that you are neutral about. Therefore, most of the missing data tends to be from these middle ratings and the distribution of the observed data is shifted to the right from the true distribution of copmletely filled data. In this scenario, the probability of observing a particular response depends on the value itself. Therefore, ignoring missing data can lead to biased parameter estimates. 

### Standard Mixture Model

<img src="GraphicalModel.png">

This standard mixsture model shows a latent variable z behind each observation Y. In this example of a multinomial mixture model, $z_n$ represents the latent variable behind each entry which is pulled from a $\theta_k$ prior. This means that there are k different latent variables that our multinomial distribution can come from. For each observed value given its latent variable, there is a $\beta_{vmz}$ that represents the probability of any observation taking a specific value.

## Interpretation of the Model

Just want to make sure I have the right interpretation of the distribution.

Let us consider the scenario where we represent the netflix movie ratings as a multinomial distribution. Then from the diagram above, $Y_{1n'}$ = the rating star count for movie 1 for person n' and we assume there are n # people. Then if $Y_{im}$ = v, that means that the star count for movie i from person m = v. To represent each $Y_m$ is a multinomial distribution, we will define it as Mult($\theta_j$, M), where $\theta_z$ = P(selecting word v from latent variable z) where M is the total number of ratings.

$z_n$ ~ Mult($\pi$,1) 


$\pi$ = [.3, .2, .1, .4]

This means that the probability of having a latent variable $z_1$ = .3, $z_2$ = .2, etc.


If we represent each $Y_{m}|z_i$ ~ Multi($\theta_z$,N), we view $\theta_z$ to be the probability distribution of ratings for a single movie. For example, $\theta = [.1 .2 .7]$ means that the probability you give the movie 1 star is .1, probability you give it 2 stars is .2, and probability you give it three stars is .7, and you can set your M = N so that it follows a Multinomial($\theta$, N). This is because N people give movie $m$ a rating and therefore the distribution of all the ratings that the N people could have given movie $m$ is Multinoimal given the $\theta$.

Then the $\beta$ estimates $\theta$ is if we view $\theta$ to be the probability distribution of ratings for a single movie. For example, $\theta = [.1 .2 .7]$ means that the probability you give the movie 1 star is .1, probability you give it 2 stars is .2, and probability you give it three stars is .7, and you can set your M = N so that it follows a Multinomial($\theta$, N). Then it would make sense that $Y_{m}$~Multi($\beta$,N) which represents the ratings from N people for the $m^{th}$ movie.

### CPT-v Model

<img src="Model2.png">

We extrapolate from the previous model to show the missing data. $\mu_v$ = $P(R_m = 1 | Y_m = v)$, whic his the probability the value is missing given the true value of the data. In our movie rating example, this probability is higher is $Y_m = 2-3$ because we are less likely to see a rating for a mediocre movie. Need to clarify previous questions before analyzing this model further.

## Medical Application

I hope to apply this model of missing data to medical data. For example, in data analyzing blood pressure or cholesterol, we may see that the probability of seeing missing values for younger, healthier people is higher than for older people because they are less likely to have health complications that require these measurements. Doctors may assume that these measurements are not necessary for the checkup and therefore the data will not be missing at random. In order to use this probabilistic model to fill in missing values, I need to understanding how these categorical variables (if you just convert them to integers) can represent a multinomial distribution. If a person's blood pressure can take the values [10, 50, 100], and our $\theta = [.2 .2 .6]$, does this mean that the person's blood pressure = Multinomial($\theta$, 1)? 

## Baseline Implementation

<img src="EM.png">

$\phi_{zn}$ = posterior distribution of the latent variable z 

$\theta_z$ = probability of that latent variable Z = z

$\beta_{vmz}$ = probability that Y = v given latent variable

$\mu_v$ = probability that $Y_i$ is missing given is has value v.

$\gamma$ and $\lambda$ are intermediate variables without interpretable value.

$\delta = I(y_{mn} = v)$


## Priors

$u_v$ comes from a beta distribution because it is a prior for a Binomial Distribution

$\theta$ comes from a Dirichlet distribution

$\beta$ comes from a dirichlet distribution

## Code

In [437]:
y = [[5, 3, 4, 2, 3, 1], [1,4,3,5,1,2],[5,2,3,1,4,3], [4,3,5,5,4,2]]
###########
# y = [5 4 3 2 3 1]
#     [1 4 3 5 1 2]
#     [5 2 3 1 4 3]
#     [4 3 5 5 4 2]

In [438]:
r = [[1,1,0,0,1,1],[1,1,0,1,0,1],[1,0,0,0,1,0],[1,1,1,1,0,0]]
########
# r = [1 1 0 0 1 1]
#     [1 1 0 1 0 1]
#     [1 0 0 0 1 0]
#     [1 1 1 1 0 0]

### Generate Full Data
I created the probability distribution of movies for there different movies. Movie 1 is a bad movie that is likely to get a lot of 1s. Beta 2 is just an average movie whose ratings will resemble a normal distribution centered at 3. Movie 3 is a great movie that is likely to get very high reviews. Concatentate these ratings to get your full data set of 100 movie ratings for these 3 movies.

In [601]:
#initialize dimensions
z = 1
n = 100
m = 3
v = 5
#bad movie. high probability of getting 1s and low prob of getting 5
beta_1 = [.5,.2,.1,.1,.1]
#average movie. 
beta_2 = [.1,.2,.5,.2,.1]
#great movie
beta_3 = [.05,.1,.1,.35,.4]

In [602]:
y_1_freq = np.random.multinomial(n, beta_1)
y_2_freq = np.random.multinomial(n,beta_2)
y_3_freq = np.random.multinomial(n,beta_3)

In [603]:
y_1 = []
y_2 = []
y_3 = []
for i in range(v):
    y_1 += [i+1]*y_1_freq[i]
    y_2 += [i+1]*y_2_freq[i]
    y_3 += [i+1]*y_3_freq[i]
complete_data = np.vstack((y_1,y_2,y_3)).T

### Mu Initialization

In the paper in study, we let $\mu_v(s)$ = $s$($v$ − 3) + 0.5, where $s$ is the parameter that controls the strength of the effect. I will set $s$ = .075. 

In [604]:
mu = np.ones(v)
for i in range(v):
    mu[i] = .075*(i-3)+.5
mu

array([ 0.275,  0.35 ,  0.425,  0.5  ,  0.575])

In terms of the example of a movie rating, It makes more sense that the movie will be missing reviews it is a neutral movie than a bad movie. Therefore, I reversed the probability so that it more resembles a bell shaped distribution where the probability of the rating being missing peaks when star = 3.

In [605]:
for i in range(v):
    mu[i] = -.075*abs(i-2)+.5
mu

array([ 0.35 ,  0.425,  0.5  ,  0.425,  0.35 ])

### Pattern for Missing Data (R)

Given this, we want to create a matrix that indicates 1 if the data is observed and 0 otherwise.

In [606]:
r = np.ones((n,m))
r.shape

(100, 3)

In [607]:
for n_ in range(n):
    for m_ in range(m):
        val = complete_data[n_][m_]
        if np.random.rand(1) < mu[val-1]:
            r[n_][m_] = 0

### Pull $\theta, \beta$ from Dirichlet

Because $\theta$ and $\beta$ are good priors for categorical distribution and multinomial distributions, they are appropriate priors for $\theta$ and $\beta$ because both represent the probability of taking a specific categorical value.

In [608]:
# theta = np.zeros((1,z))
# beta = np.zeros((v,m,z))
# theta = np.random.dirichlet((1,1),1).reshape((2,))
theta = [1]
beta = np.random.dirichlet((1,1,1,1,1),(z,m)).transpose()

In [609]:
phi = np.ones((z,n))
gamma = np.ones((m,z,n))
lambda_ = np.ones((v,m,z,n))

In [610]:
#### not sure when to make this converge because it seems to just get smaller and smaller and smaller
for i in range(10000):   
    
    ### E step
    
    ###### lambda
    for v_ in range(v):
        for m_ in range(m):
            for z_ in range(z):
                for n_ in range(n):
                    lambda_[v_][m_][z_][n_] = ((complete_data[n_][m_]==(v_+1))*mu[v_]*beta[v_][m_][z_])**r[n_][m_]*((1-mu[v_])*beta[v_][m_][z_])**(1-r[n_][m_])


    ###### gamma
    gamma = lambda_.sum(axis = 0)

    ##### brute force phi
    for z_ in range(z):
        for n_ in range(n):
            #numerator
            phi_num = np.log(theta[z_])
            for m_ in range(m):
                phi_num += np.log(gamma[m_][z_][n_])
            
            
            #denominator
            phi_denom = 0
            for z__ in range(z):
                temp = np.log(theta[z__])
                for m__ in range(m):
                    temp += np.log(gamma[m__][z__][n_])
                phi_denom += temp
            phi[z_][n_] = np.exp(phi_num-phi_denom)
            
    
    
    
    ###### log phi
    #phi_num = np.log(np.tile(theta, (n,1))).T+np.log(gamma).sum(axis = 0)
    #phi_denom = np.log(np.tile(theta, (n,1)).T)*gamma.prod(axis = 0)).sum(axis=0)

    
    ##### vectorized phi
    #phi = np.tile(theta, (n,1)).T*gamma.prod(axis = 0)/(np.tile(theta, (n,1)).T*gamma.prod(axis = 0)).sum(axis=0)



    ### M step

    theta = phi.sum(axis=1)/phi.sum()

    ####beta
    for v_ in range(v):
        for m_ in range(m):
            for z_ in range(z):
                sum_ = 0
                sum_phi = 0 
                for n_ in range(n):
                    sum_ += phi[z_][n_]*lambda_[v_][m_][z_][n_]/gamma[m_][z_][n_]
                    sum_phi += phi[z_][n_]
                beta[v_][m_][z_] = sum_/sum_phi
    ######mu
    for v_ in range(v):
        num = 0
        denom = 0
        for n_ in range(n):
            for z_ in range(z):
                for m_ in range(m):
                    num += phi[z_][n_]*r[n_][m_]*lambda_[v_][m_][z_][n_]/gamma[m_][z_][n_]
                    denom += phi[z_][n_]*lambda_[v_][m_][z_][n_]/gamma[m_][z_][n_]
        mu[v_] = num/denom
        

In [611]:
beta

array([[[ 0.45475136],
        [ 0.04012512],
        [ 0.04012512]],

       [[ 0.2176609 ],
        [ 0.12696886],
        [ 0.16324568]],

       [[ 0.14034355],
        [ 0.70171775],
        [ 0.02339059]],

       [[ 0.13118827],
        [ 0.13118827],
        [ 0.30610595]],

       [[ 0.05605592],
        [ 0.        ],
        [ 0.46713266]]])

# Predictions

The concept behind predicting is that now that we have a converged beta, for each of the missing values, we will pull the rating value $v$ from a multinomial distribution. How do I deal with this when there are latent variables? How do you know which latent variable each perosn is from? I can take the posterior distribution of the phi and take the argmax to identify which latent variable he or she comes from. 

In [612]:
fill_in_data = r*complete_data
for n_ in range(n):
    for m_ in range(m):
        latent = np.argmax(phi[:,n_])
        if fill_in_data[n_][m_] == 0:
            fill_in_data[n_][m_] = np.argmax(np.random.multinomial(1, beta[:,m_,latent]))+1

In [613]:
r[11]

array([ 0.,  0.,  0.])

In [614]:
fill_in_data[11]

array([ 1.,  3.,  4.])

In [615]:
complete_data[11]

array([1, 2, 2])

In [616]:
def fill_in_accuracy(complete,filled,rmse = True):
    if not rmse:
        difference = 0
        total = 0
        for n_ in range(n):
            for m_ in range(m):
                if (complete[n_][m_] + filled[n_][m_] <> 0):
                    total += 1
                    if complete[n_][m_] == filled[n_][m_]:
                        difference += 1
        print "The total number of missing values is", total
        print "The number of accurate filled values is", difference
    else:
        mad = abs(complete_data[np.nonzero((1-r)*complete_data)]-fill_in_data[np.nonzero((1-r)*fill_in_data)]).mean()
        print "the mean absolute deviation is",mad

In [617]:
fill_in_accuracy((1-r)*complete_data,(1-r)*fill_in_data, rmse=False)
fill_in_accuracy((1-r)*complete_data,(1-r)*fill_in_data, rmse=True)

The total number of missing values is 128
The number of accurate filled values is 50
the mean absolute deviation is 0.9375


### Future Steps

I had to hard code with for loop for three of the predictors. Working on how to use matrix multiplication but can't see a clear pattern. Also not sure how to use $\beta$ to actually predict the values after the fitting is done. Do I call np.random.multinomial($\beta$, 1) then take the index of the array that has 1 success and fill in that missing value?

## Links

Description of Multinomial Mixture Models
http://web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring14-lecture3_scribed.pdf
https://www.cs.princeton.edu/courses/archive/spring12/cos424/pdf/em-mixtures.pdf

My specific Paper
http://www.cs.ubc.ca/~bmarlin/research/presentations/lnimd_group_talk.pdf
https://people.cs.umass.edu/~marlin/research/papers/aistat-lnimd.pdf

Application of my paper
http://ijcai.org/Proceedings/11/Papers/447.pdf
http://www.cs.toronto.edu/~zemel/documents/cfmar-uai2007.pdf
https://pdfs.semanticscholar.org/2845/eda7ce8de14e351d41182f92b73ece8873ef.pdf
https://people.cs.umass.edu/~marlin/research/thesis/cfmlp.pdf

In [515]:
beta

array([[[ 0.12978091],
        [ 0.24438813],
        [ 0.04934738]],

       [[ 0.34157998],
        [ 0.09005284],
        [ 0.05399028]],

       [[ 0.13551703],
        [ 0.11465503],
        [ 0.04878373]],

       [[ 0.14558859],
        [ 0.23233754],
        [ 0.79694954]],

       [[ 0.24753349],
        [ 0.31856647],
        [ 0.05092908]]])