# $\color{purple}{\text{Understanding Missing Data and How to Deal with It (Part 5)}}$

## $\color{purple}{\text{Advanced Imputation Techniques: Multivariate Imputation}}$

### $\color{purple}{\text{Colab Environmental Setup}}$

### $\color{purple}{\text{Libraries for this lesson}}$

In [6]:
import pandas as pd
import numpy as np
from helpers import stat_comparison
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression

from autoimpute.imputations import SingleImputer
from autoimpute.imputations import MultipleImputer
from autoimpute.imputations import MICEImputer

## $\color{purple}{\text{Regression Imputation}}$

General Technique:
Use Regression/Classification Models to Imputer Numeric/Categorical Missing Values
* Linear Regression
* Stocastic Linear Regression
* Logistic Regression
* Other Possibilities Generally unexplored
  * Random Forest
  * Decision Trees
  * KNN

In [3]:
df = pd.read_csv('data/full_set.csv')
mcar_df = pd.read_csv('data/mcar_set.csv')
mcar_df
mar_df = pd.read_csv('data/mar_set.csv')
mar_df


Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,1.517509,4.229258,2.052726,0.153278,0.014975
1,2.536323,4.295391,2.104137,1.348000,0.998701
2,4.043034,5.872276,3.559629,3.274061,0.403823
3,0.082752,3.761743,-0.440590,1.031832,0.281023
4,0.196684,3.793343,1.016462,-0.667764,0.165431
...,...,...,...,...,...
19995,2.090202,4.966018,1.973792,1.470606,0.666985
19996,,4.593494,3.159423,1.212630,0.867025
19997,3.704028,4.852749,3.738618,1.153456,0.492664
19998,,5.034845,4.243867,1.640312,0.269926


Regression Imputation

In [10]:
linear_regressor = LinearRegression()

In [18]:
rest = ['feature b', 'feature c', 'feature d', 'uncorrelated']
full_data = mar_df.dropna()
linear_regressor.fit(full_data[rest], full_data['feature a'])
impute = linear_regressor.predict(mar_df[rest])


In [19]:
imputed=mar_df.assign(**{'feature a': mar_df['feature a'].where(~mar_df['feature a'].isnull(), impute)})

In [20]:
stat_comparison(df, imputed, 'feature a')

Unnamed: 0,Original,With Missing Data,difference,percentate
mean,2.367217,2.367493,0.000276,0.011674
median,2.380412,2.384461,0.004049,0.170097
stdev,1.280482,1.278774,0.001708,0.133419


In [26]:
residual = mar_df['feature a'] - impute
residual.mean()
residual.std()

0.15480550226374812

## $\color{purple}{\text{Hot Deck Imputation}}$

General Idea is to randomly sample imputed values from remaining good values.

In [57]:
demo_df = mar_df[0:10].copy()

In [58]:
def distance(x):
    return np.linalg.norm((x-demo_df.iloc[7]).dropna())
    
demo_df['distance'] = demo_df.apply(distance, axis=1)
demo_df=demo_df.dropna() # Drop Our Row

In [101]:

imputer=SingleImputer('least squares')
imputations = imputer.fit_transform(mar_df)

In [105]:
from autoimpute.imputations import SingleImputer
imputer=SingleImputer('stochastic')
imputations = imputer.fit_transform(mar_df)

Hot Deck Imputation Version #1
Pick closest point threshold

In [60]:
demo_df

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated,distance
0,1.517509,4.229258,2.052726,0.153278,0.014975,2.140696
1,2.536323,4.295391,2.104137,1.348,0.998701,1.679696
2,4.043034,5.872276,3.559629,3.274061,0.403823,1.804759
3,0.082752,3.761743,-0.44059,1.031832,0.281023,3.954464
4,0.196684,3.793343,1.016462,-0.667764,0.165431,3.477212
5,2.560068,4.446726,2.420763,0.973363,0.166179,1.312528
6,4.027199,5.079975,4.582185,0.876607,0.420479,1.651431
8,2.743726,4.50633,2.62024,1.362915,0.011719,1.035106
9,-0.180238,3.148906,0.280848,-0.741796,0.104471,4.303085


In [62]:
threshold = 2
donors = demo_df[demo_df.distance<threshold]['feature a']

1    2.536323
2    4.043034
5    2.560068
6    4.027199
8    2.743726
Name: feature a, dtype: float64

In [None]:
Hot Deck Imputation Version #2
Take closest point

In [68]:
donor = demo_df.sort_values('distance').iloc[0]['feature a']

2.7437264614106294

In [None]:
Hot Deck Imputation Version #3
Pick N Closest points

In [71]:
N=3
donors = demo_df.sort_values('distance').iloc[0:N]['feature a']
donors

8    2.743726
5    2.560068
6    4.027199
Name: feature a, dtype: float64

In [77]:
import random
#Pick with probability
random.choices(demo_df['feature a'], k=1, weights=1/demo_df['distance'])

[2.560068182587529]

## $\color{purple}{\text{Predictive Mean Matching}}$
Uses linear interpolation as part of the metric.


In [81]:
from sklearn.linear_model import LinearRegression
linear_regressor=LinearRegression()

In [85]:
demo_df = mar_df[0:10].copy()

In [89]:
rest = ['feature b', 'feature c', 'feature d', 'uncorrelated']
full_data = demo_df.dropna()
linear_regressor.fit(full_data[rest], full_data['feature a'])
demo_df['regression'] = linear_regressor.predict(demo_df[rest])

In [94]:
demo_df['distance']=np.abs(demo_df.regression-demo_df.regression.iloc[7])

In [97]:
N=3
donors = demo_df.dropna().sort_values('distance').iloc[0:N]['feature a']
donors

8    2.743726
1    2.536323
5    2.560068
Name: feature a, dtype: float64

In [110]:
from autoimpute.imputations import SingleImputer
demo_df = mar_df[0:100].copy()
imputer=SingleImputer('pmm')
imputations = imputer.fit_transform(demo_df)


  return wrapped_(*args_, **kwargs_)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [σ, beta, alpha]


Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 10 seconds.
There were 98 divergences after tuning. Increase `target_accept` or reparameterize.
There were 2 divergences after tuning. Increase `target_accept` or reparameterize.
There were 41 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.8789954411303335, but should be close to 0.8. Try to increase the number of tuning steps.
The number of effective samples is smaller than 10% for some parameters.


In [112]:
imputations.head(10)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,1.517509,4.229258,2.052726,0.153278,0.014975
1,2.536323,4.295391,2.104137,1.348,0.998701
2,4.043034,5.872276,3.559629,3.274061,0.403823
3,0.082752,3.761743,-0.44059,1.031832,0.281023
4,0.196684,3.793343,1.016462,-0.667764,0.165431
5,2.560068,4.446726,2.420763,0.973363,0.166179
6,4.027199,5.079975,4.582185,0.876607,0.420479
7,2.754742,5.339294,3.138633,1.611132,0.229141
8,2.743726,4.50633,2.62024,1.362915,0.011719
9,-0.180238,3.148906,0.280848,-0.741796,0.104471



## $\color{purple}{\text{Categorical Variables}}$

Logistic Regression

## $\color{purple}{\text{Hot Deck Imputation}}$

## $\color{purple}{\text{Predictive Mean Matching}}$

## $\color{purple}{\text{Advanced Imputation Techniques: multivariate imputation by chained equations (MICE)}}$

In [None]:
nar

## $\color{purple}{\text{Advanced Imputation Techniques: Multiple Imputation}}$

Hot Desk Imputation

Regression Imputation

Multiple Imputation

MICE

In [132]:
from autoimpute.imputations import MultipleImputer
imputer=MultipleImputer(strategy='stochastic')
imputations = imputer.fit_transform(mar_df)

In [133]:
lists=list(imputations)

In [137]:
lists[0][1].head(10)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,1.517509,4.229258,2.052726,0.153278,0.014975
1,2.536323,4.295391,2.104137,1.348,0.998701
2,4.043034,5.872276,3.559629,3.274061,0.403823
3,0.082752,3.761743,-0.44059,1.031832,0.281023
4,0.196684,3.793343,1.016462,-0.667764,0.165431
5,2.560068,4.446726,2.420763,0.973363,0.166179
6,4.027199,5.079975,4.582185,0.876607,0.420479
7,2.953229,5.339294,3.138633,1.611132,0.229141
8,2.743726,4.50633,2.62024,1.362915,0.011719
9,-0.180238,3.148906,0.280848,-0.741796,0.104471


In [138]:
lists[1][1].head(10)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,1.517509,4.229258,2.052726,0.153278,0.014975
1,2.536323,4.295391,2.104137,1.348,0.998701
2,4.043034,5.872276,3.559629,3.274061,0.403823
3,0.082752,3.761743,-0.44059,1.031832,0.281023
4,0.196684,3.793343,1.016462,-0.667764,0.165431
5,2.560068,4.446726,2.420763,0.973363,0.166179
6,4.027199,5.079975,4.582185,0.876607,0.420479
7,2.97643,5.339294,3.138633,1.611132,0.229141
8,2.743726,4.50633,2.62024,1.362915,0.011719
9,-0.180238,3.148906,0.280848,-0.741796,0.104471


In [141]:
[each[1].iloc[7]['feature a'] for each in lists]

[2.9532294619521022,
 2.9764296840936324,
 2.852605077672565,
 2.772903467827853,
 2.8978580843592163]

In [142]:
[each[1].iloc[6]['feature a'] for each in lists]

[4.027199122849932,
 4.027199122849932,
 4.027199122849932,
 4.027199122849932,
 4.027199122849932]