# $\color{purple}{\text{Understanding Missing Data and How to Deal with It (Part 5)}}$

## $\color{purple}{\text{Advanced Imputation Techniques}}$

### $\color{purple}{\text{Colab Environmental Setup}}$

### $\color{purple}{\text{Libraries for this lesson}}$

In [46]:
import pandas as pd
import numpy as np
from helpers import stat_comparison
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression

from autoimpute.imputations import SingleImputer
from autoimpute.imputations import MultipleImputer
from autoimpute.imputations import MiceImputer

## $\color{purple}{\text{Multivariate Imputation}}$

## $\color{purple}{\text{Regression Imputation}}$

General Technique:
Use Regression/Classification Models to Imputer Numeric/Categorical Missing Values
* Linear Regression
* Stocastic Linear Regression
* Logistic Regression
* Other Possibilities Generally unexplored
  * Random Forest
  * Decision Trees
  * KNN

In [47]:
df = pd.read_csv('data/full_set.csv')
mcar_df = pd.read_csv('data/mcar_set.csv')
mcar_df
mar_df = pd.read_csv('data/mar_set.csv')
mar_df


Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,1.517509,4.229258,2.052726,0.153278,0.014975
1,2.536323,4.295391,2.104137,1.348000,0.998701
2,4.043034,5.872276,3.559629,3.274061,0.403823
3,0.082752,3.761743,-0.440590,1.031832,0.281023
4,0.196684,3.793343,1.016462,-0.667764,0.165431
...,...,...,...,...,...
19995,2.090202,4.966018,1.973792,1.470606,0.666985
19996,,4.593494,3.159423,1.212630,0.867025
19997,3.704028,4.852749,3.738618,1.153456,0.492664
19998,,5.034845,4.243867,1.640312,0.269926


Regression Imputation

### $\color{purple}{\text{Libear Regression}}$

In [48]:
linear_regressor = LinearRegression()

In [49]:
rest = ['feature b', 'feature c', 'feature d', 'uncorrelated']
full_data = mar_df.dropna()
linear_regressor.fit(full_data[rest], full_data['feature a'])
impute = linear_regressor.predict(mar_df[rest])


In [50]:
imputed=mar_df.assign(**{'feature a': mar_df['feature a'].where(~mar_df['feature a'].isnull(), impute)})

In [51]:
stat_comparison(df, imputed, 'feature a')

Unnamed: 0,Original,With Missing Data,difference,percentage
mean,2.367217,2.367493,0.000276,0.011674
median,2.380412,2.384461,0.004049,0.170097
stdev,1.280482,1.278774,0.001708,0.133419


### $\color{purple}{\text{Stochastic Regression}}$

In [52]:
residual = mar_df['feature a'] - impute
residual.mean()
residual.std()

0.15480550226374812

In [54]:
residual_noise=np.random.normal(residual.mean(), residual.std(), 20000)
impute+=residual_noise

In [55]:
imputed=mar_df.assign(**{'feature a': mar_df['feature a'].where(~mar_df['feature a'].isnull(), impute)})

In [56]:
stat_comparison(df, imputed, 'feature a')

Unnamed: 0,Original,With Missing Data,difference,percentage
mean,2.367217,2.366905,0.000312,0.013167
median,2.380412,2.380144,0.000268,0.011246
stdev,1.280482,1.280378,0.000104,0.008149


In [10]:
imputer=SingleImputer('least squares')
imputations = imputer.fit_transform(mar_df)

In [11]:
from autoimpute.imputations import SingleImputer
imputer=SingleImputer('stochastic')
imputations = imputer.fit_transform(mar_df)

## $\color{purple}{\text{Hot Deck Imputation}}$

General Idea is to randomly sample imputed values from remaining good values.

In [8]:
demo_df = mar_df[0:10].copy()

In [9]:
def distance(x):
    return np.linalg.norm((x-demo_df.iloc[7]).dropna())
    
demo_df['distance'] = demo_df.apply(distance, axis=1)
demo_df=demo_df.dropna() # Drop Our Row

Hot Deck Imputation Version #1
Pick closest point threshold

In [12]:
demo_df

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated,distance
0,1.517509,4.229258,2.052726,0.153278,0.014975,2.140696
1,2.536323,4.295391,2.104137,1.348,0.998701,1.679696
2,4.043034,5.872276,3.559629,3.274061,0.403823,1.804759
3,0.082752,3.761743,-0.44059,1.031832,0.281023,3.954464
4,0.196684,3.793343,1.016462,-0.667764,0.165431,3.477212
5,2.560068,4.446726,2.420763,0.973363,0.166179,1.312528
6,4.027199,5.079975,4.582185,0.876607,0.420479,1.651431
8,2.743726,4.50633,2.62024,1.362915,0.011719,1.035106
9,-0.180238,3.148906,0.280848,-0.741796,0.104471,4.303085


In [13]:
threshold = 2
donors = demo_df[demo_df.distance<threshold]['feature a']

Hot Deck Imputation Version #2
Take closest point

In [15]:
donor = demo_df.sort_values('distance').iloc[0]['feature a']

Hot Deck Imputation Version #3
Pick N Closest points

In [None]:
N=3
donors = demo_df.sort_values('distance').iloc[0:N]['feature a']
donors

In [None]:
import random
#Pick with probability
random.choices(demo_df['feature a'], k=1, weights=1/demo_df['distance'])

## $\color{purple}{\text{Predictive Mean Matching}}$
Uses linear interpolation as part of the metric.


In [17]:
from sklearn.linear_model import LinearRegression
linear_regressor=LinearRegression()

In [18]:
demo_df = mar_df[0:10].copy()

In [19]:
rest = ['feature b', 'feature c', 'feature d', 'uncorrelated']
full_data = demo_df.dropna()
linear_regressor.fit(full_data[rest], full_data['feature a'])
demo_df['regression'] = linear_regressor.predict(demo_df[rest])

In [20]:
demo_df['distance']=np.abs(demo_df.regression-demo_df.regression.iloc[7])

In [21]:
N=3
donors = demo_df.dropna().sort_values('distance').iloc[0:N]['feature a']
donors

8    2.743726
1    2.536323
5    2.560068
Name: feature a, dtype: float64

In [22]:
from autoimpute.imputations import SingleImputer
demo_df = mar_df[0:100].copy()
imputer=SingleImputer('pmm')
imputations = imputer.fit_transform(demo_df)


  return wrapped_(*args_, **kwargs_)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [σ, beta, alpha]


Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 12 seconds.
There were 53 divergences after tuning. Increase `target_accept` or reparameterize.
There were 5 divergences after tuning. Increase `target_accept` or reparameterize.
There were 178 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.6683772367952452, but should be close to 0.8. Try to increase the number of tuning steps.
The acceptance probability does not match the target. It is 0.8860422688146068, but should be close to 0.8. Try to increase the number of tuning steps.
The estimated number of effective samples is smaller than 200 for some parameters.


In [23]:
imputations.head(10)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,1.517509,4.229258,2.052726,0.153278,0.014975
1,2.536323,4.295391,2.104137,1.348,0.998701
2,4.043034,5.872276,3.559629,3.274061,0.403823
3,0.082752,3.761743,-0.44059,1.031832,0.281023
4,0.196684,3.793343,1.016462,-0.667764,0.165431
5,2.560068,4.446726,2.420763,0.973363,0.166179
6,4.027199,5.079975,4.582185,0.876607,0.420479
7,2.499411,5.339294,3.138633,1.611132,0.229141
8,2.743726,4.50633,2.62024,1.362915,0.011719
9,-0.180238,3.148906,0.280848,-0.741796,0.104471



## $\color{purple}{\text{Categorical Variables}}$

Logistic Regression

In [24]:
cat_mar_df = pd.read_csv('data/categorical_mar.csv')

In [25]:
from sklearn.linear_model import LogisticRegression
rest = ['feature a', 'feature b', 'feature c']
from sklearn.preprocessing import LabelEncoder
df = cat_mar_df.dropna()[0:100]
lr = LogisticRegression(random_state=0, max_iter=1000).fit(df[rest], df['cat feature'])

In [26]:
impute = lr.predict(cat_mar_df[rest])

In [27]:
imputed=cat_mar_df.assign(**{'cat feature': cat_mar_df['cat feature'].where(~cat_mar_df['cat feature'].isnull(), impute)})

## $\color{purple}{\text{Hot Deck Imputation}}$

## $\color{purple}{\text{Predictive Mean Matching}}$

## $\color{purple}{\text{Advanced Imputation Techniques: multivariate imputation by chained equations (MICE)}}$

In [29]:
dmar_df = pd.read_csv('data/double_mar_set.csv')
missing_df=pd.DataFrame({'feature a': dmar_df['feature a'].isnull(),
                         'feature b': dmar_df['feature b'].isnull()})

In [30]:
Step 0: Univariate Imputation

SyntaxError: invalid syntax (4160535758.py, line 1)

In [31]:
step0_df=dmar_df.fillna({'feature a': dmar_df['feature a'].mean(), 'feature b': dmar_df['feature b'].median()})
step0_df

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,-2.827918,-3.926024,-0.171649,1.599414,0.808661
1,0.827494,-2.485444,2.251578,3.886670,0.194692
2,0.194008,-2.577198,2.882425,3.887535,0.886731
3,-1.907789,-3.792876,0.256099,2.430528,0.493951
4,-0.937901,-3.825973,1.862588,1.099501,0.394353
...,...,...,...,...,...
19995,-1.316783,-1.853379,0.632680,4.479027,0.334140
19996,1.547273,-2.070106,4.070505,3.239217,0.125296
19997,-0.552074,-2.135666,2.101401,3.162508,0.090778
19998,-0.089824,-2.312533,2.102443,3.203033,0.139082


In [63]:
imputer=SingleImputer('least squares')

In [64]:
step0_df['feature a'] = np.where(missing_df['feature a'], np.nan, step0_df['feature a'])

In [65]:
step0_df['feature a']

0       -2.827918
1        0.827494
2             NaN
3       -1.907789
4       -0.937901
           ...   
19995   -1.316783
19996    1.547273
19997   -0.552074
19998   -0.089824
19999    1.240722
Name: feature a, Length: 20000, dtype: float64

In [69]:
imputer=SingleImputer('least squares')
step1a_df=imputer.fit_transform(step0_df.assign(**{'feature a': step0_df['feature a'].where(~step0_df['feature a'].isnull(), np.nan)}))

In [70]:
imputer=SingleImputer('least squares')
step1_df=imputer.fit_transform(step1a_df.assign(**{'feature a': step1a_df['feature a'].where(~step1a_df['feature a'].isnull(), np.nan)}))

In [74]:
stat_comparison(df, step1_df, 'feature a')

Unnamed: 0,Original,With Missing Data,difference,percentage
mean,5.867453,0.430405,5.437048,92.664539
median,5.360496,0.440276,4.92022,91.786653
stdev,5.139958,1.281785,3.858173,75.062348


In [33]:
imputer=MiceImputer(strategy='least squares')

In [36]:
[each for each in imputer.fit_transform(dmar_df)]

[(1,
         feature a  feature b  feature c  feature d  uncorrelated
  0      -2.827918  -3.926024  -0.171649   1.599414      0.808661
  1       0.827494  -2.804147   2.251578   3.886670      0.194692
  2       1.132261  -2.577198   2.882425   3.887535      0.886731
  3      -1.907789  -3.792876   0.256099   2.430528      0.493951
  4      -0.937901  -3.825973   1.862588   1.099501      0.394353
  ...          ...        ...        ...        ...           ...
  19995  -1.316783  -1.853379   0.632680   4.479027      0.334140
  19996   1.547273  -2.070106   4.070505   3.239217      0.125296
  19997  -0.552074  -2.135666   2.101401   3.162508      0.090778
  19998  -0.089824  -2.312533   2.102443   3.203033      0.139082
  19999   1.240722  -2.072572   3.435945   3.754910      0.778034
  
  [20000 rows x 5 columns]),
 (2,
         feature a  feature b  feature c  feature d  uncorrelated
  0      -2.827918  -3.926024  -0.171649   1.599414      0.808661
  1       0.827494  -2.804147   2.

## $\color{purple}{\text{Advanced Imputation Techniques: Multiple Imputation}}$

Hot Desk Imputation

Regression Imputation

Multiple Imputation

MICE

In [132]:
from autoimpute.imputations import MultipleImputer
imputer=MultipleImputer(strategy='stochastic')
imputations = imputer.fit_transform(mar_df)

In [133]:
lists=list(imputations)

In [137]:
lists[0][1].head(10)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,1.517509,4.229258,2.052726,0.153278,0.014975
1,2.536323,4.295391,2.104137,1.348,0.998701
2,4.043034,5.872276,3.559629,3.274061,0.403823
3,0.082752,3.761743,-0.44059,1.031832,0.281023
4,0.196684,3.793343,1.016462,-0.667764,0.165431
5,2.560068,4.446726,2.420763,0.973363,0.166179
6,4.027199,5.079975,4.582185,0.876607,0.420479
7,2.953229,5.339294,3.138633,1.611132,0.229141
8,2.743726,4.50633,2.62024,1.362915,0.011719
9,-0.180238,3.148906,0.280848,-0.741796,0.104471


In [138]:
lists[1][1].head(10)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,1.517509,4.229258,2.052726,0.153278,0.014975
1,2.536323,4.295391,2.104137,1.348,0.998701
2,4.043034,5.872276,3.559629,3.274061,0.403823
3,0.082752,3.761743,-0.44059,1.031832,0.281023
4,0.196684,3.793343,1.016462,-0.667764,0.165431
5,2.560068,4.446726,2.420763,0.973363,0.166179
6,4.027199,5.079975,4.582185,0.876607,0.420479
7,2.97643,5.339294,3.138633,1.611132,0.229141
8,2.743726,4.50633,2.62024,1.362915,0.011719
9,-0.180238,3.148906,0.280848,-0.741796,0.104471


In [141]:
[each[1].iloc[7]['feature a'] for each in lists]

[2.9532294619521022,
 2.9764296840936324,
 2.852605077672565,
 2.772903467827853,
 2.8978580843592163]

In [142]:
[each[1].iloc[6]['feature a'] for each in lists]

[4.027199122849932,
 4.027199122849932,
 4.027199122849932,
 4.027199122849932,
 4.027199122849932]