# $\color{purple}{\text{Understanding Missing Data and How to Deal with It (Part 5)}}$

## $\color{purple}{\text{Advanced Imputation Techniques}}$

### $\color{purple}{\text{Colab Environmental Setup}}$

### $\color{purple}{\text{Libraries for this lesson}}$

In [1]:
import pandas as pd
import numpy as np
from helpers import stat_comparison, spotlight_donors, ImputationDisplayer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

from autoimpute.imputations import SingleImputer
from autoimpute.imputations import MultipleImputer
from autoimpute.imputations import MiceImputer

In [2]:
df = pd.read_csv('data/full_set.csv')
mar_df = pd.read_csv('data/mar_set.csv')
ImputationDisplayer(mar_df)

<helpers.ImputationDisplayer at 0x7f9014071d60>

## $\color{purple}{\text{Multivariate Imputation}}$
Conventional Multivariate Imputation falls into 2 categories
* Regression Imputation
* Hot Deck Imputation

Another cutting edge method worth mentioning
* Neural Network Autoencoder

## $\color{purple}{\text{Regression Imputation}}$

General Technique:
Use Regression/Classification Models to impute Numeric/Categorical Missing Values
* Linear Regression
* Stocastic Linear Regression
* Logistic Regression
* Other Possibilities (generally unexplored)
  * Random Forest
  * Decision Trees
  * KNN

### $\color{purple}{\text{Linear Regression}}$

* Works with MAR
* Can impute illegal (out of bounds) values
* Can under estimate variance/covariance

In [3]:
linear_regressor = LinearRegression()

#### Perform the linear regresssion

We base the prediction of `feature a` on the remaining features in `rest`. We only run the regression on data with full rows, `full_data`.

In [4]:
rest = ['feature b', 'feature c', 'feature d', 'uncorrelated']
full_data = mar_df.dropna()
linear_regressor.fit(full_data[rest], full_data['feature a'])
predicted = linear_regressor.predict(mar_df[rest])

#### A note about a code pattern

I will be repeating the following code pattern or variation thereof. 

```.assign(**{'feature a': df['feature a'].where(~df['feature a'].isnull(), predicted)``` 

Depending on the use case, I'll either be filling in a value when the value is missing or substituting a NaN where a missing value is (see section on MICE below).

This basically substitutes the predicted value only when values are missing.

This is basically the same pattern as

```df['feature a'] = df['feature a'].where(~df['feature a'].isnull(), predicted)```

but allows for passing the dataframe or method chaining

In [5]:
imputed=mar_df.assign(**{'feature a': mar_df['feature a'].where(~mar_df['feature a'].isnull(), 
                        predicted)}
                     )

### $\color{purple}{\text{Analyze the Results}}$

In [6]:
stat_comparison(df, imputed, 'feature a')

Unnamed: 0,Original,With Missing Data,difference,percentage
mean,2.411082,2.410061,0.001021,0.04236
median,2.420924,2.421312,0.000388,0.016017
stdev,1.279456,1.276724,0.002731,0.213476


In [7]:
mar_df.displayer(imputed, 15)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,2.777245,2.234252,-1.552282,8.772158,0.360789
1,3.955997,2.223169,-0.645673,8.815675,0.393466
2,1.936893,2.182897,-2.474526,8.549983,0.891191
3,2.484183,1.618856,-1.404809,7.360904,0.156937
4,2.089727,2.698967,-1.464007,7.7861,0.046178
5,3.36994,2.842243,-0.716772,8.767852,0.34283
6,0.764542,1.532282,-3.850084,7.992571,0.678203
7,3.539507,2.120999,0.181066,7.158813,0.942345
8,3.118221,1.816557,-1.325628,8.144401,0.460626
9,0.445575,0.827475,-4.705004,8.161819,0.014623


### $\color{purple}{\text{Stochastic Regression}}$
* Extends Linear Regression by adding noise modelling the residuals
* Better simulates variance
* Can also produce out of bounds values

We rely on the linear regression prediction above. And calculate the statistics behind the residuals of the linear regression.

In [8]:
residual = mar_df['feature a'] - predicted
residual.mean()
residual.std()

0.1564667846459442

For the prediction we model the residual noise as a normal distribution and adjust predictions accordingly.

In [9]:
residual_noise=np.random.normal(residual.mean(), residual.std(), 20000)
predicted+=residual_noise

In [10]:
imputed=mar_df.assign(**{'feature a': mar_df['feature a'].where(~mar_df['feature a'].isnull(), 
                        predicted)}
                     )

### $\color{purple}{\text{Analyze the Results}}$

In [11]:
stat_comparison(df, imputed, 'feature a')

Unnamed: 0,Original,With Missing Data,difference,percentage
mean,2.411082,2.410422,0.00066,0.027379
median,2.420924,2.41992,0.001005,0.041504
stdev,1.279456,1.27825,0.001206,0.094247


In [12]:
mar_df.displayer(imputed, 15)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,2.777245,2.234252,-1.552282,8.772158,0.360789
1,3.470138,2.223169,-0.645673,8.815675,0.393466
2,1.936893,2.182897,-2.474526,8.549983,0.891191
3,2.484183,1.618856,-1.404809,7.360904,0.156937
4,2.089727,2.698967,-1.464007,7.7861,0.046178
5,3.36994,2.842243,-0.716772,8.767852,0.34283
6,0.764542,1.532282,-3.850084,7.992571,0.678203
7,3.539507,2.120999,0.181066,7.158813,0.942345
8,3.381843,1.816557,-1.325628,8.144401,0.460626
9,0.445575,0.827475,-4.705004,8.161819,0.014623


### $\color{purple}{\text{Built into}}$ `autoimpute`

In [13]:
imputer=SingleImputer('least squares')
ls_imputations = imputer.fit_transform(mar_df)

In [14]:
from autoimpute.imputations import SingleImputer
imputer=SingleImputer('stochastic')
st_imputations = imputer.fit_transform(mar_df)

#### $\color{purple}{\text{Analyze Results}}$

### $\color{purple}{\text{Just For Fun}}$
Let's use a Random Forest Regression instead

In [15]:
rf_regressor = RandomForestRegressor()
rest = ['feature b', 'feature c', 'feature d', 'uncorrelated']
full_data = mar_df.dropna()
rf_regressor.fit(full_data[rest], full_data['feature a'])
predicted = rf_regressor.predict(mar_df[rest])

In [16]:
imputed=mar_df.assign(**{'feature a': mar_df['feature a'].where(~mar_df['feature a'].isnull(), 
                        predicted)}
                     )

#### $\color{purple}{\text{Analyze Results}}$

In [17]:
mar_df.displayer(imputed, 15)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,2.777245,2.234252,-1.552282,8.772158,0.360789
1,3.780304,2.223169,-0.645673,8.815675,0.393466
2,1.936893,2.182897,-2.474526,8.549983,0.891191
3,2.484183,1.618856,-1.404809,7.360904,0.156937
4,2.089727,2.698967,-1.464007,7.7861,0.046178
5,3.36994,2.842243,-0.716772,8.767852,0.34283
6,0.764542,1.532282,-3.850084,7.992571,0.678203
7,3.539507,2.120999,0.181066,7.158813,0.942345
8,3.201221,1.816557,-1.325628,8.144401,0.460626
9,0.445575,0.827475,-4.705004,8.161819,0.014623


In [18]:
stat_comparison(df, imputed, 'feature a')

Unnamed: 0,Original,With Missing Data,difference,percentage
mean,2.411082,2.410268,0.000815,0.033782
median,2.420924,2.424967,0.004043,0.166989
stdev,1.279456,1.275119,0.004337,0.338986



## $\color{purple}{\text{Categorical Variables}}$

Imputation of categorical variables employs classification in place of regression. Most common is multinomial logistic regression.

In [19]:
cat_mar_df = pd.read_csv('data/categorical_mar.csv')
ImputationDisplayer(cat_mar_df)

<helpers.ImputationDisplayer at 0x7f902c518df0>

In [20]:
# A little EDA
cat_mar_df.isnull().sum()

feature a         0
feature b         0
feature c         0
cat feature    4108
dtype: int64

In [21]:
from sklearn.linear_model import LogisticRegression
rest = ['feature a', 'feature b', 'feature c']
from sklearn.preprocessing import LabelEncoder
cleaned_df = cat_mar_df.dropna()
lr = LogisticRegression(random_state=0, max_iter=1000).fit(cleaned_df[rest], cleaned_df['cat feature'])

In [22]:
impute = lr.predict(cat_mar_df[rest])

In [23]:
imputed=cat_mar_df.assign(**{'cat feature': cat_mar_df['cat feature'].where(~cat_mar_df['cat feature'].isnull(), impute)})

In [24]:
cat_mar_df.displayer(imputed, 10)

Unnamed: 0,feature a,feature b,feature c,cat feature
0,-1.40778,2.469328,-0.530309,Cat A
1,15.752722,9.793495,9.382818,Cat B
2,-1.056187,1.978559,1.036313,Cat A
3,13.942551,9.976267,8.633593,Cat B
4,11.956025,7.87126,6.549692,Cat B
5,14.425958,12.004184,8.69437,Cat B
6,3.808749,3.162104,9.573234,Cat D
7,6.414938,8.919951,5.28372,Cat C
8,10.735025,11.764753,10.339851,Cat C
9,9.032399,7.574074,11.742634,Cat E


## $\color{purple}{\text{Hot Deck Imputation}}$
* General idea is to randomly sample imputed values from remaining good values.
* Doesn't impute out of bounds values

**How it works:**

* For each missing value, a set of donors is selected from good values
* A value is randomly selected from the set of donors
* Donors are selected based on some metric based algorithm

The `demo_mar.csv` dataset is the first 10 entries from one of my earlier runs. It has one missing value in `feature a`

In [25]:
demo_df = pd.read_csv('data/demo_mar.csv')
demo_df

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,1.517509,4.229258,2.052726,0.153278,0.014975
1,2.536323,4.295391,2.104137,1.348,0.998701
2,4.043034,5.872276,3.559629,3.274061,0.403823
3,0.082752,3.761743,-0.44059,1.031832,0.281023
4,0.196684,3.793343,1.016462,-0.667764,0.165431
5,2.560068,4.446726,2.420763,0.973363,0.166179
6,4.027199,5.079975,4.582185,0.876607,0.420479
7,,5.339294,3.138633,1.611132,0.229141
8,2.743726,4.50633,2.62024,1.362915,0.011719
9,-0.180238,3.148906,0.280848,-0.741796,0.104471


We use Euclidean distance to demonstrate how Hot Deck Imputation works, but in practice the metric is usually more statistically based and complex. For simplicity we add a `distance` feature.

In [26]:
def distance(x):
    return np.linalg.norm((x-demo_df.iloc[7]).dropna())
    
demo_df['distance'] = demo_df.apply(distance, axis=1)

In [27]:
demo_df

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated,distance
0,1.517509,4.229258,2.052726,0.153278,0.014975,2.140696
1,2.536323,4.295391,2.104137,1.348,0.998701,1.679696
2,4.043034,5.872276,3.559629,3.274061,0.403823,1.804759
3,0.082752,3.761743,-0.44059,1.031832,0.281023,3.954464
4,0.196684,3.793343,1.016462,-0.667764,0.165431,3.477212
5,2.560068,4.446726,2.420763,0.973363,0.166179,1.312528
6,4.027199,5.079975,4.582185,0.876607,0.420479,1.651431
7,,5.339294,3.138633,1.611132,0.229141,0.0
8,2.743726,4.50633,2.62024,1.362915,0.011719,1.035106
9,-0.180238,3.148906,0.280848,-0.741796,0.104471,4.303085


#### Donor Selection

[Van Buuren](https://stefvanbuuren.name/fimd/) identifies 4 methods of selecting donors

#### Method 1: (Single Donor)

Pick the sample closest to the missing value

In [28]:
donor = demo_df.dropna().nsmallest(1, 'distance')
spotlight_donors(demo_df,donor)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated,distance
0,1.517509,4.229258,2.052726,0.153278,0.014975,2.140696
1,2.536323,4.295391,2.104137,1.348,0.998701,1.679696
2,4.043034,5.872276,3.559629,3.274061,0.403823,1.804759
3,0.082752,3.761743,-0.44059,1.031832,0.281023,3.954464
4,0.196684,3.793343,1.016462,-0.667764,0.165431,3.477212
5,2.560068,4.446726,2.420763,0.973363,0.166179,1.312528
6,4.027199,5.079975,4.582185,0.876607,0.420479,1.651431
7,,5.339294,3.138633,1.611132,0.229141,0.0
8,2.743726,4.50633,2.62024,1.362915,0.011719,1.035106
9,-0.180238,3.148906,0.280848,-0.741796,0.104471,4.303085


#### Method 2:

Donors selected from all points under a fixed threshold

In [29]:
threshold = 2
donors = demo_df.dropna()[demo_df.dropna().distance<threshold]['feature a']
spotlight_donors(demo_df, donors)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated,distance
0,1.517509,4.229258,2.052726,0.153278,0.014975,2.140696
1,2.536323,4.295391,2.104137,1.348,0.998701,1.679696
2,4.043034,5.872276,3.559629,3.274061,0.403823,1.804759
3,0.082752,3.761743,-0.44059,1.031832,0.281023,3.954464
4,0.196684,3.793343,1.016462,-0.667764,0.165431,3.477212
5,2.560068,4.446726,2.420763,0.973363,0.166179,1.312528
6,4.027199,5.079975,4.582185,0.876607,0.420479,1.651431
7,,5.339294,3.138633,1.611132,0.229141,0.0
8,2.743726,4.50633,2.62024,1.362915,0.011719,1.035106
9,-0.180238,3.148906,0.280848,-0.741796,0.104471,4.303085


#### Method 3:

Closest N points selected as the set of donors

In [30]:
N=3
donors = demo_df.nsmallest(N+1, 'distance').tail(N)['feature a']
spotlight_donors(demo_df, donors)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated,distance
0,1.517509,4.229258,2.052726,0.153278,0.014975,2.140696
1,2.536323,4.295391,2.104137,1.348,0.998701,1.679696
2,4.043034,5.872276,3.559629,3.274061,0.403823,1.804759
3,0.082752,3.761743,-0.44059,1.031832,0.281023,3.954464
4,0.196684,3.793343,1.016462,-0.667764,0.165431,3.477212
5,2.560068,4.446726,2.420763,0.973363,0.166179,1.312528
6,4.027199,5.079975,4.582185,0.876607,0.420479,1.651431
7,,5.339294,3.138633,1.611132,0.229141,0.0
8,2.743726,4.50633,2.62024,1.362915,0.011719,1.035106
9,-0.180238,3.148906,0.280848,-0.741796,0.104471,4.303085


#### Method 4:

Donors are all points, but donor selected randomly based on the distance, closest having higher probability

In [31]:
import random
# Pick with probability inversely proportionally to distance
weights = 1/demo_df.dropna()['distance']
random.choices(demo_df.dropna()['feature a'].to_list(), k=1, weights=weights.to_list())

[2.560068182587529]

## $\color{purple}{\text{Predictive Mean Matching}}$
Uses linear interpolation as part of the metric.

Basically, the donors are selected from those observations whose predicted values from linear regression most closely matches that predicted from the missing value.


In [32]:
from sklearn.linear_model import LinearRegression
linear_regressor=LinearRegression()

In [33]:
demo_df = pd.read_csv('data/demo_mar.csv')

In [34]:
rest = ['feature b', 'feature c', 'feature d', 'uncorrelated']
full_data = demo_df.dropna()
linear_regressor.fit(full_data[rest], full_data['feature a'])
demo_df['regression'] = linear_regressor.predict(demo_df[rest])

In [35]:
demo_df['distance']=np.abs(demo_df.regression-demo_df.regression.iloc[7])

In [36]:
N=3
donors = demo_df.dropna().sort_values('distance').iloc[0:N]['feature a']
spotlight_donors(demo_df, donors, 7)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated,regression,distance
0,1.517509,4.229258,2.052726,0.153278,0.014975,1.520445,1.320369
1,2.536323,4.295391,2.104137,1.348,0.998701,2.542722,0.298092
2,4.043034,5.872276,3.559629,3.274061,0.403823,4.117501,1.276687
3,0.082752,3.761743,-0.44059,1.031832,0.281023,0.030163,2.810651
4,0.196684,3.793343,1.016462,-0.667764,0.165431,0.202129,2.638685
5,2.560068,4.446726,2.420763,0.973363,0.166179,2.363152,0.477662
6,4.027199,5.079975,4.582185,0.876607,0.420479,4.023186,1.182373
7,,5.339294,3.138633,1.611132,0.229141,2.840814,0.0
8,2.743726,4.50633,2.62024,1.362915,0.011719,2.80172,0.039094
9,-0.180238,3.148906,0.280848,-0.741796,0.104471,-0.07396,2.914774


Predictive Mean Matching is the preferred imputation method, but can be computationally expensive, so for this demo the dataset is truncated to 100 rows

In [37]:
from autoimpute.imputations import SingleImputer
demo_df = mar_df[0:100].copy()
imputer=SingleImputer('pmm')
imputations = imputer.fit_transform(demo_df)


  return wrapped_(*args_, **kwargs_)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [σ, beta, alpha]


Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 13 seconds.
There were 3 divergences after tuning. Increase `target_accept` or reparameterize.
There were 6 divergences after tuning. Increase `target_accept` or reparameterize.
There were 2 divergences after tuning. Increase `target_accept` or reparameterize.
There were 5 divergences after tuning. Increase `target_accept` or reparameterize.


In [38]:
mar_df.displayer(imputations, 10)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,2.777245,2.234252,-1.552282,8.772158,0.360789
1,3.36994,2.223169,-0.645673,8.815675,0.393466
2,1.936893,2.182897,-2.474526,8.549983,0.891191
3,2.484183,1.618856,-1.404809,7.360904,0.156937
4,2.089727,2.698967,-1.464007,7.7861,0.046178
5,3.36994,2.842243,-0.716772,8.767852,0.34283
6,0.764542,1.532282,-3.850084,7.992571,0.678203
7,3.539507,2.120999,0.181066,7.158813,0.942345
8,2.488941,1.816557,-1.325628,8.144401,0.460626
9,0.445575,0.827475,-4.705004,8.161819,0.014623


## $\color{purple}{\text{Advanced Imputation Techniques: Multivariate Imputation by Chained Equations (MICE)}}$
* Often considered the gold standard of imputation
* Is actually more of an imputation blueprint
* Applicable with missingness in multiple columns

In [39]:
dmcar_df = pd.read_csv('data/double_mcar_set.csv')
missing_df=pd.DataFrame({'feature a': dmcar_df['feature a'].isnull(),
                         'feature b': dmcar_df['feature b'].isnull()})
ImputationDisplayer(dmcar_df)

<helpers.ImputationDisplayer at 0x7f8fcf7511c0>

#### First step: Impute each missing value with some form of univariate imputation (usually mean or median)

In [41]:
step1_df=dmcar_df.fillna({'feature a': dmcar_df['feature a'].mean(), 
                         'feature b': dmcar_df['feature b'].median()
                        })
dmcar_df.displayer(step1_df, 20)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,2.777245,2.234252,-1.552282,8.772158,0.360789
1,3.801993,2.223169,-0.645673,8.815675,0.393466
2,1.936893,2.182897,-2.474526,8.549983,0.891191
3,2.413902,1.618856,-1.404809,7.360904,0.156937
4,2.089727,2.698967,-1.464007,7.7861,0.046178
5,3.36994,2.842243,-0.716772,8.767852,0.34283
6,0.764542,1.532282,-3.850084,7.992571,0.678203
7,3.539507,2.120999,0.181066,7.158813,0.942345
8,3.443095,1.816557,-1.325628,8.144401,0.460626
9,0.445575,0.827475,-4.705004,8.161819,0.014623


#### Second Step: For each column impute using a regression or hot deck technique
Start with `feature a` then `feature b`

##### Clear the missing values for the imputer then impute feature a

In [43]:
imputer=SingleImputer('least squares')
step2a_df=imputer.fit_transform(step1_df.assign(**{'feature a': step1_df['feature a'].where(~missing_df['feature a'], 
                                                  np.nan)
                                                  }
                                               )
                               )
dmcar_df.displayer(step2a_df, 10)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,2.777245,2.234252,-1.552282,8.772158,0.360789
1,3.801993,2.223169,-0.645673,8.815675,0.393466
2,1.936893,2.182897,-2.474526,8.549983,0.891191
3,2.648806,1.618856,-1.404809,7.360904,0.156937
4,2.089727,2.698967,-1.464007,7.7861,0.046178
5,3.36994,2.842243,-0.716772,8.767852,0.34283
6,0.764542,1.532282,-3.850084,7.992571,0.678203
7,3.539507,2.120999,0.181066,7.158813,0.942345
8,3.443095,1.816557,-1.325628,8.144401,0.460626
9,0.445575,0.827475,-4.705004,8.161819,0.014623


In [46]:
imputer=SingleImputer('least squares')
step2_df=imputer.fit_transform(step2a_df.assign(**{'feature b': step2a_df['feature b'].where(~missing_df['feature b'], 
                                                  np.nan)
                                                  }
                                               )
                               )
dmcar_df.displayer(step2_df, 20)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,2.777245,2.234252,-1.552282,8.772158,0.360789
1,3.801993,2.223169,-0.645673,8.815675,0.393466
2,1.936893,2.182897,-2.474526,8.549983,0.891191
3,2.648806,1.618856,-1.404809,7.360904,0.156937
4,2.089727,2.698967,-1.464007,7.7861,0.046178
5,3.36994,2.842243,-0.716772,8.767852,0.34283
6,0.764542,1.532282,-3.850084,7.992571,0.678203
7,3.539507,2.120999,0.181066,7.158813,0.942345
8,3.443095,1.816557,-1.325628,8.144401,0.460626
9,0.445575,0.827475,-4.705004,8.161819,0.014623


In [None]:
stat_comparison(step2_df, df, 'feature b')

#### Repeat Step 2 until results converge sufficiently

In [47]:
imputer=SingleImputer('least squares')
step3a_df=imputer.fit_transform(step2_df.assign(**{'feature a': step2_df['feature a'].where(~missing_df['feature a'], np.nan)}))
step3_df=imputer.fit_transform(step3a_df.assign(**{'feature b': step3a_df['feature b'].where(~missing_df['feature b'], np.nan)}))
dmcar_df.displayer(step3_df, 20)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,2.777245,2.234252,-1.552282,8.772158,0.360789
1,3.801993,2.223169,-0.645673,8.815675,0.393466
2,1.936893,2.182897,-2.474526,8.549983,0.891191
3,2.679381,1.618856,-1.404809,7.360904,0.156937
4,2.089727,2.698967,-1.464007,7.7861,0.046178
5,3.36994,2.842243,-0.716772,8.767852,0.34283
6,0.764542,1.532282,-3.850084,7.992571,0.678203
7,3.539507,2.120999,0.181066,7.158813,0.942345
8,3.443095,1.816557,-1.325628,8.144401,0.460626
9,0.445575,0.827475,-4.705004,8.161819,0.014623


In [None]:
stat_comparison(df, step3_df, 'feature a')

stat_comparison(df, step3_df, 'feature b')

### $\color{purple}{\text{MICE imputer in}}$ `autoimpute`

In [None]:
imputer=MiceImputer(n=1,k=5,strategy='least squares')

In [None]:
# MICE imputer returns a multiple imputation (see next section) we unpack it by referencing [0][1]
imputed=[each for each in imputer.fit_transform(dmar_df)][0][1]
displayer(imputed, 20)

## $\color{purple}{\text{Advanced Imputation Techniques: Multiple Imputation}}$

Many of the imputation techniques are stochastic in nature meaning that if you run the imputation a second time. You would a slightly different imputed values for the missing values.

**Multiple Imputation** is the method which repeatedly imputes missing values. The result is a collection of possible imputed values.

With a collection of imputed values for each missing value, you can perform statistics and carry through error margins and confidence intervals through your models. 



We will use `autoimpute`'s multiple imputer to demonstrate, by default it produces 5 imputations. It returns this as a generator which we unpack using list.

In [48]:
imputer=MultipleImputer(strategy='least squares')
imputations = imputer.fit_transform(mar_df)
lists=list(imputations) # Unscramble the generator

The return value is an array of tuples. Each tuple is a pair with the imputation index (ordinal count) and the imputed dataframe.

In [None]:
lists

In [51]:
mar_df.displayer(lists[2][1], 10)

Unnamed: 0,feature a,feature b,feature c,feature d,uncorrelated
0,2.777245,2.234252,-1.552282,8.772158,0.360789
1,3.955997,2.223169,-0.645673,8.815675,0.393466
2,1.936893,2.182897,-2.474526,8.549983,0.891191
3,2.484183,1.618856,-1.404809,7.360904,0.156937
4,2.089727,2.698967,-1.464007,7.7861,0.046178
5,3.36994,2.842243,-0.716772,8.767852,0.34283
6,0.764542,1.532282,-3.850084,7.992571,0.678203
7,3.539507,2.120999,0.181066,7.158813,0.942345
8,3.118221,1.816557,-1.325628,8.144401,0.460626
9,0.445575,0.827475,-4.705004,8.161819,0.014623


$\color{red}{\Large{\text{ ⚠}}}$ the `least squares` is option is deterministic. You will notice all the imputations are the same.

In [52]:
[each[1].iloc[1]['feature a'] for each in lists]

[3.9559972510977346,
 3.9559972510977346,
 3.9559972510977346,
 3.9559972510977346,
 3.9559972510977346]

If we use the `stochastic` strategy each missing value will have multiple imputed values

In [57]:
imputer=MultipleImputer(strategy='stochastic')
imputations = imputer.fit_transform(mar_df)
lists=list(imputations) # Unscramble the generator

In [58]:
[each[1].iloc[1]['feature a'] for each in lists]

[4.092361508315227,
 3.646324504587021,
 3.660059879725157,
 3.942308790112922,
 3.9425143162704175]