# $\color{purple}{\text{Understanding Missing Data and How to Deal with It (Part 7)}}$

## $\color{purple}{\text{Imputation Application to Anonymization}}$

### $\color{purple}{\text{Colab Environmental Setup}}$

### $\color{purple}{\text{Libraries for this lesson}}$

In [5]:
from helpers import clobber, stat_comparison
import pandas as pd
from autoimpute.imputations import SingleImputer

### $\color{purple}{\text{Anonymizing Data}}$


![x](images/clobber.svg)

One application of imputation is to anonymize data. The basic idea is to make missing a small amount of data then impute the missing data. The process is then repeated until enough identifiable data is removed and replaced to render the data set anonymized.

Since we've already established that MCAR missingness is the easiest to deal with. It is easier to clobber data using an MCAR mechanism. Advantage here is that we control the way missingness occurs.

We'll merge the data from the Wine Quality Data Set from the previous lesson. And to avoid confusion drop the categorical variable.

In [21]:
wine_quality = pd.concat([pd.read_csv('data/original_wine_training.csv'), pd.read_csv('data/original_wine_test.csv')]).drop(columns='type').reset_index()

For speed and simplicity we'll demonstrate with the `stochasitc` imputer

In [29]:
imputer = SingleImputer('stochastic')

The basic step is to clobber a single column using an MCAR mechanism

In [22]:
clobbered = clobber(wine_quality, 'fixed acidity', 0.1)
clobbered

Unnamed: 0,index,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,0,8.0,0.50,0.39,2.60,0.082,12.0,46.0,0.99850,3.43,0.62,10.7,6
1,1,6.6,0.28,0.28,8.50,0.052,55.0,211.0,0.99620,3.09,0.55,8.9,6
2,2,,0.19,0.23,5.70,0.123,27.0,104.0,0.99540,3.04,0.54,9.4,6
3,3,,0.20,0.37,16.95,0.048,43.0,190.0,0.99950,3.03,0.42,9.2,6
4,4,7.8,0.28,0.34,1.60,0.028,32.0,118.0,0.99010,3.00,0.38,12.1,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5995,995,6.3,0.17,0.32,4.20,0.040,37.0,117.0,0.99182,3.24,0.43,11.3,6
5996,996,7.7,0.30,0.42,14.30,0.045,45.0,213.0,0.99910,3.18,0.63,9.2,5
5997,997,6.2,0.20,0.33,5.40,0.028,21.0,75.0,0.99012,3.36,0.41,13.5,7
5998,998,5.4,0.42,0.27,2.00,0.092,23.0,55.0,0.99471,3.78,0.64,12.3,7


Then we will impute that clobbered dataset

In [31]:
imputer.fit_transform(clobbered)

Unnamed: 0,index,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,0,8.000000,0.50,0.39,2.60,0.082,12.0,46.0,0.99850,3.43,0.62,10.7,6
1,1,6.600000,0.28,0.28,8.50,0.052,55.0,211.0,0.99620,3.09,0.55,8.9,6
2,2,7.712452,0.19,0.23,5.70,0.123,27.0,104.0,0.99540,3.04,0.54,9.4,6
3,3,7.216238,0.20,0.37,16.95,0.048,43.0,190.0,0.99950,3.03,0.42,9.2,6
4,4,7.800000,0.28,0.34,1.60,0.028,32.0,118.0,0.99010,3.00,0.38,12.1,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5995,995,6.300000,0.17,0.32,4.20,0.040,37.0,117.0,0.99182,3.24,0.43,11.3,6
5996,996,7.700000,0.30,0.42,14.30,0.045,45.0,213.0,0.99910,3.18,0.63,9.2,5
5997,997,6.200000,0.20,0.33,5.40,0.028,21.0,75.0,0.99012,3.36,0.41,13.5,7
5998,998,5.400000,0.42,0.27,2.00,0.092,23.0,55.0,0.99471,3.78,0.64,12.3,7


We'll iterate over all the feature columns with 10% missing rate. Then repeat 10 times.

In [32]:
anonymized=wine_quality
for _ in range (0,10):
    for column in wine_quality.columns[0:-1]:
            anonymized = imputer.fit_transform(clobber(anonymized, column, 0.1))

In [28]:
stat_comparison(wine_quality, anonymized, 'residual sugar')

Unnamed: 0,Original,With Missing Data,difference,percentage
mean,5.435792,5.402992,0.0328,0.603399
median,3.0,4.2,1.2,40.0
stdev,4.753589,4.831698,0.078109,1.643155
