# $\color{purple}{\text{Understanding Missing Data and How to Deal with It (Part 7)}}$

## $\color{purple}{\text{Imputation Application to Anonymization}}$

### $\color{purple}{\text{Libraries for this lesson}}$

In [None]:
from helpers import clobber, stat_comparison
import pandas as pd
from autoimpute.imputations import SingleImputer

### $\color{purple}{\text{Anonymizing Data}}$


![](https://raw.githubusercontent.com/WestHealth/scipy2022-missingness-tutorial/main/images/clobber.svg)

One application of imputation is to anonymize data. The basic idea is to make missing a small amount of data then impute the missing data. The process is then repeated until enough identifiable data is removed and replaced to render the data set anonymized.

Since we've already established that MCAR missingness is the easiest to deal with. It is easier to clobber data using an MCAR mechanism. Advantage here is that we control the way missingness occurs.

We'll merge the data from the Wine Quality Data Set from the previous lesson. And to avoid confusion drop the categorical variable.

In [None]:
wine_quality = pd.concat([pd.read_csv('data/original_wine_training.csv'), pd.read_csv('data/original_wine_test.csv')]).drop(columns='type').reset_index()

For speed and simplicity we'll demonstrate with the `stochasitc` imputer

In [None]:
imputer = SingleImputer('stochastic')

The basic step is to clobber a single column using an MCAR mechanism

In [None]:
clobbered = clobber(wine_quality, 'fixed acidity', 0.1)
clobbered

Then we will impute that clobbered dataset

In [None]:
imputer.fit_transform(clobbered)

We'll iterate over all the feature columns with 10% missing rate. Then repeat 10 times.

In [None]:
anonymized=wine_quality
for _ in range (0,10):
    for column in wine_quality.columns[0:-1]:
            anonymized = imputer.fit_transform(clobber(anonymized, column, 0.1))