# Imputation

## Table of Contents

1. Imputation of missing values
1. Imputing within a pipeline

### 1. Imputation of missing values

Imputation makes an educated guess about the missing values from the known part of the data, for example, using the mean of the non-missing entries.

Load libraries.

In [6]:
from sklearn.preprocessing import Imputer
import numpy as np

Instantiate an instance  `imp` of the Imputer class. The keyword argument `missing_values='NaN'` here specifies that missing values are represented by `NaN`; `strategy='mean'` specifies that we will use the mean of the columns (`axis 0`) that contain the missing values.

In [8]:
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)                       

Fit this imputer to the sample data `A` using the `fit` method. The mean of each column will be stored in the object `imp`.

In [10]:
A = [[1, 2], [np.nan, 3], [7, 6]]
imp.fit(A) 

Create a similar dataset `B` with missing values. Transform the dataset using the fitted imputer and return a completed version of `B`.

In [12]:
B = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(B))   

Compare the output above with the arithmetic means along columns in `A` with the values that replaced the `NaN` values above.

In [14]:
np.nanmean(A, axis=0)

The imputation strategy.
- If “mean”, then replace missing values using the mean along the axis.
- If “median”, then replace missing values using the median along the axis.
- If “most_frequent”, then replace missing using the most frequent value along the axis.

The axis along which to impute.
- If axis=0, then impute along columns.
- If axis=1, then impute along rows.

###2. Imputing within a pipeline

This example shows imputing the missing values in the pima-indians-diabetes dataset and using a k-Nearest-Neighbors estimator to predict if a person has diabetes (or not).

The pima-indians-diabetes dataset:
- https://www.kaggle.com/uciml/pima-indians-diabetes-database

Load libraries.

In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error

Load the dataset as a Pandas DataFrame and print summary statistics on each attribute.

In [22]:
dataset = pd.read_csv('/dbfs/mnt/datalab-datasets/file-samples/pima-indians-diabetes.csv', header=None, skiprows=9)
print(dataset.describe())

We can see that there are columns that have a minimum value of zero (`0`). On some columns, a value of zero does not make sense and indicates an invalid or missing value.

Mark zero values as `NaN` with the Pandas DataFrame by using the `replace()` function on a subset of the columns we are interested in.

In [25]:
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, np.NaN)

For cross check purpose, use the `isnull()` function to mark all of the `NaN` values in the dataset as True and get a count of the missing values for each column.

In [27]:
print(dataset.isnull().sum())

This dataset has 9 columns. The 9th column is the labels, with 1 or 0 for Diabetes or no Diabetes. Split the dataset into `X` and `y`. Using the `squeeze` function to flatten the labels into the vector `y`.

In [29]:
X = dataset.iloc[:,:-1]
y = dataset.iloc[:,-1:].squeeze()
print (X.shape, y.shape)

Split the `X` and `y` into train and test using the `train_test_split` function.

In [31]:
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2, random_state=20)

Use a k-nearest neighbors classifier as part of a pipeline that includes imputing.

In [33]:
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
         ('knn', KNeighborsClassifier(n_neighbors=3))]

pipeline = Pipeline(steps)

Fit the pipeline using `X_train` as training data and `y_train` as target values, and pass the computed parameters to an object `knn_imputing`.

In [35]:
knn_imputing = pipeline.fit(X_train, y_train)

Compute and print metrics.

In [37]:
print('Prediction Error with Imputing: {}'.format(mean_squared_error(y_test, knn_imputing.predict(X_test))))

The output above shows that the model based on kNN predicts with ~70% accuracy, whether a person has diabetes (or not), provided information as we have it in the PIMA Indians Diabetes dataset.