# Data Preprocessing with scikit-learn - Missing Values

In this tutorial I illustrate how to preprocess data using [scikit-learn](https://scikit-learn.org/stable/), a Python library for machine learning.

Data preprocessing transforms data into a format which is more suitable for estimators. Data preprocessing involves the following operations:
* dealing with missing values
* normalization
* standardization
* formatting
* binning

In my previous articles I illustrated how to deal with [missing values](https://towardsdatascience.com/data-preprocessing-with-python-pandas-part-1-missing-data-45e76b781993), [normalization](https://towardsdatascience.com/data-preprocessing-with-python-pandas-part-3-normalisation-5b5392d27673), [standardization](https://towardsdatascience.com/data-preprocessing-with-python-pandas-part-4-standardization-ccd5b1608f1c), [formatting](https://towardsdatascience.com/data-processing-with-python-pandas-part-2-data-formatting-710c2eafa426) and [binning](https://towardsdatascience.com/data-preprocessing-with-python-pandas-part-5-binning-c5bd5fd1b950) with Python `pandas`. In this tutorial I show you how to deal with mising values with `scikit-learn`. For the other preprocessing techniques in scikit-learn, I will write other posts.

All the `scikit-learn` operations described in this tutorial follow the following steps:
* select a preprocessing methodology
* fit it through the `fit()` function
* apply it to data through the `transform()` function.

The `scikit-learn` library works only with arrays, thus when performing every operation, a dataframe column must be converted to an array. This can be achieved through the `numpy.array()` function, which receives the dataframe column as input. In addition, the `fit()` function receives as input an array of arrays, each representing a sample of the dataset. Thus the `reshape()` function could be used to convert a standard array to an array of arrays.


## Data Import
In this tutorial we exploit the `cupcake.csv` dataset, which contains the trend search of the word `cupcake` on Google Trends. Data are extracted from [this link](https://trends.google.com/trends/explore?q=%2Fm%2F03p1r4&date=all). In the original dataset data in correspondence of `2004-02` and `2006-03` have been removed, in order to demonstrate how to deal with missing values. The original values were 5 and 10 respectively. We exploit the `pandas` library to import the dataset and we transform it into a dataframe through the `read_csv()` function.

In [1]:
import pandas as pd
df = pd.read_csv('cupcake.csv')
df.head(5)

Unnamed: 0,Mese,Cupcake
0,2004-01,5.0
1,2004-02,
2,2004-03,4.0
3,2004-04,6.0
4,2004-05,5.0


Missing values are values not available in the original dataset. One solution to deal with missing values could be their removal from the dataset. However, this leads to data loss. The `scikit-learn` library provides two [mechanisms to deal with missing values](https://scikit-learn.org/stable/modules/impute.html): 
* Univariate Feature Imputation
* Multivariate Feature Imputation
* Nearest neighbors imputation

## Univariate Feature Imputation
In the Univariate Feature Imputation involves the replacement of missing values with a constant value or some provided statistics related to a feature. The `SimpleImputer` class can be used to perform univariate feature imputation. We specify which is the missing value through the `missing_values` parameter and the replacement strategy through the `strategy` parameter. For example, we can replace all the `NaN` values (identified by the `numpy.nan` variable) with the average value of the column.

In [2]:
import numpy as np
from sklearn.impute import SimpleImputer

preprocessor = SimpleImputer(missing_values=np.nan, strategy='mean')

Now we can fit the preprocessor with the `Cupcake` column of the dataframe. 

In [3]:
X = np.array(df['Cupcake']).reshape(-1,1)
preprocessor.fit(X)

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

We apply the `transform()` function to our data.

In [4]:
X_prep = preprocessor.transform(X)

We convert it to the original shape by applying the inverse `reshape()` function and we store the result into a new column of the datafram `df`.

In [5]:
df['Cupcake_univariate'] = X_prep.reshape(1,-1)[0]
df.head()

Unnamed: 0,Mese,Cupcake,Cupcake_univariate
0,2004-01,5.0,5.0
1,2004-02,,50.079208
2,2004-03,4.0,4.0
3,2004-04,6.0,6.0
4,2004-05,5.0,5.0


## Multivariate Feature Imputation
In the multivariate feature imputation each feature with missing values is calculated as a function of the other features. An iterative imputation is built thus the maximum number of iteration must be specified. We can use the `IterativeImputer` class. We consider two features: the column `Cupcake` and the index of the dataframe. Since the `IterativeImputer` is still at the experimental stage, we must enable it explicitly. 

In [6]:
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
preprocessor = IterativeImputer(max_iter=10, random_state=0)

We must convert the two features into arrays and transform them in the form `[[f11,f21], [f12,f22] ...]`. This can be done by applying the `reshape()` function to each feature and then the `hstack()` function as follows:

In [7]:
X1 = np.array(df['Cupcake']).reshape(-1,1)
X2 = np.array(df.index).reshape(-1,1)
X = np.hstack((X1,X2))

We fit the preprocessor with the obtained features.

In [8]:
preprocessor.fit(X)

IterativeImputer(add_indicator=False, estimator=None,
                 imputation_order='ascending', initial_strategy='mean',
                 max_iter=10, max_value=None, min_value=None,
                 missing_values=nan, n_nearest_features=None, random_state=0,
                 sample_posterior=False, skip_complete=False, tol=0.001,
                 verbose=0)

And we apply the preprocessor to the same features `X`. In order to retrieve the result of the operation, we must apply the `hsplit()` function, which splits the array horizontally, then we apply the inverse `reshape()`.

In [9]:
X_prep = preprocessor.transform(X)
df['Cupcake_multivariate'] = np.hsplit(X_prep, 2)[0].reshape(1,-1)[0]

We can check results. Missing values are located at position 26 and 1. It is interesting to note that the two types of imputation produce different values.

In [10]:
df.iloc[26]

Mese                    2006-03
Cupcake                     NaN
Cupcake_univariate      50.0792
Cupcake_multivariate    28.6311
Name: 26, dtype: object

In [11]:
df.iloc[1]

Mese                    2004-02
Cupcake                     NaN
Cupcake_univariate      50.0792
Cupcake_multivariate    21.6101
Name: 1, dtype: object

## Nearest neighbors imputation
This category of imputation fills missing values using the k-Nearest Neighbors approach. Each missing value is calculated using values from `n_neighbors` nearest neighbors that have a value. We can use the `KNNImputer` class of the `scikit-learn` library. In order to work properly, we must specify at least two features. Thus we exploit the `X` variable, previously defined.

In [12]:
from sklearn.impute import KNNImputer

preprocessor = KNNImputer(n_neighbors=5, weights="distance")
preprocessor.fit(X)
X_prep = preprocessor.transform(X)
df['Cupcake_knn'] = np.hsplit(X_prep, 2)[0].reshape(1,-1)[0]

In [13]:
df.iloc[26]

Mese                    2006-03
Cupcake                     NaN
Cupcake_univariate      50.0792
Cupcake_multivariate    28.6311
Cupcake_knn                10.7
Name: 26, dtype: object

In [14]:
df.iloc[1]

Mese                    2004-02
Cupcake                     NaN
Cupcake_univariate      50.0792
Cupcake_multivariate    21.6101
Cupcake_knn             4.91892
Name: 1, dtype: object

Definitely, the `KNNImputer` produces the nearest values to those in the original dataset (10 for position 26 and 5 for position 1).