# Setup

We are going to use the [fancyimpute](https://pypi.org/project/fancyimpute/) package, which needs to be installed

In [2]:
!pip install fancyimpute

Collecting fancyimpute
  Downloading fancyimpute-0.7.0.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting knnimpute>=0.1.0 (from fancyimpute)
  Downloading knnimpute-0.1.0.tar.gz (8.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nose (from fancyimpute)
  Downloading nose-1.3.7-py3-none-any.whl.metadata (1.7 kB)
Downloading nose-1.3.7-py3-none-any.whl (154 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.7/154.7 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: fancyimpute, knnimpute
  Building wheel for fancyimpute (setup.py) ... [?25l[?25hdone
  Created wheel for fancyimpute: filename=fancyimpute-0.7.0-py3-none-any.whl size=29879 sha256=e6cc394dddb07ac8b107b83421edfa1d5a906ab70ae13112de374878fd4e342c
  Stored in directory: /root/.cache/pip/wheels/1a/f3/a1/f7f10b5ae2c2459398762a3fcf4ac18c325311c7e3163d5a15
  Building wheel for knnimpute (setup.py) ... [?25l[?25hdone
  

In [3]:
# experimental parameters
random_state = 42
missing_fraction = 0.15

# Dataset: UCI Air Quality

Contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. [Source](https://archive.ics.uci.edu/dataset/360/air+quality)

In [4]:
!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo

# fetch dataset
air_quality = fetch_ucirepo(id=360)

# data (as pandas dataframes)
X = air_quality.data.features

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [5]:
# shape of the data
print(f"Data shape: {X.shape}")

# check that nothing is missing
print(f"Total missing data points: {X.isna().sum().sum()}")

Data shape: (9357, 15)
Total missing data points: 0


In [6]:
# If you want you can decomment and take a look at the data description.
# It's a bit long and it messes with the notebook formatting, so it's
# commented out by default

#import pprint
#pprint.pp(air_quality.metadata)
#print(air_quality.variables)

# Introducing controlled missing data points

We enter artificial missing data so that to be able to test and compare various algorithms. Keep in mind that the first two columns (Date and Time) are not involved in the test (it's an unreasonable scenario to have data and not knowing
when they were recorded).

In [7]:
import numpy as np
import pandas as pd

def introduce_random_missingness(df, missing_fraction, random_state=None):
    """
    Randomly set a fraction of the entries in X to NaN.

    Args:
        df (pd.DataFrame): original data
        missing_fraction (float): fraction of total entries to set as NaN (0 < missing_fraction < 1)
        random_state (int, optional): random seed for reproducibility

    Returns:
        df_missing (pd.DataFrame): copy of X with missing values
    """
    np.random.seed(random_state)

    # A copy of the original data frame, where we'll have the nan
    df_missing = df.copy()

    # How many cells should we remove?
    n_total = df.size
    n_missing = int(np.floor(missing_fraction * n_total))

    # Flatten the (row, col) indices, so to obtain a list of [(0, 0), (0, 1), (0, 2), (0, 3), ....]
    all_indices = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]

    # Picking missing indices
    missing_indices = np.random.choice(len(all_indices), n_missing, replace=False)

    # For each missing index we put a nan
    for i in missing_indices:
        row, col = all_indices[i]
        df_missing.iat[row, col] = np.nan

    return df_missing

In [8]:
# Splitting the dataframe in time coordinates and data columns
X_coord = X[['Date', 'Time']]
X_data = X.drop(columns=['Date', 'Time'])

# Introduce missing values
X_data_missing = introduce_random_missingness(X_data, missing_fraction, random_state)

# A bit of interface
missing_cnt = X_data_missing.isna().sum().sum()
size_cnt = X_data_missing.size
print(f"Total missing data points: {missing_cnt} / {size_cnt} ({100.0 * missing_cnt / size_cnt:.2f}%)")

Total missing data points: 18246 / 121641 (15.00%)


# KNN imputation

In [9]:
from fancyimpute import KNN, NuclearNormMinimization, SoftImpute, BiScaler

# Use 3 nearest rows which have a feature to fill in each row's missing features
X_imputed_KNNI = KNN(k=3).fit_transform(X_data_missing)

# Since we are using pandas dataframe, let's convery the nd array returned
# by the KNN imputer to a dataframe
X_imputed_KNNI = pd.DataFrame(X_imputed_KNNI, columns=X_data_missing.columns)

print("\n\nMissing data points: " + str(X_imputed_KNNI.isna().sum().sum()))



Imputing row 1/9357 with 1 missing, elapsed time: 22.860
Imputing row 101/9357 with 0 missing, elapsed time: 22.874
Imputing row 201/9357 with 1 missing, elapsed time: 22.895
Imputing row 301/9357 with 3 missing, elapsed time: 22.911
Imputing row 401/9357 with 2 missing, elapsed time: 22.922
Imputing row 501/9357 with 0 missing, elapsed time: 22.934
Imputing row 601/9357 with 0 missing, elapsed time: 22.946
Imputing row 701/9357 with 2 missing, elapsed time: 22.958
Imputing row 801/9357 with 5 missing, elapsed time: 22.970
Imputing row 901/9357 with 1 missing, elapsed time: 22.982
Imputing row 1001/9357 with 2 missing, elapsed time: 22.994
Imputing row 1101/9357 with 2 missing, elapsed time: 23.006
Imputing row 1201/9357 with 1 missing, elapsed time: 23.018
Imputing row 1301/9357 with 0 missing, elapsed time: 23.031
Imputing row 1401/9357 with 0 missing, elapsed time: 23.043
Imputing row 1501/9357 with 1 missing, elapsed time: 23.055
Imputing row 1601/9357 with 3 missing, elapsed time:

# Comparing performances

In [10]:
from sklearn.metrics import mean_squared_error

def evaluate_imputation (df_original, df_imputed, truth_mask):
    #extracting truth, imputed values
    truth = df_original[truth_mask]
    imputed = df_imputed[truth_mask]

    # Flatten to 1D arrays
    truth = truth.values.flatten()
    imputed = imputed.values.flatten()

    # Remove the NaNs
    truth = truth[~np.isnan(truth)]
    imputed = imputed[~np.isnan(imputed)]

    # Compute RMSE
    rmse = np.sqrt(mean_squared_error(truth, imputed))

    return rmse

In [11]:
KNNI_RMSE = evaluate_imputation(df_original = X_data, df_imputed = X_imputed_KNNI, truth_mask = X_data_missing.isna())
print(f'KNNI RMSE: {KNNI_RMSE:.4f}')

KNNI RMSE: 128.3049


# Exercises

* normalize the input data (each colum should have zero mean, unitary standard deviation) before imputatioon
  * can you tell why this is imporant?
* study the effect of different values of k for KNN imputation
* modify the `evaluate_imputation()` function so that it returns more than one performance metric. For inspiration take a look at the [sklearn.metrics](https://scikit-learn.org/stable/api/sklearn.metrics.html) package, which is already imported
* implement other imputation algorithms, either by yourself or using [fancyimpute](https://pypi.org/project/fancyimpute/) (or something else, if you find anything interesting). Compare the performances
* write alternatives to `introduce_random_missingness()` to test other types of missing data injection, and then compare the performances. For example, write a function that picks a number of variables at random (e.g. 3 variables) and for each one:
  * picks a percentage X between 0% and 20%
  * deletes either the X% or the last X% of the data for that variable