# Simple Imputation of missing values

## Introduction
This notebook will walk through some simple methods of imputing missing data, and explore the effects of these on the distributions of the datasets. The imputation methods covered will include:

- Aggregate statistics (mean, median)
- Interpolation (time series)
- Multiple imputation

# Imports

In [None]:
%reload_ext autoreload
%autoreload 2

import data_chaser as dc
import numpy as np
import os
import pandas as pd
import plotly.graph_objects as go
from sklearn.impute import SimpleImputer

## Loading in data

We will again assume that all data is saved in `lost-data-chaser/data/`. There are much nicer ways to do this, but for now let us load these files 'notebook style'. 

In [None]:
datadir = os.path.join(os.path.dirname(os.getcwd()), 'data')
fnames = sorted([os.path.join(datadir, fname) for fname in os.listdir(datadir) if fname.endswith('.csv')])

In [None]:
fire_df = pd.read_csv(fnames[0])
landslide_df = pd.read_csv(fnames[1])
meteor_df = pd.read_csv(fnames[2])
comet_df = pd.read_csv(fnames[3])

In [None]:
snow_df = pd.read_csv(fnames[4], skiprows=39)
snow_df['#datetime_MST'] = pd.to_datetime(snow_df['#datetime_MST'])
snow_df = snow_df.set_index('#datetime_MST', drop=True)
snow_df.index = snow_df.index.set_names('timestamp')

# Simple Imputation

## Imputation via aggregate statistics

First, we can try imputation using the mean and median of columns. Let us use the fires dataset, imputing the `Velocity Components (km/s): vx` attribute.

In [None]:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
fire_df.head(3)

In [None]:
feature = 'Velocity Components (km/s): vx'
fig = go.Figure(
    [
    go.Bar(x=['Missing', 'Present'], 
           y=[fire_df[feature].isna().sum(), fire_df[feature].notna().sum()])
    ]
)
fig.update_layout(autosize=True, title_text=f"Missing data ratios for variable '{feature}'")
fig.show()

We will use the `SimpleImputer` from `sklearn` to simplify the process.

In [None]:
mean_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
median_imputer = SimpleImputer(missing_values=np.nan, strategy='median')

In [None]:
fire_df[feature + '_mean_impute'] = mean_imputer.fit_transform(fire_df[feature].values.reshape(-1, 1))
fire_df[feature + '_median_impute'] = median_imputer.fit_transform(fire_df[feature].values.reshape(-1, 1))
print(fire_df[feature + '_mean_impute'])
print(fire_df[feature + '_median_impute'])

We wouldn't expect these results to be much different when using them to include more data into a machine learning model,m but we will test this theory in the next notebook. 

## Simple time-series imputation

TBC.

## Multiple variable imputation

TBC.