In [None]:
from sklearn.impute import SimpleImputer

import matplotlib.pyplot as pp
import pandas as pd
import statsmodels.api as sm
import seaborn as sb

## Data loading

Let's load our dataset.
This is the same diabetes data we have worked with before, but I have added some missing values.

In [None]:
file_name = "../data/diabetes_missing.csv"
df = pd.read_csv(file_name)
df.head()

Let's take a quick look at the size of the dataset with `shape`.

In [None]:
df.shape

## Exploratory data analysis

We have 442 observations (rows) and 13 variables (columns).
A first way to see if you have missing values is to use the describe function.

In [None]:
df.describe()

We can see from the count row that "BMI" and "Fasting Glucose" have fewer entries than the other variables and less than 442 our number of observations.

We can also directly compute the number of missing values with the `isna` method and the `sum` method.
The `isna` method will return a matrix of True/False values indicating whether a value is missing (na).
Calling `sum` on that matrix will add the entries for column together, treating False (not missing) as zero and True (missing) as one.

In [None]:
df.isna().sum()

## Removing missing values

The most conservative approach for dealing with missing values is to remove any row with a missing value.
We can do this using the `dropna` method.
Doing this means any row with a missing value anywhere will be removed.

In [None]:
df_remove = df.dropna(axis=0)
df_remove.shape

We can look at how much of the data remains by taking the ratio of rows in the new data frame to the original.

In [None]:
df_remove.shape[0] / df.shape[0]

## Imputing with the mean

A less conservative approach is to replace missing values with the mean of the column.
This approach by construction does not change the mean of the distribution for that column, but will alter the variance.
We can do this using pandas indexing fairly easily.
Below I am making a copy of the original data so we can compare the results.

#### Manual filling

In [None]:
df_mean = df.copy()
df_mean.loc[df_mean["BMI"].isnull(), "BMI"] = df_mean["BMI"].mean()
df_mean.loc[df_mean["Fasting Glucose"].isnull(), "Fasting Glucose"] = df_mean["Fasting Glucose"].mean()
df_mean.describe()

#### Using pandas

The same result can also be achieved more compactly with the [fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) method.

In [None]:
df_mean = df.copy()
df_mean["BMI"] = df["BMI"].fillna(df["BMI"].mean())
df_mean["Fasting Glucose"] = df["Fasting Glucose"].fillna(df["Fasting Glucose"].mean())
df_mean.describe()

We can plot the results.
Note that I have to create a figure and an axes which is shared between the two plots.
I did this so we could see both results together.
I also made some modifications to the colors and alpha scaling of the plots so we could see them despite the overlap.

In [None]:
fig = pp.figure()
ax = fig.add_subplot(1, 1, 1)
df["BMI"].hist(ax=ax, bins=20, alpha=0.5, color="b"), df_mean["BMI"].hist(ax=ax, bins=20, alpha=0.5, color="r")

Now lets compare the descriptive statistics.

In [None]:
df["BMI"].describe(), df_mean["BMI"].describe()

#### Using scikit-learn

All this can further be simplified if we use the [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) class from the scikit-learn (`sklearn`).
The `sklearn` package can do more sophisticated things if for example columns are categorical, but handling this is beyond the scope of the course.

Note I am passing the `strategy` argument explicitly.
By default it is mean, so this is unecessary, but it is helpful for code readability.
Other strategies such as median exist for continuous values as well.

In [None]:
df_mean[["BMI", "Fasting Glucose"]] = SimpleImputer(strategy="mean").fit_transform(df[["BMI", "Fasting Glucose"]])
df_mean["BMI"].describe()