 Preprocessing: Unavailable numerical values (*missing values*)

This notebook is an adaptation of the [original by *Aurélien Gerón*](https://github.com/ageron/handson-ml3/blob/main/02_end_to_end_machine_learning_project.ipynb), from his book: [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition. Aurélien Géron](https://www.oreilly.com/library/view/hands-on-machine-learning/9781098125967/)

# Previous steps

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

housing = pd.read_csv("./data/housing.csv") 

# Generation of training and test sets through stratified sampling by median income
train_set, test_set = train_test_split(housing, test_size=0.2,
    stratify=pd.cut(housing["median_income"], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5]),
    random_state=42
    )

**Preprocessing** data is one of the most important tasks in Machine Learning. If the data is not well prepared, *Machine Learning* algorithms won't work correctly. First we'll separate the **predictors** from the **target variable** (the **labels**), since we won't necessarily apply the same **transformations** to both.

In [None]:
housing = train_set.drop("median_house_value", axis=1) # Remove the dependent variable column
housing_labels = train_set["median_house_value"].copy() # Save the dependent variable (labels)

housing.head().T

# Identification of unavailable values

As we saw at the beginning, the 'total_bedrooms' column has unavailable values. Normally we'll speak of unavailable values, ***missing values***, *null* or ***na* (not available)** as synonyms, although we need to be careful about how those values were collected, since if there are two types of values (null and empty *string*, for example) there could be implicit information.
<!-- TODO: Review distinction between na, null, missing, impure data.... -->

In [None]:
housing.isna().sum() # the isnull() method is an alias for isna()

In [None]:
null_rows_idx = housing.isnull().any(axis=1) # indices of rows with null values
housing.loc[null_rows_idx].head()

# Deletion of rows with null values (***Listwise deletion***)

We can simply delete those incomplete instances, although this is problematic because we're eliminating information. Especially if there are many predictors (since to solve the problem of certain nulls we're losing the information from the other columns).

In [None]:
housing_option1 = housing.dropna(subset=["total_bedrooms"]) 
housing_option1.loc[null_rows_idx].head() # verify that rows with null values have been removed

We could also directly delete any row that has a null value in any column:

In [None]:
housing_option1b = housing.dropna(axis=0) # remove rows with null values
housing_option1b.loc[null_rows_idx].head() # verify that rows with null values have been removed

# Deletion of the entire column

Deleting the entire column is an option if it's not an important variable, but in this case it seems to be important given that, although that *feature* is not the one that correlates most directly with the target variable, it's one of the two used to calculate `bedrooms_ratio`, which is the second most correlated.

In [None]:
housing_option2 = housing.drop(columns="total_bedrooms")
housing_option2.loc[null_rows_idx].head()

The rows are still there in this case because the null indices were searched before. If we search for nulls now in housing_option2, we won't find them.

In [None]:
housing_option2.isnull().any(axis=None) # verify that there are no null values in the dataset

We could also directly delete all columns with nulls:

In [41]:
housing.dropna(axis=1).isnull().any(axis=None)

False

# Imputation of some value (the median in this case)

**Imputation** of a certain value (such as zero, the mean or the median) to those unavailable fields is an option if we believe that unavailable values don't respond to any specific cause, and don't bias the variable's distribution<!-- TODO: Review Different types of missing data (MCAR, MAR, MNAR) -->.

Imputation of the mean (***mean***) is more sensitive to **outliers**, since an extreme value can greatly affect the mean. The median (***median***) is more robust to extreme values. The mode (***mode***) is the value that repeats most, and is useful for categorical variables, but not as much for continuous variables.



[<img src="img/mean_outliers.jpg" width="300">](https://www.kaggle.com/code/nareshbhat/outlier-the-silent-killer)

In [None]:
median = housing["total_bedrooms"].median()
housing_option3 = housing["total_bedrooms"].fillna(median)
housing_option3.loc[null_rows_idx].head()

Now all these rows have in total_bedrooms the median value of total_bedrooms.

The `SimpleImputer` class from scikit-learn allows us to do this more easily. We create an instance of `SimpleImputer` indicating that we want to impute null values with the median, and then use the `fit()` method to calculate the median of each column and the `transform()` method to apply the imputation to all columns.

Let's see how this method would be applied to all numerical fields in the dataframe (remember that 'ocean_proximity' is categorical -text values-, and we can't calculate the median of text).

In [43]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

In [None]:
housing_num = housing.select_dtypes(include=[np.number]) # select numerical columns

In [None]:
imputer.fit(housing_num) # calculate the median of each numerical column
imputer.statistics_ # median of each numerical column

We can verify that the values are the same as those calculated by the dataframe's `median()` method.

In [46]:
housing_num.median().values

array([-118.51   ,   34.26   ,   29.     , 2119.     ,  433.     ,
       1164.     ,  408.     ,    3.54155])

In [None]:
housing_num_array_tr = imputer.transform(housing_num) # replace null values with the median
housing_num_array_tr

`transform()` returns a NumPy array, but we could convert it back to a Pandas DataFrame.

In [None]:
housing_tr = pd.DataFrame(housing_num_array_tr, columns=housing_num.columns, index=housing_num.index)
housing_tr.loc[null_rows_idx].head()

We could also directly use the `fit_transform()` method of `SimpleImputer` to calculate the value to impute (with `fit()`) and apply it (with `transform()`) in a single step.

And we could also use the `.set_output(transform="pandas")` method of the imputer so that the result is a Pandas DataFrame.

Therefore, the entire process detailed above can be summarized in a single line of code:

In [None]:
housing_tr = SimpleImputer(strategy="median").set_output(transform="pandas").fit_transform(housing_num)
housing_tr.loc[null_rows_idx].head()

# Predictive models to impute values

There are more advanced methods such as using **prediction models** (treating the column with null values as the target variable and the rest of the columns as *features*). For example, the **K-Nearest Neighbors (KNN)** algorithm could be used to predict the null values of 'total_bedrooms' based on the labeled records. SciKit-Learn has a `KNNImputer` class that does this.