# Preprocessing: Unavailable numerical values (*missing values*)

## Previous steps

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

housing = pd.read_csv("./data/housing.csv") 
train_set, test_set = train_test_split(housing, test_size=0.2,
    stratify=pd.cut(housing["median_income"], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5]),
    random_state=42
    )

**Preprocessing** data is one of the most important tasks in Machine Learning. If the data is not well prepared, *Machine Learning* algorithms won't work correctly. First we'll separate the **predictors** from the **target variable** (the **labels**), since we won't necessarily apply the same **transformations** to both.

In [None]:
X_train = train_set.drop("median_house_value", axis=1) # Remove the dependent variable column
y_train = train_set["median_house_value"].copy() # Save the dependent variable (labels)

X_train.head().T

## Identification of unavailable values

As we saw at the beginning, the 'total_bedrooms' column has unavailable values. Normally we'll speak of unavailable values, ***missing values***, *null* or ***na* (not available)** as synonyms, although we need to be careful about how those values were collected, since if there are two types of values (null and empty *string*, for example) there could be implicit information.

> **Terminology note:** In data science, several terms are used somewhat interchangeably:
> - **Missing value**: The general concept—a value that should exist but doesn't.
> - **NA (Not Available)**: A common label for missing values, used in R and pandas (`pd.NA`).
> - **NaN (Not a Number)**: Originally from IEEE 754 floating-point standard, used in NumPy/pandas for missing floats.
> - **Null**: Common in databases (SQL NULL); in pandas, `None` for object columns.
> 
> In pandas, `isna()` and `isnull()` are aliases—both detect NA and NaN values. The important distinction is between data that is *structurally* missing (not recorded) vs. *semantically* missing (e.g., "not applicable")—these may require different handling strategies.

In [None]:
X_train.isna().sum() # the isnull() method is an alias for isna()

In [None]:
null_rows_idx = X_train.isnull().any(axis=1) # indices of rows with null values
X_train.loc[null_rows_idx].head()

## Deletion of rows with null values (***Listwise deletion***)

We can simply delete those incomplete instances, although this is problematic because we're eliminating information. Especially if there are many predictors (since to solve the problem of certain nulls we're losing the information from the other columns).

In [None]:
X_train_ld_tb = X_train.dropna(subset=["total_bedrooms"]) 
X_train_ld_tb.loc[null_rows_idx].head() # verify that rows with null values have been removed

We could also directly delete any row that has a null value in any column:

In [None]:
X_train_ld_any = X_train.dropna(axis=0) # remove rows with null values
X_train_ld_any.loc[null_rows_idx].head() # verify that rows with null values have been removed

## Deletion of the entire column

Deleting the entire column is an option if it's not an important variable, but in this case it seems to be important given that, although that *feature* is not the one that correlates most directly with the target variable, it's one of the two used to calculate `bedrooms_ratio`, which is the second most correlated.

In [None]:
X_train_drop_tb = X_train.drop(columns="total_bedrooms")
X_train_drop_tb.loc[null_rows_idx].head()

The rows are still there in this case because the null indices were searched before. If we search for nulls now in housing_option2, we won't find them.

In [None]:
X_train_drop_tb.isnull().any(axis=None) # verify that there are no null values in the dataset

We could also directly delete all columns with nulls:

In [None]:
X_train.dropna(axis=1).isnull().any(axis=None)

## Imputation of some value (the median in this case)

**Imputation** of a certain value (such as zero, the mean or the median) to those unavailable fields is an option if we believe that unavailable values don't respond to any specific cause, and don't bias the variable's distribution.

> **Types of missing data:** The appropriate imputation strategy depends on *why* data is missing:
> - **MCAR (Missing Completely At Random)**: The probability of missing is unrelated to any variable. Example: a sensor randomly fails. Simple imputation (mean/median) works well.
> - **MAR (Missing At Random)**: Missingness depends on *observed* variables but not the missing value itself. Example: older survey respondents skip income questions regardless of their actual income. Can use information from other variables to impute.
> - **MNAR (Missing Not At Random)**: Missingness depends on the *unobserved* value itself. Example: high earners deliberately skip income questions. Simple imputation will bias results; requires specialized methods or domain knowledge.
>
> In our dataset, `total_bedrooms` has only 207 missing values (~1%) out of 20,640 records—a small fraction. Without additional information about *why* these values are missing, median imputation is a reasonable choice, as it's robust to potential outliers.

Imputation of the mean (***mean***) is more sensitive to **outliers**, since an extreme value can greatly affect the mean (see [Outliers and capped values](e2e020_eda.ipynb#Outliers-and-capped-values) for outlier handling techniques). The median (***median***) is more robust to extreme values. The mode (***mode***) is the value that repeats most, and is useful for categorical variables, but not as much for continuous variables.

[<img src="./img/mean_outliers.jpg" width="300">](https://www.kaggle.com/code/nareshbhat/outlier-the-silent-killer)

In [None]:
median = X_train["total_bedrooms"].median()
housing_option3 = X_train["total_bedrooms"].fillna(median)
housing_option3.loc[null_rows_idx].head()

Now all these rows have in total_bedrooms the median value of total_bedrooms.

The `SimpleImputer` class from scikit-learn allows us to do this more easily. We create an instance of `SimpleImputer` indicating that we want to impute null values with the median, and then use the `fit()` method to calculate the median of each column and the `transform()` method to apply the imputation to all columns.

Let's see how this method would be applied to all numerical fields in the dataframe (remember that 'ocean_proximity' is categorical -text values-, and we can't calculate the median of text).

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

In [None]:
housing_num = X_train.select_dtypes(include=[np.number]) # select numerical columns

In [None]:
imputer.fit(housing_num) # calculate the median of each numerical column
imputer.statistics_ # median of each numerical column

We can verify that the values are the same as those calculated by the dataframe's `median()` method.

In [None]:
housing_num.median().values

In [None]:
housing_num_array_tr = imputer.transform(housing_num) # replace null values with the median
housing_num_array_tr

`transform()` returns a NumPy array, but we could convert it back to a Pandas DataFrame.

In [None]:
housing_tr = pd.DataFrame(housing_num_array_tr, columns=housing_num.columns, index=housing_num.index)
housing_tr.loc[null_rows_idx].head()

We could also directly use the `fit_transform()` method of `SimpleImputer` to calculate the value to impute (with `fit()`) and apply it (with `transform()`) in a single step.

And we could also use the `.set_output(transform="pandas")` method of the imputer so that the result is a Pandas DataFrame.

Therefore, the entire process detailed above can be summarized in a single line of code:

In [None]:
housing_tr = SimpleImputer(strategy="median").set_output(transform="pandas").fit_transform(housing_num)
housing_tr.loc[null_rows_idx].head()

## Predictive models to impute values

There are more advanced methods such as using **prediction models** (treating the column with null values as the target variable and the rest of the columns as *features*). For example, the **K-Nearest Neighbors (KNN)** algorithm could be used to predict the null values of 'total_bedrooms' based on the labeled records. SciKit-Learn has a `KNNImputer` class that does this.