In [5]:
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request
import numpy as np

def load_housing_data():
    tarball_path = Path("datasets/housing.tgz")  # Path to the compressed dataset

    if not tarball_path.is_file():  # If the file doesn't exist locally
        Path("datasets").mkdir(parents=True, exist_ok=True)  # Create the 'datasets' directory if needed

        url = "https://github.com/ageron/data/raw/main/housing.tgz"  # URL to download the dataset
        urllib.request.urlretrieve(url, tarball_path)  # Download the .tgz file from the URL and save it locally

        with tarfile.open(tarball_path) as housing_tarball:  # Open the .tgz file as a tar archive
            housing_tarball.extractall(path="datasets")  # Extract all contents into the 'datasets' directory

    return pd.read_csv(Path("datasets/housing/housing.csv"))  # Load the CSV data into a DataFrame and return it

housing = load_housing_data()

In [6]:
housing["income_cat"] = pd.cut(
    housing["median_income"],
    bins=[0.0, 1.5, 3.0, 4.5, 6.0, np.inf],
    labels=[1, 2, 3, 4, 5]
)


from sklearn.model_selection import train_test_split

strat_train_set, strat_test_set = train_test_split(
    housing,
    test_size=0.2,
    stratify=housing["income_cat"],
    random_state=42
)

housing = strat_train_set.copy()

In [7]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median") # set imputer
housing_num = housing.select_dtypes(include=[np.number])

### Feature Scaling and Transformation

* One of the most important transformations you need to apply to your data is feature scaling. With few exceptions, machine learning algorithms don't perform well when the input numerical attributes have very different scales.

* This is the case for the housing data: the total number of rooms ranges from about 6 to 39,320, while the median incomes only range from 0 to 15. Without any scaling, most models will be biased toward ignoring the median income and focusing more on the number of rooms.

* There are two common ways to get all attributes to have the same scale: min-max scaling and standardization.

**WARNING**

As with all estimators, it is important to fit the scalers to the training data only: never use `fit()` or `fit_transform()` for anything else than the training set. Once you have a trained scaler, you can then use it to `transform()` any other set, including the validation set, the test set, and new data. Note that while the training set values will always be scaled to the specified range, if new data contains outliers, these may end up scaled outside the range. If you want to avoid this, just set the `clip` hyperparameter to `True`.

---
### Min-max Scaling (Normalization)

* Min-max scaling (many people call this normalization) is the simplest: for each attribute, the values are shifted and rescaled so that they end up ranging from 0 to 1. This is performed by subtracting the min value and dividing by the difference between the min and the max. 

* Scikit-Learn provides a transformer called `MinMaxScaler` for this. It has a `feature_range` hyperparameter that lets you change the range if, for some reason, you don't want 0-1 (e.g., neural networks work best with zero-mean inputs, so a range of -1 to 1 is preferable).

In [8]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)
housing_num_min_max_scaled

array([[-0.60851927,  0.11702128,  1.        , ..., -0.61433638,
        -0.7794789 ,  0.82803782],
       [ 0.21095335, -0.66170213,  0.52941176, ..., -0.86708979,
        -0.22929339,  0.93319203],
       [-0.51926978,  0.23617021,  0.25490196, ..., -0.92458466,
        -0.73336919, -0.64247158],
       ...,
       [ 0.47870183, -0.99148936, -0.52941176, ..., -0.71663244,
        -0.50873781, -0.44824557],
       [ 0.20689655, -0.6787234 ,  0.41176471, ..., -0.68751167,
        -0.49716556,  1.        ],
       [-0.60649087,  0.08723404,  0.68627451, ..., -0.92122457,
        -0.61608805, -0.0997934 ]], shape=(16512, 9))

---
### Standardization

* Standardization is different: first it subtracts the mean value (so standardized values have a zero mean), then it divides the result by the standard deviation (so standardized values have a standard deviation equal to 1). 

* Unlike min-max scaling, standardization does not restrict values to a specific range. However, standardization is much less affected by outliers. For example, suppose a district has a median income equal to 100 (by mistake), instead of the usual 0-15. Min-max scaling to the 0-1 range would map this outlier down to 1 and it would crush all the other values down to 0-0.15, whereas standardization would not be much affected. 

* Scikit-Learn provides a transformer called `StandardScaler` for standardization.

**TIP**

If you want to scale a sparse matrix without converting it to a dense matrix first, you can use a `StandardScaler` with its `with_mean` hyperparameter set to `False`: it will only divide the data by the standard deviation, without subtracting the mean (as this would break sparsity).

In [9]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)
housing_num_std_scaled

array([[-1.42303652,  1.0136059 ,  1.86111875, ...,  1.39481249,
        -0.93649149,  2.18511202],
       [ 0.59639445, -0.702103  ,  0.90762971, ..., -0.37348471,
         1.17194198,  2.40625396],
       [-1.2030985 ,  1.27611874,  0.35142777, ..., -0.77572662,
        -0.75978881, -0.90740625],
       ...,
       [ 1.25620853, -1.42870103, -1.23772062, ...,  0.67913534,
         0.1010487 , -0.49894408],
       [ 0.58639727, -0.73960483,  0.66925745, ...,  0.88286825,
         0.14539615,  2.54675281],
       [-1.41803793,  0.94797769,  1.22545939, ..., -0.75221898,
        -0.31034135,  0.23385961]], shape=(16512, 9))