Code for Machine Learning and Data Science II
=============================================



These are the code snippets used in End to End ML Project
part of Machine Learning and Data Science II.



### Introduction



#### Preamble



In [1]:
import matplotlib.pyplot as plt
import ChalcedonPy as cp

# Initialise ChalcedonPy
cp.init(save_path="End-to-End-ML-Project",
        display_mode="slide")

ModuleNotFoundError: No module named 'matplotlib'

#### Download Initial Data



First lets load the necessary modules for downloading the data so we
can work on it.



In [6]:
from pathlib import Path
import pandas as pd # for dataframes
import tarfile # read/write tar files
# to access and download data from web
import urllib.request
from tabulate import tabulate # for table printing

Define a function called load<sub>housing</sub><sub>data</sub>() to access and
download the data, finally returning a read of it using pandas.



In [7]:
def load_housing_data():
    # path to save the file
    tarball_path = Path("datasets/housing.tgz")
    # check if the path exists, if not create one

    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/dTmC0945/L-MCI-BSc-Data-Science-II/raw/main/data/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets")
    return pd.read_csv(Path("datasets/housing/housing.csv"))

Now read the data and assign it to the value housing.



In [8]:
housing = load_housing_data()

Let's have a look at the data



In [9]:
print(housing.head())

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6431            341300.0        NEAR BAY  
4       565.0       259.0         3.8462            342200.0        NEAR BAY  


Each row represents one district.

There are 10 attributes

-   longitude,
-   latitude,
-   housing<sub>median</sub><sub>age</sub>,
-   total<sub>rooms</sub>,
-   total<sub>bed</sub> rooms,
-   population,
-   households,
-   median<sub>income</sub>,
-   median<sub>house</sub><sub>value</sub>,
-   and ocean<sub>proximity</sub>.

To get more information about the data use the info() method.



In [10]:
print(housing.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
None


There seems to be some repetition on the ocean<sub>proximity</sub> parameters.
Let's have a bit more look into the data.

To get more information on it use the value<sub>counts</sub>() method.



In [11]:
print(housing["ocean_proximity"].value_counts())

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64


Time to look at the the other fields using the describe() method which
shows their numerical attributes.



In [12]:
print(housing.describe)

<bound method NDFrame.describe of        longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0        -122.23     37.88                41.0        880.0           129.0   
1        -122.22     37.86                21.0       7099.0          1106.0   
2        -122.24     37.85                52.0       1467.0           190.0   
3        -122.25     37.85                52.0       1274.0           235.0   
4        -122.25     37.85                52.0       1627.0           280.0   
...          ...       ...                 ...          ...             ...   
20635    -121.09     39.48                25.0       1665.0           374.0   
20636    -121.21     39.49                18.0        697.0           150.0   
20637    -121.22     39.43                17.0       2254.0           485.0   
20638    -121.32     39.43                18.0       1860.0           409.0   
20639    -121.24     39.37                16.0       2785.0           616.0   

       population

Allows the pretty print of the results.



In [None]:
import matplotlib.pyplot as plt

housing.hist(bins=50,  figsize=(12, 8))

plt.show()

#### Create a Set for Testing



In [None]:
import numpy as np

# Define a function to shuffle and split data
def shuffle_and_split_data(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

To use the data we can do the following



In [None]:
train_set, test_set = shuffle_and_split_data(housing, 0.2)

Lets see the sizes of the datasets



In [None]:
print("The size of the training set is:", len(train_set))
print("The size of the test data is:", len(test_set))

This will shuffle but because of shuffling the program will see
all the data eventually which is not something good.

To avoid it we set a RNG seed to keep the shuffled indices constant.



In [None]:
np.random.seed(42)

Here is another method in which we can keep the split constant even if
the dataset is refreshed.



In [None]:
from zlib import crc32

def is_id_in_test_set(identifier, test_ratio):
    return crc32(np.int64(identifier)) < test_ratio * 2**32

def split_data_with_id_hash(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(
        lambda id_: is_id_in_test_set(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

Unfortunately, the housing dataset does not have an identifier column.
The simplest solution is to use the row index as the ID:



In [None]:
housing_with_id = housing.reset_index()  # adds an `index` column
train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, "index")

If you use the row index as a unique identifier, you need to make
sure that new data gets appended to the end of the dataset and that no row ever gets deleted.



In [None]:
housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, "id")

Scikit-Learn provides a few functions to split datasets into multiple subsets in various ways.
An easy function is train<sub>test</sub><sub>split</sub>()



In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

To find the probability that a random sample of 1,000 people contains less than 48.5% female or
more than 53.5% female when the population's female ratio is 51.1%, we use the binomial distribution.

The cdf() method of the binomial distribution gives us the probability that the
number of females will be equal or less than the given value.



In [None]:
from scipy.stats import binom

sample_size = 1000
ratio_female = 0.511
proba_too_small = binom(sample_size, ratio_female).cdf(485 - 1)
proba_too_large = 1 - binom(sample_size, ratio_female).cdf(535)
print(proba_too_small + proba_too_large)

However, for the ones who prefer numerical results over explicit solutions,
there is also the below method to achieve a similar result.



In [None]:
np.random.seed(42)

samples = (np.random.rand(100_000, sample_size) < ratio_female).sum(axis=1)
((samples < 485) | (samples > 535)).mean()

In [None]:
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

In [None]:
plt.xlabel("Income category")
plt.ylabel("Number of districts")
housing["income_cat"].value_counts().sort_index().plot.bar(rot=0, grid=True)

plt.show()

Now you are ready to do stratified sampling based on the income category.
For this you can use Scikit-Learn’s StratifiedShuffleSplit class:



In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

splitter = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
strat_splits = []
for train_index, test_index in splitter.split(housing, housing["income_cat"]):
    strat_train_set_n = housing.iloc[train_index]
    strat_test_set_n = housing.iloc[test_index]
    strat_splits.append([strat_train_set_n, strat_test_set_n])

In [None]:
strat_train_set, strat_test_set = strat_splits[0]

In [None]:
strat_train_set, strat_test_set = train_test_split(
    housing,
    test_size=0.2,
    stratify=housing["income_cat"],
    random_state=42)

In [None]:
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

In [None]:
def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

compare_props = pd.DataFrame({
    "Overall %": income_cat_proportions(housing),
    "Stratified %": income_cat_proportions(strat_test_set),
    "Random %": income_cat_proportions(test_set),
}).sort_index()
compare_props.index.name = "Income Category"
compare_props["Strat. Error %"] = (compare_props["Stratified %"] /
                                   compare_props["Overall %"] - 1)
compare_props["Rand. Error %"] = (compare_props["Random %"] /
                                  compare_props["Overall %"] - 1)
(compare_props * 100).round(2)

Time to drop income<sub>cat</sub>() attribute to go back to the
original state.



In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

### Discover and Visualize the Data to Gain Insights



Before we start playing with the data it is a good habit to
create a copy so as to not tamper with the training set.



In [None]:
housing = strat_train_set.copy()

#### Visualising Geographical Data



As data is a bunch of points in 2D space, it is benefical to
plot it in a scatter plot.



In [None]:
housing.plot(kind = "scatter",
             x = "longitude",
             y = "latitude",
             title = "Housing Market in California")

plt.show()

This might remind you a state in a country but it is currently not
possible to see a pattern.

Let's set the alpha to 0.2 to see if it helps better.



In [None]:
housing.plot(kind = "scatter",
             x = "longitude",
             y = "latitude",
             title = r"Housing Market in California ($\alpha = 0.2$)",
             alpha = 0.2)

plt.show()

Let's make it a bit more interesting and create a plot where the higher cluster areas
would be coloured red and sparser places will be coloured colder as the value of
the land is less.



In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude",
             s=housing["population"] / 100, label="population",
             c="median_house_value", cmap="jet", colorbar=True,
             legend=True, sharex=False, figsize=(10, 7))

plt.show()

Now we can see the housing prices are very much related to the location
in this case closer to the ocean. To add a final addition let's superimpose
the state map over it.



In [None]:
filename = "california.png"

if not (IMAGES_PATH / filename).is_file():
    homl3_root = "https://github.com/ageron/handson-ml3/raw/main/"
    url = homl3_root + "images/end_to_end_project/" + filename
    print("Downloading", filename)
    urllib.request.urlretrieve(url, IMAGES_PATH / filename)

housing_renamed = housing.rename(columns={
    "latitude": "Latitude", "longitude": "Longitude",
    "population": "Population",
    "median_house_value": "Median house value (ᴜsᴅ)"})

plot_settings(style = "slide")
housing_renamed.plot(
             kind="scatter", x="Longitude", y="Latitude",
             s=housing_renamed["Population"] / 100, label="Population",
             c="Median house value (ᴜsᴅ)", cmap="jet", colorbar=True,
             legend=True, sharex=False, figsize=(10, 7))

california_img = plt.imread(IMAGES_PATH / filename)
axis = -124.55, -113.95, 32.45, 42.05
plt.axis(axis)
plt.imshow(california_img, extent=axis)

store_fig("california-housing-prices-plot",
          style = "slide",
          close = True)

#### Looking for Correlations



Time to see if there is any correlation between the value of the houses
and other parameters using pearsons correlation



In [None]:
corr_matrix = housing.corr(numeric_only=True)
corr_matrix["median_house_value"].sort_values(ascending=False)

Another way to check for correlation between attributes is to
use the pandas scatter<sub>matrix</sub>() function,



In [None]:
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]

scatter_matrix(housing[attributes], figsize=(12, 8))
plt.show()

The most promising attribute to predict the median house value
is the median income,



In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1, grid=True)
plt.show()

#### Experimenting with Attribute Combinations



In [None]:
housing["rooms_per_house"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_ratio"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["people_per_house"] = housing["population"] / housing["households"]

In [None]:
corr_matrix = housing.corr(numeric_only=True)
corr_matrix["median_house_value"].sort_values(ascending=False)

### Prepare the Data for Machine Learning Algorithms



It’s time to prepare the data for your Machine Learning algorithms.
Instead of doing this manually, you should write functions for this
purpose, 

First revert to a clean training set and separate predictors and labels.



In [None]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

#### Data Cleaning



Not all data comes perfect for ML as some of them may contain NaN or 0 or
some other value which you don't want ML algorithm to process.



In [None]:
null_rows_idx = housing.isnull().any(axis=1)
housing.loc[null_rows_idx].head()

Let's look at three (3) options:

1 - Get rid of the corresponding districts



In [None]:
housing_option1 = housing.copy()
housing_option1.dropna(subset=["total_bedrooms"], inplace=True)  # option 1
housing_option1.loc[null_rows_idx].head()

2 - Get rid of the whole attribute



In [None]:
housing_option2 = housing.copy()
housing_option2.drop("total_bedrooms", axis=1, inplace=True)  # option 2
housing_option2.loc[null_rows_idx].head()

3 - Set the values to some value (zero, the mean, the median, etc.).



In [None]:
housing_option3 = housing.copy()
median = housing["total_bedrooms"].median()
housing_option3["total_bedrooms"].fillna(median, inplace=True)  # option 3

housing_option3.loc[null_rows_idx].head()

Scikit-Learn provides a handy class to take care of missing values: SimpleImputer.

Separating out the numerical attributes to use the "median" strategy



In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

create a copy of the data without the text attribute ocean<sub>proximity</sub>:



In [None]:
housing_num = housing.select_dtypes(include=[np.number])

can fit the imputer instance to the training data using the fit() method:



In [None]:
imputer.fit(housing_num)

we cannot be sure that there won’t be any missing values in new
data after the system goes live, so it is safer to apply the
imputer to all the numerical attributes:



In [None]:
print(imputer.statistics_)

In [None]:
print(housing_num.median().values)

use this “trained” imputer to transform the training set by
replacing missing values with the learned medians:



In [None]:
X = imputer.transform(housing_num)
# to see the name of the columns
print(imputer.feature_names_in_)

To convert this to pd dataframe:



In [None]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing_num.index)

In [None]:
housing_tr.loc[null_rows_idx].head()

In [None]:
imputer.strategy

In [None]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing_num.index)

housing_tr.loc[null_rows_idx].head()

Time to drop some outliers.



In [None]:
from sklearn.ensemble import IsolationForest

isolation_forest = IsolationForest(random_state=42)
outlier_pred = isolation_forest.fit_predict(X)

print(outlier_pred)

#### Handling Text and Categorical Attributes



Time to process text instead of numbers.

In this dataset, there is just one: the ocean<sub>proximity</sub> attribute.
Let’s look at its value for the first 10 instances:



In [None]:
housing_cat = housing[["ocean_proximity"]]
housing_cat.head(8)

As ML likes numbers instead of text, convert these values to numbers
using OrdinalEncoder class from Scikit-learn



In [None]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

Let's see the results



In [None]:
print(housing_cat_encoded[:8])

And if we were to see the categories:



In [None]:
print(ordinal_encoder.categories_)

To stop from ML algorithm from treating the numbers too literally



In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

In [None]:
type(housing_cat_1hot)

the output is a SciPy sparse matrix, instead of a NumPy array.

convert it to a dense array if needed by calling the toarray() method:



In [None]:
print(housing_cat_1hot.toarray())

It is also possible to list categories using categoreis\_ instance



In [None]:
cat_encoder = OneHotEncoder(sparse_output=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
print(housing_cat_1hot)

In [None]:
print(cat_encoder.categories_)

In [None]:
df_test = pd.DataFrame({"ocean_proximity": ["INLAND", "NEAR BAY"]})
pd.get_dummies(df_test)

In [None]:
print(cat_encoder.transform(df_test))

In [None]:
df_test_unknown = pd.DataFrame({"ocean_proximity": ["<2H OCEAN", "ISLAND"]})
pd.get_dummies(df_test_unknown)

In [None]:
cat_encoder.handle_unknown = "ignore"
print(cat_encoder.transform(df_test_unknown))

In [None]:
print(cat_encoder.feature_names_in_)

In [None]:
print(cat_encoder.get_feature_names_out())

In [None]:
df_output = pd.DataFrame(cat_encoder.transform(df_test_unknown),
                         columns=cat_encoder.get_feature_names_out(),
                         index=df_test_unknown.index)

df_output

#### Feature Scaling



In [None]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)

In [None]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(8, 3), sharey=True)
housing["population"].hist(ax=axs[0], bins=50)
housing["population"].apply(np.log).hist(ax=axs[1], bins=50)
axs[0].set_xlabel("Population")
axs[1].set_xlabel("Log of population")
axs[0].set_ylabel("Number of districts")
plt.show()

In [None]:
percentiles = [np.percentile(housing["median_income"], p)
               for p in range(1, 100)]
flattened_median_income = pd.cut(housing["median_income"],
                                 bins=[-np.inf] + percentiles + [np.inf],
                                 labels=range(1, 100 + 1))
flattened_median_income.hist(bins=50)
plt.xlabel("Median income percentile")
plt.ylabel("Number of districts")
plt.show()

In [None]:
from sklearn.metrics.pairwise import rbf_kernel

age_simil_35 = rbf_kernel(housing[["housing_median_age"]], [[35]], gamma=0.1)

In [None]:
ages = np.linspace(housing["housing_median_age"].min(),
                   housing["housing_median_age"].max(),
                   500).reshape(-1, 1)
gamma1 = 0.1
gamma2 = 0.03
rbf1 = rbf_kernel(ages, [[35]], gamma=gamma1)
rbf2 = rbf_kernel(ages, [[35]], gamma=gamma2)

fig, ax1 = plt.subplots()

ax1.set_xlabel("Housing median age")
ax1.set_ylabel("Number of districts")
ax1.hist(housing["housing_median_age"], bins=50)

ax2 = ax1.twinx()  # create a twin axis that shares the same x-axis
color = "#984ea3"
ax2.plot(ages, rbf1, color=color, label="gamma = 0.10")
ax2.plot(ages, rbf2, color=color, label="gamma = 0.03", linestyle="--")
ax2.tick_params(axis='y', labelcolor=color)
ax2.set_ylabel("Age similarity", color=color)

plt.legend(loc="upper left")
plt.show()

In [None]:
from sklearn.linear_model import LinearRegression

target_scaler = StandardScaler()
scaled_labels = target_scaler.fit_transform(housing_labels.to_frame())

model = LinearRegression()
model.fit(housing[["median_income"]], scaled_labels)
some_new_data = housing[["median_income"]].iloc[:5]  # pretend this is new data

scaled_predictions = model.predict(some_new_data)
predictions = target_scaler.inverse_transform(scaled_predictions)

In [None]:
print(predictions)

In [None]:
from sklearn.compose import TransformedTargetRegressor

model = TransformedTargetRegressor(LinearRegression(),
                                   transformer=StandardScaler())
model.fit(housing[["median_income"]], housing_labels)
predictions = model.predict(some_new_data)

In [None]:
print(predictions)

#### Custom Transformers



In [None]:
from sklearn.preprocessing import FunctionTransformer

log_transformer = FunctionTransformer(np.log, inverse_func=np.exp)
log_pop = log_transformer.transform(housing[["population"]])

In [None]:
rbf_transformer = FunctionTransformer(rbf_kernel,
                                      kw_args=dict(Y=[[35.]], gamma=0.1))
age_simil_35 = rbf_transformer.transform(housing[["housing_median_age"]])

In [None]:
print(age_simil_35)

In [None]:
sf_coords = 37.7749, -122.41
sf_transformer = FunctionTransformer(rbf_kernel,
                                     kw_args=dict(Y=[sf_coords], gamma=0.1))
sf_simil = sf_transformer.transform(housing[["latitude", "longitude"]])

In [None]:
print(sf_simil)

In [None]:
ratio_transformer = FunctionTransformer(lambda X: X[:, [0]] / X[:, [1]])
print(ratio_transformer.transform(np.array([[1., 2.], [3., 4.]])))

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted

class StandardScalerClone(BaseEstimator, TransformerMixin):
    def __init__(self, with_mean=True):  # no *args or **kwargs!
        self.with_mean = with_mean

    def fit(self, X, y=None):  # y is required even though we don't use it
        X = check_array(X)  # checks that X is an array with finite float values
        self.mean_ = X.mean(axis=0)
        self.scale_ = X.std(axis=0)
        self.n_features_in_ = X.shape[1]  # every estimator stores this in fit()
        return self  # always return self!

    def transform(self, X):
        check_is_fitted(self)  # looks for learned attributes (with trailing _)
        X = check_array(X)
        assert self.n_features_in_ == X.shape[1]
        if self.with_mean:
            X = X - self.mean_
        return X / self.scale_

In [None]:
from sklearn.cluster import KMeans

class ClusterSimilarity(BaseEstimator, TransformerMixin):
    def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
        self.n_clusters = n_clusters
        self.gamma = gamma
        self.random_state = random_state

    def fit(self, X, y=None, sample_weight=None):
        self.kmeans_ = KMeans(self.n_clusters, n_init=10,
                              random_state=self.random_state)
        self.kmeans_.fit(X, sample_weight=sample_weight)
        return self  # always return self!

    def transform(self, X):
        return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)
    
    def get_feature_names_out(self, names=None):
        return [f"Cluster {i} similarity" for i in range(self.n_clusters)]

In [None]:
cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
similarities = cluster_simil.fit_transform(housing[["latitude", "longitude"]],
                                           sample_weight=housing_labels)

In [None]:
similarities[:3].round(2)

In [None]:
housing_renamed = housing.rename(columns={
    "latitude": "Latitude", "longitude": "Longitude",
    "population": "Population",
    "median_house_value": "Median house value (ᴜsᴅ)"})
housing_renamed["Max cluster similarity"] = similarities.max(axis=1)

housing_renamed.plot(kind="scatter", x="Longitude", y="Latitude", grid=True,
                     s=housing_renamed["Population"] / 100, label="Population",
                     c="Max cluster similarity",
                     cmap="jet", colorbar=True,
                     legend=True, sharex=False, figsize=(10, 7))
plt.plot(cluster_simil.kmeans_.cluster_centers_[:, 1],
         cluster_simil.kmeans_.cluster_centers_[:, 0],
         linestyle="", color="black", marker="X", markersize=20,
         label="Cluster centers")
plt.legend(loc="upper right")
plt.show()

#### Transformation Pipelines



Now let's build a pipeline to preprocess the numerical attributes:



In [None]:
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("standardize", StandardScaler()),
])

In [None]:
from sklearn.pipeline import make_pipeline

num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

In [None]:
from sklearn import set_config

set_config(display='diagram')

num_pipeline

In [None]:
housing_num_prepared = num_pipeline.fit_transform(housing_num)
print(housing_num_prepared[:2].round(2))

In [None]:
def monkey_patch_get_signature_names_out():
    """Monkey patch some classes which did not handle get_feature_names_out()
       correctly in Scikit-Learn 1.0.*."""
    from inspect import Signature, signature, Parameter
    import pandas as pd
    from sklearn.impute import SimpleImputer
    from sklearn.pipeline import make_pipeline, Pipeline
    from sklearn.preprocessing import FunctionTransformer, StandardScaler

    default_get_feature_names_out = StandardScaler.get_feature_names_out

    if not hasattr(SimpleImputer, "get_feature_names_out"):
      print("Monkey-patching SimpleImputer.get_feature_names_out()")
      SimpleImputer.get_feature_names_out = default_get_feature_names_out

    if not hasattr(FunctionTransformer, "get_feature_names_out"):
        print("Monkey-patching FunctionTransformer.get_feature_names_out()")
        orig_init = FunctionTransformer.__init__
        orig_sig = signature(orig_init)

        def __init__(*args, feature_names_out=None, **kwargs):
            orig_sig.bind(*args, **kwargs)
            orig_init(*args, **kwargs)
            args[0].feature_names_out = feature_names_out

        __init__.__signature__ = Signature(
            list(signature(orig_init).parameters.values()) + [
                Parameter("feature_names_out", Parameter.KEYWORD_ONLY)])

        def get_feature_names_out(self, names=None):
            if callable(self.feature_names_out):
                return self.feature_names_out(self, names)
            assert self.feature_names_out == "one-to-one"
            return default_get_feature_names_out(self, names)

        FunctionTransformer.__init__ = __init__
        FunctionTransformer.get_feature_names_out = get_feature_names_out

monkey_patch_get_signature_names_out()

In [None]:
df_housing_num_prepared = pd.DataFrame(
    housing_num_prepared, columns=num_pipeline.get_feature_names_out(),
    index=housing_num.index)

In [None]:
df_housing_num_prepared.head(2)

In [None]:
print(num_pipeline.steps)

In [None]:
num_pipeline[1]

In [None]:
num_pipeline[:-1]

In [None]:
num_pipeline.named_steps["simpleimputer"]

In [None]:
num_pipeline.set_params(simpleimputer__strategy="median")

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = ["longitude", "latitude", "housing_median_age", "total_rooms",
               "total_bedrooms", "population", "households", "median_income"]
cat_attribs = ["ocean_proximity"]

cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore"))

preprocessing = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs),
])

In [None]:
from sklearn.compose import make_column_selector, make_column_transformer

preprocessing = make_column_transformer(
    (num_pipeline, make_column_selector(dtype_include=np.number)),
    (cat_pipeline, make_column_selector(dtype_include=object)),
)

In [None]:
housing_prepared = preprocessing.fit_transform(housing)

In [None]:
housing_prepared_fr = pd.DataFrame(
    housing_prepared,
    columns=preprocessing.get_feature_names_out(),
    index=housing.index)
housing_prepared_fr.head(2)

In [None]:
def column_ratio(X):
    return X[:, [0]] / X[:, [1]]

def ratio_name(function_transformer, feature_names_in):
    return ["ratio"]  # feature names out

def ratio_pipeline():
    return make_pipeline(
        SimpleImputer(strategy="median"),
        FunctionTransformer(column_ratio, feature_names_out=ratio_name),
        StandardScaler())

log_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    FunctionTransformer(np.log, feature_names_out="one-to-one"),
    StandardScaler())
cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
default_num_pipeline = make_pipeline(SimpleImputer(strategy="median"),
                                     StandardScaler())
preprocessing = ColumnTransformer([
        ("bedrooms", ratio_pipeline(), ["total_bedrooms", "total_rooms"]),
        ("rooms_per_house", ratio_pipeline(), ["total_rooms", "households"]),
        ("people_per_house", ratio_pipeline(), ["population", "households"]),
        ("log", log_pipeline, ["total_bedrooms", "total_rooms", "population",
                               "households", "median_income"]),
        ("geo", cluster_simil, ["latitude", "longitude"]),
        ("cat", cat_pipeline, make_column_selector(dtype_include=object)),
    ],
    remainder=default_num_pipeline)  # one column remaining: housing_median_age

In [None]:
housing_prepared = preprocessing.fit_transform(housing)
print(housing_prepared.shape)

In [None]:
print(preprocessing.get_feature_names_out())

### Select and Train a Model



#### Training and Evaluating on the Training Set



In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = make_pipeline(preprocessing, LinearRegression())
print(lin_reg.fit(housing, housing_labels))

In [None]:
housing_predictions = lin_reg.predict(housing)
print(housing_predictions[:5].round(-2))  # -2 = rounded to the nearest hundred

In [None]:
print(housing_labels.iloc[:5].values)

In [None]:
error_ratios = housing_predictions[:5].round(-2) / housing_labels.iloc[:5].values - 1
print(", ".join([f"{100 * ratio:.1f}%" for ratio in error_ratios]))

In [None]:
from sklearn.metrics import mean_squared_error

lin_rmse = mean_squared_error(housing_labels, housing_predictions,
                              squared=False)
lin_rmse

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = make_pipeline(preprocessing, DecisionTreeRegressor(random_state=42))
tree_reg.fit(housing, housing_labels)

In [None]:
housing_predictions = tree_reg.predict(housing)
tree_rmse = mean_squared_error(housing_labels, housing_predictions,
                              squared=False)
tree_rmse

#### Better Evaluation using Cross-Validation



In [None]:
from sklearn.model_selection import cross_val_score

tree_rmses = -cross_val_score(tree_reg, housing, housing_labels,
                              scoring="neg_root_mean_squared_error", cv=10)

In [None]:
pd.Series(tree_rmses).describe()

In [None]:
lin_rmses = -cross_val_score(lin_reg, housing, housing_labels,
                              scoring="neg_root_mean_squared_error", cv=10)
pd.Series(lin_rmses).describe()

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = make_pipeline(preprocessing,
                           RandomForestRegressor(random_state=42))
forest_rmses = -cross_val_score(forest_reg, housing, housing_labels,
                                scoring="neg_root_mean_squared_error", cv=10)

In [None]:
pd.Series(forest_rmses).describe()

In [None]:
forest_reg.fit(housing, housing_labels)
housing_predictions = forest_reg.predict(housing)
forest_rmse = mean_squared_error(housing_labels, housing_predictions,
                                 squared=False)
forest_rmse

### Fine-Tuning the Model



#### Using Grid Search



In [None]:
from sklearn.model_selection import GridSearchCV

full_pipeline = Pipeline([
    ("preprocessing", preprocessing),
    ("random_forest", RandomForestRegressor(random_state=42)),
])
param_grid = [
    {'preprocessing__geo__n_clusters': [5, 8, 10],
     'random_forest__max_features': [4, 6, 8]},
    {'preprocessing__geo__n_clusters': [10, 15],
     'random_forest__max_features': [6, 8, 10]},
]
grid_search = GridSearchCV(full_pipeline, param_grid, cv=3,
                           scoring='neg_root_mean_squared_error')
grid_search.fit(housing, housing_labels)

In [None]:
print(str(full_pipeline.get_params().keys())[:1000] + "...")

In [None]:
print(grid_search.best_params_)

In [None]:
grid_search.best_estimator_

In [None]:
cv_res = pd.DataFrame(grid_search.cv_results_)
cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)

# extra code – these few lines of code just make the DataFrame look nicer
cv_res = cv_res[["param_preprocessing__geo__n_clusters",
                 "param_random_forest__max_features", "split0_test_score",
                 "split1_test_score", "split2_test_score", "mean_test_score"]]
score_cols = ["split0", "split1", "split2", "mean_test_rmse"]
cv_res.columns = ["n_clusters", "max_features"] + score_cols
cv_res[score_cols] = -cv_res[score_cols].round().astype(np.int64)

cv_res.head()

#### Randomised Search



In [None]:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {'preprocessing__geo__n_clusters': randint(low=3, high=50),
                  'random_forest__max_features': randint(low=2, high=20)}

rnd_search = RandomizedSearchCV(
    full_pipeline, param_distributions=param_distribs, n_iter=10, cv=3,
    scoring='neg_root_mean_squared_error', random_state=42)

rnd_search.fit(housing, housing_labels)

In [None]:
cv_res = pd.DataFrame(rnd_search.cv_results_)
cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)
cv_res = cv_res[["param_preprocessing__geo__n_clusters",
                 "param_random_forest__max_features", "split0_test_score",
                 "split1_test_score", "split2_test_score", "mean_test_score"]]
cv_res.columns = ["n_clusters", "max_features"] + score_cols
cv_res[score_cols] = -cv_res[score_cols].round().astype(np.int64)
cv_res.head()

In [None]:
from scipy.stats import randint, uniform, geom, expon

xs1 = np.arange(0, 7 + 1)
randint_distrib = randint(0, 7 + 1).pmf(xs1)

xs2 = np.linspace(0, 7, 500)
uniform_distrib = uniform(0, 7).pdf(xs2)

xs3 = np.arange(0, 7 + 1)
geom_distrib = geom(0.5).pmf(xs3)

xs4 = np.linspace(0, 7, 500)
expon_distrib = expon(scale=1).pdf(xs4)

plt.figure(figsize=(12, 7))

plt.subplot(2, 2, 1)
plt.bar(xs1, randint_distrib, label="scipy.randint(0, 7 + 1)")
plt.ylabel("Probability")
plt.legend()
plt.axis([-1, 8, 0, 0.2])

plt.subplot(2, 2, 2)
plt.fill_between(xs2, uniform_distrib, label="scipy.uniform(0, 7)")
plt.ylabel("PDF")
plt.legend()
plt.axis([-1, 8, 0, 0.2])

plt.subplot(2, 2, 3)
plt.bar(xs3, geom_distrib, label="scipy.geom(0.5)")
plt.xlabel("Hyperparameter value")
plt.ylabel("Probability")
plt.legend()
plt.axis([0, 7, 0, 1])

plt.subplot(2, 2, 4)
plt.fill_between(xs4, expon_distrib, label="scipy.expon(scale=1)")
plt.xlabel("Hyperparameter value")
plt.ylabel("PDF")
plt.legend()
plt.axis([0, 7, 0, 1])

plt.show()

In [None]:
from scipy.stats import loguniform

xs1 = np.linspace(0, 7, 500)
expon_distrib = expon(scale=1).pdf(xs1)

log_xs2 = np.linspace(-5, 3, 500)
log_expon_distrib = np.exp(log_xs2 - np.exp(log_xs2))

xs3 = np.linspace(0.001, 1000, 500)
loguniform_distrib = loguniform(0.001, 1000).pdf(xs3)

log_xs4 = np.linspace(np.log(0.001), np.log(1000), 500)
log_loguniform_distrib = uniform(np.log(0.001), np.log(1000)).pdf(log_xs4)

plt.figure(figsize=(12, 7))

plt.subplot(2, 2, 1)
plt.fill_between(xs1, expon_distrib,
                 label="scipy.expon(scale=1)")
plt.ylabel("PDF")
plt.legend()
plt.axis([0, 7, 0, 1])

plt.subplot(2, 2, 2)
plt.fill_between(log_xs2, log_expon_distrib,
                 label="log(X) with X ~ expon")
plt.legend()
plt.axis([-5, 3, 0, 1])

plt.subplot(2, 2, 3)
plt.fill_between(xs3, loguniform_distrib,
                 label="scipy.loguniform(0.001, 1000)")
plt.xlabel("Hyperparameter value")
plt.ylabel("PDF")
plt.legend()
plt.axis([0.001, 1000, 0, 0.005])

plt.subplot(2, 2, 4)
plt.fill_between(log_xs4, log_loguniform_distrib,
                 label="log(X) with X ~ loguniform")
plt.xlabel("Log of hyperparameter value")
plt.legend()
plt.axis([-8, 1, 0, 0.2])

plt.show()

#### Analyse the Best Models and Errors



In [None]:
final_model = rnd_search.best_estimator_  # includes preprocessing
feature_importances = final_model["random_forest"].feature_importances_
print(feature_importances.round(2))

In [None]:
sorted(zip(feature_importances,
           final_model["preprocessing"].get_feature_names_out()),
           reverse=True)


| 0.19087378222226137|log_<sub>median</sub><sub>income</sub>|
| 0.07625632853052883|cat_<sub>ocean</sub><sub>proximity</sub><sub>INLAND</sub>|
| 0.06365028932207333|bedrooms_<sub>ratio</sub>|
| 0.057834740538722625|rooms<sub>per</sub><sub>house</sub>_<sub>ratio</sub>|
| 0.04907003277818634|people<sub>per</sub><sub>house</sub>_<sub>ratio</sub>|
| 0.038165489600129165|geo_<sub>Cluster</sub> 3 similarity|
| 0.025700861301416925|geo_<sub>Cluster</sub> 22 similarity|
| 0.02186407550147744|geo_<sub>Cluster</sub> 17 similarity|
| 0.021818299311019237|geo_<sub>Cluster</sub> 6 similarity|
| 0.018249904787654904|geo_<sub>Cluster</sub> 2 similarity|
| 0.017263517651784216|geo_<sub>Cluster</sub> 32 similarity|
| 0.015649725317935348|geo_<sub>Cluster</sub> 18 similarity|
| 0.015236556682888558|geo_<sub>Cluster</sub> 40 similarity|
| 0.014160249342841777|geo_<sub>Cluster</sub> 43 similarity|
| 0.014113856232349186|geo_<sub>Cluster</sub> 7 similarity|
| 0.013968406769681294|geo_<sub>Cluster</sub> 21 similarity|
| 0.013781633271007265|geo_<sub>Cluster</sub> 38 similarity|
| 0.013515022744382842|geo_<sub>Cluster</sub> 34 similarity|
| 0.013508738042902313|geo_<sub>Cluster</sub> 41 similarity|
| 0.012844820424121687|geo_<sub>Cluster</sub> 24 similarity|
| 0.01236427981858226|geo_<sub>Cluster</sub> 10 similarity|
| 0.01176408158393247|remainder_<sub>housing</sub><sub>median</sub><sub>age</sub>|
| 0.011436849025886087|geo_<sub>Cluster</sub> 31 similarity|
| 0.011430032718708965|geo_<sub>Cluster</sub> 30 similarity|
| 0.011262888671999243|geo_<sub>Cluster</sub> 42 similarity|
| 0.011082126500672662|geo_<sub>Cluster</sub> 16 similarity|
| 0.01087991522984511|geo_<sub>Cluster</sub> 1 similarity|
| 0.0106352262787596|geo_<sub>Cluster</sub> 25 similarity|
| 0.010629636976156976|geo_<sub>Cluster</sub> 26 similarity|
| 0.010325438106958176|geo_<sub>Cluster</sub> 20 similarity|
| 0.009978597341631139|geo_<sub>Cluster</sub> 35 similarity|
| 0.009811902116084414|geo_<sub>Cluster</sub> 14 similarity|
| 0.00926835411026417|geo_<sub>Cluster</sub> 39 similarity|
| 0.009210910491673824|geo_<sub>Cluster</sub> 37 similarity|
| 0.008838219938405523|geo_<sub>Cluster</sub> 0 similarity|
| 0.00883623406533351|geo_<sub>Cluster</sub> 9 similarity|
| 0.008743931217845727|geo_<sub>Cluster</sub> 8 similarity|
| 0.008563362393325231|geo_<sub>Cluster</sub> 36 similarity|
| 0.008465719960196051|geo_<sub>Cluster</sub> 28 similarity|
| 0.008001576292023282|geo_<sub>Cluster</sub> 44 similarity|
| 0.007942690751495287|geo_<sub>Cluster</sub> 4 similarity|
| 0.00792848112158647|geo_<sub>Cluster</sub> 11 similarity|
| 0.007724755678419276|log_<sub>total</sub><sub>rooms</sub>|
| 0.0071161520040372486|log_<sub>population</sub>|
| 0.006792638958967365|log_<sub>total</sub><sub>bedrooms</sub>|
| 0.006504955481581277|log_<sub>households</sub>|
| 0.006215189140929538|geo_<sub>Cluster</sub> 23 similarity|
| 0.0056299034599644185|geo_<sub>Cluster</sub> 19 similarity|
| 0.005544728303210014|geo_<sub>Cluster</sub> 27 similarity|
| 0.00526201040506395|geo_<sub>Cluster</sub> 33 similarity|
| 0.004834036365364593|geo_<sub>Cluster</sub> 15 similarity|
| 0.004177160622478634|geo_<sub>Cluster</sub> 12 similarity|
| 0.0040378619656822245|geo_<sub>Cluster</sub> 13 similarity|
| 0.0036513301086410445|geo_<sub>Cluster</sub> 29 similarity|
| 0.0033595680753400604|cat_<sub>ocean</sub><sub>proximity</sub>_&lt;1H OCEAN|
| 0.001969988014822343|geo_<sub>Cluster</sub> 5 similarity|
| 0.001955486511135513|cat_<sub>ocean</sub><sub>proximity</sub><sub>NEAR</sub> OCEAN|
| 0.0002362031046240973|cat_<sub>ocean</sub><sub>proximity</sub><sub>NEAR</sub> BAY|
| 6.124671500751693e-05|cat_<sub>ocean</sub><sub>proximity</sub><sub>ISLAND</sub>|



#### Evaluate using the Test Set



Time to evaluate the model with the test set:

1.  get the predictors and the labels from the test set,
2.  run final<sub>mode</sub> to transform data,
3.  make predictions,
4.  evaluate the predictions.



In [None]:
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

final_predictions = final_model.predict(X_test)

final_rmse = mean_squared_error(y_test, final_predictions, squared=False)
final_rmse

We need to know How good is the model. For this, compute
the %95 [confidence interval](https://en.wikipedia.org/wiki/Confidence_interval) for the generalisation error using
scipy.stats.t.interval().



In [None]:
from scipy import stats

confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
result= np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                         loc=squared_errors.mean(),
                         scale=stats.sem(squared_errors)))
print(result)

The value is roughly in the middle of it.

There are different ways of calculating it as well.

Below is the manual way of writing the confidence interval.



In [None]:
m = len(squared_errors)
mean = squared_errors.mean()
tscore = stats.t.ppf((1 + confidence) / 2, df=m - 1)
tmargin = tscore * squared_errors.std(ddof=1) / np.sqrt(m)
print(np.sqrt(mean - tmargin), np.sqrt(mean + tmargin))

Alternatively, we can use z-score (i.e., [Standard score](https://en.wikipedia.org/wiki/Standard_score)) to calculate as the
data set is relatively big.



In [None]:
zscore = stats.norm.ppf((1 + confidence) / 2)
zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m)
print(np.sqrt(mean - zmargin), np.sqrt(mean + zmargin))

NOTE: Doing a lot of hyperparameter tuning can make the model behave worse as
it would be highly tuned to the train data and may not be as performant as on test data.



#### Saving the Model



Time to get the model into production and the easiest way to do is to save the best model.

To save it, you can use the joblib module [[More Info]​](https://joblib.readthedocs.io/en/stable/), which allows pipelining.



In [None]:
import joblib

joblib.dump(final_model, "my_california_housing_model.pkl")

Once saved, you can load and use it, with all the necessary dependencies loaded.



In [2]:
import joblib

# extra code – excluded for conciseness
from sklearn.cluster import KMeans
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics.pairwise import rbf_kernel

def column_ratio(X):
    return X[:, [0]] / X[:, [1]]

#class ClusterSimilarity(BaseEstimator, TransformerMixin):
#    [...]

final_model_reloaded = joblib.load("my_california_housing_model.pkl")

new_data = housing.iloc[:5]  # pretend these are new districts
predictions = final_model_reloaded.predict(new_data)
print(predictions)

FileNotFoundError: [Errno 2] No such file or directory: 'my_california_housing_model.pkl'