### **D2APR: Aprendizado de Máquina e Reconhecimento de Padrões** (IFSP, Campinas) <br/>
**Prof**: Samuel Martins (Samuka) <br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>. <br/><br/>

### Custom CSS style

In [None]:
%%html
<style>
.dashed-box {
    border: 1px dashed black !important;
}
.dashed-box tr {
  background-color: white !important;  
}
.alt-tab {
    background-color: black;
    color: #ffc351;
    padding: 4px;
    font-size: 1em;
    font-weight: bold;
    font-family: monospace;
}
// add your CSS styling here
</style>

<span style='font-size: 2.5em'><b>California Housing 🏡</b></span><br/>
<span style='font-size: 1.5em'>Predict the median housing price in California districts</span>

<span style="background-color: #ffc351; padding: 4px; font-size: 1em;"><b>Sprint 5</b></span>

<img src="./imgs/california-flag.png" width=300/>

---



## Before starting this notebook
This jupyter notebook is designed for **experimental and teaching purposes**. <br/>
Although it is (relatively) well organized, it aims at solving the _target problem_ by evaluating (and documenting) _different solutions_ for somes steps of the **machine learning pipeline** — see the ***Machine Learning Project Checklist by xavecoding***. <br/>
We tried to make this notebook as literally a _notebook_. Thus, it contains notes, drafts, comments, etc.<br/>

For teaching purposes, some parts of the notebook may be _overcommented_. Moreover, to simulate a real development scenario, we will divide our solution and experiments into **"sprints"** in which each sprint has some goals (e.g., perform _feature selection_, train more ML models, ...). <br/>
The **sprint goal** will be stated at the beginning of the notebook.

A ***final notebook*** (or any other kind of presentation) that compiles and summarizes all sprints — the target problem, solutions, and findings — should be created later.

#### Conventions

<ul>
    <li>💡 indicates a tip. </li>
    <li> ⚠️ indicates a warning message. </li>
    <li><span class='alt-tab'>alt tab</span> indicates and an extra content (<i>e.g.</i>, slides) to explain a given concept.</li>
</ul>

---

## 🎯 Sprint Goals
- Evaluate Polynomial Regression Models
---

### 0. Imports and default settings for plotting

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")

params = {'legend.fontsize': 'x-large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)

## 🛠️ 5. Prepare the Data

To use **Polynomial Regression**, we need to decide _which features_ will be used/considered first.

### 5.1. Load the cleaned training set

Let's consider the training and testing sets already cleaned (sprint #2):
- Drop duplicated instances (no found)
- Drop instances with `housing_median_age` capped at 52
- Drop instances with `median_house_value` capped at 500001.0

In [None]:
# load the cleaned training set
housing_train = pd.read_csv('./datasets/housing_train_sprint-2.csv')

In [None]:
housing_train.head()

In [None]:
housing_train.shape

### 5.2. Quick EDA to get insights about the features

#### **Generate aggregate features**
Let's also analyse the new features created in the previous sprints.

In [None]:
housing_train_eda = housing_train.copy()

housing_train_eda["rooms_per_household"] = housing_train_eda["total_rooms"] / housing_train_eda["households"]
housing_train_eda["bedrooms_per_room"] = housing_train_eda["total_bedrooms"] / housing_train_eda["total_rooms"]
housing_train_eda["population_per_household"] = housing_train_eda["population"] / housing_train_eda["households"]

In [None]:
housing_train_eda

In [None]:
housing_train_eda.columns

In [None]:
sns.pairplot(housing_train_eda, x_vars=['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income'],  y_vars=['median_house_value'], height=5)

In [None]:
sns.pairplot(housing_train_eda, x_vars=['rooms_per_household', 'bedrooms_per_room', 'population_per_household'],  y_vars=['median_house_value'], height=5)

<br/>

By looking at the scatter plots, we cannot identify a specific relationship (linear, quadratic, cubic, ...) between the _features_ and the _outcome_ (`median_house_value`). <br/>
As observersed in previous sprints, the `median_income` seems to have a _'linear'_ relationship with the `median_house_value`.

The `population_per_household` has significative _outliers_ which impair its visualization. Let's **remove** these outliers to try to improve the visualization.

##### **Removing outliers for `population_per_household`**

In [None]:
# IQR outlier detection


In [None]:
sns.pairplot(housing_train_eda_without_outliers, x_vars=['rooms_per_household', 'bedrooms_per_room', 'population_per_household'],  y_vars=['median_house_value'], height=5)

Now, we can see the true dispersion of the `population_per_household` and the `median_house_value`. However there is not a clear relationship between them.

Let's then consider two scenarios for **Polynomial Regression**
1. Use _only_ the `median_income`.
2. Use _all features_ except those that generated the aggregate features (`total_rooms`, `total_bedrooms`, `population`, `household`).

### 5.3. Separate the _features_ and the _target outcome_

In [None]:
# store the target outcome into a numpy array
y_train = housing_train['median_house_value'].values

# overwrite the dataframe with only the features  
housing_train = housing_train.drop(columns=['median_house_value'])

### 5.4. Preprocessing Pipelines

#### **Scenario 1**

In [None]:
# pipeline for numerical
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler



#### **Scenario 2**

In [None]:
num_attributes = housing_train.columns.drop('ocean_proximity')
num_attributes

In [None]:
# get the integer index of each attribute/column for the filtered dataframe by the numeric attributes:
for index, column_name in enumerate(housing_train[num_attributes]):
    print(f'{index} = {column_name}')

In [None]:
#### feature engineering method from the Sprint #4
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

# our 3 new features are based on some the features: totalrooms, 
# column index
rooms_col_idx, bedrooms_col_idx, population_col_idx, households_col_idx = 3, 4, 5, 6

class HousingFeatEngineering(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self  # nothing else to do
    
    def transform(self, X):
        n_rows = X.shape[0]
        
        # creating the new features
        rooms_per_household = X[:, rooms_col_idx] / X[:, households_col_idx]
        population_per_household = X[:, population_col_idx] / X[:, households_col_idx]
        bedrooms_per_room = X[:, bedrooms_col_idx] / X[:, rooms_col_idx]
                
        # to concatenate the new array as columns in our feature matrix, we need to reshape them first
        rooms_per_household = rooms_per_household.reshape((n_rows, 1))
        population_per_household = population_per_household.reshape((n_rows, 1))
        bedrooms_per_room = bedrooms_per_room.reshape((n_rows, 1))
        
        # concatenating the new features into the feature matrix
        X_out = np.hstack((X, rooms_per_household, population_per_household, bedrooms_per_room))
        
        return X_out

The columns of **output numpy array** will correspond to: <br/>
0 = longitude <br/>
1 = latitude <br/>
2 = housing_median_age <br/>
3 = total_rooms <br/>
4 = total_bedrooms <br/>
5 = population <br/>
6 = households <br/>
7 = median_income <br/>
8 = rooms_per_household <br/>
9 = population_per_household <br/>
10 = bedrooms_per_room <br/>

To satisfy the scenario 2, we need to **remove/drop** the features `total_rooms`, `total_bedrooms`, `population`, `household`. <br/>
To do this automatically, we can create another **transformer** that removes the corresponding numpy array columns after `HousingFeatEngineering` throughout their column indices. <br/>
Coincidentally, these column indices are _the same_ used in `HousingFeatEngineering`... but, **always be aware of this.**

In [None]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

# our 3 new features are based on some the features: totalrooms, 
# column index
rooms_col_idx, bedrooms_col_idx, population_col_idx, households_col_idx = 3, 4, 5, 6

class DropFeatures(BaseEstimator, TransformerMixin):
    pass

In [None]:
# pipeline for numerical
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler



## 🏋️‍♀️ 6. Train ML Algorithms

### 6.1. Getting the independent (features) and dependent variables (outcome)

In [None]:
# scenario 1


In [None]:
print(X_train_scenario_1.shape)
print(X_train_scenario_1[:5])

In [None]:
# scenario 2


In [None]:
# we already have `y_train`
y_train.shape

### 6.2. Training the Models

For this initial evaluation, let's consider the **default parameters** of `PolynomialFeatures` but the `include_bias`.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

In [None]:
# printing function
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

#### **→ Scenario 1**

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# default degree = 2


In [None]:
print(X_train_scenario_1.shape)
print(X_train_scenario_1[:5])

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression



In [None]:
display_scores(lin_rmse_scores_scenario_1)

Although the errors are relatively _stable_ across the folds (look at the _standard deviation_), the **cross validation score** (\\$71,453.19 ± \$1,359.74) is _considerably higher_ than `Linear Regression` (\\$58,371.04 ± \$1,757.91).

<table align="left" class="dashed-box">
<tr>
    <td>💡</td>
    <td>Although we have create a <code>Pipeline</code> only for <b>preprocessing</b>, we could incorporate <b>all steps/modes/transformers</b> (including <code>polynomial transformation</code> and <code>linear regression</code>) into a <b>single <code>Pipeline</code><b/>.</td>
</tr>
<tr>
    <td></td>
    <td>We will see that in the next sprints.</td>
</tr>
</table><br/><br/>

#### **→ Scenario 2**

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# default degree = 2


In [None]:
print(X_train_scenario_2.shape)
print(X_train_scenario_2[:5])

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression



In [None]:
display_scores(lin_rmse_scores_scenario_2)

This **polynomial regression model** is very unstable (look at the _standard deviation_): it presents lower errors in some folds and extremely high ones in other folds. <br/>
It seems that:
- this combination of features is not good; and/or
- the outliers (regardless their nature) are impacting the results; and/or
- the considered degree is not adequate; and/or
- this model is not adequate for this problem.

<table align="left" class="dashed-box">
<tr>
    <td>💡</td>
    <td>Before ignoring <b>polynomial regression</b> for our problem, we could try to <b><i>fine-tune</i></b> its <i>hyperparameters</i>, especially the <code>degree</code> since it highly impacts the final results, and/or remove outliers.</td>
</tr>
</table><br/><br/>