### **D2APR: Aprendizado de Máquina e Reconhecimento de Padrões** (IFSP, Campinas) <br/>
**Prof**: Samuel Martins (Samuka) <br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>. <br/><br/>

### Custom CSS style

In [1]:
%%html
<style>
.dashed-box {
    border: 1px dashed black !important;
}
.dashed-box tr {
  background-color: white !important;  
}
.alt-tab {
    background-color: black;
    color: #ffc351;
    padding: 4px;
    font-size: 1em;
    font-weight: bold;
    font-family: monospace;
}
// add your CSS styling here
</style>

<span style='font-size: 2.5em'><b>California Housing 🏡</b></span><br/>
<span style='font-size: 1.5em'>Predict the median housing price in California districts</span>

<span style="background-color: #ffc351; padding: 4px; font-size: 1em;"><b>Sprint 6</b></span>

<img src="./imgs/california-flag.png" width=300/>

---



## Before starting this notebook
This jupyter notebook is designed for **experimental and teaching purposes**. <br/>
Although it is (relatively) well organized, it aims at solving the _target problem_ by evaluating (and documenting) _different solutions_ for somes steps of the **machine learning pipeline** — see the ***Machine Learning Project Checklist by xavecoding***. <br/>
We tried to make this notebook as literally a _notebook_. Thus, it contains notes, drafts, comments, etc.<br/>

For teaching purposes, some parts of the notebook may be _overcommented_. Moreover, to simulate a real development scenario, we will divide our solution and experiments into **"sprints"** in which each sprint has some goals (e.g., perform _feature selection_, train more ML models, ...). <br/>
The **sprint goal** will be stated at the beginning of the notebook.

A ***final notebook*** (or any other kind of presentation) that compiles and summarizes all sprints — the target problem, solutions, and findings — should be created later.

#### Conventions

<ul>
    <li>💡 indicates a tip. </li>
    <li> ⚠️ indicates a warning message. </li>
    <li><span class='alt-tab'>alt tab</span> indicates and an extra content (<i>e.g.</i>, slides) to explain a given concept.</li>
</ul>

---

## 🎯 Sprint Goals
- Fine-tune the _hyperparameters_ of the Polynomial Regression models form Sprint #5.
---

### 0. Imports and default settings for plotting

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")

params = {'legend.fontsize': 'x-large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)

## 🛠️ 5. Prepare the Data

We will consider the same two scenarios for **Polynomial Regression** from Sprint #5 in this sprint:
1. Use _only_ the `median_income`.
2. Use _all features_ except those that generated the aggregate features (`total_rooms`, `total_bedrooms`, `population`, `household`).

### 5.1. Load the cleaned training set

Let's consider the training and testing sets already cleaned (sprint #2):
- Drop duplicated instances (no found)
- Drop instances with `housing_median_age` capped at 52
- Drop instances with `median_house_value` capped at 500001.0

In [None]:
# load the cleaned training set
housing_train = pd.read_csv('./datasets/housing_train_sprint-2.csv')

In [None]:
housing_train.head()

In [None]:
housing_train.shape

### 5.2. Separate the _features_ and the _target outcome_

In [None]:
# store the target outcome into a numpy array
y_train = housing_train['median_house_value'].values

# overwrite the dataframe with only the features  
housing_train = housing_train.drop(columns=['median_house_value'])

### 5.3. Pipelines

For the sake of simplicity, let's include the **Polynomial Regression** objects (`PolynomialFeatures()` + `LinearRegression()`) into our **pipeline**. So, it is no longer _just_ dedicated to preprocessing.

#### **Scenario 1**

In [None]:
# pipeline for numerical
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

attributes_scenario_1 = ['median_income']

pipeline_scenario_1 = Pipeline([
    ('imputer', SimpleImputer()),   # let's evaluate the mean and median inputation
    ('scaler', RobustScaler()),
    ('poly', PolynomialFeatures())
])

# we will just use the ColumnTransformer because it automaticaly filters the required columns for us before performing the pipeline.
# (name, transformer, columns)
preprocessed_pipeline_scenario_1 = ColumnTransformer([
    ("numerical", pipeline_scenario_1, attributes_scenario_1)
])


# full pipeline: preprocessing + model training/prediction
full_pipeline_scenario_1 = Pipeline([
        ('preprocessing', preprocessed_pipeline_scenario_1),
        ('lin_regression', LinearRegression())
])

#### **Scenario 2**

In [None]:
#### feature engineering method from the Sprint #4
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

# our 3 new features are based on some the features: totalrooms, 
# column index
rooms_col_idx, bedrooms_col_idx, population_col_idx, households_col_idx = 3, 4, 5, 6

class HousingFeatEngineering(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self  # nothing else to do
    
    def transform(self, X):
        n_rows = X.shape[0]
        
        # creating the new features
        rooms_per_household = X[:, rooms_col_idx] / X[:, households_col_idx]
        population_per_household = X[:, population_col_idx] / X[:, households_col_idx]
        bedrooms_per_room = X[:, bedrooms_col_idx] / X[:, rooms_col_idx]
                
        # to concatenate the new array as columns in our feature matrix, we need to reshape them first
        rooms_per_household = rooms_per_household.reshape((n_rows, 1))
        population_per_household = population_per_household.reshape((n_rows, 1))
        bedrooms_per_room = bedrooms_per_room.reshape((n_rows, 1))
        
        # concatenating the new features into the feature matrix
        X_out = np.hstack((X, rooms_per_household, population_per_household, bedrooms_per_room))
        
        return X_out

In [None]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

# our 3 new features are based on some the features: totalrooms, 
# column index
rooms_col_idx, bedrooms_col_idx, population_col_idx, households_col_idx = 3, 4, 5, 6

class DropFeatures(BaseEstimator, TransformerMixin):
    def __init__(self, verbose=False):
        self.verbose = verbose
    
    def fit(self, X, y=None):
        return self  # nothing else to do
    
    def transform(self, X):
        X_out = np.delete(X, [rooms_col_idx, bedrooms_col_idx, population_col_idx, households_col_idx], axis=1)
        
        # for debugging
        if self.verbose:
            np.set_printoptions(suppress=True)
            print('X[:5]')
            print(X[:5])
            print('\nX_out[:5]')
            print(X_out[:5])
        
        return X_out

In [None]:
# pipeline for numerical
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

attributes_scenario_2 = housing_train.columns.drop('ocean_proximity')

pipeline_scenario_2 = Pipeline([
    ('imputer', SimpleImputer()),  # let's evaluate the mean and median inputation
    ('feat_engineering', HousingFeatEngineering()),
    ('drop_features', DropFeatures(verbose=False)),
    ('scaler', RobustScaler()),
    ('poly', PolynomialFeatures())
])

# we will just use the ColumnTransformer because it automaticaly filters the required columns for us before performing the pipeline.
# (name, transformer, columns)
preprocessed_pipeline_scenario_2 = ColumnTransformer([
    ("numerical", pipeline_scenario_2, attributes_scenario_2)
])

# full pipeline: preprocessing + model training/prediction
full_pipeline_scenario_2 = Pipeline([
        ('preprocessing', preprocessed_pipeline_scenario_2),
        ('lin_regression', LinearRegression())
])

## 🏋️‍♀️ 6. Train ML Algorithms

### 6.1. Hyperparameter Optimization (fine-tuning)

<table align="left" class="dashed-box">
<tr>
    <td><span class='alt-tab'>alt tab</span></td>
    <td><b>Slides:</b> Hyperparameter Optimization (fine-tuning).</td>
</tr>
</table><br/><br/>

### **Scenario 1**

#### **Finding out the hyperparameter key names**

In [None]:
full_pipeline_scenario_1.get_params()

#### **Grid-search**

In [None]:
# grid search


In [None]:
# best hyperparameters


In [None]:
# best score


In [None]:
# cross-validation results


### **Scenario 2**

In [None]:
full_pipeline_scenario_2.get_params()

In [None]:
# grid search


In [None]:
# best hyperparameters


In [None]:
# best score


In [None]:
# cross-validation results


### Discussion