### **D2APR: Aprendizado de Máquina e Reconhecimento de Padrões** (IFSP, Campinas) <br/>
**Prof**: Samuel Martins (Samuka) <br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>. <br/><br/>

### Custom CSS style

In [1]:
%%html
<style>
.dashed-box {
    border: 1px dashed black !important;
}
.dashed-box tr {
  background-color: white !important;  
}
.alt-tab {
    background-color: black;
    color: #ffc351;
    padding: 4px;
    font-size: 1em;
    font-weight: bold;
    font-family: monospace;
}
// add your CSS styling here
</style>

<span style='font-size: 2.5em'><b>California Housing 🏡</b></span><br/>
<span style='font-size: 1.5em'>Predict the median housing price in California districts</span>

<span style="background-color: #ffc351; padding: 4px; font-size: 1em;"><b>Sprint 4</b></span>

<img src="../imgs/california-flag.png" width=300/>

---



## Before starting this notebook
This jupyter notebook is designed for **experimental and teaching purposes**. <br/>
Although it is (relatively) well organized, it aims at solving the _target problem_ by evaluating (and documenting) _different solutions_ for somes steps of the **machine learning pipeline** — see the ***Machine Learning Project Checklist by xavecoding***. <br/>
We tried to make this notebook as literally a _notebook_. Thus, it contains notes, drafts, comments, etc.<br/>

For teaching purposes, some parts of the notebook may be _overcommented_. Moreover, to simulate a real development scenario, we will divide our solution and experiments into **"sprints"** in which each sprint has some goals (e.g., perform _feature selection_, train more ML models, ...). <br/>
The **sprint goal** will be stated at the beginning of the notebook.

A ***final notebook*** (or any other kind of presentation) that compiles and summarizes all sprints — the target problem, solutions, and findings — should be created later.

#### Conventions

<ul>
    <li>💡 indicates a tip. </li>
    <li> ⚠️ indicates a warning message. </li>
    <li><span class='alt-tab'>alt tab</span> indicates and an extra content (<i>e.g.</i>, slides) to explain a given concept.</li>
</ul>

---

## 🎯 Sprint Goals
- Refactor our codes by using the sklearn Pipelines
---

### 0. Imports and default settings for plotting

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")

params = {'legend.fontsize': 'x-large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)

## 🛠️ 5. Prepare the Data

#### **Preprocessing tasks**
- Fill in missing values (imputation)
- Add new features
- Feature Scaling
- One-Hot Encoding

<table align="left" class="dashed-box">
<tr>
    <td><span class='alt-tab'>alt tab</span></td>
    <td><b>Slides:</b> Scikit-Learn Design Principles - Hyperparameters vs Parameters<br/>
        <b>Slides:</b> Scikit-Learn Design Principles - Main APIs</td>
</tr>
</table><br/><br/>

### 5.1. Load the cleaned training set

Let's consider the training and testing sets already cleaned (sprint #2):
- Drop duplicated instances (no found)
- Drop instances with `housing_median_age` capped at 52
- Drop instances with `median_house_value` capped at 500001.0

In [4]:
# load the cleaned training set
housing_train = pd.read_csv('../datasets/housing_train_sprint-2.csv')

In [5]:
housing_train.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-121.37,37.06,25.0,474.0,92.0,300.0,104.0,3.8062,340900.0,INLAND
1,-118.39,34.14,19.0,5076.0,1034.0,2021.0,960.0,5.5683,309200.0,<1H OCEAN
2,-122.07,37.41,26.0,1184.0,225.0,815.0,218.0,5.7657,322300.0,NEAR BAY
3,-121.92,36.57,42.0,3944.0,738.0,1374.0,598.0,4.174,394400.0,NEAR OCEAN
4,-118.36,33.82,36.0,1083.0,187.0,522.0,187.0,5.7765,339500.0,<1H OCEAN


In [6]:
housing_train.shape

(14857, 10)

### 5.2. Separate the _features_ and the _target outcome_

In [7]:
housing_train.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')

In [11]:
# store the target outcome into a numpy array
y_train = housing_train['median_house_value'].values

In [12]:
y_train

array([340900., 309200., 322300., ..., 112500.,  88100.,  89000.])

In [13]:
y_train.shape

(14857,)

In [15]:
# overwrite the dataframe with only the features  
housing_train = housing_train.drop(columns=['median_house_value'])

In [16]:
housing_train.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
0,-121.37,37.06,25.0,474.0,92.0,300.0,104.0,3.8062,INLAND
1,-118.39,34.14,19.0,5076.0,1034.0,2021.0,960.0,5.5683,<1H OCEAN
2,-122.07,37.41,26.0,1184.0,225.0,815.0,218.0,5.7657,NEAR BAY
3,-121.92,36.57,42.0,3944.0,738.0,1374.0,598.0,4.174,NEAR OCEAN
4,-118.36,33.82,36.0,1083.0,187.0,522.0,187.0,5.7765,<1H OCEAN


In [17]:
housing_train.shape

(14857, 9)

### 5.3. Separate the _numerical_ and _categorical_ features
Since we perform different preprocessing tasks (transformations) to _numerical_ features and _categorical_ ones, let's split them into two different dataframes.

In [18]:
housing_train.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'ocean_proximity'],
      dtype='object')

In [22]:
# numerical atributes
num_attributes = housing_train.columns.drop('ocean_proximity')
num_attributes

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income'],
      dtype='object')

In [23]:
# categorical attributes
cat_attributes = ['ocean_proximity']
cat_attributes

['ocean_proximity']

In [26]:
# separating the features
housing_train_num = housing_train[num_attributes]
housing_train_cat = housing_train[cat_attributes]

In [27]:
housing_train_num.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
0,-121.37,37.06,25.0,474.0,92.0,300.0,104.0,3.8062
1,-118.39,34.14,19.0,5076.0,1034.0,2021.0,960.0,5.5683
2,-122.07,37.41,26.0,1184.0,225.0,815.0,218.0,5.7657
3,-121.92,36.57,42.0,3944.0,738.0,1374.0,598.0,4.174
4,-118.36,33.82,36.0,1083.0,187.0,522.0,187.0,5.7765


In [28]:
housing_train_cat.head()

Unnamed: 0,ocean_proximity
0,INLAND
1,<1H OCEAN
2,NEAR BAY
3,NEAR OCEAN
4,<1H OCEAN


### 5.4. Filling in missing values

`sklearn.impute.SimpleImputer` <br/>
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

In [36]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')
imputer.fit(housing_train_num)

SimpleImputer(strategy='median')

In [37]:
imputer.statistics_  # computed medians

array([-118.45  ,   34.24  ,   27.    , 2142.    ,  441.    , 1207.    ,
        416.    ,    3.4559])

In [40]:
housing_train_num.median()

longitude             -118.4500
latitude                34.2400
housing_median_age      27.0000
total_rooms           2142.0000
total_bedrooms         441.0000
population            1207.0000
households             416.0000
median_income            3.4559
dtype: float64

<table align="left" class="dashed-box">
<tr>
    <td>💡</td>
    <td>The <code>SimpleImputer</code> finds out the <i>statistic for imputation</i> <b>for ALL features</b>.</td>
</tr>
<tr>
    <td></td>
    <td>We can save this <i>transformer</i> on the disk for future transfomations.</td>
</tr>
</table><br/><br/>

In [42]:
# filling in the missing values FOR ALL attributes
# it generates a numpy array
housing_train_num_imputed = imputer.transform(housing_train_num)
housing_train_num_imputed

array([[-1.2137e+02,  3.7060e+01,  2.5000e+01, ...,  3.0000e+02,
         1.0400e+02,  3.8062e+00],
       [-1.1839e+02,  3.4140e+01,  1.9000e+01, ...,  2.0210e+03,
         9.6000e+02,  5.5683e+00],
       [-1.2207e+02,  3.7410e+01,  2.6000e+01, ...,  8.1500e+02,
         2.1800e+02,  5.7657e+00],
       ...,
       [-1.2186e+02,  3.7310e+01,  2.4000e+01, ...,  1.8080e+03,
         6.2500e+02,  2.2259e+00],
       [-1.2132e+02,  3.7960e+01,  4.6000e+01, ...,  9.7500e+02,
         3.7300e+02,  2.0398e+00],
       [-1.1730e+02,  3.4140e+01,  3.9000e+01, ...,  8.4100e+02,
         3.2000e+02,  1.9432e+00]])

### 5.5. Adding new features
To _automate data preprocessing_ via sklearn, we will need _to create_ our **own transformer** to add the new features considered.

In [44]:
# template to create an own estimation
from sklearn.base import BaseEstimator, TransformerMixin


class NameOfYourTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self  # nothing else to do
    
    def transform(self, X):
        return None  # return the transformed data instead of None


Since our custom transformer can be executed before other transformation, we will consider that the input is a **numpy 2D array**, not a _dataframe_. <br/>

This transformer will create 3 new features, based on the current ones:
- `total_rooms`
- `total_bedrooms`
- `population`
- `households`


Thus, we need to find their column indices first because our input will be a **numpy 2D array**.

In [43]:
# get the integer index of each attribute/column:
for index, column_name in enumerate(housing_train_num.columns):
    print(f'{index} = {column_name}')

0 = longitude
1 = latitude
2 = housing_median_age
3 = total_rooms
4 = total_bedrooms
5 = population
6 = households
7 = median_income


In [48]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

total_rooms_col_idx = 3
households_col_idx = 6
total_bedrooms_col_idx = 4
population_col_idx = 5

class HousingFeatEngineering(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self  # nothing else to do
    
    def transform(self, X):
        n_rows = X.shape[0]
        
        # create the new features
        rooms_per_household = X[:, total_rooms_col_idx] / X[:, households_col_idx]
        bedrooms_per_room = X[:, total_bedrooms_col_idx] / X[:, total_rooms_col_idx]
        population_per_household = X[:, population_col_idx] / X[:, households_col_idx]
        
        # to concatenate the new array as columns in our feature matrix, we need to reshape first
        rooms_per_household = rooms_per_household.reshape((n_rows, 1))
        bedrooms_per_room = bedrooms_per_room.reshape((n_rows, 1))
        population_per_household = population_per_household.reshape((n_rows, 1))
        
        # concatenation the new features into the feature matrix X
        X_out = np.hstack((X, rooms_per_household, bedrooms_per_room, population_per_household))
        
        return X_out

In [49]:
feat_engineer = HousingFeatEngineering()

housing_train_num_new_feats = feat_engineer.transform(housing_train_num.values)  # we need to convert it to numpy first
housing_train_num_new_feats

array([[-121.37      ,   37.06      ,   25.        , ...,    4.55769231,
           0.19409283,    2.88461538],
       [-118.39      ,   34.14      ,   19.        , ...,    5.2875    ,
           0.2037037 ,    2.10520833],
       [-122.07      ,   37.41      ,   26.        , ...,    5.43119266,
           0.19003378,    3.73853211],
       ...,
       [-121.86      ,   37.31      ,   24.        , ...,    3.1024    ,
           0.3362558 ,    2.8928    ],
       [-121.32      ,   37.96      ,   46.        , ...,    4.91152815,
           0.19923581,    2.61394102],
       [-117.3       ,   34.14      ,   39.        , ...,    5.565625  ,
           0.18809657,    2.628125  ]])

In [50]:
housing_train_num_new_feats.shape

(14857, 11)

In [51]:
# show the new feats
housing_train_num_new_feats[:, -3:]

array([[4.55769231, 0.19409283, 2.88461538],
       [5.2875    , 0.2037037 , 2.10520833],
       [5.43119266, 0.19003378, 3.73853211],
       ...,
       [3.1024    , 0.3362558 , 2.8928    ],
       [4.91152815, 0.19923581, 2.61394102],
       [5.565625  , 0.18809657, 2.628125  ]])

### 5.6. Feature Scaling
Exactly as performed in the previous sprint: **RobustScaler**. <br/>
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html

In [52]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
scaler.fit(housing_train_num)

RobustScaler()

In [53]:
housing_train_num_scaled = scaler.transform(housing_train_num)
housing_train_num_scaled

array([[-0.81337047,  0.752     , -0.11111111, ..., -0.93989637,
        -0.94545455,  0.1689414 ],
       [ 0.01671309, -0.02666667, -0.44444444, ...,  0.84352332,
         1.64848485,  1.01876055],
       [-1.00835655,  0.84533333, -0.05555556, ..., -0.40621762,
        -0.6       ,  1.1139619 ],
       ...,
       [-0.94986072,  0.81866667, -0.16666667, ...,  0.62279793,
         0.63333333, -0.5931999 ],
       [-0.7994429 ,  0.992     ,  1.05555556, ..., -0.24041451,
        -0.13030303, -0.68295153],
       [ 0.32033426, -0.02666667,  0.66666667, ..., -0.37927461,
        -0.29090909, -0.72953943]])

### 5.7. Categorical Varaible Encoding
Instead of using the method `.get_dummies()` from _pandas_, let's use a method from _sklearn_.

`sklearn.preprocessing.OneHotEncoder` <br/>
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [61]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore')
housing_train_cat_1hot = encoder.fit_transform(housing_train_cat)

In [62]:
housing_train_cat

Unnamed: 0,ocean_proximity
0,INLAND
1,<1H OCEAN
2,NEAR BAY
3,NEAR OCEAN
4,<1H OCEAN
...,...
14852,NEAR OCEAN
14853,<1H OCEAN
14854,<1H OCEAN
14855,INLAND


In [63]:
housing_train_cat_1hot

<14857x5 sparse matrix of type '<class 'numpy.float64'>'
	with 14857 stored elements in Compressed Sparse Row format>

<table align="left" class="dashed-box">
<tr>
    <td>💡</td>
    <td>Notice that the output is a <i>SciPy sparse matrix</i>, instead of a <i>NumPy array</i>. This is very useful when you have categorical attributes with <b>thousands of categories</b>.</td>
</tr>
<tr>
    <td></td>
    <td>After one-hot encoding, we get a matrix with thousands of columns, and the matrix is <i>full of 0s</i> except for <i>a single <b>1</b> per row</i>.</td>
</tr>
<tr>
    <td></td>
    <td>Using up tons of memory mostly to store zeros would be very wasteful, so instead a sparse matrix only stores the location of the nonzero elements.</td>
</tr>
</table><br/><br/>


In [65]:
# converting to NumPy array
housing_train_cat_1hot.toarray()

array([[0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

In [66]:
# getting the list of categories
encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

### 5.8. Creating Preprocessing `Pipelines`
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

<table align="left" class="dashed-box">
<tr>
    <td><span class='alt-tab'>alt tab</span></td>
    <td><b>Slides:</b> Scikit-Learn Design Principles - Pipelines<br/></td>
</tr>
</table><br/><br/>

Let's create a **Preprocessing `Pipeline`**.

In [67]:
from sklearn.pipeline import Pipeline

#### Pipeline for numerical data

In [68]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('feat_engineering', HousingFeatEngineering()),
    ('robust_scaler', RobustScaler())
])

In [73]:
housing_train_num_preprocessed = num_pipeline.fit_transform(housing_train_num)

In [74]:
housing_train_num_preprocessed

array([[-0.81337047,  0.752     , -0.11111111, ..., -0.41720345,
        -0.15307229,  0.03593627],
       [ 0.01671309, -0.02666667, -0.44444444, ...,  0.05044793,
        -0.00235622, -0.88393775],
       [-1.00835655,  0.84533333, -0.05555556, ...,  0.14252434,
        -0.21672549,  1.04374832],
       ...,
       [-0.94986072,  0.81866667, -0.16666667, ..., -1.34973604,
         2.07630238,  0.04559594],
       [-0.7994429 ,  0.992     ,  1.05555556, ..., -0.19046999,
        -0.07242097, -0.2835198 ],
       [ 0.32033426, -0.02666667,  0.66666667, ...,  0.22866686,
        -0.24710446, -0.26677954]])

In [75]:
housing_train_num_preprocessed.shape

(14857, 11)

#### Pipeline for categorical data

In [76]:
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([
    ('one-hot-encoding', OneHotEncoder(handle_unknown='ignore'))
])

In [77]:
housing_train_cat_preprocessed = cat_pipeline.fit_transform(housing_train_cat)

In [78]:
housing_train_cat_preprocessed.toarray()

array([[0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

In [79]:
np.all(housing_train_cat_preprocessed.toarray() == housing_train_cat_1hot)

True

### 5.9. Putting it all by `ColumnTransformer`
https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html

Applies _transformers_ to **columns** of an array or pandas DataFrame. <br/>
This **estimator** allows _different columns_ or _column subsets_ of the input to be **transformed *separately*** and the _features generated_ by each transformer will be _concatenated_ to form a **single feature space**. <br/>

This is useful for _heterogeneous or columnar data_, to combine several feature extraction mechanisms or transformations into a single transformer.

In [80]:
num_attributes

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income'],
      dtype='object')

In [81]:
cat_attributes

['ocean_proximity']

In [82]:
housing_train.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
0,-121.37,37.06,25.0,474.0,92.0,300.0,104.0,3.8062,INLAND
1,-118.39,34.14,19.0,5076.0,1034.0,2021.0,960.0,5.5683,<1H OCEAN
2,-122.07,37.41,26.0,1184.0,225.0,815.0,218.0,5.7657,NEAR BAY
3,-121.92,36.57,42.0,3944.0,738.0,1374.0,598.0,4.174,NEAR OCEAN
4,-118.36,33.82,36.0,1083.0,187.0,522.0,187.0,5.7765,<1H OCEAN


In [95]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, OneHotEncoder

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('feat_engineering', HousingFeatEngineering()),
    ('robust_scaler', RobustScaler())
])

cat_pipeline = Pipeline([
    ('one-hot-encoding', OneHotEncoder(handle_unknown='ignore'))
])

# (name, transformer, columns)
preprocessed_pipeline = ColumnTransformer([
    ('numerical', num_pipeline, num_attributes),
    ('categorical', cat_pipeline, cat_attributes)
])

In [96]:
housing_train_pre_npy = preprocessed_pipeline.fit_transform(housing_train)

In [97]:
housing_train_pre_npy

array([[-0.81337047,  0.752     , -0.11111111, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.01671309, -0.02666667, -0.44444444, ...,  0.        ,
         0.        ,  0.        ],
       [-1.00835655,  0.84533333, -0.05555556, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [-0.94986072,  0.81866667, -0.16666667, ...,  0.        ,
         0.        ,  0.        ],
       [-0.7994429 ,  0.992     ,  1.05555556, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.32033426, -0.02666667,  0.66666667, ...,  0.        ,
         0.        ,  0.        ]])

In [100]:
housing_train_pre_npy.shape

(14857, 16)

In [101]:
preprocessed_pipeline.named_transformers_

{'numerical': Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                 ('feat_engineering', HousingFeatEngineering()),
                 ('robust_scaler', RobustScaler())]),
 'categorical': Pipeline(steps=[('one-hot-encoding', OneHotEncoder(handle_unknown='ignore'))])}

In [104]:
preprocessed_pipeline.transformers_

[('numerical',
  Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                  ('feat_engineering', HousingFeatEngineering()),
                  ('robust_scaler', RobustScaler())]),
  Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
         'total_bedrooms', 'population', 'households', 'median_income'],
        dtype='object')),
 ('categorical',
  Pipeline(steps=[('one-hot-encoding', OneHotEncoder(handle_unknown='ignore'))]),
  ['ocean_proximity'])]

### 5.10. Saving the Preprocessed Pipeline

In [107]:
import joblib

joblib.dump(preprocessed_pipeline, '../models/preprocessed_pipeline.pkl')

['../models/preprocessed_pipeline.pkl']

In [108]:
# to load the pipeline
loaded_preprocessed_pipeline = joblib.load('../models/preprocessed_pipeline.pkl')

In [110]:
housing_train_pre_npy_2 = loaded_preprocessed_pipeline.fit_transform(housing_train)
housing_train_pre_npy_2

array([[-0.81337047,  0.752     , -0.11111111, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.01671309, -0.02666667, -0.44444444, ...,  0.        ,
         0.        ,  0.        ],
       [-1.00835655,  0.84533333, -0.05555556, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [-0.94986072,  0.81866667, -0.16666667, ...,  0.        ,
         0.        ,  0.        ],
       [-0.7994429 ,  0.992     ,  1.05555556, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.32033426, -0.02666667,  0.66666667, ...,  0.        ,
         0.        ,  0.        ]])

In [111]:
np.all(housing_train_pre_npy == housing_train_pre_npy_2)

True

### 5.11. Saving the Preprocessed Training Set

In [112]:
np.save('../datasets/housing_train_pre_numpy_sprint-4.npy', housing_train_pre_npy)

## 🏋️‍♀️ 6. Train ML Algorithms

### 6.1. Getting the independent (features) and dependent variables (outcome)

In [113]:
X_train = housing_train_pre_npy
# we already have y_train

In [114]:
X_train.shape

(14857, 16)

In [115]:
y_train.shape

(14857,)

### 6.2. Training the Models

<h3 style="color: #ff5757 !important"><b>Cross-validation</b></h3>

#### **→ Linear Regression**

In [116]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()  # default parameters
lin_scores = cross_val_score(lin_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)

lin_rmse_scores = np.sqrt(-lin_scores)

In [117]:
# printing function
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In [118]:
display_scores(lin_rmse_scores)

Scores: [58883.49218279 55293.5735956  55181.52250607 57775.16025404
 60155.15922657 59588.9688012  57781.68607191 59995.21362185
 59923.51235138 59132.12479715]
Mean: 58371.04134085788
Standard deviation: 1757.914437100754


<br/>

We have exactly the results Sprint #3.
- **Linear Regression:** \\$58,371 ± \$1,757