### <i>Hands on Machine Learning with Scikit-Learn and Tensorflow</i>


# Chapter 2 - California House Price Prediction 

(Part 2 of 3)

The objective of this model is to predict house prices given a California census data set.

In [1]:
# housekeeping stuff to set the page width as wide as possible in this notebook
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import pandas as pd


## Prepare the Data for Machine Learning Algorithms

Always write functions for preparing the data:

- It will allow you to reproduce these transformations easily on any dataset
- You will gradually build up a library of transformation functions 
- You can use these funtctions in your live system to transform the data before feeding it into your algorithm
- It will allow you to easily try different transformations and see what works best

First let's revert to a clean dataset and let's separate the predictors and the labels since we don't necessarily want to apply the same transformations to the predictors and the target values.

Note: <code>drop()</code> creates a copy of the data and does not affect <code>strat_train_set</code>

In [2]:
# import the data from part 1
import os
import pickle
import numpy as np

def load_data():
    if not os.path.isdir('./data') or not os.path.exists('./data/prt1.pkle'):
        raise Exception('No data from part 1 found. Please run part 1 first.')
        os.exit(1)

    with open('./data/prt1.pkle', 'rb') as f:
        strat_train_set, _ = pickle.load(f)
        
    housing = strat_train_set.drop('median_house_value', axis=1)
    housing_labels = strat_train_set['median_house_value'].copy()
    
    return housing, housing_labels

housing, housing_labels = load_data()

Now we have two data frames, <code>housing</code> contains everything we can use for predictions, and <code>housing_labels</code> contains the values we are trying to predict.

## Data Cleaning

Most maching learning algorithms can't work with missing features so let's create a few functions to create them. <code>total_bedrooms</code> from earlier has some missing values and there are 3 options to fix this:

- Get rid of the corresponding districts
- Get rid of the whole attribute
- Set the values to some value (zero, the mean, the median, etc)

You can do these easily with <code>DataFrame</code>'s <code>dropna()</code>, <code>drop()</code>, and <code>fillna()</code> methods:

In [3]:
housing.dropna(subset=['total_bedrooms'])       # option 1
housing.drop('total_bedrooms', axis=1)          # option 2
median = housing['total_bedrooms'].median()
_ = housing['total_bedrooms'].fillna(median)    # option 3 

If you chose option 3, you should compute the median on the training set and use that value to fill in the missing values in the training set, but don't forget to save the median value you have computed. You will need it later to replace missing values in the test set when you want to evaluate your system, also once the system goes live to replace missing values in new data.

## Scikit-Learn Estimators and Transformers

Scikit-Learn has a handy class to take care of missing values: <code>Imputer</code>. First you need to create an <code>Imputer</code> instance specifying that you want to replace each attribute's missing values with the median of that attribute:

In [4]:
# Get the data again to be clean as we will use sklearn rather than the 3 options above
housing, housing_labels = load_data()

from sklearn.preprocessing import Imputer

imputer = Imputer(strategy='median')

Since the median can only be computed on numerical attributes, we need to create a copy of the data without the text attribute <code>ocean_proximity</code>:

In [5]:
housing_num = housing.drop('ocean_proximity', axis=1)

Now you can fit the imputer to the training data using the <code>fit()</code> method:

In [6]:
_ = imputer.fit(housing_num)

The imputer has simply computed the median of each attribute and stored the result in its <code>statistics_</code> instance variable. Only the <code>total_bedrooms</code> attribute had missing values, but we cannot be sure there won't be any missing values in new data after the system goes live, so it's safer to apply the imputer to all the numerical attributes:

In [7]:
imputer.statistics_

array([ -118.51  ,    34.26  ,    29.    ,  2119.5   ,   433.    ,
        1164.    ,   408.    ,     3.5409])

In [8]:
housing_num.median().values

array([ -118.51  ,    34.26  ,    29.    ,  2119.5   ,   433.    ,
        1164.    ,   408.    ,     3.5409])

Now you can use this "trained" imputer to transform the training set by replacing missing values by the learned medians:

In [9]:
X = imputer.transform(housing_num)

The result is plain Numpy array containing the transformed features. If you want to put it back in a Pandas DataFrame, it's simple:

In [10]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns)

For details of sklearn's API design principles see http://goo.gl/wL10sI

Short version:
- <b>Estimators:</b> For example an Imputer. Computes some values through the <code>fit()</code> method.
- <b>Transformers:</b> Transform a dataset through the <code>transform()</code> method and returns the transformed dataset. Generally reies on a Estimator's <code>fit()</code> estimation.
- For convience there is also generally a <code>fit_transform()</code> method to a Transformer whch could run much faster sometimes.
- <b>Predictors:</b> Some Estimators are capable of making a prediction given a dataset. For example LinearRegression model. A predictor has a <code>predict()</code> and a <code>score()</code> method which measures the accuracy of the prediction given a test set.


## Handling Text and Catagorical Attributes

We removed the categorical attribute <code>ocean_proximity</code> earlier because it is a text attribute so we cannot compute it's median. Most Machine Learning algos prefer to work with numbers anyway so let's conert these text labels to numbers.

Scikit-Learn provides a transformer for this task called <code>LabelEncoder</code>:

In [11]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = housing['ocean_proximity']
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded

array([0, 0, 4, ..., 1, 0, 3])

Now we can use this numerical data in any ML algo. You can look at the mapping tht this encode has learned using the <code>classes_</code> attribute (the index into the list gives the mapping):

In [12]:
print(encoder.classes_)

['<1H OCEAN' 'INLAND' 'ISLAND' 'NEAR BAY' 'NEAR OCEAN']


An issue with this representation is that many ML algos will assume that two nearby values are more similar than two distant values, which is not the case. Use <i>one-hot encoding</i> to fix this. That creates one binary attribute per category with a 1 when the category is matched for that data row. 

Scikit-Learn provides a <code>OneHotEncoder</code> encoder to convert integer categorical values into one-hot encoded vectors. Let's do that... Note that <code>fit_transform()</code> expects a 2D array, but <code>housing_cat_encoded</code> is a 1D array so we need to reshape it (see SciPy's documentation).

In [13]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1, 1))
housing_cat_1hot

<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

In [14]:
housing_cat_1hot.toarray()

array([[ 1.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.],
       ..., 
       [ 0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.]])

We can apply both steps (from text to integer categories and then to one-hot encodings) using the <code>LabelBinarizer</code> class:

In [15]:
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)
housing_cat_1hot

array([[1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1],
       ..., 
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0]])

Note that this returns a dense NumPy matrix by default but you can get a sparce matrix by passing <code>sparse_output=True</code> to the <code>LabelBinarizer</code> constructor.

## Custom Transformers

You will need to write your own transformers for tasks such as custom cleanup operations or combining specific attributes. You will want these to be compliant with Scikit-Learn's API so they work seamlessly with things like sklearn pipelines. Since sklearn relies on duck typing, you only need to create a class that implements 3 methods: <code>fit()</code> (returning <code>self</code>), <code>transform()</code> and <code>fit_transform()</code>. You can get the last one for free by adding <code>TransformMixin</code> as a base class. Also, if you add <code>BaseEstimator</code> as a base class (and avoid <code>*args</code> and <code>**kargs</code> in your constructor) you will get two extra methods: <code>get_params()</code> and <code>set_parms()</code> that will be useful for automatic hyperparameter tuning. Example of a small transformer class that adds the combined attributes we discussed earlier:

In [16]:
from sklearn.base import BaseEstimator, TransformerMixin

rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):        # avoid *args and **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
        
    def fit(self, X, y=None):
        return self      # nothin else to do 
    
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

In this example the transformer had one hyperparamter, <code>add_bedrooms_per_room</code> that was set to True by default. This hyperparamter will allow you to easily test whether or not adding it makes a difference to the algorithm. More generally you can use hyperparameters like this to gate any data preparation step that you are not sure about. The more you automate these data preparation steps the more combinations you can automatically try out, making it more likely you will find a great combination and saving you a lot of time. 

## Feature Scaling

One of the most important transformations you need to apply is <i>feature scaling</i>. With few exceptions ML algos don't perform well when the input numerical attributes have very different scales. For example the total number of rooms ranges from 6 to 39,320 while the median incomes only range from 0 to 15. Note that scaling the target values is generally not required. 

There are two common ways to get all attributes to have the same scale: <b><i>min-max scaling</i></b> and <b><i>standardization</i></b>.

Min-max scaling works by squashing all values into a range between 0 and 1. We do this by subtracting the minimum and then dividing by the difference between the max and the min. Scikit-Learn provides a transformer called <code>MinMaxScaler</code> for this. It has a <code>feature_range</code> hyperparameter that lets you change the range if you don't want 0-1 for any reason. 

Standardization subtracts the mean (so standardized values always have the same mean) and the divides by the variance so that the resulting distribution has unit variance. This does not bound the values to a specific range, unlike min-max scaling, which may be a problem for some algorithms (e.g. neural networks often expect an input value to range from 0 to 1). But standardization is much less affected by outliers. For example, suppose a district had a median income of 100 (by mistake). Then min-max scaling would squash everything in the range 0 to 15 down to 0-0.15, whereas standardization would not be much affected. Scikit-Learn provides a transformer called <code>StandardScaler</code> for standardization. 

<font color="#850000">
As with all the transformations, only fit the scalers to the training data, not the full dataset (including the test set). Only then can you use them to transform the training set and the test set (and new data).
</font><br>

## Transformation Pipelines

There are many data transformation steps that need to be executed in the right order. Scikit-Learn provides the <code>Pipeline</code> class to help with such sequences:

In [17]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ('imputer', Imputer(strategy='median')),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

housing_num_tr = num_pipeline.fit_transform(housing_num)

All but the last name/estimator pair in the Pipeline constructor must be transformers (i.e. they must have a <code>fit_transform()</code> method).

When you call the pipepline's fit() method it calls fit_transform() sequentially on all transforms, passing the output from each one as the input to the next. When the final transformer is reached it only calls it's fit() method. 

The pipeline exposes the same methods as the final estimator. 

You have a pipeline for the numerical values and you still need to apply the <code>LabelBinarizer</code> on the categorical values. Scikit-Learn provides a <code>FeatureUnion</code> class to join these transformations into a single pipeline. You give it a list of transformers (which can be entire pipelines) and when it's transform() method is called it runs each transformers transform() method in parallel, waits for their output and then concatenates them and returns the result. Here's a full pipeline handling both numerical and categorical attributes:

In [18]:
from sklearn.pipeline import FeatureUnion

# Refresh to get clean data again
housing, housing_labels = load_data()
housing_num = housing.drop('ocean_proximity', axis=1)

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']

num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', Imputer(strategy='median')),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', LabelBinarizer()),
])

full_pipeline = FeatureUnion(transformer_list=[
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline),
])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared

array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ..., 
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])

In [19]:
housing_prepared.shape

(16512, 16)

There is nothing in Scikit-Learn to handle Pandas DataFrames so the DataFrameSelector class accomplishes this task, as well as selecting the relevant columns for each pipeline. (but see Pull Request #3886 in sklearn)

_______________________________________________________________________________________________________________________________________________________________

<b>That's it for this part. Now pickle the data and make it ready for import in part 3.</b>


In [20]:
import pickle, os

if not os.path.isdir('./data'):
    os.makedirs('./data')

with open('./data/prt2.pkle', 'wb') as f:
    pickle.dump({'housing': housing, 
                 'housing_prepared': housing_prepared, 
                 'housing_labels': housing_labels,
                 'full_pipeline': full_pipeline,
                 'encoder': encoder,
                 'num_attribs': num_attribs,
                }, f, pickle.HIGHEST_PROTOCOL)