## Chapter 2: An End-to-End Machine Learning Project

In this notebook, we will go through an end-to-end example project, pretending to be a recently hired data scientist in a real estate company. Here are the main steps we will go through:

1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune the model.
7. Present the solution.
8. Launch, monitor, and maintain the system.

### 1. Look at the big picture

For this project, we are asked to build a model of housing prices in California using the California census data. This data has metrics such as the population, median income, median housing price, and so on for each block group in California. Block groups are the smallest geographical unit for which the US Census Buearu publishes sample data (a block group typically has a population of 600 to 3000 people). The model should learn from this data and be able to predict the median housing price in any district, given all the other metrics. This model will be used as a factor to determine whether it is worth investing in a given area or not.

**Frame the problem**

1. Define the project objective.
2. How will the solution be used?
3. What are the current solutions?
4. What type of Machine Learning is needed?
5. How should performance be measured?


### 2. Get the data

In typical environments your data would be available in a relational database and spread across multiple tables/documents/files. To access it, you would first need to get your credentials and access authorizations. 

In this project, the data can be simply downloaded from 

"https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.tgz"

Extract *housing.csv* from the tgz file, then load the file as a pandas DataFrame.

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
housing = pd.read_csv(os.getcwd() + '/Data/CaliforniaHousing/housing.csv')

### 3. Explore the data

1. Display the top five rows using the head() method. Learn the attributes in the data.
2. Get a quick description of the data using the info() method. Learn the total number of rows, each attribute's type, and number of non-null values.
3. Display the frequencies of categorial attributes using the value_counts() method.
4. Display a summary of numerical attributes using the describe() method.
5. Plot historgrams for each numerical attribute to get a feel of its distribution.

In [None]:
housing.head()

In [None]:
housing.info()

In [None]:
housing.ocean_proximity.value_counts()
# housing['ocean_proximity']

In [None]:
housing.describe()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()

Have you noticed these things?
1. There are 207 missing values for attribute total_bedrooms.
1. The median income attribute does not look like it is expressed in US dollars.
2. The housing median age and the median house value are capped.
3. These attributes have very different scales.
4. Most attributes are right-skewed.

### Visualizing Geographical Data


In [None]:
# 1. scatter plot of geographical data
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

In [None]:
# 2. scatter plot with colors indicating house values
housing.plot(kind="scatter",
             x="longitude",
             y="latitude",
             alpha=0.4,
             s=housing["population"]/100,
             label="population",
             figsize=(10,7),
             c="median_house_value",
             cmap=plt.get_cmap("jet"),
             colorbar=True,
             sharex=False)
plt.legend()

In [None]:
# This cell requires california.png file downloaded
# from textbook GitHub repository

# scatter plot on California map
import matplotlib.image as mpimg
import numpy as np
california_img=mpimg.imread('Data/CaliforniaHousing/california.png')
ax = housing.plot(kind="scatter",
                  x="longitude",
                  y="latitude",
                  figsize=(10,7),
                  s=housing['population']/100,
                  label="Population",
                  c="median_house_value",
                  cmap=plt.get_cmap("jet"),
                  colorbar=False,
                  alpha=0.4)
plt.imshow(california_img,
           extent=[-124.55, -113.80, 32.45, 42.05],
           alpha=0.5)
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)

prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cbar = plt.colorbar()
cbar.ax.set_yticklabels(["$%dk"%(round(v/1000)) \
                         for v in tick_values], fontsize=14)
cbar.set_label('Median House Value', fontsize=16)

plt.legend(fontsize=16)
plt.show()

### Correlation between attributes

1. Use .corr() to display the standard correlation coefficient between median house value and each input feature.
2. Use pandas.plotting.scatter_matrix() to visualize the correlation.

- Correlation coefficient ranges from -1 to 1.
- When it is close to 1, it means that there is a strong positive correlation.
- When it is close to -1, it means that there is a strong negative correlation.
- When it is close to zero, it means that there is no *linear* correlation (other correlation may still exist

In [None]:
housing.corr()['median_house_value'].sort_values(ascending=False)

In [None]:
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
plt.show()

### Experimenting with combinations of attributes

In [None]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

housing.plot(kind="scatter", x="rooms_per_household", y="median_house_value",
             alpha=0.2)
plt.axis([0, 5, 0, 520000])
plt.show()
print(housing.corr()['median_house_value'].sort_values(ascending=False))

### 4. Prepare the data for Machine Learning algorithms


1. Impute missing values with median (use sklearn.preprocessing.Imputer).
2. Convert categorical attributes to numerical attributes (use sklearn.preprocessing.OneHotEncoder).
3. Add extra useful attributes (customized transformer)
4. Feature scaling: use sklearn.preprocessing.StandardScaler to scale attributes to zero mean and unit variation.
5. Split the data into training set and test set. (Should we use purely randomized splitting?)

**Question**
- Should we use simple random sampling to obtain test data?
- Should we use the entire dataset to build feature scaler?

In [None]:
# Impute total_bedrooms with median value
# median = housing['total_bedrooms'].median()
# housing['total_bedrooms'].fillna(median, inplace=True)

from sklearn.preprocessing import Imputer
housing_num = housing.drop('ocean_proximity', axis=1)
housing_num_columns = housing_num.columns
housing_cat = housing['ocean_proximity']
imputer = Imputer(strategy='median')
imputer.fit(housing_num)
print('median values:', housing_num.median().values)
print('imputer statistics:', imputer.statistics_)
housing_num = imputer.transform(housing_num)
housing_num = pd.DataFrame(housing_num,
                           columns=housing_num_columns)
housing_num.info()

In [None]:
# Preprocess categorical feature 'ocean_proximity'
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
housing_cat_encoded, housing_categories = housing_cat.factorize()
print('housing_categories:', housing_categories)
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1, 1))
print(housing_cat_1hot.toarray()[:5])

housing_cat = pd.DataFrame(housing_cat_1hot.toarray(),
                           columns=housing_categories)
housing_cat.head()

In [None]:
# create a custom transformer to add extra attributes: 
# 1. rooms_per_household
# 2. population_per_household
# 3. (optional) bedrooms_per_room
from sklearn.base import BaseEstimator, TransformerMixin

# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, bedrooms_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X,
                         rooms_per_household,
                         population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X,
                         rooms_per_household,
                         population_per_household]

# apply the above class to add extra attributes
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing_num.values)

# convert it to a dataframe
housing_extra_attribs = pd.DataFrame(housing_extra_attribs,
                                     columns=list(housing_num.columns)+['rooms_per_household',
                                                                    'population_per_household',
                                                                    ])
housing_extra_attribs.head()

In [None]:
# Feature Scaling
housing_prepared = pd.concat([housing_extra_attribs, housing_cat], axis=1)
housing_prepared_columns = housing_prepared.columns
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
housing_prepared = scaler.fit_transform(housing_prepared)
housing_prepared = pd.DataFrame(housing_prepared,
                            columns=housing_prepared_columns)
print('shape:', housing_prepared.shape)
housing_prepared.head()

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
housing_train, housing_test = train_test_split(housing_prepared,
                                               test_size=0.2,
                                               random_state=1)
print('training set:', housing_train.shape)
print('test set:', housing_test.shape)
housing_train_labels = housing_train.pop('median_house_value')
housing_test_labels = housing_test.pop('median_house_value')

### 5. Select and Train a Model

1. Apply a Machine Learning model to the training set.
2. Measure the performance of the model on the test set.

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_train, housing_train_labels)

In [None]:
from sklearn.metrics import mean_squared_error

housing_train_predictions = lin_reg.predict(housing_train)
lin_mse = mean_squared_error(housing_train_labels,
                             housing_train_predictions)
print('MSE on training set', lin_mse)

In [None]:
from sklearn.metrics import mean_absolute_error

lin_mae = mean_absolute_error(housing_train_labels,
                              housing_train_predictions)
print('MAE on training set', lin_mae)

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_train, housing_train_labels)

In [None]:
housing_train_predictions = tree_reg.predict(housing_train)
tree_mse = mean_squared_error(housing_train_labels,
                              housing_train_predictions)
print('MSE on training set', tree_mse)

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(housing_train, housing_train_labels)

In [None]:
housing_train_predictions = forest_reg.predict(housing_train)
forest_mse = mean_squared_error(housing_train_labels,
                              housing_train_predictions)
print('MSE on training set', forest_mse)

In [None]:
# Better evaluation using cross-validation
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg,
                         housing_train,
                         housing_train_labels,
                         scoring="neg_mean_squared_error",
                         cv=10)
print('Decision Tree:', scores)

In [None]:
scores = cross_val_score(lin_reg,
                         housing_train,
                         housing_train_labels,
                         scoring="neg_mean_squared_error",
                         cv=10)
print('Linear Regression:', scores)

In [None]:
scores = cross_val_score(forest_reg,
                         housing_train,
                         housing_train_labels,
                         scoring="neg_mean_squared_error",
                         cv=10)
print('Random Forest:', scores)

### 6. Fine-Tune the Model
Search for a good combination of hyperparameter values for random forest model.
- Grid search: give a few possible values for each hyperparameter, then try all combinations.
- Random search: select values randomly (efficient when there are a large number of hyperparameters)

**Analyze the best model and its error**
- Does this model make sense?
- Should less important features be dropped?
- Does the model make any typical errors?

**Evaluate the model on test set**
- (transform the test data)
- analyze the performance of the model on test set

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30],
     'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap
    # set as False
    {'bootstrap': [False],
     'n_estimators': [3, 10],
     'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of
# (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(\
                   forest_reg,
                   param_grid,
                   cv=5,
                   scoring='neg_mean_squared_error',
                   return_train_score=True,
                          )
grid_search.fit(housing_train,
                housing_train_labels)

In [None]:
# The best hyperparameter combination found:
print('best parameters:', grid_search.best_params_)

# The best model with above parameters
best_model = grid_search.best_estimator_
housing_train_pred = best_model.predict(housing_train)
print('MSE:', mean_squared_error(housing_train_pred,
                                 housing_train_labels))
scores = cross_val_score(best_model,
                          housing_train,
                          housing_train_labels,
                          cv=10,
                          scoring="neg_mean_squared_error")
print(scores)

In [None]:
# Randomized search
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

params = {'bootstrap': [True, False],
          'n_estimators': randint(2, 30),
          'max_features': randint(2, 10),}

forest_reg = RandomForestRegressor(random_state=42)

random_search = RandomizedSearchCV(\
                   forest_reg,
                   params,
                   cv=5,
                   scoring='neg_mean_squared_error',
                   return_train_score=True,
                                  )
random_search.fit(housing_train,
                  housing_train_labels)

In [None]:
# The best hyperparameter combination found:
print('best parameters:', random_search.best_params_)

# The best model with above parameters
best_model = random_search.best_estimator_
housing_train_pred = best_model.predict(housing_train)
print('MSE:', mean_squared_error(housing_train_pred,
                                 housing_train_labels))
scores = cross_val_score(best_model,
                          housing_train,
                          housing_train_labels,
                          cv=10,
                          scoring="neg_mean_squared_error")
print(scores)

In [None]:
# Analyze the best model
feature_importance = best_model.feature_importances_
attributes = housing_train.columns
sorted(zip(feature_importance, attributes),
       reverse=True)


In [None]:
# Evaluation the model on test set
housing_test_pred = best_model.predict(housing_test)
best_mse = mean_squared_error(housing_test_labels,
                              housing_test_pred)
print('MSE on test set:', best_mse)

### Launch, Monitor, and Maintain the model
- Monitor the live performance of the system
- Retrain the model with new data