# Chapter 2. End-to-End Machine Learning Project

In this chapter you will work through an example project end to end,
pretending to be a recently hired data scientist at a real estate company. Here
are the main steps you will go through:
1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system.


## Working with Real Data

When you are learning about Machine Learning, it is best to experiment with
real-world data, not artificial datasets. Fortunately, there are thousands of
open datasets to choose from, ranging across all sorts of domains. Here are a
few places you can look to get data:

When we are learning about ML, it's best to work with real data sets, not artificial ones. We list the following data sources:

- Popular open data reposatories
    - [UC Irvine ML repo](https://archive.ics.uci.edu/ml/index.php)
    - [Kaggle Datasets](https://www.kaggle.com/datasets)
    - [Amazon AWS Datasets](https://registry.opendata.aws/)
- Meta Portals: they list open data reposatories
    - [Data Portals](http://dataportals.org/)
    - [OpenDataMonitor](https://opendatamonitor.eu/frontend/web/index.php?r=dashboard%2Findex)
    - [Quandl](https://www.quandl.com/)
- Other pages listing many open data reposatories
    - [Wikipedia](https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research)
    - [Quora](https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public)
    - [The Datasets Subreddit](https://www.reddit.com/r/datasets)

In this chapter we’ll use the California Housing Prices dataset from the
StatLib repository. This dataset is based on data from the
1990 California census. It is not exactly recent (a nice house in the Bay Area
was still affordable at the time), but it has many qualities for learning, so we
will pretend it is recent data. For teaching purposes I’ve added a categorical
attribute and removed a few features.

## 1. Look at the big picture

Our first task is to use the california census data to build a model of the housing prices in the state. This data includes features such as:
- Population
- Median Income
- Median housing price for each block group in California

A block group is the smallest geographical unit for which cencus data is published. A Block group has a population between 600 to 3,000. We will call them "districts" for short.
    
Our model should be able to predict the median housing price for any district, given the other features.

### Framing the Problem:
- The task is to build a model of housing prices in California using census data.
- The model should predict the median housing price in any district based on various metrics.
- The predicted prices will be used in a downstream Machine Learning system for real estate investments.

Pipelines:
- Data processing components in a data pipeline are used to manipulate and transform the data.
- Components run asynchronously, processing data and passing the results to the next component.
- The architecture is robust but requires proper monitoring to avoid performance issues.

Current Solution:
- The current solution involves manual estimation of district housing prices by experts.
- Estimates are often inaccurate, with deviations of more than 20% from the actual prices.
- Building a model to predict median housing prices using census data is seen as a cost-effective and accurate alternative.

Designing the System:
- The problem is framed as a supervised multiple regression task.
- The goal is to predict the median housing price using multiple features of the district.
- Batch learning is suitable as the data is small enough to fit in memory.

### Select a Performance Measure:
- The Root Mean Square Error (RMSE) is a typical performance measure for regression tasks.
- RMSE quantifies the average prediction error, giving more weight to larger errors.
- Mean Absolute Error (MAE) can also be considered in the presence of outliers.

### Check the Assumptions:
- Assumptions need to be verified to ensure the system aligns with downstream requirements.
- In this case, it is confirmed that the downstream system needs actual prices, not categories.
- It is crucial to verify assumptions early on to avoid wasting time on the wrong approach.

Overall, the task involves building a model to predict housing prices in California using census data. The model will be integrated into a data pipeline for real estate investments. The current manual estimation process is costly and inaccurate, leading to the need for a more efficient and accurate solution. The system will be designed as a supervised multiple regression task using batch learning techniques. Performance will be evaluated using the RMSE measure. Assumptions about the downstream system's requirements have been verified to ensure the system's alignment with expectations.


## 2. Get the Data

### Download the Data

In [1]:
# Setup
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

In [2]:
# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [4]:
import os
import tarfile
import urllib.request

In [5]:
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

In [6]:
#Function to fetch data
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [7]:
fetch_housing_data()
