# End to End Project
The following notebook is about developing an end to end machine learning project, going over the different aspects of the process. Here we have chosen the example of predicting house price given then `real estate data`.<br>

## Step 1: Framing the problem
Be clear so as to what is the end goal. Know what will the output of our model be used for: Will it be used directly as an end result or will it be fed to another machine learning system. Knowing these things helps us decide what measure of accuracy is required in the result, which algorithms should be used, which performace measure should be selected, how much time and resources do we need to spend on maing the model etc.<br>

Development of a system usually requires `pipelines` in machine learning. A sequence of data processing components is called a data pipeline. Components typically run asynchronously. Each component pulls in a large amount of data, processes it and spits out the result into another data store, which is then pulled in by the next component. Each component is self contained and interfaced only via data-stores. This system allows different teams to work on different aspects of the project and also makes it more robust beacause if one component breaks down, others can still function for some time.<br>

Another important thing we can search is whether or not there are existing solutions to the problem, which can help us gain insights. After considering the above facts, we start framing a problem in terms of whether it will be a supervised, unsupervised or reinforcement learning? Is it a classification task, regression task or somemthing else? Should be use batch learning or online learning? For our project we can clearly see it is going to be a `supervised learning` task since we are given labeled training examples. (each instance comes with an expected output feature). It is also going to be a `multivariate-regression` task since we are being asked to predict using multiple features. And finally since we have a census data, it has to be `batch learning`. (no continous data).<br>

Techniques such as `MapReduce` are also used to split batch learning work across multiple servers.

## Step 2: Select a performance measure
Typical performance measures for a regression task include:<br>

**Root Mean Square Error (RMSE):** $\text{RMSE}(\mathbf{X}, h) = \sqrt{\frac{1}{m}\sum\limits_{i=1}^{m}\left(h(\mathbf{x}^{(i)}) - y^{(i)}\right)^2}$ where, <br>
$m = $ number of instances.<br> $\mathbf{x}^{(i)} = $ a vector representing all the features of the ith instance of the data set.<br> $y^{(i)} = $ the label corresponding to the ith instance.<br> $\mathbf{X} = {(\mathbf{x}^{(i)})}^T$ i.e A matrix whose every row vector represents the features of the ith instance.<br> $h = $ the prediction function of our system.<br> $\text{RMSE}(\mathbf{X}, h) = $ the cost function.<br>

**Mean Absolute Error (MAE):** $\text{MAE}(\mathbf{X}, h) = \frac{1}{m}\sum\limits_{i=1}^{m}\left|h(\mathbf{x}^{(i)}) - y^{(i)}\right|$<br>

Although we might want to use **RMSE** for regression tasks, it is not favourible if there are too many outliers in our data.<br>

Both **RMSE** and **MAE** are ways to measure distance between two vectors: `prediction vector` and the `target vector`, called `norms.` **RMSE** corresponds to the `Euclidian norm` or $l_2$ norm, **MAE** corresponds to the `Manhattan norm` or the $l_1$ norm. More generally $l_k$ norem of a vector $\mathbf{v}$ containing $n$ elements is defined as: $\left\| \mathbf{v} \right\| _k = (\left| v_0 \right|^k + \left| v_1 \right|^k + \dots + \left| v_n \right|^k)^{\frac{1}{k}}$. $l_0$ gives the number of non-zero elements in the vector whereas $l_\infty$ gives the maximum value in the vector. Higher norm indices focus on larger values.

## Step 3: Check the Assumptions
It's always good to cross check the assumptions with the goal so that months of work is not wasted incase we ended up using a wrong algorithm. 

## Step 4: Downloading Data and Utility Tasks
### 4.1 Imports
These modules will be used throughout the project thus we import them first, the other imports are done as we go further.

In [1]:
# Utility imports to handle os specific tasks.
import os
import tarfile

# Import for downloading the data sets.
import urllib

# For visualizing data.
import matplotlib
import matplotlib.pyplot as plt

# Inline plots as opposed to plots in new windows.
%matplotlib inline

# Pretty plots.
import seaborn as sns
sns.set()

# Imports for crucial data structures and pseudo-random generation.
import numpy as np
import pandas as pd

# To make the output stable across runs.
np.random.seed(42)

# This is done for cross platform compatibility.
root_directory = "."
chapter = "end_to_end_project"

# os.join(path, *paths) is used to concatenate passed strings to form a path.
images_path = os.path.join(root_directory, "images", chapter)

# A Utility function to save the generated figures as png files.
# +------------------------------------------------------------------------+
# | parameter     | type    | Comment                                      |
# +------------------------------------------------------------------------+
# | fig_id        | string  | Takes the figure name.                       |
# | tight_layout  | boolean | Decide whether to use tight_layout() or not. |
# | fig_extension | string  | Sets image extension type.                   |
# | resolution    | integer | Stores the dpi.                              |
# -------------------------------------------------------------------------+
def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    
    # Construct the image path.
    path = os.path.join(images_path, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    
    # Allows subplots to fit in the figure area.
    if tight_layout:
        plt.tight_layout()
    
    # matplotlib.pyplot.savefig(*args, **kwargs)
    # Some of the parameters include:
    # 1. fname: string or file-like object, in our case it is the path of the figure.
    # 2. dpi: interger value, dots per inch (resolution).
    # 3. format: string, contains the file extension.
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

### 4.2 Data Aquisition
We shall now download the data sets required for our project from the `handson-ml` repository.

In [2]:
# URL to fetch our data set.
repo_url = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
data_url = repo_url + "datasets/housing/housing.tgz"

# Path to our data on the system.
data_path = os.path.join("datasets", "housing")

# A utility function to download and extract the data set from the repository.
def fetch_housing_data(housing_url=data_url, housing_path=data_path):
    
    # Create a directory if it does not exist.
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    
    # Full path to our downloaded archive.
    tgz_path = os.path.join(housing_path, "housing.tgz")
    
    # Download request for our data set.
    urllib.request.urlretrieve(housing_url, tgz_path)
    
    # Opens the tarfile in 'read-only' mode as default, returning a TarFile Object.
    housing_tgz = tarfile.open(tgz_path)
    
    # extractall(path) takes the path as a string parameter and extracts the contents of the archive.
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [3]:
fetch_housing_data()

In [4]:
# Loads the data set into our program.
def load_housing_data(housing_path=data_path):
    csv_path = os.path.join(housing_path, "housing.csv")
    
    # Returns a pandas DataFrame object.
    return pd.read_csv(csv_path)

### 4.3 A Look at the structure of Data
From a quick look at the first few entries using the `pandas.DataFrame.head()`.

In [5]:
housing = load_housing_data()

# DataFrame.head(n=5), n:integer, number of entries to be shown.
housing.head(10)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
5,-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
6,-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
7,-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
8,-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
9,-122.25,37.84,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


Above are the first few rows of our real estate data. As we can see, each row has 1`ten attributes`. We can generate information about this `DataFrame` using the `pandas.DataFrame.info()` method.