# Chapter 2: End-to-End Machine Learning Project

In [21]:
import numpy as np
import pandas as pd

**Main Steps:**
1. Big picture
2. Get the data
3. Discover and visualise
4. Prepare data
5. Select and train model
6. Fine-tune model
7. Present solution
8. Launch, monitor, and maintain

## Look at the Big Picture

**Goal:** use census data to predicct median housing price per district.

### Frame the Problem

**Questions:**
1. What is the end business objective?
2. What is the current solution?
    - Gives a reference for performance and insights on possible solutions
3. Frame the problem
    - Supervised, unsupervised, reinforcement etc.
    - Problem type (regression, classification etc.)
    - Batch learning or online learning?
    
### Select a Performance Measure

Common performance measures for regession problems are:
- *Root Mean Square Error (RMSE):*
\begin{equation}
    \text{RMSE}(\mathbf{X}, h) = \sqrt{ \frac{1}{m} \sum_{i=1}^m \left( h(\mathbf{x}^{(i)}) - y^{(i)}\right)^2 }
\end{equation}

- *Mean Absolute Error (MAE):*
\begin{equation}
    \text{MAE}(\mathbf{X}, h) = \frac{1}{m} \sum_{i=1}^m \Big\lvert h(\mathbf{x}^{(i)}) - y^{(i)}\Big\rvert
\end{equation}

- Other $l_k$ norms

The higher $k$ the greater the impact of large values, so RMSE is more sensitive to outliers than MAE. RMSE is better if outliers are exponentially rare (like a bell curve), otherwise MAE may be better.

### Check the Assumptions

Assumptions in the problem - e.g. are exact values necessary in a regression problem, or just categories?

## Get the Data

### Create the Workspace

Blah Blah

### Download the Data

- Good to have a function that downloads the data
- Write a script that uses the function to fetch latest data
- *Optional:* schedule a job to fetch latest data automatically at regular intervals
- Also should write function to load data

### Take a Quick Look at the Data Structure

- `df.head()`
- `df.info()`
    - Note missing values
    - List data types: categorical (ordinal/numerical) or numerical (discrete/continuous/interval)
- `df.describe()`
- List different values for discrete data using `df.value_counts()`
- Histograms of numerical data using `df.hist(bins=50, figsize=(20, 15))`

### Create a Test Set

- **Data snooping bias:** Overfitting to the *test* set by looking at test set (even briefly)
- `train_test_split` from Scikit-Learn splits data into training and test set

In [47]:
from sklearn.model_selection import train_test_split

# Uniformly sampled data from 0 to 100
data = pd.DataFrame(100 * np.random.random(50), columns=['cont'])

# Split into 80/20 proportions
train, test = train_test_split(data, test_size=0.2)

**Issue:**
- Isn't reproducible: running it again results in different split
- One solution: do this once and save
- Another: set `random_state=42` to control shuffling and ensure reproducible output
- But: neither of these work if you update the dataset. The textbook has a potential solution by splitting by hashed identifiers

**Stratified Sampling:**
- If dataset isn't large enough, random selection of test set can introduce sampling bias
- Stratified sampling guarantees test set is representative of population by controlling for specific factors
- Population is divided into *strata* and sample has same proportions in each stratum
- Ex: controlling for gender
- Use `StratifiedShuffleSplit` from Scikit-Learn

In [58]:
# Uniformly sampled data from 0 to 100
data = pd.DataFrame(100 * np.random.random(1000), columns=['cont'])

# pd.cut to bin data - useful for turning continuous data into categorical (e.g. strata)
data['cat'] = pd.cut(data['cont'], 
                     bins=[0, 33, 66, 100], 
                     labels=['low', 'medium', 'high'], 
                     include_lowest=True)

data.head(10)

Unnamed: 0,cont,cat
0,14.412158,low
1,77.628944,high
2,76.401231,high
3,6.0732,low
4,33.731694,medium
5,88.475088,high
6,80.144287,high
7,12.53228,low
8,7.919757,low
9,54.044066,medium


In [59]:
from sklearn.model_selection import StratifiedShuffleSplit

# Initialise split object
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

# This loop has only one iteration but syntax is necessary because of split class
# 2nd arg indicates what to control for, 1st arg just gives n_samples so can use np.zeros(n_samples)
for train_index, test_index in split.split(data, data['cat']):
    train = data.loc[train_index]
    test = data.loc[test_index]

# Function to compare proportions in the different sets
def cat_proportions(df):
    return df['cat'].value_counts() / len(df)

# DataFrame to store results
compare_props = pd.DataFrame({
    'Overall': cat_proportions(data),
    'Train' : cat_proportions(train),
    'Test': cat_proportions(test),
}).sort_index()

compare_props

Unnamed: 0,Overall,Train,Test
low,0.319,0.31875,0.32
medium,0.335,0.335,0.335
high,0.346,0.34625,0.345


## Other Terminology

**Data Pipeline**: a sequence of data processing components
- Components run asynchronously and outputs are stored in data stores between components
- Components are self-contained so if one component fails, downstream components can continue using last output
- Monitoring is important so failing components can be caught and fixed

**Map Reduce:** programming paradigm of using parallel, distributed algorithims to process data
- To allow a group of (memory independent) computers to process data that is too much for a single processor

## Code Samples

### OS

In [1]:
import os # Operation system dependent functionality

In [14]:
PATH = '/Users/christopherleonard'

# To combine path names into one complete path
# Note that it adds the necessary slash
ML_PATH = os.path.join(PATH, 'P/hands-on-machine-learning')
print(ML_PATH)

/Users/christopherleonard/P/hands-on-machine-learning


In [15]:
# Check if specified path is an existing directory
os.path.isdir(ML_PATH)

True

In [18]:
# Create specified directory
# Will return error if already exists
TEST_PATH = os.path.join(ML_PATH, 'chapter-2/test')
os.mkdir(TEST_PATH)

# Also works with relative directory
os.mkdir('test')

### pd.cut

Useful for binning data, for example turning continuous data into categorical

In [46]:
# Uniformly sampled data from 0 to 100
data = pd.DataFrame(100 * np.random.random(50), columns=['cont'])

# pd.cut to bin data - useful for turning continuous data into categorical (e.g. strata)
data['cat'] = pd.cut(data['cont'], 
                     bins=[0, 33, 66, 100], 
                     labels=['low', 'medium', 'high'], 
                     include_lowest=True)

data.head(10)

Unnamed: 0,cont,cat
0,58.210354,medium
1,28.998124,low
2,66.689417,high
3,64.775803,medium
4,43.950615,medium
5,67.655351,high
6,92.013886,high
7,95.371403,high
8,87.702804,high
9,61.873568,medium
