# Chapter 2: End-to-End Machine Learning Project

In [1]:
import numpy as np
import pandas as pd

**Main Steps:**
1. Big picture
2. Get the data
3. Discover and visualise
4. Prepare data
5. Select and train model
6. Fine-tune model
7. Present solution
8. Launch, monitor, and maintain

## Look at the Big Picture <a name="bigpicture"></a>

**Goal:** use census data to predict median housing price per district.

### Frame the Problem <a name="frameproblem"></a>

**Questions:**
1. What is the end business objective?
2. What is the current solution?
    - Gives a reference for performance and insights on possible solutions
3. Frame the problem
    - Supervised, unsupervised, reinforcement etc.
    - Problem type (regression, classification etc.)
    - Batch learning or online learning?
    
### Select a Performance Measure <a name="performancemeasures"></a>

Common performance measures for regession problems are:
- *Root Mean Square Error (RMSE):*
\begin{equation}
    \text{RMSE}(\mathbf{X}, h) = \sqrt{ \frac{1}{m} \sum_{i=1}^m \left( h(\mathbf{x}^{(i)}) - y^{(i)}\right)^2 }
\end{equation}

- *Mean Absolute Error (MAE):*
\begin{equation}
    \text{MAE}(\mathbf{X}, h) = \frac{1}{m} \sum_{i=1}^m \Big\lvert h(\mathbf{x}^{(i)}) - y^{(i)}\Big\rvert
\end{equation}

- Other $l_k$ norms

The higher $k$ the greater the impact of large values, so RMSE is more sensitive to outliers than MAE. RMSE is better if outliers are exponentially rare (like a bell curve), otherwise MAE may be better.

### Check the Assumptions <a name="checkassumptions"></a>

Assumptions in the problem - e.g. are exact values necessary in a regression problem, or just categories?

## Get the Data <a name="getdata"></a>

### Create the Workspace <a name="createworkspace"></a>

Blah Blah

### Download the Data <a name="downloaddata"></a>

- Good to have a function that downloads the data
- Write a script that uses the function to fetch latest data
- *Optional:* schedule a job to fetch latest data automatically at regular intervals
- Also should write function to load data

### Take a Quick Look at the Data Structure <a name="quicklook"></a>

- `df.head()`
- `df.info()`
    - Note missing values
    - List data types: categorical (ordinal/numerical) or numerical (discrete/continuous/interval)
- `df.describe()`
- List different values for discrete data using `df.value_counts()`
- Histograms of numerical data using `df.hist(bins=50, figsize=(20, 15))`

### Create a Test Set <a name="testset"></a>

- **Data snooping bias:** Overfitting to the *test* set by looking at test set (even briefly)
- `train_test_split` from Scikit-Learn splits data into training and test set

In [3]:
from sklearn.model_selection import train_test_split

# Uniformly sampled data from 0 to 100
data = pd.DataFrame(100 * np.random.random(50), columns=['cont'])

# Split into 80/20 proportions
train, test = train_test_split(data, test_size=0.2)

**Issue:**
- Isn't reproducible: running it again results in different split
- One solution: do this once and save
- Another: set `random_state=42` to control shuffling and ensure reproducible output
- But: neither of these work if you update the dataset. The textbook has a potential solution by splitting by hashed identifiers

**Stratified Sampling:**
- If dataset isn't large enough, random selection of test set can introduce sampling bias
- Stratified sampling guarantees test set is representative of population by controlling for specific factors
- Population is divided into *strata* and sample has same proportions in each stratum
- Ex: controlling for gender
- Use `StratifiedShuffleSplit` from Scikit-Learn
- **Q:** How to decide what to control for? How many factors can you control for? 

In [5]:
# Uniformly sampled data from 0 to 100
data = pd.DataFrame(100 * np.random.random(1000), columns=['cont'])

# pd.cut to bin data - useful for turning continuous data into categorical (e.g. strata)
# (Scikit-Learn also has Discretization functionality)
data['cat'] = pd.cut(
    data['cont'], 
    bins=[0, 33, 66, 100], 
    labels=['low', 'medium', 'high'], 
    include_lowest=True
)

data.head(10)

Unnamed: 0,cont,cat
0,63.447623,medium
1,64.887668,medium
2,86.919338,high
3,32.330967,low
4,4.656797,low
5,78.043343,high
6,19.769441,low
7,36.791181,medium
8,41.079619,medium
9,82.44654,high


In [59]:
from sklearn.model_selection import StratifiedShuffleSplit

# Initialise split object
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

# This loop has only one iteration but syntax is necessary because of split class
# 2nd arg indicates what to control for, 1st arg just gives n_samples so could just use np.zeros(n_samples)
for train_index, test_index in split.split(data, data['cat']):
    train = data.loc[train_index]
    test = data.loc[test_index]

# Function to compare proportions in the different sets
def cat_proportions(df):
    return df['cat'].value_counts() / len(df)

# DataFrame to store results
compare_props = pd.DataFrame({
    'Overall': cat_proportions(data),
    'Train' : cat_proportions(train),
    'Test': cat_proportions(test),
}).sort_index()

compare_props

Unnamed: 0,Overall,Train,Test
low,0.319,0.31875,0.32
medium,0.335,0.335,0.335
high,0.346,0.34625,0.345


## Discover and Visualise the Data to Gain Insights <a name="discoverandvisualise"></a>

- If training set is large, sample exploration set for speed
- Copy training set to avoid 'damage'

### Visualising Geographical Data <a name="geographicaldata"></a>

- Use `df.plot(kind='scatter', x='x_val', y='y_val')`
- May need to play around to pick out patterns
- Useful parameters to capture other dimensions:
    - `alpha=0.1`: opacity of circles to help see high density
    - `s=df['col']`: control radius of circles by column
    - `c=df['col'], cmap=plt.get_cmap('jet')`: colour scale based on column
- Use this generate ideas for features etc.
- Repository has an example of plotting over a map

### Looking for Correlations <a name="correlations"></a>

**Pearson's Correlation Coefficient:** Covariance scaled by product of standard deviations:

\begin{equation}
    \rho_{X,Y} 
        = \frac{\text{cov}(X, Y)}{\sigma_X \sigma_Y}
        = \frac{\text{E}\left[(X-\mu_X)(Y-\mu_Y)\right]}{\sigma_X \sigma_Y}
\end{equation}

**Correlation Matrix:** 
- Matrix with $(X, Y)$ entry $\rho_{X,Y}$
- `df.corr()`
- Note that this only measures linear correlations

**Scatter Matrix:**
- `from pandas.plotting import scatter_matrix`
- Scatter plots of all combinations of attributes and histograms on diagonal
- May be better to do subset of features

### Experimenting with Attribute Combinations <a name="attributecombinations"></a>

- Try out different attribute combinations (e.g. ratios) and looking back at correlations.
- At first this can be quick - you can go back and look for better ones

# Prepare the Data for Machine Learning Algorithms <a name="prepareformachinelearning"></a>

You should write functions to prepare the data:
- This makes it reproducible (you can even use it in a live system)
- This allows you to build a library of functions for the future
- Makes it easier to try different combinations of functions

Separate predictors and targets - don't always want to apply same transformations to both

### Data Cleaning

Generally need to resolve missing values either by:
- Removing samples
- Removing feature
- 'Imputing' missing values (e.g. using the median)

*Note:* Whatever you do will need to be replicated on the test set and in production

**SimpleImputer**: Scikit-Learn class to fill missing values
- `from sklearn.impute import SimpleImputer`
- `strategy`: options are mean, median, most_frequent, or constant
- For mean and median, data must be numeric
- Good idea to impute for all features that may be missing values in production, even if not missing in training set

Scikit-Learn also has (experimental) *multivariate* feature imputation, which fills values using other features, and nearest neighbours imputation.

**Multiple Imputation:** In statistics it's common to perform to try multiple imputations and use cross-validation at the end of the pipeline to understand the consequences of different strategies.

### Handling Text and Categorical Attributes <a name="handlingcategorical"></a>

ML algorithms generally need numbers so we need to convert categories to numerical form via *encoding*.

**Ordinal Encoding:** Enumerate different categories
- `from sklearn.preprocessing import OrdinalEncoder`
- This implicitly gives an ordering and measure of similarity to the categories

**One-hot Encoding:** Assigns each category a new binary feature
- Or *dummy* encoding or *one-of-K* encoding
- `from sklearn.preprocessing import OneHotEncoder`
- Output is a SciPy *sparse matrix* which just holds position of '1' in each row - use `toarray()` to convert
- Treats missing values as a separate category (Q: could you use mean imputer after encoding?)
- To avoid colinearity, use `drop='first'` to encode in `n_categories - 1`. Use `drop=if_binary` to encode e.g. male/female as single binary

One-hot encoding may be unwise for large number of categories (for performance reasons). Possible solution is to use proxy features (e.g. population and GDP in place of country code). Another is to use a low-dimensional embedding learned during training (see *representation learning*).

### Custom Transformers  <a name="customtransformers"></a>

- You should turn custom data cleaning into transformers so they'll integrate with Scikit-Learn
- Need a class with 3 method: `fit()` (returning `self`), `transform()`, and `fit_transform()`
- Set `TransformerMixin` as a base class to get `fit_transform()` for free
- Set `BaseEstimator` as a base class to get `get_params()` and `set_params()` methods (can't have `*args` and `*kwargs` in `__init__` constructor)
- *Advice:* Add hyperparameters to 'gate' data preparation steps (e.g. adding a new feature) so you can turn them on and off

### Feature Scaling <a name="featurescaling"></a>

- ML algorithms generally don't perform well when inputs have very different scales
- Scaling output values is generally not required

**Min-max Scaling:** Linear scaling to [0, 1] range
- Also called *normalisation*
- `from sklearn.preprocessing import MinMaxScaler`
- Very sensitive to outliers
- *Q:* Is it a problem that unseen data may not fit in the [0, 1] range?

**Standardisation:** Linear scaling to 0 mean and unit variance
- `from sklearn.preprocessing import StandardScaler`
- Doesn't work for some ML algorithms which need inputs to be in range [0, 1]

- Centring sparse data (e.g. dummy binary variables from categorical data) is inadvisable because this destroys the sparsity, but you can still do scaling (consider using `MaxAbsScaler`)
- Scaling data with many outliers may not work very well - consider using `RobustScaler` instead

### Transformation Pipelines <a name="pipelines"></a>

**Pipelines:** chain estimators into a composite estimator
- `from sklearn.pipeline import Pipeline`
- `make_pipeline` is a shorthand for constructing pipelines
- Pipelines only transform observed data X, use `TransformedTargetRegresssor` to transform target y
- All but last estimator must be transformers (they have `.fit_transform()` method)
- Calling `.fit()` on pipeline does `fit_transform()` on all but last one which it calls `fit()`

**ColumnTransformer:** allows you to apply different transformers to different columns
- `from sklearn.compose import ColumnTransformer`
- `make_column_transformer` is a shorthand
- What happens to unlisted columns is determined by arg `remainder: {'drop', 'passthrough'}` or an estimator to be applied to them
- The estimators that go in can themselves be pipelines
- When outputs are mix of dense and sprase matrices, `ColumnTransformer` decides density of output based on ratio of dense/sparse in inputs
- Use `from sklearn import set_config // set_config(display='diagram')` to show diagrams of composite estimators
- `make_column_selector` is useful to select columns to go into `ColumnTransformer`

## Select and Train a Model <a name="selectandtrain"></a>

### Training and Evaluating on the Training Set <a name="evaluatingontrainingset"></a>

- A reasonable standard error metric is RMSE
- I like to compare with the base error rate (for regression this is error if predicting using the mean)
- Training error is a bad estimate of generalisation error because of overfitting

### Better Evaluation Using Cross-Validation <a name="evaluatingusingcrossvalidation"></a>

**K-Fold Cross-Validation:** Partitions training set into $K$ *folds* and trains and calculates validation error leaving one fold out each time
- Can use `cross_val_score` to just get one score, or `cross_validate` to use multiple metrics and return more info
- *Note:* Scikit-learn cross-validation expects a *utility* function rather than a *cost* function, so error metrics (e.g. RMSE) need to be *negative* (e.g. `neg_root_mean_square_error`)
- Cross-validation can be very slow becuase it needs to train the model many times
- **Q:** How accurate are the mean and standard deviation of error scores as estimates or generalisation error/stdev? What is the relationship between this and number of folds?
- There are other options than just splitting randomly into folds, e.g. `StratifiedKFold` - c.f. `StratifiedShuffleSplit`
- *Leave-One-Out Cross-Validation:* Take n_folds = n_samples. From Scikit-learn: "As a general rule, most authors, and empirical evidence, suggest that 5- or 10- fold cross validation should be preferred to LOO."
- It may be necessary to shuffle data before cross-validating if order isn't already random

**Note:** Try out a few different models from different categories before spending too much time tweaking hyperparameters. *The goal is to shortlist a few (2-5) promising models.*

**joblib:** Python library to avoid having to compute the same thing twice
- Use `joblib.dump(my_model, 'my_model.pkl')` to save models, parameters, hyperparameters, cv-scores, and predictions so you don't have to do them twice
- Then use `my_model_loaded = joblib.load('my_model.pkl)` to get it back instantly

## Fine-Tune Your Model <a name="finetune"></a>

### Grid Search <a name="gridsearch"></a>

**GridSearchCV:** Perform (cross-)validation across a specified grid of hyperparameters
- `param_grid` can take multiple dictionaries of parameters. Within each dictionary it does every combination of parameters
- This can be very slow, particularly combined with cross-validation
- A good choice for parameters is approximate powers of 3: 3, 10, 30, 100,...

**RandomizedSearchCV:** Cross-validation with a fixed number of samples of hyperparameters from specified distributions.
- `param_distributions` specifies distributions of each parameter using `scipy.stats`
- Advantages:
    - Full control over number of iterations
    - Given 1,000 iterations, each parameter takes 1,000 values, not just the few you specified in grid search (so including irrelevant hyperparameters doesn't reduce efficacy of search)

**Other options:**
- Some models can fit data with a variety of hyperparameter values at once (in particular regularisation paths for linear models). Scikit-Learn has estimators that incorporate this (e.g. `RidgeCV`)
- Some models have closed-form formulae for the optimal regularisation parameter based on information criteria (AIC or BIC), e.g. `LassoLarsIC`

**Additional guidance:**
- Specify multiple metrics for evaluation (*why?*)
- Search over parameters of composite estimators (e.g. pipelines)

### Ensemble Methods <a name="ensemblemethods"></a>

Combine models that perform best.

### Analyse the Best Models and Their Errors <a name="analysebestmodels"></a>

For example, use insights to improve feature set (some models have dedicated methods for this), or look at specific errors.

### Evaluate Your System on the Test Set <a name="evaluateontestset"></a>

- This is easier if you have a transformer that incorporates the data processing and the estimator
- If you did a lot of hyperparameter tuning then test performance may be worse than CV error

The text has an example of constructing a confidence interval around the test error. I am sceptical about this. It seems to assume that the squared error of the estimator is distributed normally (is this reasonable?). Then the difference between the sample mean (of the squared errors) and the true mean, divided by the standard error is distributed according to Student's t-distribution. This then yields a confidence interval for the true mean. I think I need to read ESL to understand this properly.

## Other Terminology

**Data Pipeline**: a sequence of data processing components
- Components run asynchronously and outputs are stored in data stores between components
- Components are self-contained so if one component fails, downstream components can continue using last output
- Monitoring is important so failing components can be caught and fixed

**Map Reduce:** programming paradigm of using parallel, distributed algorithims to process data
- To allow a group of (memory independent) computers to process data that is too much for a single processor

**Duck Typing:** Programming paradigm - "if it walks like a duck and it quacks like a duck, then it must be a duck"

**Ensemble Learning:** Building a model on top of many other models.

## Code Samples

### OS

In [1]:
import os # Operation system dependent functionality

In [14]:
PATH = '/Users/christopherleonard'

# To combine path names into one complete path
# Note that it adds the necessary slash
ML_PATH = os.path.join(PATH, 'P/hands-on-machine-learning')
print(ML_PATH)

/Users/christopherleonard/P/hands-on-machine-learning


In [15]:
# Check if specified path is an existing directory
os.path.isdir(ML_PATH)

True

In [18]:
# Create specified directory
# Will return error if already exists
TEST_PATH = os.path.join(ML_PATH, 'chapter-2/test')
os.mkdir(TEST_PATH)

# Also works with relative directory
os.mkdir('test')