# Chapter 2 - End-to-End Machine Learning Project

## 2.1 Working with Real Data

- Various places to find real-world data:
    - Open Data Repositories
        - [UC Irvine Machine Learning Repository](http://archive.ics.uci.edu/ml)
        - [Kaggle Datasets](https://www.kaggle.com/datasets)
        - [Amazon's AWS Datasets](https://registry.opendata.aws)
    - Meta Portals
        - [Data Portals](http://www.dataportals.org)
        - [OpenDataMonitor](http://opendatamonitor.eu)
        - [Quandl](http://quandl.com)
    - Other Pages
        - [Wikipedia's list of ML Datasets](https://homl.info/9)
        - [Quora.com](https://homl.info/10)
        - [Reddit Datasets Subreddit](https://www.reddit.com/r/datasets)

## 2.2 Look at the Big Picture

- The task in this chapter is to use California census data to build a model of housing prices in the state
- The model should learn from the data and be able to predict the median housing price in any district, given the other metrics

### 2.2.1 Frame the Problem

- Frame the Problem/Look at Big Picture Checklist:
    - Define the objective in business terms
    - How will your solution be used?
    - What are the current solutions/workarounds (if any)?
    - How should you frame this problem (supervised/unsupervised, online/offline, etc.?)
    - How should performance be measured?
    - Is the performance measure aligned with the business objective?
    - What would be the minimum performance needed to reach the business objective?
    - What are comparable problems? Can you reuse experience or tools?
    - Is human expertise available?
    - How would you solve the problem manually?
    - List the assumptions you (or others) have made so far.
    - Verify the assumptions if possible?
- Data Pipelines
    - A sequence of data processing components
    - Components typically run asynchronously
        - Each component pulls in a large amount of data, processes it, and spits out the result in another data store
        - If a component breaks down, the downstream components can often continue to run normally (at east for a while)
        by just using the last output from the broken component
        - A broken component can go unnoticed for some time if proper monitoring is not implemented

### 2.2.2 Select a Performance Measure

- A typical performance measure for regression problems is the Root Mean Square Error (RMSE)
$$RMSE(\textbf{X},h) = \sqrt{\frac{1}{m} \sum_{i=1}^{m} (h(\textbf{x}^{(i)}) - y^{(i)})^2 }$$
    - $m$ is the number of instances in the dataset you are measuring the RMSE on
    - $\textbf{x}^{(i)}$ is a vector of all the feature values (excluding the label) of the $i^{th}$ instance in the dataset, and $y^{(i)}$ 
    is its label (the desired output value for that instance)
    - $\textbf{X}$ is a matrix containing all the feature values (excluding labels) of all instances in the dataset.
    There is one row per instance, and the $i^{th}$ row is equal to the transpose of $\textbf{x}^{(i)}$, noted $(\textbf{x}^{(i)})^\intercal$
    - $h$ is your system's prediction function, also called a *hypothesis*. When your system is given an instance's feature
    vector $\textbf{x}^{(i)}$, it outputs a predicted value $\hat{y}^{(i)} = h(\textbf{x}^{(i)})$
- Mean Absolute Error (MAE) or Average Absolute Deviation
    - Good for when data contains a lot of outliers
$$MAE(\textbf{X},h) = \frac{1}{m} \sum_{i=1}^{m} \left| h(\textbf{x}^{(i)}) - y^{(i)}\right|$$

- Both RMSE and MAE are ways to measure distance between two vectors
    - Computing the RMSE corresponds to the *Euclidean norm*
        - Also called the $l_2$ norm, noted $||x||_2$
    - Computing the Mean Absolute Error (MAE) corresponds to the *Manhattan Norm*
        - Also called the $l_1$ norm, noted $||x||_1$
        - Measures the distance between two points in a city if you can only travel along orthogonal city blocks (right angles)
    - Generally speaking, the $l_k$ norm vector $\textbf{v}$ containing *n* elements is defined as  $||\textbf{v}||_k$
        - $l_0$ gives the number of nonzero elements in the vector
        - $l_{\inf}$ gives the maximum absolute value in the vector
    - The higher the norm index, the more if focuses on large values and neglects small ones
        - RMSE is more sensitive to outliers because of this but when the distribution is normal, RMSE performs very well
        and is the preferred performance measure

### 2.2.3 Check the Assumptions

- It is good practice to list and review the assumptions to catch serious issues early on

## 2.3 Get the Data

- Get the Data Checklist:
    - List the data you need and how much you need
    - Find and document where you can get that data
    - Check how much space it will take
    - Check legal obligations, and get authorization if necessary
    - Get access authorizations
    - Create a workspace (with enough storage space)
    - Get the data
    - Convert the data to a format you can easily manipulate (without changing the data itself)
    - Ensure sensitive information is deleted or protected (e.g. anonymized)
    - Check the size and type of data (time series, sample, geographical, etc.)
    - Sample a test set, put it aside, and never look at it (no data snooping)
    
### 2.3.1 Create the Workspace

- Workspace should be a virtualenv or conda env with the following packages:
    - Jupyter
    - NumPy
    - pandas,
    - Matplotlib
    - Scikit-Learn

### 2.3.2 Download the Data

- In typical environments, your data would be available in a relational database (or some other data store) and spread
across multiple tables/documents/files

### 2.3.4 Take a Quick Look at the Data Structure

- The `.head()` method in pandas allows you to look at the top five rows of the data
- The `.info()` method in pandas is useful to get a quick description of the data
    - Provides total number of rows
    - Each attribute's type
    - Number of nonnull values
        - Missing values need to be taken care of in the data cleaning step
- For categorical columns, it is useful to use the `.value_counts()` method of a pandas series
- The `.describe()` method shows a summary of the numerical attributes
- The `.hist()` method plots a histogram of all numerical attributes in the dataset
    - Important to note that skewed distributions need to be manipulated to be more bell-shaped

### 2.3.5 Create a Test Set

- A *test set* is created by randomly sampling 20% of the dataset and setting it aside
    - When the dataset is large enough, random sampling methods work well enough but if it is not, there is a great risk
    of significant sampling bias
    - *Stratified sampling* divides the population into homogeneous subgroups called *strata* and the correct number of 
    instances are sampled from each stratum to guarantee that the test set is representative of the overall population
        - Sklearn provides the `StratifiedSuffleSplit` class to do stratified sampling


## 2.4 Discover and Visualize the Data to Gain Insights

- After taking a quick glance at the data and creating a test set, the next step is to gain greater understanding through
visualization
- Using the `.copy()` method of pandas Dataframes is helping in preserving the integrity of the training set

### 2.4.1 Visualizing Geographical Data

- Dealing with the housing dataset, we have to visualize geographical data and this is best done through use of a scatterplot
    - Setting the parameter `alpha=.1` will make it easier to see high density data points
    - To further aid the analysis, we can augment the radius of the circle to represent the population, the color to the
    price

### 2.4.2 Looking for Correlations

- In datasets that aren't large, the *standard correlation coefficient* (Pearson's r) can be calculated between every pair
of attributes using the `.corr()` method
    - The value ranges from -1 to 1
    - Correlations close to 1 signify strong positive correlations
    - Correlations close to -1 signify strong negative correlations
    - Correlations close to 0 signify very weak or no correlation at all
    - Important to note that the correlation has nothing to do with the slope of the relationship
- Another way of getting the correlations between attributes is to use the `.scatter_matrix()` function in pandas
    - Note: produces a figure of $n^2$ plots, $n$ being the number of numerical attributes in the dataset

### 2.4.3 Experimenting with Attribute Combinations

- Creating new attribute combinations is the process of using existing features, and calculating new ones through arithmetic
or some other means
- Often times, new attributes or combinations of attributes will be more correlated with the target value
- The first round of exploration does not have to be thorough
    - The idea is to get a prototype up and running and then iterate on the creative process

## 2.5 Prepare the Data for Machine Learning Algorithms

- After the exploration phase, it's time to prepare the data for the algorithm
    - It is best practice to create a series of functions for the purposes of this step because:
        - Reproducibility
        - Building a library of functions for re-usability
        - Can be transferred over to production

### 2.5.1 Data Cleaning

- Most ML algorithms cannot work with missing features
- There are a few options when dealing with missing features
    - Getting rid of the row of data
    - Getting rid of the column (attribute)
    - Imputing the values (zero, mean, median, etc)
        - When using this option, it is important to calculate only for the training set and carry that value over to the
        test set
        - Sklearn has a `SimpleImputer` class that can be used for simple imputation but can only be used on numerical attributes
        
- Sci-Kit Learn Design
    - All objects share a consistent and simple interface
    - *Estimators*
        - Any object that can estimate some parameters based on a dataset is called an *estimator*
        - Estimation is performed by the `.fit()` method
        - Will generally only take the dataset as a parameter, along with the labels in a supervised learning algorithm
        - All other parameters are hyperparameters and must be set as an instance variable
    - *Transformers*
        - Estimators that can also transform the dataset (imputer is an example)
        - Transformations are performed via the `.transform()` method
        - All transformers also have a `.fit_transforms()` method that does both `.fit()` and `.transform()` and is often
        more optimized than doing the two operations separately
    - *Predictors*
        - Estimators that are capable of making predictions are called *predictors*
    - *Inspection*
        - All the estimator's hyperparameters are accessible directly via public instance variables
        - All the estimator's learned parameters are accessible via public instance variables with an underscore suffix

### 2.5.2 Handling Text and Categorical Attributes

- For ordinal columns, categorical columns that have a sense of order to them, Sklearn provides the `OrdinalEncoder` class
- For categorical columns that are not ordinal, Sklearn provides the `OneHotEncoder` class
    - The result of the `OneHotEncoder` class is a sparse matrix to preserve memory
- If a categorical attribute has a large number of possible categories, then one-hot encoding will result in a large number
of input features
    - This can be handled by:
        - Replacing the categorical input with useful numerical features
        - Replace each category with a learnable, low-dimensional vector called an *embedding*
            - This is called *representation learning*

### 2.5.3 Custom Transformers

- Sometimes you will need to write your own transformers and they will need to implement the `.fit()`, `.transform()`, and
the `.fit_transform()` methods so that it seamlessly integrates with the rest of the Sklearn API
- Many data preparation steps can be automated with Sklearn's base transformers and a custom transformer allowing the data
scientist to find many combinations in minimal time

### 2.5.4 Feature Scaling

- *Feature scaling* is one of the most important transformations for ML
    - ML algorithms typically do not perform well when numerical attributes have different scales
    - Scaling is generally not required on target values
- There are two common ways to get all attributes to have the same scale:
    - *Min-max scaling* (Normalization): values are shifted and rescaled so that they end up ranging from 0 to 1
        - $X_{normalized} = \frac{X - X_{min}}{X_{max}-X_{min}}$
        - Sklearn provides a `MinMaxScaler` transformer
        - Heavily influenced by outliers
    - *Standardization*: values are shifted so that the resulting distribution has a unit variance
        - $X_{standardized} = \frac{X - X_{mean}}{X_{std}}$
        - Sklearn provides a `StandardScaler` transformer for this
        - Standardization is not affected by outliers
- Important to note that the transformers of `MinMaxScaler` and `StandardScaler` should only be fit to the training set

### 2.5.5 Transformation Pipelines

- Sklearn provides the `Pipeline` class to make sequences of transformations easily repeatable
    - The `Pipeline` constructor takes a list of name/estimator pairs
    - All but the last estimator must be transformers (they must have the `.fit_transform()` method
- In v.20, Sklearn provides the `ColumnTransformer` class that allows users to specify different transformations for a set
of columns
    - A set of transformations can be defined for numerical columns
    - A set of transformations can be defined for categorical columns
    - When `.fit()` is used, the different transformations are applied appropriately for the defined columns 

## 2.6 Select and Train a Model

- Follows framing the problem, getting the data, exploration, splitting into training and test sets, transforming the data,
cleaning up the data, and preparing the data for the ML algorithms0

### 2.6.1 Training and Evaluating on the Training Set

- *Underfitting* is when the features do not provide enough predictive information
    - Solutions are to use a more powerful model
    - Create better features
    - Reduce the constraints on the model   

### 2.6.2 Better Evaluation Using Cross-Validation

- One way to evaluate models (and the better way) is to split the training set into a smaller training set and a validation set
    - Train the models on the smaller training set
    - Evaluate against the validation set
- Sklearn provides a *K-fold cross-validation feature* that randomly splits the training set into $k$ subsets called *folds*,
and then trains and evaluates the model $k$ times, picking a different fold for evaluation every time and training on the
other $k-1$ folds
    - The result is an array with $k$ evaluation scores
- Cross-validation allows you to get an estimate of the performance, but also how precise the estimate is
- Building a model on top of many other models is called *ensemble learning*
    - An example is a Random Forest
- A sign of overfitting is when the training set's evaluation score is much lower than the validation sets
    - Possible solutions are:
        - Constrain (regularize it)
        - Get more data
- Models you experiment with should be saved so that you can come back to them easily
    - Both the hyperparameters and the trained parameters should be saved
    - Cross-validation scores should be saved
    - Predictions could also be saved
    - Sklearn models can easily be saved via Python's `pickle` model or using the `joblib` library

## 2.7 Fine-Tune Your Model

- Once you have a shortlist of promising models, the models then need to be fine-tuned

### 2.7.1 Grid Search

- The first option to fine-tuning a model is to experiment with hyperparameters
- Sklearn provides the `GridSearchCV` class to perform the search for you
    - Tell Sklearn what hyperparameters you want it to experiment with
    - It will use cross-validation to evaluate all the possible combinations of hyperparameter values
    - When you don't know what value a hyperparameter should have, a simple approach is to try out consecutive powers of 10

### 2.7.2 Randomized Search

- Randomized search is the alternative to Grid Search and is preferred when the hyperparameter space is large
    - Sklearn provides the `RandomizedSearchCV` class
    - This will evalutate a given number of random combinations by selecting a random value for each hyperparameter at every
    iteration
    - The two main benefits of Randomized Search are:
        - It will let you evaluate $x$ different values for every hyperparameter as opposed to a few using Grid Search
        - The number of iterations can be set so the computing budget can be controlled

### 2.7.3 Ensemble Methods

- Another way to fine-tune the system is to combine the models
- The group (*ensemble*) of the "best" models will often perform better than the best individual models, especially if 
the individual models make very different types of errors

### 2.7.4 Analyze the Best Models and Their Errors

- You will often gain good insights on the problem by inspecting the best models
- With this information, dropping some of the less useful features becomes accessible
- Looking at the specific errors the system makes and why the model makes them could lead to solutions to fix the problem
    - This could be solved via:
        - Adding extra features
        - Getting rid of uninformative ones
        - Cleaning up outliers

### 2.7.5 Evaluate Your System on the Test Set

- After fine-tuning the models, its time to evaluate on the *test set*
- To do so, you call `.transform()` using the pipeline you created prior
- You might want to have an idea of how precise this estimate is and you can use the `scipy.stats.t.interval()` function
to compute a *95% confidence interval* for the generalization error
- A lot of hyperparameter tuning will usually lead to a *test set* evaluation that performs slightly worse than cross-validation
- Prior to launching, you need to:
    - Present your solution (highlighting lessons learned)
    - What worked and what did not
    - What assumptions were made
    - What the system's limitations are 

## 2.8 Launch, Monitor, and Maintain Your System

- After getting approval to launch the model into production, it needs to be deploy
    - One way to do this is to save the trained Sklearn model (e.g. using `joblib`), including the full preprocessing and
    prediction pipeline, then load this trained model within your production environment and use it to make predictions by
    calling the `.predict()` method
- Once the model is deploy, code needs to be written to monitor the system's live performance at regular intervals and
trigger alerts when it drops
- In some cases, the model's performance can be inferred from downstream metrics
- A monitoring system needs to be put in place as well as all the relevant processes to define what to do in case of failures
and how to prepare for them