# Chapter 1: The Machine Learning Landscape

In [1]:
# Import packages
import numpy as np

## Table of Contents

1. [Types of Machine Learning System](#types)
    1. [Supervision](#supervision)
    2. [Batch / Online Learning](#batch)
    3. [Instance / Model-Based Learning](#instance)
2. [Main Challenges of Machine Learning](#challenges)
3. [Testing and Validation](#testing)
    1. [Hyperparameter Tuning and Model Selection](#modelselection)
    2. [Data Mismatch](#datamismatch)
4. [No Free Lunch](#nofreelunch)
5. [Appendix](#appendix)
    1. [Code Samples](#codesamples)
    2. [List of Machine Learning Algorithms](#mlalgorithms)


## Types of Machine Learning System <a name="types"></a>

- *Accuracy:* ratio of correctly to incorrectly classified samples in classification
- *Data mining:* applying ml techniques to dig into large amounts of data to discover patterns

### Supervision <a name="supervision"></a>

**Supervised learning:** training data is labelled
- Regression
- Classification

**Unsupervised learning:** training data is unlabelled
- Clustering
- Visualisation: output 2D or 3D representation of data
- Dimensionality reduction: simplify data without losing too much information (feature extraction)
- Anomaly detection
- Association rule learning: discover relations between attributes

**Semi-supervised learning:** training data is partially labelled
- Labelling friends in Facebook photos
- Often a blend of supervised and unsupervised learning algorithms

**Reinforcement learning:** agent pursues policy and learns according to rewards or penalties
- DeepMind's AlphaGo

### Batch / Online Learning <a name="batch"></a>

**Batch (offline) learning:** system is incapable of learning incrementally so it trained (offline) using all available data
- Ineffective if you need to adapt to rapidly changing data
- Takes a lot of computing power

**Online learning:** train system incrementally by feeding it data sequentially (individually or in mini-batches)
- Can discard new data instances after learning from them (saves space)
- *Out-of-core* learning: use online learning to train systems on huge datasets (usually done offline)
- *Learning rate:* how quickly system adapts to new data
- If bad data is fed to system, performance will gradually decline

### Instance / Model-Based Learning  <a name="instance"></a>

**Instance-based learning:**
- Uses training data when evaluating new data point
- 'Learns examples by heart'(?) and generalises to new data by using a similarity measure to compare with learned examples

**Model-based learning:**
- Training data is used to create model which is used to evaluate new data, training data is not used directly
- *Utility function*: measures how good your model is
- *Cost function*: measures how bad your model is

## Main Challenges of Machine Learning <a name="challenges"></a>

**Insufficient training data:**
- Reasonable algorithms may perform almost identically if there is enough data, but insufficient data is still common

**Nonrepresentative training data:**
- *Sampling noise*: Nonrepresentative data as a result of random chance (e.g. if sample is too small)
- *Sampling biase*: Nonrepresentative data as a result of flawed sampling

**Poor-quality data:**
- E.g. missing features - possible responses include omitting this feature, omitting these instances, and approximating missing values

**Irrelevant features:**
- *Feature selection*: selecting best features to train on
- *Feature extraction*: combining existing features to product better ones

**Overfitting training data:**
- *Overfitting*: the model performs well on the training data but doesn't generalise well
- Possible solutions:
    - Simplify the model by reducing number of parameters, number of attributes in training data, or by contraining the model
    - Gather more training data
    - Reduce noise in training data
- *Regularisation*: contraining a model to make it simplier and reduce the risk of overfitting
- *Hyperparameter*: a parameter of the learning algorithm, not the model. It is set prior to training and remains constant during training. The amount of regularisation is commonly controlled by hyperparameters.

**Underfitting training data:**
- *Underfitting*: when the model can't capture the complexity of the daa
- Possible solutions:
    - Select a more complex/powerful model
    - Improve the features (e.g. through engineering)
    - Reduce constraints (e.g. by reducing regularisation hyperparameters)

## Testing and Validation <a name="testing"></a>

- Recommended to split data into *training set* and *test set*
- *Generalisation/out-of-sample error*: error rate on unseen data, estimated using the test set
- If training error is low but generalisation error is high then model is overfitting training data
- Common to use 80/20 train/test split but this depends on the size of dataset (you need enough in test set to get a good estimate of generalisation error)

### Hyperparameter Tuning and Model Selection <a name="modelselection"></a>

- If you choose between models or select hyperparameters based on test set error then you have used the test set for training and test set error will not be a good estimate for generalisation error

**Holdout Validation:**
- Split data into training, validation (or developement), and test sets
- Train multiple models (with different hyperparameters) on training set, select model that performs best on validation set, retrain on training set + validation set and estimate generalisation error using test set
- This is effective if there is *lots* of data, but if the validation set is too small it won't give an accurate estimate of generalisation error in different models and if the reduced training set is too small then the models trained on it will perform significantly worse than those trained on the full training set.

**Cross-validation:**
- Perform validation many times with different small validation sets and average validation errors
- Works better with less data, but you need to train models for each round of validation

### Data Mismatch <a name="datamismatch"></a>

- You may have a large amount of data, but only a small part is representative of data that will be used in production
- Priority: the data in validation and test sets should be as representative as possible of production data 
- Then if models perform poorly on the validation set, it may be because of a mismatch between training and validation sets
- Solution:
    - Hold some of the (unrepresentative) training data in a *train-dev set*
    - Evaluate on train-dev set before validation set to determine if error is due to data mismatch or overfitting

## No Free Lunch <a name="nofreelunch"></a>
- The choice of machine learning model is based on assumptions
- *No Free Lunch Theorem:* Machine learning algorithms all perform the same when averaged over all possible problems
- There is no such thing as a good machine learning algorithm, only an appropriate algorithm

## Appendix

### Code Samples <a name="codesamples"></a>

#### Numpy r_ and c_ Functions <a name="r_"></a>

Short-hand for building up arrays quickly

In [2]:
# Create 1D array with range specified by slice
np.r_[0:5:1]

array([0, 1, 2, 3, 4])

In [3]:
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[7, 8, 9], [10, 11, 12]])

# Concatenate a and b along 0th axis
np.r_[a, b]

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [4]:
# Concatenate along 1st axis
np.r_['1', a, b]

array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])

In [5]:
# 2nd integer forces arrays to be 2-dimensional, then concatenate along 0th axis
np.r_['0,2', [1,2,3], [4,5,6]]

array([[1, 2, 3],
       [4, 5, 6]])

In [16]:
# 3rd integer specifies how to upgrade to upgrade to 2d
# It gives the axis which will contain the start of the existing arrays
# Default is -1
print(np.r_['0,2,0', [1,2,3]].shape)
print(np.r_['0,2,1', [1,2,3]].shape)

(3, 1)
(1, 3)


In [17]:
np.r_['0,2,0', [1,2,3], [4,5,6]]

array([[1],
       [2],
       [3],
       [4],
       [5],
       [6]])

In [18]:
# c_[inputs] is short-hand for r_['1,2,0', inputs]
# Useful for upgrading 1d to 2d as column vectors and concatenating horizontally
np.c_[[1, 2, 3], [4, 5, 6]]

array([[1, 4],
       [2, 5],
       [3, 6]])

### List of Machine Learning Algorithms <a name="mlalgorithms"></a>

- Linear Regression
- Logistic Regression
- k-Nearest Neighbours
- Neural network
    - Convolutional neural network (CNN)
    - Recurrent neural network (RNN)
    - Perceptron
- Transformer
- Support vector machine (SVM)
    - Regression SVM
    - One-class SVM (anomaly detection)
- Random forest
    - Regression random forest
    - Decision tree
    - Isolation forest (anomaly detection)
- Deep belief network (DBN)    
    - Restricted Boltzmann machine (RBM)
- K-means clustering
- DBSCAN
- Hierarchical Cluster Analysis (HCA)
- Principal components analysis (PCA)
    - Kernel PCA
- Locally linear embedding (LLE)
- t-distributed stochastic neighbour embedding (t-SNE)
- Assocation rule learning
    - Apriori
    - Eclat
- Naive Bayes classifier
- Winnow


