# 4 Fundamentals of Machine Learning

## 4.1 Four branches of ML

### 1) Supervised learning: learn how to map input data to known targets (*annotations*), given a set of examples

### 2) Unsupervised learning: find transformations of the input data without the help of any targets, for the purpose of data visualization, compression, denoising or better understanding of the correlations (e.g. *Dimensionality reduction* and *Clustering*)

### 3) Self-supervised learning: No human-annotated labels, generated from the input data, typically using a heuristic algorithm

### 4) Reinforcement learning: an *agent* receives information about its environment and learns to choose actions that will maximize the reward

## 4.2 Evaluating machine-learning models

### In ML, the goal is to achieve models that *generalize* --- perform well on never-before-seen data --- and *overfitting* is the central obstacle

### The following sections look at strategies for mitigating overfitting and maximizing generalization

## 4.2.1 Training, validation and test sets

###  Evaluating a model always boils down to splitting the available data into 3 sets: training, validation and test

###  Why not have 2 two sets: training & test?   ----  Because developing a model always involves *tuning its configuration* (e.g. choose the number of layers or size of layers --- *hyper-parameters*)
###  You do this tuning by seeing the feedback signal from the performance on validation data (in essence, tuning is a form of *learning*: a search for a good configuration in some parameter space). As a result, tuning based on the validation set will soon cause *overfitting to the validation set*

###  Centeral to this phenomenon is the notion of *information leaks*: each time you tune a hyper-parameter of the model based on the performance on validation set, some information about the validation set *leaks* into the model, the validation set becomes less reliable to evaluate the model.

### Thus our model should not have had access to *any* information about the test set,  even indirectly. That' s why we set a validation set separately rather than 2 sets simply

### Besides this splitting, we still have several advanced ways to do that come in handy when little data is available: 1) simple hold-out validation; 2) K-fold validaiton; 3) iterated K-fold validation with shuffling

## Simple hold-out validation

### Set apart some fraction of data as the test set. Train on the remaining data, and evaluate on the test data. (Of, course, we should reserve a validation set to avoid *information leaks*)

In [None]:
# hold-out validation

num_validation_samples = 10000

np.random.shuffle(data)  # shuffling data is usually appropriate

# split the validation set
validation_data = data[: num_validation_samples]
data = data[num_validation_samples:]

# split the training data
training_data = data[:]

# train a model on training data and evaluate it on the validation data
model = get_model()
model.train(traning_data)
validation_score = model.evaluate(validation_data)

'''
At this point, you can tune your model, retrain it, evaluate it, tune it again,....etc.
to find the best hyper-parameters
'''

# once you've tuned your hyper-parameters, it's common to train your final model from scratch
# on all non-test data available

model = get_model()
model.train(np.concatenate([training_data, validation_data]))

test_score = model.evaluate(test_data)

### it suffers from one flaw: if little data is available, then the validation & test set may contain too few samples to be statistically representative of the data at hand;
### This problem is easy to recognize: if different random shuffling rounds of the data before splitting end up yielding very different measures of model performance, then you are facing this problem

### K-fold validation and iterated K-fold validation are two ways to address this problem, as discussed next:

## K-fold validation

### Def: split data into K partitions of equal size. For each partition i, train a model on the remaining K-1 partitions and evaluate it on partition i. The final score is then the average of the K scores obtained.

### This method is helpful when the performance of your model shows siginificant variance based on your training split

In [None]:
k = 4
num_validation_samples = len(data) // k

np.random.shuffle(data)

validation_scores = []
for i in range(k):
    validation_data = data[i * num_validation_samples : (i+1) * num_validation_samples]
    training_data = data[:num_validation_samples * i] + data[num_validation_samples * (i+1):]
    # note '+' here is list concatenation operator,not summation
    
    model = get_model()
    model.train(training_data)
    validation_score = model.evaluate(validation_data)
    validation_scores.append(validation_score)

average_validation_score = np.average(validation_scores)

# tune the model and train the final model
model = get_model()
model.train(data)
test_score = model.evaluate(test_data)

## Iterated K-fold validation with shuffling 

### This approach is for situations in which you have ralatively little data available and you need to evaluate the model as precisely as possible;
### It consists of applying K-fold validation multiple times, shuffling the data every time before spliting it K ways. The final score is the average of th scores obtained at each run of K-fold validation.
### Note that you end up training and evaluate P &times; K models (P is the number of iterations you use), which can be very expensive

## 4.2.2 Things to keep in mind

### Keep an eye out for the following when choosing an evaluation protocal:

### 1) Data representativeness: 

### we hope both training set and test set to be representative of the data at hand. e.g. classify images of digits, you start from samples ordered by class 0~9, and directly take 80% as training set and the remaining 20% as test set, this will result a ridiculus mistake: your training set only contains class 0~7, while test set only class 8~9. Thus, we ususally should *randomly shuffle* the data before splitting

### 2) The arrow of time:  
### if we try to predict the future given the past, we should *not* shuffle data because doing this will create a *temporal leak*: the model will be effectively trained on the *future* data. You should make sure all data in test set is *posterior* to data in the training set

### 3) Redundancy in data: 
### if some data points appear twice or even more, then shuffing it into training & test set will result in redundancy, in effect, we are testing on part of the training data. Thus we should make sure the training set and validation set are *disjoint*

## 4.3 Data preprocessing, feature engineering and feature learning  

### Though many data-preprocessing and feature-engineering techniques are *domain specific*, we first review the basics that are common to all data domains

## 4.3.1 Data preprocessing for neural networks

### Data preprocessing aims at making the raw data at hand more amenable to neural networks.
### This includes 1) vectorization; 2) normalization; 3) handling missing values; 4) feature extraction

## 1) Vectorization

### All inputs and targets in a neural network must be *tensors of floating-point data* (or in specific cases, tensors of integers). Whatever data to process, we must first turn it into tensors, a step called *data vectorization*

## 2) Value normalization

### In general, it is not safe to feed into a neural network data that 1) takes relatively large values 2) heterogeneous scale. Doing so will trigger large gradient updates that will prevent the network from converging.

### To make learning more easier for the network, the data should have following characteristics:
### 1) Take  small values --- typically most values should be in 0~1 range;
### 2) Be homogeneous --- all features should take values in roughly the same range;

### Additionally, we might have stricter normalization rules, although it is not always necessary:
### ---> Normalize each feature independently to have a mean of 0 and standard deviation of 1

In [None]:
# easily to do with Numpy arrays

x -= x.mean(axis = 0)
x /= x.std(axis = 0)

## 3) Handling missing values

### In general, with neural networks, it's safe to input missing values as 0, with the condition that 0 is not already a meaningful value.
### The network will learn from exposure to the data that the value 0 means *missing data* and will start to *learn* to ignore the value.

### Note that if you expect misssing values in test data, but the network was trained on data without any missing values, the network won't have learned to ignore missing values !!

### In this situation, we should artificially generate training samples with missing values: e.g. copy some training samples several times and drop some of the features that you expect are likely to be missing in the test data

## 4.3.2 Feature engineering 

### *Feature engineering* is the process of using your own knowledge about the data and the ML algorithm at hand to make the algorithm work better by applying hardcoded (non-learned) transformations to the data before it goes into the model

### In many cases, it isn't resonable to expect a ML model to be able to learn from completely arbitrary data. The data needs to be presented in a way that will make the model's job easier

### The essence of feature engineering: making a problem easier by expressing it in a simpler way. It usually requires understanding the problem in depth.

### Fortunately, modern deep learning removes the need for most feature engineering, because neural networks are capable of automatically extracting useful features from raw data

### But good feture engineeing still helps !  For 2 reasons:
### 1) allow us to solve problems more elegantly while using fewer resources;
### 2) let you solve a problem with far less data: if we only have a few samples, then the information value in their features becomes critical

## 4.4 Overfitting and underfitting

### The fundamental issue in ML is the tension between *optimization* & *generalization*:

### *Optimization* refers to the process of adjusting a model to get the best performance possible on training data (*learning* in *machine learning*);
### 'Generalization' refers to how well the trained model performs on data it has never seen before;
### Though the goal is to get good generalizaiton, but we cannot control generalization, we can only adjust the model based on the training data.

### At the begining of the training, optimization and generalization are correlated: the lower the loss on training date, the lower on the loss on test data ----------While this happens, the model is said to be *underfit*: there is still progress to be made; The network has not yet modeled all relevant patterns in the training data.

### But after a certain number of iterations on training data, generalization stops improving and validation metrics stall and begin to degrade  ----------The model is starting to *overfit*

### To prevent overfitting (learning misleading and irrelevant patterns in traning data):
### 1) Best solution is to *get more training data*;
### 2) When it's not possible, the next-best solution is to modulate the quantity of information that your model is allowed to store or to add constraints on what information it's allowed to store

### The process of fighting overfitting this way is called *Regularization*. Let's see some most common regularization techniques:

## 4.4.1 Reducing the network's size

### Reduce the size of the model: the number of learnable parameters in the model (num of layers &  units per layer)  -------- which is so called the model's *capacity*. 

### The model with more parameters has more *memorization capacity* and therefore can easily learn a perfect dictionary-like mapping (without any generalization power), but such a model is useless for predicting new things

### On the other hand, if the network has limited memorization resources, it won't be able to learn this mapping as easily; thus in order to minimize the loss, it will have to resort to learning compressed representations that have predictable power regarding the targets ---- precisely the type of representations we're interested in (BUT you need to make sure you model don't *underfit* !!!)

### Unfortuately, no magical formula to determine the right number of capacity used in network, we must evaluate an array of different architectures (on validation set, of course)
### The general workflow is to start with relatively few layers and parameters, and increase the size of layers or add new layers until you see diminishing returns with regard to validation loss

## 4.4.2 Adding weight regularization

### Occam's razor (奥卡姆剃刀原理): Given two explanations for something, the explanation most likely to be correct is the simplest one --- the one with fewer assumptions;   ----- simple models are less likely to overfit than complex ones

###  A *simple* model here is a model where the distribution of parameter values has *less entropy* (fewer parameters). 

### Thus a common way to mitigate the overfitting is to put constraints on teh complexity of a network by forcing its weights to take only small values, which makes the distribution of weight values more *regular* ------This is so-called *weight regularization*

### *weight regularization* is done by adding to the loss function of the network a *cost* associated with having large weights. Generally the cost comes in 2 flavors:

### 1) *L1 regularization*  ---- cost is proportional to the *abouslute value of weight coefficients* (L1 norm);
### 2) *L2 regularizaiton*  ---- cost is proportional to the *square of the value of the weight coefficients* (L2 norm)
### **L2 regularization is also called *weight decay* in the context of neural networks

### In Keras, weight regularization is added by passing *weight regularizer instances* to layers as keyword arguments. Let's see the example of movie-review classification network:

In [2]:
# add L2 regularization to the model

from keras import models
from keras import layers
from keras import regularizers

model = models.Sequential()
model.add(layers.Dense(16, kernel_regularizer = regularizers.l2(0.001),
                       activation = 'relu', input_shape = (10000, )))
model.add(layers.Dense(16, kernel_regularizer = regularizers.l2(0.001),
                       activation = 'relu'))
model.add(layers.Dense(1, activation = 'sigmoid'))

### l2(0.001) means every coefficient in the weight matrix of the layer wiil add 0.001 &times; weight_coefficient_value to the total loss of the network. Note that because this penalty is *only added at training time*,  the loss for the network will be much higher at training than at test time

### We may have other choices for weight regularizers in Keras

In [None]:
from keras import regularizers

regularizers.l1(0.001)  # L1 regullarization

regularizers.l1_l2(l1 = 0.001, l2 = 0.001)  # simultaneous L1 and L2 regularization

## 4.4.3 Adding dropout

### *Dropout*, applied to a layerm consists of randomly *dropping out* (setting to 0) a number of output features of the layer during training. 

### e.g. [0.2, 0.5, 1.3, 0.8, 1.1] is the output of a given layer, after dropping out, this vector will have a few zero entries distributed ramdomly: e.g. [0, 0.5, 1.3, 0, 1.1]

### *dropout rate*: the fraction of the features that are zeroed out (usually between 0.2~0.5)
### At test time, no units are dropped out, instead the layer's output values are scaled down by a factor equal to the dropout rate, to balance for the fact that more units are active than at traning time

In [4]:
import numpy as np

test = np.array([3,  1, 10, 15])

x = np.random.randint(0, high=2, size = test.shape)
print(x)

[1 0 0 0]


### Consider a numpy matrix containing the output of a layer,  *layer_output*, of shape (batch_size, features). At training time, we zero out at ramdom a fraction of the values in the matrix

In [None]:
layer_output *= np.random.randint(0, high = 2, size = layer_output.shape)

### At test time, we scale down the output by the dropout rate. Here we scale by 0.5 for instance if previously dropped half of the units

In [None]:
layer_output *= 0.5

### Note that this process can be implemented by doing both operations at training time and leaving the output unchanged at test time, which is often the way implemented in practice

In [None]:
layer_output *= np.random.randint(0, high = 2, size = layer_output.shape)
layer_output /= 0.5
# Note here we scaling up rather than scaling down in this case

### The core idea is to introduce noise in the output values of a layer can break up happenstance patterns that are not significant, which the network will start memorizing if no noise is present

### In Keras, we introduce dropout in a network via the *Dropout* layer

In [None]:
# IMDB example: we add 2 Dropout layers 

model = models.Sequential()

model.add(layers.Dense(16, activation = 'relu', input_shape = (10000, )))
model.add(layers.Dropout(0.5))

model.add(layers.Dense(16, activation = 'relu',))
model.add(layers.Dropout(0.5))

model.add(layers.Dense(1, activation = 'sigmoid'))

## To recap, the most common ways to prevent overfitting:
## 1) Get more training data;
## 2) Reduce the capacity of the network;
## 3) Add weight regularization;
## 4) Add dropout.