# 1. Introduction

This chapter covers
- Forms of machine learning beyond classification and regression
- Formal evaluation procedures for machine learning models
- Preparing data for deep learning
- Feature engineering
- Tackling overfitting
- The universal workflow for approaching machine learning problems

# 2. Four branches of machine learning

- **Supervised learning**
    - This is by far the most common case. It consists of learning to map input data to known targets (also called annotations), given a set of examples (often annotated by humans). **Supervised learning by far is the dominant form of deep learning today, with a wide range of industry applications.**
- **Unsupervised learning**
    - This branch of machine learning consists of finding interesting transformations of the input data without the help of any targets, for the purposes of data visualization, data compression, or data denoising, or to better understand the correlations present in the data at hand. Unsupervised learning is the bread and butter of data analytics, and it’s often a necessary step in better understanding a dataset before attempting to solve a supervised-learning problem. **Dimensionality reduction** and **clustering** are well-known categories of unsupervised learning.
- **Self-supervised learning**
    - It is a supervised learning without human-annotated labels (they’re generated from the input data, typically using a
heuristic algorithm).
- **Reinforcement learning**
    - In reinforcement learning, an agent receives information about its environment and learns to choose actions that
will maximize some reward. Currently, reinforcement learning is mostly a research area and hasn’t yet had significant
practical successes beyond games. In time, however, we expect to see reinforcement learning take over an increasingly large range of real-world applications: self-driving cars, robotics, resource management, education, and so on. It’s an idea
whose time has come, or will come soon.

# 3. Evaluating machine-learning models

In machine learning, the goal is to achieve models that **generalize** — that perform well on never-before-seen data—and **overfitting** is the central obstacle. In this section, we’ll focus on **how to measure generalization**: how to evaluate machine-learning models.

## 3.1 Training, validation and test sets

Evaluating a model always boils down to splitting the available data into three sets: 

- **training**
- **validation**
- **test**

You train on the training data and evaluate your model on the validation data. Once your model is ready for prime time, you test it one final time on the test data.

Splitting your data into training, validation, and test sets may seem straightforward, but there are a few advanced ways to do it that can come in handy when little data is available.

- **Hold-out validation**
- **K-Fold**
- **Iterated K-Fold validation with shuffling**

<img width="800" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1zP9sqXEYvfNLyII_avHuF9JYk2-gBF0O">

>```python
# Hold-out validation pseudo-code
num_validation_samples = 10000
# Shuffling the data is usually appropriate.
np.random.shuffle(data)
# Defines the validation set
validation_data = data[:num_validation_samples]
data = data[num_validation_samples:]
# Defines the training set
training_data = data[:]
# Trains a model on the training data, and evaluates it on the validation data
model = get_model()
model.train(training_data)
validation_score = model.evaluate(validation_data)
# At this point you can tune your model, retrain it, evaluate it, tune it again...
# Once you’ve tuned your hyperparameters, it’s common to train your final model
# from scratch on all non-test data available.
model = get_model()
model.train(np.concatenate([training_data,
validation_data]))
test_score = model.evaluate(test_data)
```

>```python
# K-Fold Validation
k = 4
num_validation_samples = len(data) // k
np.random.shuffle(data)
validation_scores = []
for fold in range(k):
    # Selects the validation data partition
    validation_data = data[num_validation_samples * fold:num_validation_samples * (fold + 1)]
    # Uses the remainder of the data as training data. Note that the + operator is list concatenation, not summation
    training_data = data[:num_validation_samples * fold] + data[num_validation_samples * (fold + 1):]
    # Creates a brand-new instance of the model (untrained)
    model = get_model()
    model.train(training_data)
    validation_score = model.evaluate(validation_data)
    validation_scores.append(validation_score)
# Validation score: average of the validation scores of the k folds
validation_score = np.average(validation_scores)
# Trains the final model on all nontest data available
model = get_model()
model.train(data)
test_score = model.evaluate(test_data)
```


## 3.2 Data preprocessing and feature engineering

In addition to model evaluation, an important question we must tackle before we dive deeper into model development is the following: how do you prepare the input data and targets before feeding them into a neural network?

###  3.2.1 Data preprocessing

Data preprocessing aims at making the raw data at hand more amenable to neural networks. This includes:
- **vectorization**
- **normalization**
- **handling missing values**
- **feature extraction**

### 3.2.2 Feature engineering

The essence of **feature engineering** is making a problem easier by **expressing it in a simpler way**. It usually requires understanding the problem in depth.

<img width="400" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=14dhxjfqcJlg4SCQHURYhLHPuDKQgoWfT">

Modern deep learning removes the need for most feature engineering, because neural networks are capable of automatically extracting useful features from raw data. Does this mean you don’t have to worry about feature engineering as
long as you’re using deep neural networks? No, for two reasons:

- **Good features** still allow you to solve problems more elegantly while **using fewer resources**. For instance, it would be ridiculous to solve the problem of reading a clock face using a convolutional neural network.
- **Good features** let you solve a problem with far **less data**. The ability of deep learning models to learn features on their own relies on having lots of training data available; if you have only a few samples, then the information value in
their features becomes critical.

## 3.3 Overfitting and underfitting

At the **beginning of training**, optimization and generalization are correlated: **the lower the loss on training data, the lower the loss on test data**. While this is happening, your model is said to be **underfit**: there is still progress to be made; the network hasn’t yet modeled all relevant patterns in the training data. 

But after a certain number of iterations on the training data, **generalization stops improving, and validation metrics stall and then begin to degrade**: the model is starting to **overfit**. That is, it’s beginning to learn patterns that are specific to the training data but that are misleading or irrelevant when it comes to new data.

To prevent a model from learning misleading or irrelevant patterns found in the training data, the best solution is to get more training data. 
- **A model trained on more data will naturally generalize better**. 

When that isn’t possible, the next-best solution:
- to modulate the quantity of information that your model is allowed to store
- to add constraints on what information it’s allowed to store. 

If a network can only afford to memorize a small number of patterns, the optimization process will force it to focus on the **most prominent patterns**, which have a better chance of generalizing well. The processing of fighting overfitting this way is called **regularization**.

## 3.3.1 Regularization

These are the most common ways to prevent overfitting in neural networks:
- Get more training data.
- Reduce the capacity of the network.
- Add weight regularization.
- Add dropout.

### 3.3.1.1 Reducing the network’s size

The general workflow to find an appropriate model size is:

1. Start with relatively few layers and parameters
2. Increase the size of the layers or add new layers until you see diminishing returns with regard to validation loss

Let’s try this on the movie-review classification network:

- Original model
>```python
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
```

- Version of the model with lower capacity
>```python
model = models.Sequential()
model.add(layers.Dense(4, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(4, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
```

- Version of the model with higher capacity
>```python
model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
```

<table>
<tr>
    <td> <img src="https://drive.google.com/uc?export=view&id=1VMkzNR_R8ODITTtwCAmLZzjIZfJTN11x" width="400"> </td>
    <td> <img src="https://drive.google.com/uc?export=view&id=1-C0mb-NPmlyCFv-HMkSbATOVywmC9LVZ" width="400"> </td>
</tr>
</table>

As you can see, **the smaller network starts overfitting later** than the reference network (after six epochs rather than four), and its performance degrades more slowly once it starts overfitting.

**The bigger network starts overfitting almost immediately**, after just one epoch, and it overfits much more severely. Its validation loss is also noisier.


### 3.3.1.2 Adding weight regularization

**A common way to mitigate overfitting** is to put constraints on the complexity of a network by **forcing its weights to take only small values**, which makes the distribution of weight values more regular. This is called **weight regularization**, and it’s done by adding to the loss function of the network a cost associated with having large weights.

- **L1 regularization**
    - The cost added is proportional to the absolute value of the weight coefficients (the L1 norm of the weights).
- **L2 regularization**
    - The cost added is proportional to the square of the value of the weight coefficients (the L2 norm of the weights)
    
>```python
# Adding L2 weight regularization to the model
from keras import regularizers
model = models.Sequential()
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
```

<table>
<tr>
    <td> <img src="https://drive.google.com/uc?export=view&id=1WwK9JJhPyf_WcYlB1LabnRvWuTLjVQNy" width="400"> </td>
</tr>
</table>

As you can see, the model with L2 regularization (dots) has become much more resistant to overfitting than the reference model (crosses), even though both models have the same number of parameters.

### 3.3.1.3 Adding dropout

**Dropout** is one of the most effective and most commonly used regularization techniques for neural networks, developed by [Geoff Hinton](https://en.wikipedia.org/wiki/Geoffrey_Hinton) and his students at the University of Toronto. Dropout, applied to a layer, consists of randomly dropping out (setting to zero) a number of output features of the layer during training.

<img width="400" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1RXqlzlb7oN7Q3e5ojctwmEXBUCk1ode2">

This technique may seem strange and arbitrary. Why would this help reduce overfitting? Hinton says he was inspired by, among other things, a fraud-prevention mechanism used by banks. In his own words, “I went to my bank. The tellers kept changing
and I asked one of them why. He said he didn’t know but they got moved around a lot. I figured it must be because it would require cooperation between employees to successfully defraud the bank. This made me realize that randomly **removing a different subset of neurons on each example would prevent conspiracies and thus reduce overfitting**.”

- The core idea is that introducing noise in the output values of a layer can break up happenstance patterns that aren’t significant (what Hinton refers to as conspiracies), which the network will start memorizing if no noise is present.

>```python
# Adding dropout to the IMDB network
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))
```

<img width="400" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1PjdL0qXfg2AqwiY5Yp7H_lx8pnw-ETZD">

# 4. The universal workflow of machine learning

1. **Defining the problem and assembling a dataset**
    - you must define the problem at hand
    - machine learning can only be used to memorize patterns that are present in your training data. You can only recognize what you’ve seen before.
2. **Choosing a measure of success**
    - your metric for success will guide the choice of a loss function: what your model will optimize.
3. **Deciding on an evaluation protocol**
    - hold-out validation set
    - K-fold cross-validation
    - iterated K-fold validation
    - note: in most cases, the first will work well enough
4. **Preparing your data**
    - your data should be formatted as **tensors**.
    - the values taken by these tensors should usually be scaled to small values (-1,1) or (0,1)
    - if different features take values in different ranges, then the data should be **normalized**.
    - you may want to do some **feature engineering**, especially for **small-data problems**.
5. **Developing a model that does better than a baseline**
    - you need to make three key choices to build your first working model
    
    <img width="600" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1Td3XWIeYStS600Swrp0G-Brne9P-BrAw">

6. **Scaling up: developing a model that overfits**
    - add layers.
    - make the layers bigger.
    - train for more epochs.
    - note: when you see that the model’s performance on the validation data begins to degrade, you’ve achieved overfitting
    
7. **Regularizing your model and tuning your hyperparameters**
    - Add dropout.
    - Try different architectures: add or remove layers.
    - Add L1 and/or L2 regularization
    - Try different hyperparameters (such as the number of units per layer or the learning rate of the optimizer) to find the optimal configuration.
    - Optionally, iterate on feature engineering: add new features, or remove features that don’t seem to be informative.