# 4.2 Evaluating machine-learning models

In the three examples presented in chapter 3, we split the data into a **training set**, a **validation set**, and a **test set**. The reason not to evaludate the models on the same data  they were trained on quickly became evident: after just a few epochs, all three models began to **overfit**. That is, their performance on never-before-seen data started stalling (or worsening) compared to their performance on the training data - which always improves as training progresses. <br>
In machine learning, the goal is to achieve models that **generalize** - that perform well on never-before-seen data - and overfitting is the central obstacle. You can only control that which you can observe, so it's crucial to be able to reliably measure the generalization power of your model. The following sections look at strategies for mitigating overfitting and maximizing generalization.<span class="mark">In this section, we'll focus on how to measure generalization : how to evaluate machine-learning models.</span>

## 4.2.1 Training, validation, and test sets

Evaluating a model always boils down to splitting the available data into three sets:
- training data
- validation data.
- test data.

You train on the training data and evaludate your model on the validation data. Once your model is ready for prime time, you test it one final time on the test data. <br>
You may ask, why not have two sets: a training set and a test set? You'd train on the training data and evaludate on the test data. Much simpler!
The reason is that developing a model always involves **tuning its configuration**: for example, choosing the number of layers or the size of the layers (called the **hyper-parameters** of the model, to distinguish them from the **parameters**, which are the network's weights). You do this tuning by using as a feedback signal the performance of the model on the validation data. In essence, this tuning is a form of **learning** : search for a good configuration in some parameter space. As a result, tuning the configuration of the model based on its performance on the validation set can quickly result in **overfitting to the validation set**, even though your model is never directly trained on it.
Central to this phenomenon is the notion of **information leaks**. Every time you tune a hyperparameter of your model based on the model's performance on the validation set, some information about the validation data leaks into the model. some information about the validation data leaks into the model. If you do this only once, for one parameter, then very few bits of information will leak, and your validation set will remain reliable to evaludata the model. But if you repeat this many times - running one experiment, evaluating on the validation set, and modifying your model as a result - then you'll leak an increasingly significant amount of information about the validation set into the model.<br>
At the end of the day, you'll end up with a model that performs artificially well on the validation data, because that's what you optimized it for. **You care about performance on completely new data**. not the validation data, <span class="mark">so you need to use a completely different, never-before-seen dataset to evaluate the model: the test data set.</span>your model shouldn't have had access to any information about the test set, even indirectly.
If anything about the model has benn tuned based on test set performance, then your measure of generalization will be flawed. <br>
**Splitting your data into training, validation, and test sets** may seem straightforward, but there are a few advanced ways to do it that can come in handy when little data is available. Let's review three classic evaluation recipes: simple hold-out validation, K-fold validation, and iterated K-fold validation with shuffling.

### Simple HOLD-OUT VALIDATION
Set apart some fraction of your data as your test set. Train on the remaining data, and evaludate on the test set. As you saw in the previous sections, in order to prevent information leaks, you shouldn't tune your model based on the test set, and therefore you should also reserve a validation set.<br>
Schematically, hold-out validation look figure 4.1. The following listing shows a simple implmentation.
![4.1](nb_images/Figure 4.1.JPEG)


In [2]:
import numpy as np
import tensorflow as tf
import keras

### Listing 4.1 Hold-out validation

In [None]:
num_validation_samples = 10000
# Shuffling the data is usually appropriate.
np.random.shuffle(data) 
# Defines the validation set
validation_data = data[:num_validation_sample]
data = data[num_validation_samples]

# Defines the training set
training_data = data[:]

# Trains a model on the training data, and evaludates it in the validation data
model = get_model()
model.train(training_data)
validation_score = model.evaluate(validation_data)

# At this point you can tune your model,
# retrain it, evaludate it, tune it again...

# Once you've tuned your hyperparameters, 
# it's common to train your final model from scratch on all non-test data available.
model = get_model()
model.train(np.concatenate([training_data, validation_data]))
test_score = model.evaludata(test_data)

- Shuffling the data is usually appropriate.

In [7]:
data = np.arange(0, 80000)
num_validation_samples = 10000
np.random.shuffle(data)
print(data) 

[78271 63528 57890 ... 37304 46322 75090]


This is the simplest evaludation protocol, and it suffers from one flaw: **if little data is available, then your validation and test sets may contain too few samples to be statistically representative of the data at hand**. This is easy to recognize: if different random shuffling rounds of the data before splitting end up yielding very different measures of model performance, then  you're having this issue. K-fold validation and iterated K-fold validation are two ways to address this, as discussed next.

### K-FOLD VALIDATION