### Selecting the best model

If our model is underperforming, how should we move forward? There are several possible answers:

- Use a more complicated/more flexible model
- Use a less complicated/less flexible model
- Gather more training samples
- Gather more data to add features to each sample

The answer to this question is often counter-intuitive.

In particular, sometimes using a more complicated model will give worse results, and adding more training samples may not improve your results! 

The ability to determine what steps will improve your model is what separates the successful machine learning practitioners from the unsuccessful.

### The Bias-variance trade-off

Fundamentally, the question of `the best model` is about finding a sweet spot in the tradeoff between bias and variance. 

Consider the following figure, which presents two regression fits to the same dataset:

![image.png](attachment:image.png)

The model on the left attempts to find a straight-line fit through the data. Because the data are intrinsically more complicated than a straight line, the straight-line model will never be able to describe this dataset well. Such a model is said to `underfit` the data: the model has `high bias`

The model on the right attempts to fit a high-order polynomial through the data. 
Such a model is said to overfit the data: that is, it has so much model flexibility that the model ends up accounting for random errors as well as the underlying data distribution; another way of saying this is that the model has high `variance`.

To look at this in another light, consider what happens if we use these two models to predict the y-value for some new data. In the following diagrams, the red/lighter points indicate data that is omitted from the training set:

![image.png](attachment:image.png)



From the scores associated with these two models, we can make an observation that holds more generally:

- For `high-bias` models, the performance of the model on the validation set is similar to the performance on the training set.
- For `high-variance` models, the performance of the model on the validation set is far worse than the performance on the training set.

If we imagine that we have some ability to tune the model complexity, we would expect the training score and validation score to behave as illustrated in the following figure:

![image.png](attachment:image.png)

The diagram shown here is often called a `validation curve`, and we see the following essential features:

- The training score is everywhere higher than the validation score. This is generally the case: the model will be a better fit to data it has seen than to data it has not seen.
- For very low model complexity (a high-bias model), the training data is under-fit, which means that the model is a poor predictor both for the training data and for any previously unseen data.
- For very high model complexity (a high-variance model), the training data is over-fit, which means that the model predicts the training data very well, but fails for any previously unseen data.
- For some intermediate value, the validation curve has a maximum. This level of complexity indicates a suitable trade-off between bias and variance.

### Learning curve

A plot of the training/validation score with respect to the size of the training set is known as a `learning curve`.

The general behavior we would expect from a learning curve is this:

- A model of a given complexity will overfit a small dataset: this means the training score will be relatively high, while the validation score will be relatively low.
- A model of a given complexity will underfit a large dataset: this means that the training score will decrease, but the validation score will increase.
- A model will never, except by chance, give a better score to the validation set than the training set: this means the curves should keep getting closer together but never cross.

With these features in mind, we would expect a learning curve to look qualitatively like that shown in the following figure:



![image.png](attachment:image.png)

The notable feature of the learning curve is the `convergence` to a particular score as the number of training samples grows. 

In particular, once you have enough points that a particular model has converged, adding more training data will not help you!

The only way to increase model performance in this case is to use another (often more complex) model.



