# Model Selection
## 1. Evaluation
Let's consider the situation where a certain (correct) model was trained by a well-preprocessed data. Yet, it provide unacceptable errors with unseen data. Among the options available to the ML engineer:
* Get more training data
* consider only a subset of features
* adding more features: the present features might not be informative enough
* adding complex features: polynomial ones (in case of linear model)
* considering larger / smaller ***regularization*** parameter 

### 1.1 train test splitting
A basic method to evaluate a model is to first divide the data set provided into two subsets:
* training set: generally $70\%$ of the entire dataset
* testing set: generally $30\%$ of the entire dataset
We fit the model with the training set. Then, we evaluate its performance on the test set by computing the cost function.

### 1.2 train - cross validation - test splitting
Consider the scenario where multiple models are possible. For example, having multiple polynomial regressions. In this case, the degree of the polynomial $d$ can be seen as an additional parameter (hyperparameter). Choosing the parameter $d$ based on the test set does not guarantee the final model is general enough to perform well on unseen data. Therefore, a new division is introduced: 
* training set: with which each of the polynomial models is fit, generall $60\%$ of the entire dataset
* cross-validation set: to choose which degree is the most optimal, generally $20\%$
* test set: used to evaluate the final model (based on the cross-validation results) performance. $20\%$ 
Consequently, we can derive the terms: train error, cross-validatation (CV) error and test error.

### 1.3 Bias or Variance
Generally, a low-performance can always break down to either ***high bias*** or ***high variance***.  
* high variance: when the model overfits the data: the CV error is much higher than the train error since the model fits the training dataset to a large extent
* high bias: when the model underfits the data: Both CV and train errors are quite high since the model is not well trained and should be expected to make poor predictions on unseen data.

## 2. Regularization
### 2.1 Cost functions
Assuming that the model is associated with the cost function $J(\theta)$. and a regularized cost function $J_{reg}(\theta) = J(\theta) + \sum_{i=1}^{m} \theta_i ^ 2$. We introduce the following: 
* train cost function = $J_{train} (\theta)$ 
* CV cost functio = $J_{cv}(\theta)$
* test cost function = $J_{test} (\Theta) $
where each of the above-mentioned cost functions has the same formula as the general cost function, only itereting through the corresponding data set.
### 2.2 General selection procedure
To choose the best ***model variant*** and ***regularization parameter***, the following procedure is useful: 
1. create a list of possible $\lambda$'s 
2. create the models with the different variants (in the previous example, polynomial with different degrees)
3. iterate through the possible values of $\lambda$'s and for each value fit the possible model variants obtaining **$\lambda$** model variants 
4. for each choice, compute the cross validation error
5. select the combination $(\lambda, model~variant)$ that minimizes the cross validation error
6. test the final model by computing $J_{test} (\Theta) $

### 2.3 Learning Curves
#### 2.3.1 High bias
A model that underfits the training data does not consider all the relevant aspects of the problem. Consequently, additional training sample will not improve the performance as simple the model does not take full advantage of the additional data.
Taking into consideration such remarks, one practical indicator of ***high bias*** is the following learning curves:

![High bias Learning curve](https://github.com/ayhem18/Towards_Data_science/blob/master/Machine_Learning/generalities/learning_curve_high_bias.png?raw=true) 

A simple linear model is quite likely to underfit a complex problem with a significant number of features. When plotting $J_{cv}(\theta)$ and $J_{train} (\theta)$ with $m$ as the parameter. We can see that after a certain training set size, $J_{cv}(\theta)$ ceases to decrease and the $J_{train} (\theta)$ ceases to increase. A similar behavior is a serious indicator of underfitting.

#### 2.3.2 High Variance
The larger the dataset, the more complex the model should be to overfit the training data. Therefore, if we consider an initial model with high variance, it might be helpful to keep adding training samples. The following learning curves are generally obtained:

![High bias Learning curve](https://github.com/ayhem18/Towards_Data_science/blob/master/Machine_Learning/generalities/learning_curve_high_variance.png?raw=true) 

In light of such results we can divide the possible solutions as follows:
1. fixing high bias (underfit)
* add more complex features
* decrease $\lambda$ the regularization parameter
* add more features
2. fixing high  variance (overfit)
* adding training sample
* increase $\lambda$ the regularization parameter
* consider a subset of features