# **16 Cross Validation and Regularization**

The balance of complexity: 
* A model that is too complex can lead to overfitting 
* A model that is too simple can lead to underfitting 

How do we control model complexity to avoid under - and - overfitting?

We can make use of **cross-validation** in order to assess *when* our model begins to overfit 

We can also apply **regularization** in order to adjust the complexity of our models ourselves

### **16.1 Cross Validation and Regularization**

#### **16.1.1 Training, Test, and Validation Sets**

* We know that *increasing* model complexity *decreased* our model's training error, but *increased* its variance 

##### **16.1.1.1 Test Sets**

* In order to assess our model on "unseen" data, we can make use of a **test set**
* The datapoints in this set will *not* be used to fit the model 

* After, we use the remaining portion of our data - which we call the **training set** - to run OLS, gradient descent, or something else 
* After we construct our model on the training data, we assess its performance on the test set (this is indicative of how well it can make predictions on *unseen* data)

* We can only use the test set **once**: to compute the performance of the model after all fine-tuning has been completed 

This process of dividing our data is known as **train - test - split**
* Usually we split it up so we have $10\%$ or $20\%$ as our test 


<img src="
https://ds100.org/course-notes/cv_regularization/images/train-test-split.png" alt="Image Alt Text" width="500" height="170">

##### **16.1.1.2 Validation Sets**

What if we were dissatisfied with our test set's performance?
* We can't go back and adjust our model, as that would no longer be a true representation of the model's performance on *unseen* data 

The solution? 
* Introduce a **validation set**
* A validation set is a random portion of the *training set* that is set aside for assessing model performance while the model is *still being developed* 

1) Perform a train-test split 
2) Set the test aside; we will not touch it until the end of the model design process 
3) Set aside a portion of the training to be used for validation 
4) Fir model parameters to the datapoints in the remaining training 
5) Assess model performance on validation, and adjust the model as needed. Re-fit model to the remaining portion of the training, the reevaluate on the validation set until you are satisfied 
6) After *all* the model development is complete, assess the model's performance on the test set 

The process of creating a validation set is called a **validation split**

<img src="https://ds100.org/course-notes/cv_regularization/images/validation-split.png" alt="Image Alt Text" width="600" height="170">

Validation error decreases *then increases* as we increase model complexity

<img src="https://ds100.org/course-notes/cv_regularization/images/training_validation_curve.png" alt="Image Alt Text" width="600" height="460">

### **Cross Validation**
* Cross validation subdivides the training set into two groups; one group used to fit the model (a mini training set), and the other used to validatie it (a validation set)

* $k$-fold cross-validation means that we perform this splitting step $k$ times. Each validation set contains \frac{1}{k} of the total training data 

<img src="https://ds100.org/course-notes/cv_regularization/images/model_selection.png" alt="Image Alt Text" width="600" height="240">

### **Regularization**

* One way or limiting complexity is saying that our model parameters can't be too large 
* In regularization, we penalize our model for picking large theta values y modifying our loss function 

$$\frac{1}{n} \Sigma_{i = 1}^{n} \text{Loss}(y_i, \hat{y_i}) + \lambda R(\theta)$$

* There is now an extra "cost" to choosing large values of theta. We won't choose the best possible theta vector (higher model bias), but we'll avoid overfitting (lower model variance)

* What does $R(\theta)$ look like?
* In Data 100, we'll mostly consider two types of regularized regression 

**LASSO(L1) Regression:**

$$ \frac{1}{n} |Y - X\theta|_{2}^{2} + \lambda \cdot \Sigma_{j = 1}^{d} |\theta_j|$$

* Encourage sparsity (drive some values of $\theta$ to 0)
* No closed - form solution for the optimal $\theta$
* The penalty term added to the loss function is the sum of the absolute values of the coefficients

**Ridge(L2) Regression:**

$$ \frac{1}{n} |Y - X\theta|_{2}^{2} + \lambda \cdot \Sigma_{j = 1}^{d} \theta_j^2$$

* "Robust" - tends to spread theta weights over many features 
* Optimal $\theta$ can be found: 
    * $\hat{\theta} = (X^T X + n \lambda I)^{-1} X^TY$
* The penalty term added to the loss function is the sum of the squares of the coefficients





