## Content:

- **Regularization**
- **Regularization Code**

- **Hyperparameters**
  - Parameter vs Hyperparameter

  - Steps to choose hyperparameter
  - Plot between lambda($λ$) and adj.R-squared($R^2$)


- **Cross Validation:**
  - Definition and implementation

- **K-fold CV:**
   - Problems with cross-val
   - Definition and implementation




## **Regularization**

#### Which features are useful to have a perfectly fit model?



<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/043/112/original/Screenshot_2023-08-16_at_6.54.12_PM.png?1692192581_' width=800></center>




#### How to make $w_1, w_2 \neq 0 $ and $w_3, w_4 = 0$ ?



<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/043/113/original/Screenshot_2023-08-16_at_6.54.20_PM.png?1692192624' width=800></center>






Here  $d$ is the number of features

<br>

**Note:** This term $\sum_{j=1}^{d} w_j^2$ is called **regularization** and its used:
- So that Gradient Descent works in minimizing values of $w_j $ by making them $≈ 0$


#### Understanding the new Loss function



<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/043/114/original/Screenshot_2023-08-16_at_6.54.27_PM.png?1692192643' width=800></center>





#### How to get that sweet spot between loss function and Regularization ?



<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/043/115/original/Screenshot_2023-08-16_at_6.54.34_PM.png?1692192678' width=800></center>






#### How does $\lambda$ creates that sweet spot between MSE and Regularization term ?

Ans: With a **right $\lambda$ value** :

1. There is **enough freedom to MSE** so that:
  - The **weights are  optimized** to reach the **lowest possible MSE value**
  - Which **does not lead to overfitting**

2. It also provide **enough freedom to Regularization term** so that:
 - The regularization term can make the weights of the model close to 0
 - Which **does not lead to underfitting**

<br>

**Note:** The term $w_j^2$ is called as L2/Ridge Regularization



## **Points to Remember**



<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/043/117/original/Screenshot_2023-08-16_at_6.54.48_PM.png?1692192728' width=800></center>




## **L-1 Regularization**

#### What do you think, will $\sum_{j=1}^{d} |w_j|$ work ?



<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/043/120/original/Screenshot_2023-08-16_at_6.55.09_PM.png?1692192872' width=800></center>





<center><img src=https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/043/123/original/Screenshot_2023-08-16_at_7.33.45_PM.png?1692194777 width=800></center>

## **Interesting property of L1 and L2 Reg**

#### When to use L1, L2 Regularization ?

<center><img src=https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/043/124/original/Screenshot_2023-08-16_at_7.34.47_PM.png?1692194929 width=800></center>

#### Why does L1 create a sparse W and L2 does not ?



<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/043/121/original/Screenshot_2023-08-16_at_6.55.16_PM.png?1692192926' width=800></center>




Ans: To figure out why L1 creates a sparse matrix and L2 does not, lets look into weight updation:

- $w_j^{new} =w_j^{old} - \eta \frac{\partial L}{\partial W_j} $

<br>

As **MSE remains same, so the only change** is caused due to **derivative of $|w_j|$ and $w_j^2$**

<br>

#### What do you think, will be the derivative of  $|w_j|$ and $w_j^2$ ?

Ans: for $\frac{d|w_j|}{dw_j} =  [1,0,-1] $ while for $\frac{d|w_j^2|}{dw_j} = 2 \times w_j$

<br>

**observe**
$\frac{d|w_j|}{dw_j}$ is **independent of $w_j$**
- hence quickly reaches 0

<br>

while $\frac{dw_j^2}{dw_j}$ is **dependent on $w_j$**
- hence **when  $w_j$ large**, it **reaches close to 0 very fast**

- But as **$w_j$ approaches zero, the value becomes very small** --> causing **$w_j$ to remain close to 0 only**



#### If you are not sure which regularization to use, is there a way to combine both L1 and L2 ?



<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/043/122/original/Screenshot_2023-08-16_at_6.59.22_PM.png?1692192957' width=800></center>




## **L1, L2 Regularization Code**

#### Using Sklearn diabetes data - https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset


<img src= https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/043/110/original/Screenshot_2023-08-16_at_6.03.47_PM.png?1692189250 width=800>

In [1]:
from sklearn import datasets

data = datasets.load_diabetes()
data

{'data': array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
          0.01990749, -0.01764613],
        [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
         -0.06833155, -0.09220405],
        [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
          0.00286131, -0.02593034],
        ...,
        [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
         -0.04688253,  0.01549073],
        [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
          0.04452873, -0.02593034],
        [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
         -0.00422151,  0.00306441]]),
 'target': array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
         69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
         68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
         87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
        259.,  53., 190., 142.,  75., 142., 155., 225.,  59

In [None]:
x = data["data"]
y = data["target"]

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
# print(x_train.shape)
# print(x_test.shape)
# print(y_train.shape)
# print(y_test.shape)

In [None]:
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

In [None]:
deg = 15
poly = PolynomialFeatures(degree=deg)
x_train_poly = poly.fit_transform(x_train)
x_test_poly = poly.fit_transform(x_test)

In [None]:
scaler = StandardScaler()
x_train_poly_scaled = scaler.fit_transform(x_train_poly)
x_test_poly_scaled = scaler.fit_transform(x_test_poly)

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()
model.fit(x_train_poly_scaled, y_train)
output = model.predict(x_test_poly_scaled)

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
train_output = model.predict(x_train_poly_scaled)
print("Training Error =", mean_squared_error(y_train, train_output))

test_output = model.predict(x_test_poly_scaled)
print("Testing Error =", mean_squared_error(y_test, test_output))

Training Error = 8.382508169707599e-24
Testing Error = 305864.7051985991


# Using L1 & L2

In [None]:
from sklearn.linear_model import Lasso, Ridge

In [None]:
lasso_model = Lasso(alpha=0.01)
lasso_model.fit(x_train_poly_scaled, y_train)

  model = cd_fast.enet_coordinate_descent(


In [None]:
lasso_train_output = lasso_model.predict(x_train_poly_scaled)
lasso_test_output = lasso_model.predict(x_test_poly_scaled)
print("Training Error =", mean_squared_error(y_train, lasso_train_output))
print("Testing Error =", mean_squared_error(y_test, lasso_test_output))

Training Error = 16.871373882405926
Testing Error = 75564.49600989118


In [None]:
ridge_model = Ridge(alpha=1)
ridge_model.fit(x_train_poly_scaled, y_train)

In [None]:
ridge_train_output = ridge_model.predict(x_train_poly_scaled)
ridge_test_output = ridge_model.predict(x_test_poly_scaled)
print("Training Error =", mean_squared_error(y_train, ridge_train_output))
print("Testing Error =", mean_squared_error(y_test, ridge_test_output))

Training Error = 7.727441417848905
Testing Error = 234788.3445404214
