# Cross-validation for R-squared


Cross-validation is a vital approach to evaluating a model. It maximizes the amount of data that is available to the model, as the model is not only trained but also tested on all of the available data

---
## Cross-validation motivation
- Model performance is dependent on the way we split up the data
- Not representative of the model's ability to generalize to unseen data
- Solution: Cross-validation!
_ _ _
## Cross-validation and model performance
- 5 folds = 5-fold
- CV10 folds = 10-fold
- CVk folds = k-fold CV
- More folds = More computationally expensive

---
## Cross-validation in scikit-learn
```python
    from sklearn.model_selection import cross_val_score, KFold
    
    kf = KFold(n_splits=6, shuffle=True, random_state=42)
    reg = LinearRegression()
    
    cv_results = cross_val_score(reg, X, y, cv=kf)
```
---
## SUPERVISED LEARNING WITH SCIKIT-LEARNER
valuating cross-validation performance
```python

    print(cv_results)
    print(np.mean(cv_results), np.std(cv_results))   
    print(np.quantile(cv_results, [0.025, 0.975]))
    


.

## üîç What is Penalization in Linear Regression?

**Penalization** refers to adding a **penalty term** to the loss function in linear regression to discourage overly complex models (e.g., large coefficients). This helps prevent **overfitting**.

---

### üéØ Why Penalize?

In Ordinary Least Squares (OLS), the goal is to minimize(The Residual Sum of Square **RSS**):

$$
\sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

But without constraints, models can overfit, especially in high-dimensional or noisy data. Penalization adds a cost to large coefficients, helping improve generalization.

---

### üìö Types of Penalized Regression

#### 1. **Ridge Regression (L2 Penalty)**

Adds the squared magnitude of coefficients:

$$
\text{Loss} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2
$$

- Shrinks coefficients smoothly
- Keeps all features, but reduces their influence

#### 2. **Lasso Regression (L1 Penalty)**

Adds the absolute value of coefficients:

$$
\text{Loss} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j|
$$

- Encourages **sparsity** (some coefficients become exactly zero)
- Useful for **feature selection**

#### 3. **Elastic Net**

Combines L1 and L2 penalties:

$$
\text{Loss} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda_1 \sum |\beta_j| + \lambda_2 \sum \beta_j^2
$$

- Balances sparsity and shrinkage
- Good when features are correlated

---

### ‚öôÔ∏è Intuition

Penalization changes the optimization objective:

- You minimize both prediction error **and** model complexity.
- Larger values of **Œª** (lambda) lead to **simpler models**.

---

Let me know if you want a code demo or plot example!


## Regularized Regression

## Why Regularize?
- Recall: Linear regression minimizes a loss function
- It chooses a coefficient, a, for each feature variable, plus b
- Large coefficients can lead to overfitting
- Regularization: Penalize large coefficients

---

### Ridge Regression
- Loss function = OLS loss function +
   $$
            \alpha \cdot \sum_{i=1}^{n} a_i^2
   $$
- Ridge penalizes large positive or negative coefficients
- $\alpha$: parameter we need to choose
- Picking $\alpha$ is similar to picking k in KNN
- Hyperparameter: variable used to optimize model parameters
- $\alpha$ controls model complexity
   - $\alpha$ = 0 = OLS(Can lead to underfitting)
   - Very high $\alpha$: Can lead to underfitting

## Ridge regression in scikit-learn
``` python
from sklearn.linear_model import Ridge

scores = []
for alpha in [0.1, 1.0, 10.0, 100.0, 1000.0]:    
    ridge = Ridge(alpha=alpha)    
    ridge.fit(X_train, y_train)    
    y_pred = ridge.predict(X_test)    s
    cores.append(ridge.score(X_test, y_test))
    
print(scores)
```

## Lasso regression(L1 Penalty)
- Loss function = OLS loss function +
   $$
            \alpha \cdot \sum_{i=1}^{n} |a_i|
   $$
---
### Lasso regression for feature selection
- Lasso can select important features of a dataset
- Shrinks the coefficients of less important features to zero
- Features not shrunk to zero are selected by lasso

---

### Lasso regression in scikit-learn

```python
from sklearn.linear_model import Lasso
scores = []for alpha in [0.01, 1.0, 10.0, 20.0, 50.0]:  
    lasso = Lasso(alpha=alpha)  
    lasso.fit(X_train, y_train)  
    lasso_pred = lasso.predict(X_test)  
    scores.append(lasso.score(X_test, y_test))

print(scores)
``
---

### Laso feature selection
```python
X = diabetes_df.drop("glucose", axis=1).values
y = diabetes_df["glucose"].values
names = diabetes_df.drop("glucose", axis=1).columns

lasso = Lasso(alpha=0.1)
lasso_coef = lasso.fit(X, y).coef_

plt.bar(names, lasso_coef)
plt.xticks(rotation=45)
plt.show()
```
`

## Hyperparameter tuning
+ Ridge/lasso regression: Choosing alpha
+ KNN: Choosing n_neighbors
+ Hyperparameters: Parameters we specify before fitting the model
    - Like alpha and n_neighbors
 
## Choosing the correct hyperparameters
1. Try lots of different hyperparameter values
2. Fit all of them separately
3. See how well they perform
4. Choose the best-performing values. 


## GridSearchCV in scikit-learn

```python
from sklearn.model_selection import GridSearchCV
kf = KFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {"alpha": np.arange(0.0001, 1, 10),"solver": ["sag", "lsqr"]}
ridge = Ridge()
ridge_cv = GridSearchCV(ridge, param_grid, cv=kf)
ridge_cv.fit(X_train, y_train)
print(ridge_cv.best_params_, ridge_cv.best_score_)
```

## Limitations and an alternative approach
+ 3-fold cross-validation, 1 hyperparameter, 10 total values = 30 fits
+ 10 fold cross-validation, 3 hyperparameters, 30 total values = 900 fits

## RandomizedSearchCV
```python
from sklearn.model_selection import RandomizedSearchCV
kf = KFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {'alpha': np.arange(0.0001, 1, 10),"solver": ['sag', 'lsqr']}
ridge = Ridge()
ridge_cv = RandomizedSearchCV(ridge, param_grid, cv=kf, n_iter=2)
ridge_cv.fit(X_train, y_train)
print(ridge_cv.best_params_, ridge_cv.best_score_)
```

```python
#¬†Create the parameter space
params = {"penalty": ["l1", "l2"],
         "tol": np.linspace(0.0001, 1.0, 50),
         "C": np.linspace(0.1, 1.0, 50),
         "class_weight": ["balanced", {0:0.8, 1:0.2}]}

# Instantiate the RandomizedSearchCV object
logreg_cv = RandomizedSearchCV(logreg, params, cv=kf)

# Fit the data to the model
logreg_cv.fit(X_train, y_train)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Best Accuracy Score: {}".format(logreg_cv.best_score_))
```

In [7]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, KFold