### Regularization

#### In general we can think of regularization as a way to reduce overfitting and varience,
 #### 1. This required some additonal bias
 #### 2. Requires a search for optimal penalty hyperparameter

#### There are 3 main types of regularization.
##### 1. L1 reg. or LASSO regression
##### 2. L2 reg. or Ridge regression
##### 3. Combinging L1 and L2 or Elastic net

#### Good example for ChatGPT

Alright, here’s **regularization** explained super simply:

---

### Imagine you're fitting a line to some data points (that’s a linear model).

You want the line to predict well **not just for the training data**, but also for **new, unseen data**.

Now sometimes, the model becomes too smart for its own good…

It **bends itself too much** to match the training data perfectly — that’s called **overfitting**.

---

### Regularization is like telling your model:

> “Yo, don’t go crazy trying to fit everything perfectly. Keep it simple and chill.”

---

### How does it do that?

It **penalizes** the model when it uses **very large weights** (the numbers that control the slope of the line).
Why? Because large weights usually mean the model is trying too hard to fit the training data.

Regularization **adds a small punishment** to the model’s error (loss), based on how large the weights are.

---

### There are two common flavors:

#### 1. **L1 Regularization (Lasso)**

* Adds penalty = sum of absolute values of weights
* Can make some weights go all the way to zero (i.e., removes unnecessary features)

#### 2. **L2 Regularization (Ridge)**

* Adds penalty = sum of squares of weights
* Makes weights smaller, but rarely exactly zero

---

### Analogy time 🧠:

Imagine you’re trying to draw a straight line through a cloud of dots on a whiteboard.

* **Without regularization:** You let your hand be as shaky or twisty as it wants — might fit all dots perfectly, but looks messy.
* **With regularization:** It’s like putting a little resistance band on your hand — keeps your movement smooth and simple.

---

So TL;DR:
**Regularization = a way to keep the model from overfitting by discouraging it from relying too much on any one input.**
It's like adding some “common sense” to your model.


Oh hell yes — this is where the fun starts. Let’s break this down in the simplest, most visual way possible — with a bit of **geometry and intuition**.

---

### 🧠 First, understand what optimization is doing:

In linear regression (without regularization), you're just minimizing the **loss** (say, Mean Squared Error).
With regularization, you’re minimizing:

```
Loss = MSE + Regularization Term
```

So it becomes a **tug-of-war** between:

* Fitting the data well (**MSE**)
* Keeping the model simple (**regularization term**)

---

## 🎯 Let's now focus on why L1 can zero out weights but L2 usually doesn’t.

---

## 🟨 Think in 2D: Imagine a graph with just **2 coefficients: w1 and w2**

### 1. **L1 Regularization (Lasso)**:

It adds this penalty:

```
|w1| + |w2|
```

And this forms a **diamond-shaped constraint region** like this:

```
     |
  \  |  /
   \ | /       <- diamond (sharp corners)
---- + ----
   / | \
  /  |  \
     |
```

These sharp corners **lie exactly on the axes** — where **w1 or w2 = 0**.

So when the optimization tries to find the minimum error **inside this diamond**, it's super likely to land on one of those corners — which means **setting w1 or w2 exactly to 0**.

👉 That’s why **L1 regularization leads to sparsity** — it kills off unnecessary weights.

---

### 2. **L2 Regularization (Ridge)**:

It adds this penalty:

```
w1² + w2²
```

And that forms a **circle (or ellipse) constraint region**:

```
     •
  •     •
 •       •
  •     •
     •
```

No sharp corners here, just smooth curves.

When you minimize the loss inside this region, you’re **less likely to hit exact zeros**.
You’ll get **small weights**, but not **exactly zero**.

👉 So **L2 shrinks all weights**, but doesn’t usually kill them.

---

### 🔥 Bottom Line:

| Regularization | Penalty Shape | What It Does      | Zeroes Out Coeffs? |
| -------------- | ------------- | ----------------- | ------------------ |
| L1 (Lasso)     | Diamond       | Feature selection | ✅ Yes              |
| L2 (Ridge)     | Circle        | Shrinkage         | ❌ Rarely           |

---

### Bonus Analogy 🎯

* **L1 is like firing people** from a company: "You’re not useful? You're out!"
* **L2 is like reducing salaries**: "You’re kinda useful? I’ll just pay you less."


#### Feature scaling

Hell yes. Let’s do **feature scaling** the **street-smart way** — no books, just intuition.

---

## 🧠 What is Feature Scaling?

Imagine you’re training a model, and your input has:

* `age`: 20 to 80
* `income`: 20,000 to 2,00,000
* `height`: 1.5 to 2 meters

Now you throw these into a model like a salad.

**Problem:**
The model sees `income` is way bigger than `height`, so it starts thinking **income is more important**, even when it’s not.

> Feature scaling is just making sure **everything is on the same playing field**.

So `age`, `income`, `height` — all become like:
`0.1`, `0.5`, `0.9` — same scale, fair fight.

---

## 🎯 Why do we care?

Some models (like Linear Regression, KNN, SVM, Gradient Descent-based models) **get confused** if features have wildly different scales.
They think big numbers = big importance.

So we scale them to stop that nonsense.

---

## ⚙️ Types of Feature Scaling (with brain-dead analogies):

---

### 1. **Min-Max Scaling (Normalization)**

```python
x_scaled = (x - min) / (max - min)
```

**What it does:**
Squishes values between **0 and 1**.

**Analogy:**
Imagine shrinking every player to fit between the heights of the shortest and tallest — now everyone's between 0 (shortest) and 1 (tallest).

**Used when:**
You know your data limits and want things in a specific range (e.g., neural networks love this).

---

### 2. **Standardization (Z-Score Scaling)**

```python
x_scaled = (x - mean) / std
```

**What it does:**
Transforms data to have:

* Mean = 0
* Standard Deviation = 1

**Analogy:**
Imagine you center everyone around "average", and measure how weird or far off they are from it.

**Used when:**
You don’t know the exact bounds of the data or want to keep outliers around. SVM, Logistic Regression, PCA — all love this.

---

### 3. **MaxAbs Scaling**

```python
x_scaled = x / max(abs(x))
```

**What it does:**
Keeps **signs (positive/negative)** intact, scales values to -1 to +1.

**Analogy:**
Like min-max, but doesn’t mess with 0. Good for **sparse data** (lots of 0s).

---

### 4. **Robust Scaling**

```python
x_scaled = (x - median) / IQR
```

Where IQR = interquartile range (75th percentile - 25th percentile)

**What it does:**
Ignores outliers, focuses on the middle chunk of the data.

**Analogy:**
You’re trying to compare people’s height, but you ignore the giants and dwarfs. You care about the **normal crowd**.

**Used when:**
Your data has outliers that you don't want to distort the scaling.

---

## 💡 TL;DR Quick Chart

| Type         | Range     | Sensitive to Outliers? | When to Use                             |
| ------------ | --------- | ---------------------- | --------------------------------------- |
| Min-Max      | 0 to 1    | ✅ Yes                  | Neural nets, bounded features           |
| Standard (Z) | \~-3 to 3 | ✅ Yes                  | Most ML algorithms (SVM, LR, PCA, etc.) |
| MaxAbs       | -1 to 1   | ✅ Yes                  | Sparse data                             |
| Robust       | Depends   | ❌ No                   | Data with outliers                      |



In [1]:
## Data setup for regularization

In [2]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

In [3]:
df = pd.read_csv('../../pierian-data-excercises/08-Linear-Regression-Models/Advertising.csv')

In [4]:
X = df.drop('sales', axis=1)
y = df['sales']

In [5]:
from sklearn.preprocessing import PolynomialFeatures

In [6]:
poly_converter = PolynomialFeatures(degree=3, include_bias=False)

In [7]:
poly_features = poly_converter.fit_transform(X)

In [8]:
X.shape

(200, 3)

In [9]:
poly_features.shape

(200, 19)

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=42)

In [12]:
# We will scale the features now
# We dont do it before splitting so that the test data doesn't leaks into training

In [13]:
from sklearn.preprocessing import StandardScaler

In [14]:
scaler = StandardScaler()

In [15]:
X_train = scaler.fit_transform(X_train) # Scaling the X_train

In [16]:
X_train[0] # See the values are much smaller now

array([ 1.53003874, -0.8862929 , -1.16598019,  1.93705409, -0.22476356,
       -0.62651363, -0.90793157, -0.850692  , -0.81317305,  2.23096225,
        0.10749852, -0.35876444, -0.61820397, -0.60530328, -0.5751799 ,
       -0.81381745, -0.72447935, -0.64260728, -0.56741817])

In [17]:
X_test = scaler.fit_transform(X_test)

In [18]:
X_test[0]

array([ 0.44073112,  0.58434711,  0.9171906 ,  0.14346778,  0.68658331,
        0.91421828,  0.32793801,  0.79576027,  0.54490668, -0.0904355 ,
        0.32767892,  0.46219831,  0.45921271,  0.95195313,  0.70519137,
        0.09113576,  0.49371634,  0.45876497,  0.2081822 ])

#### L2 Regularization impl.

In [19]:
from sklearn.linear_model import Ridge

In [20]:
ridge_model = Ridge(alpha=10)

In [21]:
ridge_model.fit(X_train, y_train)

In [22]:
test_predictions = ridge_model.predict(X_test)

In [23]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, root_mean_squared_error

In [24]:
mae = mean_absolute_error(y_test, test_predictions)
mse = mean_squared_error(y_test, test_predictions)
rmse = root_mean_squared_error(y_test, test_predictions)

In [25]:
mae

1.5821428571428588

In [26]:
mse

3.329029354046658

In [27]:
rmse

1.824562784353188

In [28]:
from sklearn.linear_model import RidgeCV # To get the best value of alpha, we need to use cross-validation

In [29]:
ridge_cv_model = RidgeCV(alphas=(0.1, 1.0, 10.0), scoring= 'neg_mean_squared_error')

In [30]:
ridge_cv_model.fit(X_train, y_train)

In [31]:
ridge_cv_model.alpha_ # Gives the best alpha

np.float64(0.1)

In [32]:
from sklearn.metrics._scorer import _SCORERS

In [33]:
_SCORERS.keys() # These are all the diff. types of scoring metrics we can use. (They are tuned to be: "The higher the better")

dict_keys(['explained_variance', 'r2', 'neg_max_error', 'matthews_corrcoef', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_root_mean_squared_log_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'd2_absolute_error_score', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'positive_likelihood_ratio', 'neg_negative_likelihood_ratio', 'adjusted_rand_score', 'rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'reca

In [34]:
y_test_predictions_cv = ridge_cv_model.predict(X_test)

In [35]:
mae = mean_absolute_error(y_test, y_test_predictions_cv)
mse = mean_squared_error(y_test, y_test_predictions_cv)
rmse = root_mean_squared_error(y_test, y_test_predictions_cv)

In [36]:
mae

1.582142857142856

In [37]:
mse

2.868203954822992

In [38]:
rmse # This is a better alpha value than 10.

1.6935772656784787

### LASSO Regression - (Least absolute shrinkage and selection operator)

In [41]:
from sklearn.linear_model import LassoCV # We'll do cross validation to know the correct alpha value

In [73]:
lasso_cv_model = LassoCV(eps=0.1, n_alphas=100, cv = 5) # cv is k-fold: No. of folds we want

In [74]:
lasso_cv_model.fit(X_train, y_train)

lasso_cv_model.alpha_ # This is the best value of alpha we got

In [76]:
test_predictions = lasso_cv_model.predict(X_test)

In [77]:
mae = mean_absolute_error(y_test, test_predictions)
mse = mean_squared_error(y_test, test_predictions)
rmse = root_mean_squared_error(y_test, test_predictions)

In [78]:
mae

1.5994504489541947

In [79]:
mse

3.6113829965844397

In [80]:
rmse

1.9003639116191509

In [81]:
# It didnt perform as good as L2 model (Ridge)

In [82]:
lasso_cv_model.coef_ # It is only considering 2 features, So with only two input features the performance is impressive

array([0.97675148, 0.        , 0.        , 0.        , 3.8148913 ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])

### ElasticNet : Combines L1 and L2