<div align="center">

# <span style="color:#ffc509;"> **LASSO L1 Regression** (Least Absolute Shrinkage and Selection Operator) </span>

</div>

<div align="center">

# <span style="color:#ffc509;"> **LASSO L1 Regression** (Least Absolute Shrinkage and Selection Operator) </span>

</div>

Lasso regression is a statistical technique used in machine learning to improve the accuracy of predictive models, especially when working with large datasets. Its main goal is to prevent overfitting, a problem that occurs when a model fits the training data too closely and does not perform well on new data. Lasso achieves this by applying a penalty that reduces the magnitude of the coefficients of the predictor variables, even forcing some to zero. This simplifies the model, automatically selects the most important variables, and improves its generalization ability.

<i>
Clarification:
Although Lasso regression and L1 regularization are often mentioned as if they were the same, they are not actually. Lasso is a regression model that uses L1 penalization, but L1 regularization is also used in other models beyond regression, such as in logistic regression, support vector machines (SVM), neural networks, and feature selection in linear models. The confusion arises because Lasso is the most common example of L1 regularization, leading many sources to treat them as synonyms. </i>

### <span style="color:#ffc509"> What is **regularization?** </span>

In machine learning, regularization is like putting a brake on the model so it doesn't become too complicated. The problem is that sometimes the model memorizes the training data, like a student who memorizes exact answers for an exam. This is called "overfitting." When this happens, the model performs poorly on new data. Regularization adds a "penalty" to the formula that the model uses to know when it's wrong, so that it learns more generally, and not so specifically, and in this way, it doesn't memorize the training data.

### <span style="color:#ffc509"> **What** is Lasso regression? </span>

Lasso regression, also known as L1 regularization, is a type of linear regression that adds a penalty based on the absolute value of the coefficients of the predictor variables. This penalty forces some coefficients to be exactly zero, which means those variables are eliminated from the model.

### <span style="color:#ffc509"> **Why** is Lasso regression used? </span>

- Prevention of overfitting: Reduces the complexity of the model, preventing it from fitting the training data too closely.
- Automatic variable selection: Identifies and retains only the most relevant variables for prediction, simplifying the model and improving its interpretability.
- Handling multicollinearity: Although it does not eliminate it completely, Lasso can mitigate the effects of multicollinearity (high correlation between predictor variables) by selecting one variable and eliminating the correlated ones.


### <span style="color:#ffc509">  **How** does LASSO regression work? </span>

1.  L1 Penalty: Lasso adds a penalty term to the cost function of the linear regression model. This term is proportional to the sum of the absolute values of the coefficients of the predictor variables.
2. Lambda Parameter (λ): This parameter controls the strength of the penalty.
    - A high λ increases the penalty, which reduces more coefficients to zero, simplifying the model.
    - A low λ decreases the penalty, retaining more variables in the model.
3. Minimization of Mean Squared Error (MSE): Lasso seeks the value of the coefficients that minimizes the MSE (the difference between the predicted and actual values), subject to the L1 penalty.
4. Variable Selection: By forcing some coefficients to zero, Lasso performs automatic variable selection, keeping only the most important ones.


The cost function that LASSO regression tries to minimize is:

$$ J(\beta) = \text{Error} + \lambda \sum_{i=1}^{p} |\beta_i| $$

Where:

* $J(\beta)$ is the cost function.
* $\text{Error}$ is a measure of the error between the predicted values and the actual values (for example, the sum of squared errors in standard linear regression).
* $\lambda$ (lambda) is the **regularization parameter** (a hyperparameter that is tuned). It controls the strength of the penalty. The higher the value of $\lambda$, the greater the penalty.
* $\sum_{i=1}^{p} |\beta_i|$ is the **L1 norm** of the coefficients, which is the sum of the absolute values of all the coefficients ($\beta_i$).
* $p$ is the number of predictor variables.

### <span style="color:#ffc509">  **When** is Lasso regression used? </span>

- High-dimensional datasets: When there are many more predictor variables than observations.
- Automatic variable selection: When it is necessary to identify the most relevant variables for prediction.
- Predictive problems: Where the main objective is prediction accuracy.

### <span style="color:#ffc509">  **The key to LASSO regression: Feature selection** </span>

The main distinguishing feature of L1 regularization (and therefore LASSO regression) is its ability to **force some of the coefficients of the predictor variables to be exactly zero**.

* When $\lambda$ is sufficiently large, the L1 penalty can cause the coefficients of the less important variables to be reduced to zero, effectively eliminating those variables from the model.
* This makes LASSO regression a useful method for **feature selection**, as it automatically identifies the most relevant variables for prediction.
* The resulting model is more **parsimonious** (has fewer variables) and, therefore, often easier to interpret.

### <span style="color:#ffc509">  **Key Differences between LASSO Regression (L1) and Ridge Regression (L2)** </span>

| Characteristic          | LASSO Regression (L1)                                      | Ridge Regression (L2)                                           |
|-------------------------|-----------------------------------------------------------|----------------------------------------------------------------|
| **Type of Regularization** | L1 Norm: Sum of the absolute value of the coefficients ($|\beta_i|$) | L2 Norm: Sum of the square of the coefficients ($\beta_i^2$) |
| **Effect on Coefficients** | Can reduce some coefficients to **exactly zero**.  | Reduces the magnitude of the coefficients, but **rarely to zero**. |
| **Feature Selection** | **Performs feature selection** by eliminating variables. | **Does not perform explicit feature selection**.              |
| **Model Parsimony** | Tends to generate **more parsimonious** models (fewer variables). | Tends to keep all variables in the model (although with small weights). |
| **Interpretability** | Can improve **interpretability** by simplifying the model. | Interpretability may be lower due to the presence of all variables. |
| **Handling Multicollinearity** | Tends to select one variable from a correlated group. | Distributes the weight among correlated variables.          |

### <span style="color:#ffc509">  **Advantages** of LASSO regression: </span>

* Automatic feature selection: Identifies and eliminates irrelevant variables, simplifying the model and making it easier to interpret. For example, when predicting the price of a house, LASSO selects the most relevant features such as size and location.
* Helps prevent overfitting: Reduces the complexity of the model by penalizing large coefficients, which improves accuracy on new data.
* Useful in high-dimensional datasets: Works well when there are many predictor variables (high dimension), some of which may be irrelevant.
* Improves model interpretability: By having fewer variables, the model is easier to understand and explain.
* Can handle some multicollinearity: Tends to select one variable from a group of highly correlated variables and set the coefficients of the others to zero. For example, if you have variables like "square meters" and "number of rooms," LASSO selects the most important one.

### <span style="color:#ffc509">  **Disadvantages** of LASSO regression: </span>

* May discard relevant variables: If $\lambda$ is too large, it could eliminate variables that actually have an impact on the prediction, which would reduce the accuracy of the model.
* Arbitrary selection in high multicollinearity: If there are groups of highly correlated variables, LASSO may select one of them arbitrarily (randomly), which can be unstable, as small variations in the data could change the selected variable.
* May not perform as well as Ridge if all variables are relevant: If most variables have some impact on the prediction, Ridge regression might give better results in terms of predictive accuracy.
* The choice of the parameter $\lambda$ is crucial: An incorrect value of $\lambda$ can lead to an underfitted model (if $\lambda$ is too large) or an overfitted model (if $\lambda$ is too small). The optimal selection of $\lambda$ is often done using cross-validation techniques.


### <span style="color:#ffc509"> Common Use Cases </span>

- Genomics: Identification of genes relevant to certain diseases.
- Marketing: Selection of the most important demographic and behavioral variables to predict the response to an advertising campaign.
- Finance: Building credit risk models, selecting the most influential factors.

### <span style="color:#ffc509"> **Dictionary** of Key Terms </span>

- **Overfitting**: A model that fits the training data too closely and does not generalize well to new data.
- **Regularization**: Techniques to prevent overfitting, adding additional information to avoid excessive model complexity.
- **L1 Penalty**: The penalty used in Lasso, based on the absolute value of the coefficients.
- **Lambda** (**λ**): The parameter that controls the strength of the penalty in Lasso.
- **Alpha** (**α**): This term is not mentioned in the provided text.
- **Multicollinearity**: High correlation between predictor variables.
- **Mean Squared Error** (**MSE**): A measure of the difference between the predicted and actual values.
- **Bias**: The difference between the average predictions of a model and the actual values.
- **Variance**: The variability of a model's predictions for different datasets.
- **Cost Function**: A function that measures how well a model fits the data; the goal is to minimize this function.
- **Beta Coefficients** (**β**): The values that multiply the predictor variables in a regression model.
- **R2**: A measure of what proportion of the variance in the dependent variable is explained by the independent variables.
- **Dependent Variable**: The variable being predicted.
- **Independent Variables**: The variables used to make predictions.
- **Elastic Net**: A regularization technique that combines L1 (Lasso) and L2 (Ridge) penalties.
- **Ridge**: Also known as L2 regularization.

___
___
___

<div align="center">

# <span style="color:#ffc509">  **Example**: ☕ </span>

</div>

### We have a coffee shop and we want to know what things make us sell more coffee per day.

We have been recording data for several weeks, and we have variables such as:

- ✅ It's cold
- ✅ It's the weekend
- ✅ There are promotions
- ✅ There is live music
- ✅ Decorative lights were turned on
- ✅ Change in the menu
- ✅ There are new chairs
- ✅ There was a full moon 🌕
- ✅ The cat climbed on the counter 😺

---

### 🎯 Our goal:
**Predict how many coffees will be sold**, depending on our factors.

But there's a problem:

👉 Not all of these factors actually influence sales.
Some are important (like promotions), others are noise (like the full moon or the cat 😺).

---

### 📉 This is where Lasso (L1) regression comes in

Lasso is a type of **linear regression with regularization**.
It's like a filter that helps you:

- 🧠 Select only the variables that really help predict
- 🧹 Eliminate or ignore the ones that don't work
- 🧾 Make the model simpler and more generalizable

---

### 📊 How does it work in the coffee shop?

Let's say we feed the model with records from many days:

- Day 1: It's cold + weekend + promotion + music = 280 coffees
- Day 2: Hot + Monday + no promo + cat on the counter = 90 coffees
- Day 3: Cold + menu change + lights + music = 200 coffees
- Day 4: Full moon + new chairs + promo = 100 coffees

The model **learns what things really impact sales**.

---

### 📈 Classic vs. Lasso Model

### 🧮 Linear regression model (without penalty):
$$ \hat{y} = \beta_0 + \beta_1 \cdot \text{Cold} + \beta_2 \cdot \text{Weekend} + \beta_3 \cdot \text{Promotion} + \beta_4 \cdot \text{Music} + \beta_5 \cdot \text{Lights} + \beta_6 \cdot \text{Menu} + \beta_7 \cdot \text{Chairs} + \beta_8 \cdot \text{FullMoon} + \beta_9 \cdot \text{Cat} $$

$\hat{y}$ corresponds to the predicted sales.

This model could give small but not exactly zero weights ($\beta_i$) to irrelevant variables like the full moon or the cat.

### 🪄 Lasso Regression (with **L1 penalty**):

$$ \min_{\beta} \left\{ \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} x_{ij} \beta_j)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right\} $$



which would be:

$$ \text{Cost} = \underbrace{\frac{1}{2n} \sum_{i=1}^{n} (\text{Sales}_i - \text{PredictedSales}_i)^2}_{\text{How far our predictions are from the actual sales}} + \underbrace{\lambda \sum_{j=1}^{p} |\beta_j|}_{\text{Penalty for giving importance to too many factors}} $$


$ \min_{\beta}$ = The value we want to minimize


$ \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} x_{ij} \beta_j)^2 $ = **SSE** (sum of squared errors) Sums the following terms for each observation $_i$, from the first ($_i$=1) to the last ($_i$=n).

- $y_i $ = The actual number of coffees sold on day $_i$
- $\beta_0 $ = The number the model predicts for day $_i$

- $\sum_{j=1}^{p} x_{ij} \beta_j$ = contribution of all variables where for a specific observation $_i$, it multiplies the value of each predictor variable ${j}$ $x_{ij}$ by its corresponding coefficient $\beta_j$ and then sums all these products for all $^{p}$ predictor variables.

    - $\sum$ = summation
    - ${}^{p}_{j=1}$  = These are the limits of the summation
        - ${j}$ = index of the summation
        - =1 = indicates that the summation starts with the value of ${j}$ equal to 1 (first predictor variable in the dataset)
        - $^{p}$ = indicates that the summation ends when the value of ${j}$ reaches ${p}$. Where ${p}$ represents the total number of predictor variables in the model
    - $x_{ij}$ = represents the value of the ${j}$-th predictor variable for the ${i}$-th observation
    - $\beta_j$ = beta with subscript ${j}$ represents the coefficient associated with the ${j}$-th predictor variable.
    - $^{2}$ = the error is squared to: 1.- avoid the cancellation of positive and negative errors. 2.- penalize large errors more. 3.- facilitate calculation

$\frac{1}{2n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $ = **MSE** (mean squared error). Like SSE but adding the scale factor that measures how far our predictions are from the actual values

- $y_i$ = Actual value of the dependent variable (number of coffees sold)
- $\hat{y}_i$ = Predicted value of the dependent variable by the regression model for the $_i$-th observation (number of coffees the model predicts will be sold on day $_i$)
- $^{2}$ = The error is squared to: 1.- avoid the cancellation of positive and negative errors. 2.- Penalize large errors more. 3.- Facilitate calculation

$\lambda \sum_{j=1}^{p} |\beta_j|$ = is the **L1 penalty**. This penalty forces the coefficients of the less important variables to be exactly **zero**.
- $\lambda$ = Regularization parameter or penalty constant
    - If it is close to 0, the penalty is weak (without much tendency to reduce coefficients to zero)
    - If it is large, the penalty is strong (model incentivized to make many of the predictor variables' coefficients zero)
    - Optimal value: Generally found using cross-validation techniques.
- $\sum$ = summation
    - ${}^{p}_{j=1}$  = These are the limits of the summation
        - ${j}$ = index of the summation
        - =1 = indicates that the summation starts with the value of ${j}$ equal to 1 (first predictor variable in the dataset)
        - $^{p}$ = indicates that the summation ends when the value of ${j}$ reaches ${p}$. Where ${p}$ represents the total number of predictor variables in the model
- $\beta_j$ = Coefficient (the weight) associated with the ${j}$-th predictor variable of the model. Indicates the strength and direction of the relationship between the predictor and dependent variables (considerable or no impact on coffee sales).
- $|\cdot|$ = Absolute value symbol, that is, the absolute value (number) is taken regardless of whether it is negative or positive (-3=3, 3=3)
---

### ✨ The magic result:

After training the Lasso model with the data, we could obtain something like:

$$ \text{Sales} \approx 50 + 80 \cdot \text{Cold} + 120 \cdot \text{Weekend} + 150 \cdot \text{Promotion} + 60 \cdot \text{Music} + 0 \cdot \text{Lights} + 30 \cdot \text{Menu} + 0 \cdot \text{Chairs} + 0 \cdot \text{FullMoon} + 0 \cdot \text{Cat} $$

And this clearly tells us that:

- 🌡️ **Cold weather**, **weekends**, and **promotions** are key factors that increase your sales.
- 🎶 **Live music** and a **new menu** have a smaller but positive impact.
- 💡 **Decorative lights**, **new chairs**, the **full moon**, and your **cat's** antics seem to have no significant influence on sales.

---

### 🚀 In summary, Lasso helps us to:

- **Identify the real sales factors.**
- **Simplify your prediction model.**
- **Make smarter decisions for your coffee shop.**

With this, we can now focus on the factors that really influence our business, making it grow! ☕

___
## <span style="color:#ffc509">  **Useful Links** </span>
https://www.ibm.com/es-es/think/topics/lasso-regression