# Regularization

## What is regularization? Why do we need it? 
Regularization is used to reduce overfitting in machine learning models. It helps the models to generalize well and make them robust to outliers and noise in the data.

## What happens to our linear regression model if we have three columns in our data: x, y, z  —  and z is a sum of x and y?
We won't be able to perform regression. Beacause z is linearly dependent on x and y. When try to find normal equation $\beta = (XX^{T})^{-1}X^{T}y$.  
The $XX^{T}$ would be not invertible. 

## ‍Which regularization techniques do you know? ‍
There are mainly three types of reularization, 
1. Lasso regularization(L1) -- It penalizes the **absolute values of regression coefficients**, note it allow the coefficient to be zero, allow features selection. Applicable when know there are certain number of variable cannot provide useful info.
2. Ridge regularization(L2) -- It penalizes the **squares of regression coefficients**. Applicable when we have a lot of features that might correlate with each other(**multicollinearity**).
3. Elastic-Net -- Combine the L1 and L2 regression, has $\lambda_1$ and $\lambda_2$ for both the flexibility of both.
**$\lambda$ determines the amount of regularization**

## When do we need to perform feature normalization for linear models? When it’s okay not to do it? ‍

**Feature normalization is necessary for L1 and L2 regularizations.** The idea of both methods is to penalize all the features relatively equally. This can't be done effectively if every feature is scaled differently.

Linear regression without regularization techniques can be used without feature normalization. Also, regularization can help to make the analytical solution more stable, — it adds the regularization matrix to the feature matrix before inverting it.

## What kind of regularization techniques are applicable to linear models? ‍

**Ridge regression, Lasso, Elastic Net**, Basis pursuit denoising, Rudin–Osher–Fatemi model (TV), Potts model, RLAD, Dantzig Selector,SLOPE, AIC/BIC

## How does L2 regularization look like in a linear model?
[Visualization for L1,L2](https://www.youtube.com/watch?v=Xm2C_gTAl8c)  
L2 regularization adds a penalty term, which is $\lambda$ * squeares of regression coefficients. In a linear model, it means that the model will be less **sensitive** to the changes. So it shrunk the optimal value of slope, the line will be less steep, because of smaller slope.


## How L1 regularization looks like in a linear model? ‍
L1 regularization adds a penalty term, which is $\lambda$ * absolute value of regression coefficients. Since it also penalize on the regression, It will also shrunk the optimal value of slope, but the **slope will still larger than L2 regularization since it only penalize the absolute value of coefficient instead of squares.**

![image.png](attachment:image.png)

## How do we select the right regularization parameters? 
1. We can use Cross validation method to find the optimal regularization parameters.
2. grid search, for example https://scikit-learn.org/stable/modules/linear_model.html has one formula for the implementing for regularization, alpha in the formula mentioned can be found by doing a RandomSearch or a GridSearch on a set of values and selecting the alpha **which gives the least cross validation or validation error.**

## What’s the effect of L2 regularization on the weights of a linear model? ‍
L2 regularization penalizes **larger weights** more severely (due to the squared penalty term), which encourages weight values to decay toward zero.

## What’s the difference between L2 and L1 regularization? ‍
1. Penalty terms: L1 regularization uses the sum of the absolute values of the weights, while L2 regularization uses the sum of the weights squared.  
<br>
2. Feature selection: L1 performs feature selection by reducing the coefficients of some predictors to 0, while L2 does not.  
<br>

3. Computational efficiency: L2 has an analytical solution, while L1 does not. **(ridge esitmate always exists for non-zero $\lambda$)**
<br>

4. Multicollinearity: L2 addresses multicollinearity by constraining the coefficient norm.
---
Supporting material about 3, 4:  
[Addressing Multicollinearity](https://www.youtube.com/watch?v=5NVcmLZCGOg&ab_channel=ChrisMack)  
[Lasso & Ridge further explanation](https://towardsdatascience.com/different-forms-of-regularization-and-their-effects-6a714f156521)  
[Why L2 useful for collinearity?](https://stats.stackexchange.com/questions/395145/why-is-l2-regression-good-for-handling-multicollinearity)  

----

## Can we have both L1 and L2 regularization components in a linear model? ‍

Yes, elastic net regularization combines L1 and L2 regularization.

## What’s the interpretation of the bias term in linear models? ‍

**Bias is simply, a difference between predicted value and actual/true value.** It can be interpreted as the distance from the average prediction and true value i.e. true value minus mean(predictions). But dont get confused between accuracy and bias.(Metric for classification)

## If a weight for one variable is higher than for another  —  can we say that this variable is more important?

Yes - if your predictor variables are normalized.

Without normalization, the weight represents the change in the output per unit change in the predictor. If you have a predictor with a huge range and scale that is used to predict an output with a very small range - for example, using each nation's GDP to predict maternal mortality rates - your coefficient should be very small. That does not necessarily mean that this predictor variable is not important compared to the others.


# Feature selection

## What is feature selection? Why do we need it? 

Feature Selection is a method used to select the relevant features for the model to train on. We need feature selection to remove the irrelevant features which leads the model to under-perform.

## Which feature selection techniques do you know? ‍

Here are some of the feature selections:

* Principal Component Analysis(PCA)
* Neighborhood Component Analysis
* ReliefF Algorithm
### Still need to search the lasts two

## Can we use L1 regularization for feature selection?
Yes, because the nature of L1 regularization will lead to sparse coefficients of features. Feature selection can be done by keeping only features with non-zero coefficients.


## Can we use L2 regularization for feature selection?
Because L2 use the square value of regression coefficient to penalize. The coefficeint can only asymptotically to zero and cannot be actual zero
