# Regularization

Regularization is a technique to alleviate overfitting problem by imposing some penalty to the loss function.

### 1. Ridge regression ($L_2$ regularization)

Perhaps the most common form of regularization is known as *ridge regression* or $L_2$ *regularization*, sometimes also called *Tikhonov regularization*. This proceeds by penalizing the sum of squares (2-norms) of the model coefficients; in this case, the penalty on the model fit would be 

$$ J(\theta) =  \frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\sum_{j=1}^n \boldsymbol{\theta}_j^2$$

where $\alpha$ is a free parameter that controls the strength of the penalty.
This type of penalized model is built into Scikit-Learn with the ``Ridge`` estimator:

In [None]:
from sklearn.linear_model import Ridge
params_Ridge = {'polynomialfeatures__degree': np.arange(1, 10),
                'ridge__alpha': np.logspace(-1, -4, 10)}

##I put normalize=True to reach convergence faster, since it is giving me warnings...as my x value have wide range
model = make_pipeline(PolynomialFeatures(), Ridge(normalize=True))   

### 2. Lasso regression ($L_1$ regularization)

Another very common type of regularization is known as lasso, and involves penalizing the sum of absolute values (1-norms) of regression coefficients:

$$ J(\theta) = \frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2+ \lambda\sum_{j=1}^n |\theta_j|$$

Though this is conceptually very similar to ridge regression, the results can differ surprisingly: for example, due to geometric reasons lasso regression tends to favor *sparse models* where possible: that is, it preferentially sets model coefficients to exactly zero.

We can see this behavior in duplicating the ridge regression figure, but using L1-normalized coefficients:

In [None]:
from sklearn.linear_model import Lasso
params_Lasso = {'polynomialfeatures__degree': np.arange(1, 10),
                'lasso__alpha': np.logspace(-1, -4, 10)}

#put max_iter since it needs more time to reach convergence
model = make_pipeline(PolynomialFeatures(), 
                      Lasso(normalize=True, tol=0.01))

### 3. Elastic net 

Linear regression with combined L1 and L2 regularizer

$$
J(\theta) = \frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 + \alpha \sum_{j=1}^n |\theta_k| + (1 - \alpha) \sum_{k=1}^n \theta_j^2
$$



In [None]:
from sklearn.linear_model import ElasticNet
#i set tol to be low since it is eating my pc....
model = make_pipeline(PolynomialFeatures(), 
                      ElasticNet(normalize=True))

#note that sklearn has two parameters, alpha and l1_ratio, for the complete equation, refer to the doc
params_Elasticnet = {'polynomialfeatures__degree': np.arange(1, 10),
                'elasticnet__alpha': np.logspace(-1, -4, 10),
                "elasticnet__l1_ratio": np.linspace(0, 1, 5)}

### Ridge or Lasso or Elastic net??

Regularization should be ALMOST ALWAYS used, since these techniques reduces overfitting.

How to choose is a little bit difficult. It is easier to understand the assumptions behind.
1.  Ridge assumes that coefficients are normally distributed.   **Thus, if you don't want any feature to dominate too much, use Ridge.**
2. Lasso assumes that coefficients are Laplace distributed (in layman sense, it mean some predictors are very useful while some are completely irrelevant).   Here, Lasso has the ability to shrink coefficient to zero thus eliminate predictors that are not useful to the output, thus automatic feature selection.  **In simple words, if you have only very few predictors with medium/large effect, use Lasso.**
3.  Elastic basically is a compromise between the two, and thus take huge computation time to reach that compromise.  **If you have the resource to spare, you can use Elastic net**


### 4. ElasticNet + Stochastic Gradient Descent

Sklearn provides ElasticNet along with stochastic gradient descent, and they called <code>SGDRegressor()</code>.

In [None]:
from sklearn.linear_model import SGDRegressor

model = make_pipeline(PolynomialFeatures(), 
                      SGDRegressor())

params_SGD = {'polynomialfeatures__degree': np.arange(1, 10),
                'sgdregressor__alpha': np.logspace(-1, -4, 10),
                'sgdregressor__penalty': ['l2', 'l1', 'elasticnet'],
                 'sgdregressor__l1_ratio': np.linspace(0, 1, 5),
              'sgdregressor__learning_rate': ['constant', 'optimal',
                                             'invscaling', 'adaptive']}

### Many more....

There are just too many to mention.  It may be nice to read here: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model.   Sklearn documentation usually writes very good manual when to use which algorithm.  