# Regularization
Regularization, in mathematics and statistics and particularly in the fields of machine learning and inverse problems, refers to a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting. You penalize your loss function by adding a multiple of an L1L1 (LASSO[2]) or an L2L2 (Ridge[3]) norm of your weights vector ww (it is the vector of the learned parameters in your linear regression). regularization artificially discourages complex or extreme explanations of the world even if they fit the what has been observed better. The idea is that such explanations are unlikely to generalize well to the future; they may happen to explain a few data points from the past well, but this may just be because of accidents of the sample.


### Models Pros & Cons
----------------------------------------------------------------------------------------------------------------------
#### Ridge Regressions (L2)
- Big O Notation (Cost Function):

Pros: Work best when least square estimate has high variance; Deal with high-dimensional data but shrink coefficients; If many Ps contribute Y, Ridge is good; Deal multi-colinearity; 

Cons: No feature selection performed; Good for grouping (if some features identical, coefficients are same)

#### LASSO Regressions (L1)
- Big O Notation (Cost Function):

Pros: Work best when least square estimate has high variance; Deal exetremly well with high-dimensional data but shrink coefficients; If only a few feature actually useful, LASSO is good; Deal multi-colinearity; 

Cons: Feature selection performed; Good for elimiating travil features; If groups highly correlated features, pick one, others shrink to 0; 

#### Elastic Net
- Big O Notation (Cost Function):

Pros: Always preferred over L1, L2 specifically.

Cons:

#### Least-Angle Regression (LARS)
- Big O Notation (Cost Function):

Pros: 1.It is computationally just as fast as forward selection; 2.It produces a full piecewise linear solution path, which is useful in cross-validation or similar attempts to tune the model; 3.If two variables are almost equally correlated with the response, then their coefficients should increase at approximately the same rate. The algorithm thus behaves as intuition would expect, and also is more stable; 4.It is easily modified to produce solutions for other estimators, like the lasso.; 5.It is effective in contexts where p >> n (i.e., when the number of dimensions is significantly greater than the number of points.


Cons: Because LARS is based upon an iterative refitting of the residuals, it would appear to be especially sensitive to the effects of noise; 2.Since almost all high dimensional data in the real world will just by chance exhibit some fair degree of collinearity across at least some variables, the problem that LARS has with correlated variables may limit its application to high dimensional data






----------------------------------------------------------------------------------------------------------------------



## --------------------- Ridge Regressions (L2)

#### Wiki Definitation: 
L2 regularization. {lambda X beta^2}
#### Input Data: 
X(Numeric) / X(Categorical)
#### Initial Parameters: 
lambda = shrink parameter [infinite: null model <--> 0:least square]
#### Cost Function: 
Add panelity term into the RSS minimization 
#### Process Flow: 
Standarize features ->
Shrink coefficients towards 0 but never at 0 (Shrink by %)
#### Evaluation Methods: 

#### Tips: 


In [None]:
# ---------------------- R
library(glmnet)

# Create X matrix and Y vector
x <- model.matrix(Y ~ . , data)[, -1]
Y <- data$Y

# Ridge Regression
grid <- 10^seq(10, -2, length = 100)
table.ridge <- glmnet(x, y, alpha=0, lambda = grid) # alpha = 0 ridge, 1 lasso

# See a lambda value in the grid
table.ridge$lambda[3] # 1-100
coef(table.ridge)[,3] # 1-100
sqrt(sum(coef(table.ridge)[-1,3]^2))

# Get coefficients for a new lambda value
predict(table.ridge, S=[newvale], type="coefficients")[1:20,] # top 20 features

# Croos validate to find the best lambda
set.seed(10)
table.cv <- cv.glmnet(trainX, trainY, alpha=0)
best.lam <- table.cv$lambda.min

In [None]:
# ---------------------- Python


## --------------------- LASSO Regressions (L1)

#### Wiki Definitation: 
L1 regularization. {lambda X |beta|}
#### Input Data: 
X(Numeric) / X(Categorical)
#### Initial Parameters: 
lambda = shrink parameter [infinite: null model <--> 0:least square]
#### Cost Function: 
Add panelity term into the RSS minimization 
#### Process Flow: 
Standarize features ->
Shrink coefficients towards 0 some features will be at 0 (Shrink by #)
#### Evaluation Methods: 

#### Tips: 


In [None]:
# ---------------------- R

library(glmnet)
table.lasso = glmnet(trainX, trainY, alpha=1, lambda=grid) # alpha = 1 lasso

# CV
set.seed(10)
table.cv <- cv.glmnet(trainX, trainY, alpha=1)
best.lam <- table.cv$lambda.min


In [None]:
# ---------------------- Python



## --------------------- Elastic Net

#### Wiki Definitation: 
In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. The elastic net method overcomes the limitations of the LASSO (least absolute shrinkage and selection operator) method. Use of this penalty function has several limitations.[1] For example, in the "large p, small n" case (high-dimensional data with few examples), the LASSO selects at most n variables before it saturates. Also if there is a group of highly correlated variables, then the LASSO tends to select one variable from a group and ignore the others. To overcome these limitations, the elastic net adds a quadratic part to the penalty (∥ β ∥^2) which when used alone is ridge regression (known also as Tikhonov regularization). 
#### Input Data: 
X(Numeric) / X(Categorical)
#### Initial Parameters: 

#### Cost Function: 

#### Process Flow: 
https://en.wikipedia.org/wiki/Elastic_net_regularization

The quadratic penalty term makes the loss function strictly convex, and it therefore has a unique minimum. The elastic net method includes the LASSO and ridge regression: in other words, each of them is a special case where λ 1   = λ , λ 2   = 0 or λ 1   = 0 , λ 2   = λ. Meanwhile, the naive version of elastic net method finds an estimator in a two-stage procedure : first for each fixed λ 2     it finds the ridge regression coefficients, and then does a LASSO type shrinkage. This kind of estimation incurs a double amount of shrinkage, which leads to increased bias and poor predictions. To improve the prediction performance, the authors rescale the coefficients of the naive version of elastic net by multiplying the estimated coefficients by ( 1 + λ 2   )  

#### Evaluation Methods: 

#### Tips: 



In [None]:
# --------------------- R
"""
Glmnet: Lasso and elastic-net regularized generalized linear models is software which is implemented as an R source 
package.[8][9] This includes fast algorithms for estimation of generalized linear models with ℓ1 (the lasso), 
ℓ2 (ridge regression) and mixtures of the two penalties (the elastic net) using cyclical coordinate descent, 
computed along a regularization path.”
"""

In [None]:
# --------------------- Python
"""
scikit-learn includes linear regression, logistic regression and linear support vector machines with 
elastic net regularization.”
"""

## --------------------- Least-Angle Regression (LARS)

#### Wiki Definitation: 
In statistics, least-angle regression (LARS) is an algorithm for fitting linear regression models to high-dimensional data, developed by Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. 
#### Input Data: 
X(Numeric) / X(Categorical)
#### Initial Parameters: 

#### Cost Function: 

#### Process Flow: 
Instead of giving a vector result, the LARS solution consists of a curve denoting the solution for each value of the L1 norm of the parameter vector. The algorithm is similar to forward stepwise regression, but instead of including variables at each step, the estimated parameters are increased in a direction equiangular to each one's correlations with the residual.

[1] The basic steps of the Least-angle regression algorithm are:

[2] Start with all coefficients β j equal to zero.

[3] Find the predictor x j most correlated with y 

[4] Increase the coefficient β j  in the direction of the sign of its correlation with y. Take residuals r = y − y ^ along the way. Stop when some other predictor x has as much correlation with r as x j has.

[5] Increase (β j, β k) in their joint least squares direction, until some other predictor x m     has as much correlation with the residual r.

[6] Continue until: all predictors are in the model

#### Evaluation Methods: 

#### Tips: 



In [None]:
# ---------------------- R
"""
Least-angle regression is implemented in R via the lars package
"""

In [None]:
# ---------------------- Python
"""
http://scikit-learn.org/stable/modules/linear_model.html#least-angle-regression
"""

----------------------------------------------------------------------------------------------------------------------

# Evaluation Methods