# Regularized Linear Models

As we saw in Chapters 1 and 2, a good way to reduce overfitting is to regularize the
model (i.e., to constrain it): the fewer degrees of freedom it has, the harder it will be
for it to overfit the data. For example, a simple way to regularize a polynomial model
is to reduce the number of polynomial degrees.
For a linear model, regularization is typically achieved by constraining the weights of
the model. We will now look at Ridge Regression, Lasso Regression, and Elastic Net,
which implement three different ways to constrain the weights.

# Ridge Regression or Tikhonov regularization

Ridge Regression (also called Tikhonov regularization) is a regularized version of Linear
Regression: a regularization term equal to αΣi = 1
n θi
2 is added to the cost function.
This forces the learning algorithm to not only fit the data but also keep the model
weights as small as possible.

Here is how to perform Ridge Regression with Scikit-Learn using a closed-form solution
(a variant of Equation 4-9 using a matrix factorization technique by André-Louis
Cholesky):


In [2]:
#from sklearn.linear_model import Ridge
#ridge_reg = Ridge(alpha=1, solver="cholesky")
#ridge_reg.fit(X, y)
#ridge_reg.predict([[1.5]])
#      array([[1.55071465]])

And using Stochastic Gradient Descent:

In [3]:
#sgd_reg = SGDRegressor(penalty="l2")
#sgd_reg.fit(X, y.ravel())
#sgd_reg.predict([[1.5]])
#     array([1.47012588])

The penalty hyperparameter sets the type of regularization term to use. Specifying
"l2" indicates that you want SGD to add a regularization term to the cost function
equal to half the square of the ℓ2 norm of the weight vector: this is simply Ridge
Regression.

# Lasso Regression

Least Absolute Shrinkage and Selection Operator Regression (simply called Lasso
Regression) is another regularized version of Linear Regression: just like Ridge
Regression, it adds a regularization term to the cost function, but it uses the ℓ1 norm
of the weight vector instead of half the square of the ℓ2 norm

An important characteristic of Lasso Regression is that it tends to completely eliminate
the weights of the least important features (i.e., set them to zero).

In other words, Lasso Regression automatically performs feature selection and outputs a
sparse model (i.e., with few nonzero feature weights).

In [5]:
#Here is a small Scikit-Learn example using the Lasso class. Note that you could
#instead use an SGDRegressor(penalty="l1").
#from sklearn.linear_model import Lasso
#lasso_reg = Lasso(alpha=0.1)
#lasso_reg.fit(X, y)
#lasso_reg.predict([[1.5]])
#       array([1.53788174])

# Elastic Net

Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The
regularization term is a simple mix of both Ridge and Lasso’s regularization terms,
and you can control the mix ratio r. When r = 0, Elastic Net is equivalent to Ridge
Regression, and when r = 1, it is equivalent to Lasso Regression

So when should you use plain Linear Regression (i.e., without any regularization),
Ridge, Lasso, or Elastic Net? It is almost always preferable to have at least a little bit of
regularization, so generally you should avoid plain Linear Regression. Ridge is a good
default, but if you suspect that only a few features are actually useful, you should prefer
Lasso or Elastic Net since they tend to reduce the useless features’ weights down to
zero as we have discussed. In general, Elastic Net is preferred over Lasso since Lasso
may behave erratically when the number of features is greater than the number of
training instances or when several features are strongly correlated.

In [6]:
#Here is a short example using Scikit-Learn’s ElasticNet (l1_ratio corresponds to
#the mix ratio r):
#from sklearn.linear_model import ElasticNet
#elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
#elastic_net.fit(X, y)
#elastic_net.predict([[1.5]])
#      array([1.54333232])

# Early Stopping
A very different way to regularize iterative learning algorithms such as Gradient
Descent is to stop training as soon as the validation error reaches a minimum. This is
called early stopping. Figure 4-20 shows a complex model (in this case a high-degree
Polynomial Regression model) being trained using Batch Gradient Descent. As the
epochs go by, the algorithm learns and its prediction error (RMSE) on the training set
naturally goes down, and so does its prediction error on the validation set. However,

after a while the validation error stops decreasing and actually starts to go back up.
This indicates that the model has started to overfit the training data. With early stopping
you just stop training as soon as the validation error reaches the minimum. It is
such a simple and efficient regularization technique that Geoffrey Hinton called it a
“beautiful free lunch.”

# Softmax Regression
The Logistic Regression model can be generalized to support multiple classes directly,
without having to train and combine multiple binary classifiers (as discussed in
Chapter 3). This is called Softmax Regression, or Multinomial Logistic Regression.