# Regularization

* Penalizes complex models to prevent overfitting

https://blog.datadive.net/selecting-good-features-part-ii-linear-models-and-regularization/

In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is the process of solving ill-posed problems or to prevent overfitting. Regularization can be applied to objective functions in ill-posed optimization problems. The regularization term, or penalty, imposes a cost on the optimization function to make the optimal solution unique.

https://en.wikipedia.org/wiki/Regularization_(mathematics)

## Cost or Loss Function

In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its opposite (in specific domains, variously called a reward function, a profit function, a utility function, a fitness function, etc.), in which case it is to be maximized.

We often use MSE as our cost function in linear regression $mse = \frac{\sum(y-\hat{y})^2}{n}$.

https://en.wikipedia.org/wiki/Loss_function

In [None]:
# get data and train test split
import numpy as np
import pandas as pd

auto = pd.read_csv('https://raw.githubusercontent.com/gitmystuff/Datasets/main/Auto.csv', usecols=['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year'])
auto = auto[(auto != '?').all(axis=1)]
auto['horsepower'] = auto['horsepower'].astype(np.int64)

# train test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    auto.drop('mpg', axis=1),
    auto['mpg'],
    test_size=0.25,
    random_state=42)

In [None]:
# remind us of the OLS coefficients
import statsmodels.api as sm

# add the constant
# X_train = sm.add_constant(X_train)
X_train.insert(0, 'const', 1)
model = sm.OLS(y_train, X_train[['const', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']]).fit()
model.params[1:]

cylinders      -0.160143
displacement    0.000373
horsepower     -0.001899
weight         -0.006457
acceleration    0.057588
year            0.762270
dtype: float64

In [None]:
# lasso example
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = scaler.fit_transform(X_train)
y = y_train
names = auto.columns

lasso = Lasso(alpha=5)
lasso.fit(X, y)

d = {'Feature': names, 'Coeff': lasso.coef_}
lasso_df = pd.DataFrame(d)
print(lasso_df[1:])

        Feature     Coeff
1     cylinders -0.000000
2  displacement -0.000000
3    horsepower -0.000000
4        weight -1.670802
5  acceleration  0.000000
6          year  0.000000


## Lasso / l1 Regularization

* $\alpha = \sum|w_i|$
* Forces weak features to have zero coefficients
* Performs feature selection
* Models can be unstable (coefficients fluctuate significantly on data changes with correlated features)

In [None]:
# ridge example
from sklearn.linear_model import Ridge

X=X_train
y=y_train

ridge = Ridge(alpha=10)
ridge.fit(X,y)

d = {'Feature': names, 'Coeff': ridge.coef_}
ridge_df = pd.DataFrame(d)
print(ridge_df[1:])

        Feature     Coeff
1     cylinders -0.141583
2  displacement  0.000050
3    horsepower -0.002074
4        weight -0.006451
5  acceleration  0.056597
6          year  0.759790


## Ridge / l2 Regularization

* $\alpha = \sum w_i^2$
* Spreads out coefficients more equally
* Exposes correlated features (have similar coefficients)
* Models are more stable (coefficients don't fluctuate as much on data changes with correlated features)

## ElasticNet