# Lab 11 - Regularization

The basic idea of regularization - make the lernear more "strict". The learner produces a more complex model just if it increases the preformance in a meaningful manner. Both the complexity and performances are taken into consideration, to adress the over-fitting problem. In the regularization framework, not only the performance is important (which might lead to a model that generelize poorly on real data that is out of the training scope), but also the complexity, and $\lambda$ is the 'knob' to control this trade-off.

In general:

$h_{S}=\underset{h\in H}{argmin} L_s{h} + \lambda R(h)$

where R represents the complexity of the hypothesis

In [28]:
import sys
sys.path.append("../")
from utils import *

In [29]:
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Lasso, Ridge

In [30]:
X = pd.read_csv('../data/mtcars.csv').drop(columns=['model']).dropna()
# X = X.drop(columns=['cyl', 'disp', 'qsec', 'drat', 'gear', 'hp', 'am']) 

# px.imshow(X.corr()).show()

np.random.seed(0)
target = 'mpg'
train, test = train_test_split(X, test_size=0.4)
X_train, y_train, X_test, y_test = train.loc[:, train.columns != target], train[target], test.loc[:, test.columns != target], test[target]
print("After splitting, we got %d train samples and %d test samples" % (len(train), len(test)))

After splitting, we got 19 train samples and 13 test samples


In [31]:
lambdas = 10**np.linspace(-3, 2, 100)

lasso_coef_df = pd.DataFrame([], columns=list(X_train.columns), index=lambdas)
ridge_coef_df = pd.DataFrame([], columns=list(X_train.columns), index=lambdas)
losses = pd.DataFrame([], columns=['lasso_mse', 'lasso_reg', 'lasso_loss', 
                                 'ridge_mse', 'ridge_reg', 'ridge_loss'], index=lambdas)

for lam in lambdas:
    lasso_lam = Lasso(alpha=lam, normalize=True, max_iter=10000, tol=1e-4).fit(X_train, y_train)
    ridge_lam = Ridge(alpha=lam, normalize=True).fit(X_train, y_train)
    
    lasso_coef = lasso_lam.coef_
    ridge_coef = ridge_lam.coef_

    lasso_coef_df.loc[lam, :] = lasso_coef
    ridge_coef_df.loc[lam, :] = ridge_coef
    
    losses.loc[lam, 'lasso_mse'] = mean_squared_error(y_test, lasso_lam.predict(X_test))
    losses.loc[lam, 'lasso_reg'] = lam*np.linalg.norm(lasso_coef, ord=1)
    losses.loc[lam, 'lasso_loss'] = losses.loc[lam, 'lasso_mse'] + losses.loc[lam, 'lasso_reg']
    
    losses.loc[lam, 'ridge_mse'] = mean_squared_error(y_test, ridge_lam.predict(X_test))
    losses.loc[lam, 'ridge_reg'] = lam*np.linalg.norm(lasso_coef, ord=1)
    losses.loc[lam, 'ridge_loss'] = losses.loc[lam, 'ridge_mse'] + losses.loc[lam, 'ridge_reg']
    


# Linear regression regularization

A good way to neasure the complexity of linear models is to measure the model coefficient norm.
Some intuition why looking at norm of the coefficient makes sense:

Note that the norm of the trivial hypothesis, the simplest one that maps any input to 0, ($w = \left(0,0,0,...,)\right)$) is 0. that means that the penalty of the simplest model is indeed 0.


needed - 
** some explanation why higher norm is higher complexity
a more complex model is a models that "look" at more variables from the input??


## Ridge

Ridge regularization penalize by looking at norm 2 of the coefficients: 

$$
\hat w^{lasso}_{\lambda} = \underset{w_0\in\mathbb{R}, w\in\mathbb{R}^d}{argmin} \Vert w_0 + Xw - y \Vert^2_2 +
    \lambda \Vert w \Vert^2_2
$$

Note that as we make the penalty higher, the coefficients shrink *toward* 0, to make $\Vert w_0 + Xw - y \Vert^2_2$ term smaller and smaller.

In [38]:
arg_ridge= np.argmin(losses.loc[:, 'ridge_mse'])
min_ridge_y = losses.loc[:, 'ridge_mse'].values[arg_ridge]
min_ridge_x = np.log(lambdas)[arg_ridge]

fig = make_subplots(rows=6, cols=1, specs=[[{"rowspan": 4}], [None], [None], [None], [{"rowspan": 2}], [None]], 
                   subplot_titles=['Regularization Paths', 'Loss'])
for i, col in enumerate(X_train.columns):
    fig.add_trace(go.Scatter(x=np.log(lambdas), y=ridge_coef_df.loc[:, col], mode='lines', marker_color = colors[i], name=col + ' ridge'))

fig.add_trace(go.Scatter(x=np.log(lambdas), y=losses.loc[:, 'ridge_mse'], mode='lines + markers', name='MSE of Ridge'), 5, 1)

fig.add_trace(go.Scatter(x=[min_ridge_x], y=[min_ridge_y], mode='lines + markers', name='min Ridge'), 5, 1)

ridge_coef = ridge_coef_df.iloc[arg_ridge]

fig.show()

## Lasso
The Lasso regularized method is attempting to solve the following optimization problem:

$$
\hat w^{lasso}_{\lambda} = \underset{w_0\in\mathbb{R}, w\in\mathbb{R}^d}{argmin} \Vert w_0 + Xw - y \Vert^2_2\ +
    \lambda \Vert w \Vert_1
$$

Note that as we make the penalty higher, the coefficients shrink *exactly* to 0.
That is why lasso tend to create sparser solutions.




needed - 
** explanation of why l2 and l1 behave differently - 
I personally never got this phrase -
"this intersection typically happens at one of the corners of the norm ball." (from the book explanation, the geometric intuition). fount a good explanation here https://www.quora.com/Whats-a-good-way-to-provide-intuition-as-to-why-the-lasso-L1-regularization-results-in-sparse-weight-vectors that relate to looking at the algebra. Not sure how deep to get into it, but I do think that its important to see why its true and not just memorise the behaviors.

In [39]:
arg_lasso = np.argmin(losses.loc[:, 'lasso_mse'])
min_lasso_y = losses.loc[:, 'lasso_mse'].values[arg_lasso]
min_lasso_x = np.log(lambdas)[arg_lasso]


fig = make_subplots(rows=6, cols=1, specs=[[{"rowspan": 4}], [None], [None], [None], [{"rowspan": 2}], [None]], 
                   subplot_titles=['Regularization Paths', 'Loss'])
for i, col in enumerate(X_train.columns):
    fig.add_trace(go.Scatter(x=np.log(lambdas), y=lasso_coef_df.loc[:, col], mode='lines', marker_color = colors[i], name=col + ' lasso'))

fig.add_trace(go.Scatter(x=np.log(lambdas), y=losses.loc[:, 'lasso_mse'], mode='lines + markers', name='MSE of Lasso regressor'), 5, 1)

fig.add_trace(go.Scatter(x=[min_lasso_x], y=[min_lasso_y], mode='lines + markers', name='min Ridge'), 5, 1)

lasso_coef = lasso_coef_df.iloc[arg_lasso]

fig.show()

## Regularization paths difference


Note the following graph - the middle points are coefficient of the unique OLS solution for this regression problem.
Each point on the sides is correspond to coefficient that is chosen by a regression with regularization. 
Note the different effects of Lasso and Ridge -
They both seem to shrink the coefficients, and therefore the coefficients are closer to 0 with regularization then they are in the OLS solution.
Note that Ridge are shifting the coefficient closer to 0, but Lasso shift the coefficient to 0 itself.

In [33]:
ols = LinearRegression().fit(X_train, y_train)

paths = []
for i, col in enumerate(X_train.columns):
    paths.append([ridge_coef_df.iloc[arg_ridge].values[i], ols.coef_[i], lasso_coef_df.iloc[arg_lasso].values[i]])

In [34]:
fig = go.Figure()
for i, col in enumerate(X_train.columns):
    fig.add_trace(go.Scatter(x=[-1], y=[paths[i][0]], mode='markers', marker_color = colors[i], showlegend=False))
    fig.add_trace(go.Scatter(x=[0], y=[paths[i][1]], mode='markers', marker_color = colors[i], showlegend=False))
    fig.add_trace(go.Scatter(x=[1], y=[paths[i][2]], mode='markers', marker_color = colors[i], showlegend=False))

    fig.add_trace(go.Scatter(x=[-1, 0, 1], y=paths[i], mode='lines', line={'dash':'dash'}, marker_color = colors[i], name=col))


fig.add_annotation(x=1, y=6,
        text="Lasso Coeff",
        showarrow=False)

fig.add_annotation(x=0, y=6,
        text="OLS Coeff",
        showarrow=False)

fig.add_annotation(x=-1, y=6,
        text="Ridge Coeff",
        showarrow=False)

fig.update_layout(xaxis = {
    'range': [-1.1, 1.1],
    
    'showticklabels': False,
    'zeroline': False
},)
    
fig.show()



We can see the above phenomena also by inspecting the regularization paths - 
Note the differences in the coefficients behavior when we increase $\lambda$ (make the penalty of the coefficients regularization more dominant in the loss term we optimize).

Lasso regularization tend to eliminate completely some variables (it zero coefficients) and results sparser models.

From the other hand, Ridge tend to shrink the coefficients - "limit" the efffect of some variables on the target, but not to completely ignore them.


In [36]:
colors = px.colors.sequential.Turbo

fig = go.Figure()
for i, col in enumerate(X_train.columns):
    fig.add_trace(go.Scatter(x=np.log(lambdas), y=lasso_coef_df.loc[:, col], mode='lines', marker_color = colors[i], name=col + ' lasso'))
    fig.add_trace(go.Scatter(x=np.log(lambdas), y=ridge_coef_df.loc[:, col], mode='lines', marker_color = colors[i], line={'dash': 'dash'}, name=col + ' ridge'))
fig.show()