# Introduction to regularized regression

Despite ordinary least squares (OLS) being the best unbiased linear estimator, biased (regularized) estimators can achieve lower mean-squared error especially on unseen or held out data. This is because regularization reduces variance in order to improve model predictions on cross-validated data. Regularization is also particularly useful when there is multicollinearity between your regressors. OLS suffers from large standard errors on the $\beta$ estimates in this regime.

When building encoding models for neural data, regressors often may be collinear. Let's take a basic example of a mouse that must press a lever to recieve a food reward. If a food reward is delivered every time the lever is pressed, it will be quite difficult to separate out the neural encoding of action (lever pressing) from the encoding of reward (recieving food) because these variables always occur together (and occur closely in time). When designing behavioral tasks, think how you might avoid collinearities.


### Ridge, lasso and ordinary least squares (OLS)
The goal of regression is to minimize the sum of squared errors (SSE). Below are shown the different SSE formulations for OLS, ridge, and lasso regression.
$$
SSE_{OLS} = \sum_{i=1}^{n}{(y_i - \hat{y}_i)^2}
$$
$$
SSE_{Ridge} = \sum_{i=1}^{n}{(y_i - \hat{y}_i)^2} + \alpha\sum_{j=1}^{P}{\beta_j^2}
$$
$$
SSE_{Lasso} = \sum_{i=1}^{n}{(y_i - \hat{y}_i)^2} + \lambda\sum_{j=1}^{P}{|\beta_j|}
$$
You can see that the ridge and lasso equations include an additional term that penalizes large $\beta$ values. The only difference between ridge and lasso is that ridge is squared and lasso takes the absolute value of the weights instead. By minimizing these functions, we can compute the OLS and Ridge estimators! We will skip over the derivation of the closed form solutions, but just know that they come from minimizing the above equations. 

$$
\hat{\beta}_{OLS} = (X^TX)^{-1}X^Ty
$$
$$
\hat{\beta}_{Ridge} = (X^TX + \alpha I)^{-1}X^Ty
$$
Lasso regression has no closed form solution. It must be solved with other techniques such as gradient descent, so we will omit it for now.


In [1]:
#First import the standard toolboxes
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from scipy.stats import zscore
from tqdm import tqdm

#Import the specific functions for this class
import encoding_tools

#Import some sklearn functions
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.pipeline import Pipeline

ModuleNotFoundError: No module named 'sklearn'