## Bias-Variance Trade-off of Ridge Regression

We know from class that 
$$ \beta_{ridge} = \frac{1}{1+\lambda} \beta_{OLS} $$

Given this relationship, we can express the bias and variance terms of the ridge estimator in terms of the OLS estimator.

**Bias**:

Using the given relationship, the bias for the ridge estimator becomes:

$ E[\beta_{ridge}] - \beta = E\left[\frac{1}{1+\lambda} \beta_{OLS}\right] - \beta $
$ = \frac{1}{1+\lambda} E[\beta_{OLS}] - \beta $

If $E[\beta_{OLS}] = \beta$, then the bias is:

$ Bias = \frac{\beta}{1+\lambda} - \beta = \frac{-\lambda \beta}{1+\lambda} $

**Variance**:

The variance of the ridge estimator is:

$ \text{var}[\beta_{ridge}] = \text{var}\left[\frac{1}{1+\lambda} \beta_{OLS}\right] $
$ = \left(\frac{1}{1+\lambda}\right)^2 \text{var}[\beta_{OLS}] $

Thus, the bias-variance tradeoff equation becomes:

$ E[(\beta_{ridge} - \beta)^2] = \left(\frac{1}{1+\lambda}\right)^2 \text{var}[\beta_{OLS}] + \left(\frac{-\lambda \beta}{1+\lambda}\right)^2 = \left(\frac{1}{1+\lambda}\right)^2 \sigma^2 (X^TX)^{-1} + \left(\frac{-\lambda \beta}{1+\lambda}\right)^2$


## Beyond Quadratic Loss

The expression is essentially a function of $ \lambda $. The terms represent the variance and squared bias contributions, respectively.

**Variance term**: As $ \lambda $ increases, the factor $ \left(\frac{1}{1+\lambda}\right)^2 $ decreases, implying the variance of the ridge estimator decreases with increasing $ \lambda $.

**Bias term**: As $ \lambda $ increases, the bias term $ \left(\frac{-\lambda \beta}{1+\lambda}\right)^2 $ increases, indicating that the bias of the estimator increases with increasing $ \lambda $.

To find the optimal $ \lambda $ that minimizes the total expected prediction error (i.e., the sum of variance and squared bias), we can differentiate the above expression with respect to $ \lambda $ and set it to zero. Using FOC, the optimal shrinkage can be solved as 


$$ \lambda^* = \frac{p \sigma^2}{\sum_{j=1}^{p} \beta_j^2} $$

We can obtain an estimator with a lower MSE than OLS if the parameter is tuned appropriately .

### (a) Minimizer of the $L_1$ Loss

Given the $L_1$ loss: $L_1(y, \hat{y}) = E[|y-\hat{y}| | X]$,

We want to find $ \alpha $ that minimizes the expression $ E[|y-\alpha| | X] $.

To do this, we can rewrite the expectation in terms of the conditional density:

$E[|y-\alpha| | X] = \int_{-\infty}^{\infty} |y-\alpha| p_{y|x}(y)dy$

To find the minimizer, take the derivative of the integrand with respect to $ \alpha $ and set it to 0. The function $ |y-\alpha| $ is not differentiable at $ y=\alpha $, but its subdifferential contains 0 when $ y=\alpha $, so we can split the integral:

$E[|y-\alpha| | X] = \int_{-\infty}^{\alpha} (\alpha - y) p_{y|x}(y)dy + \int_{\alpha}^{\infty} (y - \alpha) p_{y|x}(y)dy$

To minimize this, we equate the two integrals which can be viewed as cumulative distribution function. The solution is $ \alpha $ such that:

$P(Y \leq \alpha | X = x) = P(Y > \alpha | X = x) = 0.5$

This means that $ \alpha $ is the median of the conditional distribution of $ Y $ given $ X = x $. Thus, the minimizer of the $ L_1 $ loss is the conditional median.

### (b) Financial Interpretation

Let's say $ x = S_t $ is the stock price at time $ t $ and $ y = S_T $ is the stock price at time $ T $. The conditional median found in part (a) represents the price at which there is a 50% chance the stock will be below and a 50% chance it will be above at time $ T $ given the stock price at time $ t $. This could be interpreted as a measure of central tendency for the future stock price, not skewed by extreme price movements (unlike the mean, which could be affected by outliers). 

### (c)

If $ x $ is a multi-dimensional random vector, the analysis in part (a) still holds, but now our condition is on a vector value rather than a scalar value. The minimizer of the $ L_1 $ loss in this context would be the conditional median of $ Y $ given the multi-dimensional vector $ X = x $. In other words, for each possible vector $ x $, we would compute a median value of $ Y $ based on the joint distribution of $ X $ and $ Y $. The intuition remains the same: the conditional median gives us a central value of $ Y $ given $ X = x $ that is not influenced by potential outliers in the $ Y $ distribution.

## Feature Engineering of the Hedge Fund Dataset

### OLS

In [1]:
import pandas as pd
ff = pd.read_csv('/Users/Eric/opt/anaconda3/envs/dsm/F-F_Research_Data_5_Factors_2x3.csv', skiprows=3)
yahoo = pd.read_csv('/Users/Eric/opt/anaconda3/envs/dsm/QMNIX_month.csv')
yahoo['Return'] = yahoo['Adj Close'].pct_change()  # Compute monthly returns
ff = ff.iloc[:721] #choose only the monthly data
ff['Date'] = pd.to_datetime(ff['Unnamed: 0'].astype(str), format='%Y%m')
yahoo['Date'] = pd.to_datetime(yahoo['Date'])
yahoo = yahoo[["Date","Return"]]
df = pd.merge(ff, yahoo, on='Date', how='inner') #join the two tables based on dates
df.drop('Unnamed: 0', axis=1, inplace=True)
df.dropna(inplace=True) #drop na entries
import statsmodels.api as sm

# Convert relevant columns to numeric types, if they're not already
for column in ['Mkt-RF', 'SMB', 'HML', 'RMW', 'CMA', 'Return']:
    df[column] = pd.to_numeric(df[column], errors='coerce')

# Define independent variables (X) with the Fama-French factors and an intercept
X = df[['Mkt-RF', 'SMB', 'HML', 'RMW', 'CMA']]
X = sm.add_constant(X)  # Adds a constant (intercept) to the model

# Define dependent variable (Y)
Y = df['Return']

# Drop rows with missing values for regression
X_clean = X.dropna()
Y_clean = Y.loc[X_clean.index]

# Fit the OLS model
model = sm.OLS(Y_clean, X_clean).fit()

# Display model summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                 Return   R-squared:                       0.188
Model:                            OLS   Adj. R-squared:                  0.146
Method:                 Least Squares   F-statistic:                     4.528
Date:                Fri, 27 Oct 2023   Prob (F-statistic):           0.000933
Time:                        19:30:34   Log-Likelihood:                 196.23
No. Observations:                 104   AIC:                            -380.5
Df Residuals:                      98   BIC:                            -364.6
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0047      0.004      1.220      0.2

### Elastic Net Regularization

In [2]:
from sklearn.linear_model import ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# Scale the data for regularization methods
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_clean)

# Fit the Elastic Net model
enet_cv = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1], cv=10, n_jobs=-1)
enet_model = enet_cv.fit(X_scaled, Y_clean)

# Predict using both models
ols_preds = model.predict(X_clean)
enet_preds = enet_model.predict(X_scaled)

# Calculate the MSE for both models
ols_mse = mean_squared_error(Y_clean, ols_preds)
enet_mse = mean_squared_error(Y_clean, enet_preds)

print("OLS MSE:", ols_mse)
print("Elastic Net MSE:", enet_mse)

# Coefficients from the Elastic Net model
print("\nElastic Net coefficients:")
print("Intercept:", enet_model.intercept_)
for feature, coef in zip(['Mkt-RF', 'SMB', 'HML', 'RMW', 'CMA'], enet_model.coef_[1:]):
    print(feature, ":", coef)

OLS MSE: 0.001344743300453897
Elastic Net MSE: 0.001345242133105692

Elastic Net coefficients:
Intercept: 0.0037321169366177505
Mkt-RF : -0.003981420639365932
SMB : -0.002387897959977573
HML : 0.012119388223212733
RMW : 0.002008799478489891
CMA : 0.00516879831084039


## Add more non-linear regressors 

In [16]:
import numpy as np
from sklearn.linear_model import ElasticNet

# 1. Add squared values
df['Mkt-RF^2'] = df['Mkt-RF'] ** 2
df['SMB^2'] = df['SMB'] ** 2
df['HML^2'] = df['HML'] ** 2
df['RMW^2'] = df['RMW'] ** 2
df['CMA^2'] = df['CMA'] ** 2

# 2. Add interaction terms
df['Mkt-RF*SMB'] = df['Mkt-RF'] * df['SMB']
df['Mkt-RF*HML'] = df['Mkt-RF'] * df['HML']
df['Mkt-RF*RMW'] = df['Mkt-RF'] * df['RMW']
df['Mkt-RF*CMA'] = df['Mkt-RF'] * df['CMA']

# 3. Add log transformations (add 1 to handle negative values, then take log)
df['log_Mkt-RF'] = np.log(df['Mkt-RF'] + 1)
df['log_SMB'] = np.log(df['SMB'] + 1)
df['log_HML'] = np.log(df['HML'] + 1)

# Update the independent variables (X) to include the new features
X_extended = df[['Mkt-RF', 'SMB', 'HML', 'RMW', 'CMA', 
                 'Mkt-RF^2', 'SMB^2', 'HML^2', 'RMW^2', 'CMA^2', 
                 'Mkt-RF*SMB', 'Mkt-RF*HML', 'Mkt-RF*RMW', 'Mkt-RF*CMA',
                 'log_Mkt-RF', 'log_SMB', 'log_HML']]
X_extended = sm.add_constant(X_extended)

# Drop rows with missing values for regression
X_ext_clean = X_extended.dropna()
Y_ext_clean = Y.loc[X_ext_clean.index]

# Fit the OLS model with extended features
model_extended = sm.OLS(Y_ext_clean, X_ext_clean).fit()

# Scaling extended features for Elastic Net
X_ext_scaled = scaler.fit_transform(X_ext_clean)

# Separate ElasticNet from ElasticNetCV for more granular control
# l1_ratio from enet_cv represents the best found value from cross-validation
enet = ElasticNet(l1_ratio=enet_cv.l1_ratio_, max_iter=10000, tol=1e-6)

# Refit the model with the extended features
enet_model_refit = enet.fit(X_ext_scaled, Y_ext_clean)
enet_refit_preds = enet_model_refit.predict(X_ext_scaled)

# Calculate MSE for extended features
enet_refit_mse = mean_squared_error(Y_ext_clean, enet_refit_preds)



  result = getattr(ufunc, method)(*inputs, **kwargs)


In [6]:
print("Elastic Net (Refitted) Extended MSE:", enet_refit_mse)

Elastic Net (Refitted) Extended MSE: 0.0023546816924517387


### Observation:
The MSE does not significantly reduce. In fact, it is even slighly bigger.

## 10-fold cross validation

In [4]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import make_scorer
from sklearn.linear_model import LinearRegression

# Define a function to calculate MSE
def mse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred)

# Create a scorer for MSE
mse_scorer = make_scorer(mse, greater_is_better=False)

# Set up 10-fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# Initialize a Linear Regression model (equivalent to OLS without constant)
lr = LinearRegression()

# Cross-validation for the original features
mse_original = cross_val_score(lr, X_clean, Y_clean, cv=kf, scoring=mse_scorer)

# Cross-validation for the extended features
mse_extended = cross_val_score(lr, X_ext_clean, Y_ext_clean, cv=kf, scoring=mse_scorer)

# Since we defined MSE with "greater_is_better=False", the returned scores will be negative
# Let's multiply them by -1 to get positive MSE values
mse_original = -1 * mse_original
mse_extended = -1 * mse_extended

# Display the results
print("MSE for original features:", mse_original)
print("Average MSE for original features:", np.mean(mse_original))
print("\n")
print("MSE for extended features:", mse_extended)
print("Average MSE for extended features:", np.mean(mse_extended))


MSE for original features: [0.00062233 0.00038469 0.00344339 0.00059173 0.0054903  0.0003724
 0.00153287 0.00194544 0.00178615 0.00088   ]
Average MSE for original features: 0.0017049294275977987


MSE for extended features: [0.01081526 0.00294804 0.01132812 0.00090612 0.00562376 0.00177416
 0.02145382 0.00277859 0.00119402 0.01137531]
Average MSE for extended features: 0.0070197197383696595


### Explanation: 

This might be due to the overfitting (multicollinearity) in our models

According to the bias-variance tradeoff, when MSE is constant, it it not possible to improve both on the bias and variance.

### The optimal Elastic Net model using a grid search of the parameter space.

In [13]:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# Split the data
X_train, X_test, Y_train, Y_test = train_test_split(X_ext_clean, Y_ext_clean, test_size=0.3, random_state=42)

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define the hyperparameter grid
param_grid = {
    'alpha': np.logspace(-4, 4, 20),
    'l1_ratio': np.linspace(0, 1, 30)
}

# Initialize ElasticNet
elastic_net = ElasticNet(max_iter=100000, tol=0.1)

# Initialize GridSearchCV with the new elastic_net object and parameter grid
grid_search = GridSearchCV(elastic_net, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)

# Fit GridSearchCV
grid_search.fit(X_train_scaled, Y_train)

# Find the best model and its MSE on the test set
best_en = grid_search.best_estimator_
mse_test = mean_squared_error(Y_test, best_en.predict(X_test_scaled))

print("Best Elastic Net Parameters:", grid_search.best_params_)
print("MSE on Test Set:", mse_test)


Fitting 5 folds for each of 600 candidates, totalling 3000 fits


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

Best Elastic Net Parameters: {'alpha': 0.03359818286283781, 'l1_ratio': 0.7931034482758621}
MSE on Test Set: 0.0012503274216438026


  model = cd_fast.enet_coordinate_descent(


### Result 

Best Elastic Net Parameters: {'alpha': 0.03359818286283781, 'l1_ratio': 0.7931034482758621}

MSE on Test Set: 0.0012503274216438026

Compared with the previous lowest-MSE model, it is not the best anymore. There could be the following reasons:

1. Overfitting during Hyperparameter Tuning: Despite using cross-validation, there might be some overfitting to the training set, leading to suboptimal performance on the test set.

2. Bias-Variance Tradeoff: The model with additional features might have lower bias but higher variance, leading to less consistent performance across different datasets.