# First Assignment - FINTECH 540 - Machine Learning for FinTech

In this assignment, you will gain hands-on experience applying linear models to financial market data. Specifically, you will work with time series prices of the 30 constituents of the *Dow Jones Industrial Average (DJIA)* Index. The dataset covers the period from June $2^{nd}$, 2017, through June $2^{nd}$, 2023. The price series of the ETF associated with the DJIA index is also provided, whose symbol is *DIA*. The dataset is uploaded on Sakai in the same place where you found this notebook.

You will deal with three consecutive tasks, so in general, you can only perform a task if you have solved the previous one. You can obtain at most 100 points for this home assignment. The tasks are briefly summarized below, and you can find the relative prompt in each subsection of this notebook:
- Build descriptive linear models (CAPM) for all the index constituents (*20 points*).
- Select a subset of constituents and fit a predictive linear model to forecast the index value (*40 points*).
- Repeat the linear modeling exercise using boostrapped returns (*40 points*).

## About this notebook

You only need to write the final code between the `### START CODE HERE ###` and `### END CODE HERE ###` comments. You can create more cells to experiment with and prepare your final code at your convenience. Remember to put the final version of the code where it is asked. Before submitting, remember to fully run your notebook from the start to the end to ensure that there will be no runtime error. Avoiding following such guidelines will result in a decrease in the total points.

## Task 1 - Build descriptive linear models (CAPM) for all the index constituents (*20 points*)

The Capital Asset Pricing Model (CAPM) is represented as:

$$R_i - R_f =   \beta_i (R_m - R_f) + e_i$$

Where:
- $R_i$ is the return of the asset or security $i$.
- $R_f$ is the risk-free rate, representing the return on a risk-free investment.
- $\beta_i$ is the beta of the asset $i$, which measures its sensitivity to market movements.
- $R_m$ is the market portfolio's return (the index).
- $e_i$ is the error term or residual representing unexplained variation in the asset's return.

The CAPM equation helps estimate the return of an asset based on its risk relative to the market and the risk-free rate. You can calculate the daily risk-free rate by using the following formula.

$$ r_{\text{daily}} = \left(1 + r_{\text{annual}}\right)^{\frac{1}{365}} - 1 $$

Where:
- $r_{\text{daily}}$ is the daily yield. It represents the expected daily return on investment.
- $ r_{\text{annual}} $ is the annual yield. It represents the expected annual return on investment.
- The formula assumes daily compounding, meaning the investment's return is calculated daily over a year (365 days). It allows to do the modeling based on daily returns.

For this task, you can use an annual yield of *5.482%* per the annualized U.S. 3-month Treasury Bill yield.

To solve this part of the homework, you have to:
- Compute the daily yield from the annualized provided in the prompt.
- Prepared the data to fit the CAPM for each company in the DJIA index described above.
- Fit the CAPM for each company and check the estimated sensitivity to market movements.
- Select a subset of stocks sensitive to market movements between 0.85 and 1.15. Before including a symbol, ensure the estimated sensitivity is statistically significant. Store the symbols in a Python list before moving to the next task.

Before performing the CAPM modeling, remember to split the dataset into a training set and a test set and use only the training set to perform Task 1. Use *2022-01-01* as a cutoff date. Ensure the cutoff date is included in the test set and not in the train set.

**Motivation behind the task**

Fitting individual CAPM models allows for a detailed assessment of each stock's risk profile. CAPM provides a systematic way to quantify the sensitivity of each stock's returns to market movements, as measured by the beta coefficient. This individual assessment is valuable because different stocks may exhibit varying levels of market sensitivity.

Selecting stocks based on their beta values is usually a risk-based approach to portfolio construction. By choosing stocks with higher (lower) beta values, you are essentially selecting those that tend to exhibit greater (lower) price volatility in response to market fluctuations. This can be seen as a deliberate strategy to include riskier (safer) assets in the portfolio.

This task will set the basis for selecting a subset of index constituents to be used for a predictive model. 

**Grading Criteria**

- **Data Preparation (10 points)**: Points will be awarded for preparing the data appropriately for the modeling task.

- **CAPM Model Fitting (10 points)**: Points will be awarded based on the correctness and completeness of the CAPM models, including accurate significance evaluation and the subset of stock selection based on the beta estimations.

In [None]:
### START CODE HERE ###

In [41]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from datetime import datetime
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [42]:
file_path = '/Users/apple/Desktop/F540 HW1/dows_daily.csv'
data = pd.read_csv(file_path)

# df
data['Date'] = pd.to_datetime(data['Date'])

print(data.head())

        Date     DIA    GS.N  NKE.N  CSCO.OQ  JPM.N   DIS.N  INTC.OQ  \
0 2017-06-02  211.91  213.31  52.98    31.98  82.64  107.18    36.32   
1 2017-06-05  211.86  213.99  53.01    31.76  82.79  106.52    36.34   
2 2017-06-06  211.37  214.53  52.48    31.56  82.96  105.50    36.13   
3 2017-06-07  211.72  215.78  53.23    31.61  83.91  105.92    36.26   
4 2017-06-08  211.86  218.76  53.20    31.61  84.95  104.32    36.48   

       MRK.N   CVX.N  ...   PG.N       IBM.N   MMM.N  AAPL.OQ  WMT.N   CAT.N  \
0  62.427347  103.11  ...  88.59  145.232686  206.70  38.8625  79.62  105.95   
1  62.045937  103.19  ...  88.74  145.576545  206.22  38.4825  80.26  105.20   
2  61.664526  104.17  ...  88.80  145.538339  205.41  38.6125  78.93  104.55   
3  61.082876  103.77  ...  88.77  144.210661  205.01  38.8425  79.15  103.51   
4  60.262843  104.00  ...  87.85  145.280444  205.94  38.7475  78.93  105.01   

   AMGN.OQ    V.N   TRV.N    BA.N  
0   159.15  96.15  125.15  190.23  
1   160.22  96

In [35]:
# Ri
returns = data.set_index('Date').pct_change().dropna()

# Rm
market_returns = returns['DIA']

# Rf
annual_risk_free_rate = 0.05482  
daily_risk_free_rate = (1 + annual_risk_free_rate) ** (1 / 365) - 1

# train & test dataset
cutoff_date = datetime(2022, 1, 1)
train_returns = returns[returns.index < cutoff_date]  
test_returns = returns[returns.index >= cutoff_date] 
train_market_returns = market_returns[market_returns.index < cutoff_date] 


In [36]:
# Ri（对所有股票）
returns = data.set_index('Date').pct_change().dropna()

# Rm（DIA）
market_returns = returns['DIA']

# Rf
annual_risk_free_rate = 0.05482
daily_risk_free_rate = (1 + annual_risk_free_rate) ** (1 / 365) - 1

# Train & Test dataset
cutoff_date = datetime(2022, 1, 1)
train_returns = returns[returns.index < cutoff_date]
test_returns = returns[returns.index >= cutoff_date]
train_market_returns = market_returns[market_returns.index < cutoff_date]

In [37]:
# 构建 CAPM 模型
results = []

for stock in train_returns.columns:
    if stock == 'DIA':  # 跳过市场组合本身
        continue

    # 资产超额收益率
    y = train_returns[stock] - daily_risk_free_rate
    # 市场超额收益率
    x = train_market_returns - daily_risk_free_rate
    x = sm.add_constant(x)  # 添加常数项

    # 拟合 CAPM 模型
    model = sm.OLS(y, x).fit()

    # 提取 beta 和截距项
    intercept = model.params['const']
    beta = model.params['DIA']

    # 检查 beta 的显著性
    p_value = model.pvalues['DIA']
    is_selected = 0.85 <= beta <= 1.15 and p_value < 0.05

    # 存储结果
    results.append({
        "Stock": stock,
        "Beta": beta,
        "Intercept": intercept,
        "P-Value": p_value,
        "Selected": "Yes" if is_selected else "No"
    })

results_df = pd.DataFrame(results)

print(results_df)

      Stock      Beta  Intercept        P-Value Selected
0      GS.N  1.253873   0.000055  5.263803e-268       No
1     NKE.N  0.943454   0.000639  7.964746e-143      Yes
2   CSCO.OQ  0.987662   0.000198  1.652478e-199      Yes
3     JPM.N  1.205581   0.000115  9.404262e-272       No
4     DIS.N  0.986073  -0.000051  9.599901e-159      Yes
5   INTC.OQ  1.117561  -0.000041  3.742641e-138      Yes
6     MRK.N  0.614409  -0.000112   1.395134e-92       No
7     CVX.N  1.223480  -0.000297  1.908209e-203       No
8     AXP.N  1.380397   0.000176  8.425121e-264       No
9      VZ.N  0.477046  -0.000170   3.647262e-79       No
10     HD.N  0.986357   0.000448  8.666819e-226      Yes
11   WBA.OQ  0.831931  -0.000676   1.018870e-84       No
12    MCD.N  0.797540   0.000124  2.879080e-167       No
13    UNH.N  1.019555   0.000502  9.856521e-187      Yes
14     KO.N  0.677765  -0.000114  5.770224e-154       No
15    JNJ.N  0.630239  -0.000081  5.761217e-131       No
16  MSFT.OQ  1.035874   0.00093

In [None]:
### END CODE HERE ###

## Task 2 - Select a subset of constituents and fit a predictive linear model to forecast the index value (*40 points*)

In this task, you will apply linear predictive modeling techniques to forecast the value of the DIA ETF on the DJIA index using the subset of its constituents you selected in the previous task. The goal is to build a predictive linear model that accurately estimates the future index return based on the historical data of selected constituent stocks. Note that to perform this predictive task, you have to prepare the data accordingly. Don't use the excess returns with respect to a daily risk-free rate for this task, but use the plain returns instead.

The predictive linear regression equation to estimate the dependent variable \(Y\) at time \(t+1\) is represented as:

$$ Y_{t+1} = \beta_0 + \beta_1 X_{1,t} + \beta_2 X_{2,t} + \ldots + \beta_k X_{k,t} + \varepsilon_{t} $$

In this equation:

- $Y_{t+1}$ represents the dependent variable at time $t+1$ that we want to predict. Note that the dependent variable is real-valued.
- $\beta_0$ is the intercept or constant term.
- $\beta_1, \beta_2, \ldots, \beta_k$ are the $k$ coefficients for the independent variables $ X_{1,t}, X_{2,t}, \ldots, X_{k,t} $ at time $t$. you can assume $k$ to be the number of selected stocks from the previous task. Note that the regressors are real-valued.
- $\varepsilon_{t}$ represents the error term at time $t$, capturing unexplained variation or noise in the dependent variable at that specific time.

Before performing the linear regression modeling, remember to split the dataset into a training set and a test set. Use *2022-01-01* as a cutoff date, the same way you did in the previous task. Make sure the cutoff date will be included in the test set and not in the train set.

Assess the performance of your predictive model using an appropriate evaluation metric for a regression problem like this one. Evaluate the model on the test set to ensure its predictive accuracy out-of-sample.

**Grading Criteria**

- **Data Preparation (15 points)**: Points will be awarded for preparing the data appropriately for the modeling task.

- **Predictive Regression Model Building (20 points)**: Points will be awarded based on the correctness and completeness of the regression model built using selected stocks' returns and the index return.

- **Model Evaluation (5 points)**: Points will be awarded based on the proper choice of evaluation metric.

In [None]:
### START CODE HERE ###

In [44]:
# 从第一题的结果中筛选出股票
selected_stocks = results_df[results_df['Selected'] == 'Yes']['Stock'].tolist()

# 计算每日收益率
returns = data.set_index('Date').pct_change().dropna()

# 添加 DIA 的下一日收益率作为目标变量
returns['DIA_next'] = returns['DIA'].shift(-1)  

# 删除空值
returns = returns.dropna()

# Step 2: 分割训练集和测试集
cutoff_date = datetime(2022, 1, 1)
train_data = returns[returns.index < cutoff_date]
test_data = returns[returns.index >= cutoff_date]

# 选定自变量（选出的股票收益率）和目标变量（DIA 下一日收益率）
X_train = train_data[selected_stocks]
y_train = train_data['DIA_next']
X_test = test_data[selected_stocks]
y_test = test_data['DIA_next']

# Step 3: 添加常数项，构建回归模型
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

# 构建 OLS 回归模型
model = sm.OLS(y_train, X_train).fit()

# Step 4: 预测和评估模型
y_pred = model.predict(X_test)

# 使用回归评估指标衡量模型性能
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# 打印结果
print("Regression Summary:\n", model.summary())
print("\nModel Performance:")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"R-squared (R²): {r2}")

Regression Summary:
                             OLS Regression Results                            
Dep. Variable:               DIA_next   R-squared:                       0.073
Model:                            OLS   Adj. R-squared:                  0.061
Method:                 Least Squares   F-statistic:                     5.981
Date:                Sun, 26 Jan 2025   Prob (F-statistic):           3.84e-12
Time:                        17:59:54   Log-Likelihood:                 3407.0
No. Observations:                1154   AIC:                            -6782.
Df Residuals:                    1138   BIC:                            -6701.
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0007      0.00

In [None]:
### END CODE HERE ###

## Task 3 - Augment the Dataset with Bootstrapped Alphas and Fit again the Linear Predictive Models (40 points)

In this task, we explore the concept of bootstrapped alphas and their role in predictive modeling. Bootstrapped alphas are used as proxy trading signals for real alphas that can be practically obtained. These signals are correlated with future returns and can play the role of good predictors in the predictive modeling process. Don't use the excess returns with respect to a daily risk-free rate for this task, but use the plain returns instead when you have to calculate the boostrapped alphas.

We define bootstrapped alphas $\alpha_t$ as per the formula below:

$$\alpha_{i,t} := \rho_{\text{boot}} r_{i,t+1} + \sqrt{1 - \rho_{\text{boot}}^{2}} z_{i,t}$$

where:
- $r_{i,t+1}$ represents the next period return of the traded security $i$, which is given to you.
- $z_{i,t} \sim \mathbb{N}(0,\sigma^{2})$ is a randomly drawn scalar associated for each company $i$, which is not given and you have to sample. When sampling, ensure that each sampled vector is independent of the other since you have to draw samples for each company you will use as regressors. The number of companies stays the same that you used in the previous task and that you have selected by fitting the CAPM model in task 1.
- $\sigma^{2}_{i}$ is an estimate of the true conditional variance of the security $i$, which you have to calculate based on the given returns. Note that you have to calculate those variances on the train set only. Use the same cutoff applied in the previous task to define what the training set is.
- $\rho_{\text{boot}} \in [-1,1]$ is a correlation coefficient, which you have to set equal to 0.25.

In this setting, the parameter $\rho_{\text{boot}}$ artificially regulates the strength of the trading signal you create. We remark that regressing the bootstrapped alpha $\alpha_t$ on the future returns $r_{t+1}$ results in an $R^2$ equal to $\rho^2$.

The equation above formalizes the calculation of the boostrapped alpha for a single security while you will have more than one security. Try to make your calculations as efficient as possible by computing them simultaneously. It is possible by using calculations between pandas dataframe. Remember that $z_{i,t} \sim \mathcal{N}(0,\sigma^{2}_{i})$ can be calculated as $z_{i,t} = \sqrt{\sigma^{2}_{i}}u_{i,t}$ where $u_{i,t} \sim \mathcal{N}(0,1)$. 

Once you calculate the boostrapped alphas, repeat the linear predictive forecasting exercise as in the previous task. This time you will use the boostrapped alphas as predictors, while you will keep the same target as before, the index returns. In other words, the target stays the same as in the previous task (future returns for DIA) by looking at the equation below. Still, the predictors change from the current returns of the constituents to the alpha bootstrap you have calculated.

$$ Y_{t+1} = \beta_0 + \beta_1 X_{1,t} + \beta_2 X_{2,t} + \ldots + \beta_k X_{k,t} + \varepsilon_{t} $$

To ensure reproducibility, please set the random seed to 42. Don't use another seed, and remember to set it. Avoiding to follow these guidelines will result in point deductions.

**Motivation behind the task**

In the dynamic and complex world of financial markets, predictive modeling is a potent tool to decipher underlying patterns and trends that govern security prices. Coming up with good predictors for a certain set of assets is a complicated task that is not necessarily the purpose of this assignment. The concept of bootstrapped alphas, as delineated in this exercise, emerges as a sophisticated method to engineer artificial trading signals that can potentially enhance the predictive power of financial models. It is equivalent to assuming that we have a way to predict the future returns of the index constituents. Look at the alpha bootstrap equation to understand why we are talking about future returns by looking at what the prices indicate.

The utilization of bootstrapped alphas is grounded in the mathematical formulation provided, where the alpha ($\alpha_{i,t}$) for a security $i$ at time $t$ is constructed using a combination of the next period return of the security ($r_{i,t+1}$) and a stochastic component ($z_{i,t}$) drawn from a normal distribution. This formulation allows for the incorporation of both deterministic and random elements, thereby mimicking the inherent uncertainty and volatility observed in financial markets.

By setting the correlation coefficient ($\rho_{\text{boot}}$) to 0.25, we are essentially moderating the influence of the artificial trading signal, ensuring that it does not overwhelmingly dictate the behavior of the bootstrapped alphas. This parameter, therefore, serves as a tuning knob, allowing us to control the strength of the trading signal and, consequently, its predictive power. However, you have to keep this parameter fixed for this exercise, as indicated by the prompt.

The subsequent step of employing these bootstrapped alphas as predictors in a linear predictive forecasting model is an exercise to highlight how well one can expect to forecast index returns, given a good way to predict future returns for the constituents. By replacing the current returns of the constituents with the calculated bootstrapped alphas, we are essentially enhancing the model with artificially generated yet statistically grounded signals that can potentially unveil deeper insights into the market dynamics.

**Grading Criteria**

- **Data Preparation (30 points)**: Points will be awarded for preparing the data appropriately for the modeling task.

- **Predictive Regression Model Building (5 points)**: Points will be awarded based on the correctness and completeness of the regression model built using selected stocks' boostrapped alpha and the index return.
- **Model Evaluation (5 points)**: Points will be awarded based on the proper choice of evaluation metric.

In [None]:
### START CODE HERE ###

In [45]:
selected_stocks = results_df[results_df['Selected'] == 'Yes']['Stock'].tolist()

# Step 2: 计算每日收益率
returns = data.set_index('Date').pct_change().dropna()

# 添加 DIA 的下一日收益率作为目标变量
returns['DIA_next'] = returns['DIA'].shift(-1)
returns = returns.dropna()

# 分割训练集和测试集
cutoff_date = datetime(2022, 1, 1)
train_data = returns[returns.index < cutoff_date]
test_data = returns[returns.index >= cutoff_date]

In [46]:
# Step 3: 计算 Bootstrapped Alphas
# 设置参数
rho_boot = 0.25  # 相关系数
np.random.seed(42)  # 固定随机种子

# 计算训练集中的条件方差（每只股票）
variances = train_data[selected_stocks].var()  # 返回每只股票的条件方差
std_devs = np.sqrt(variances)  # 标准差

# 为训练集生成 z_{i,t}（正态随机噪声）
random_z = pd.DataFrame(
    {stock: np.random.normal(0, std_devs[stock], len(train_data)) for stock in selected_stocks},
    index=train_data.index
)

# 计算 Bootstrapped Alphas
bootstrapped_alphas = pd.DataFrame(index=train_data.index)
for stock in selected_stocks:
    bootstrapped_alphas[stock] = (
        rho_boot * train_data[stock].shift(-1) +  # r_{i,t+1}
        np.sqrt(1 - rho_boot**2) * random_z[stock]  # z_{i,t}
    )

# 删除缺失值（由于 shift(-1) 可能导致最后一行 NaN）
bootstrapped_alphas = bootstrapped_alphas.dropna()

In [47]:
# Step 4: 构建新的线性预测模型
# 目标变量 (DIA 的下一日收益率)
y_train = train_data['DIA_next'].loc[bootstrapped_alphas.index]

# 添加常数项
X_train = sm.add_constant(bootstrapped_alphas)

# 使用 Bootstrapped Alphas 拟合线性模型
model = sm.OLS(y_train, X_train).fit()

# 用测试集重新计算 Bootstrapped Alphas
test_random_z = pd.DataFrame(
    {stock: np.random.normal(0, std_devs[stock], len(test_data)) for stock in selected_stocks},
    index=test_data.index
)

test_bootstrapped_alphas = pd.DataFrame(index=test_data.index)
for stock in selected_stocks:
    test_bootstrapped_alphas[stock] = (
        rho_boot * test_data[stock].shift(-1) +
        np.sqrt(1 - rho_boot**2) * test_random_z[stock]
    )

test_bootstrapped_alphas = test_bootstrapped_alphas.dropna()

# 对应的测试集目标变量
y_test = test_data['DIA_next'].loc[test_bootstrapped_alphas.index]
X_test = sm.add_constant(test_bootstrapped_alphas)

In [49]:
# Step 5: 模型预测和评估
y_pred = model.predict(X_test)

# 模型性能评估
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Regression Summary:\n", model.summary())
print("\nModel Performance:")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"R-squared (R²): {r2}")

Regression Summary:
                             OLS Regression Results                            
Dep. Variable:               DIA_next   R-squared:                       0.308
Model:                            OLS   Adj. R-squared:                  0.299
Method:                 Least Squares   F-statistic:                     33.80
Date:                Sun, 26 Jan 2025   Prob (F-statistic):           2.04e-80
Time:                        18:18:38   Log-Likelihood:                 3572.4
No. Observations:                1153   AIC:                            -7113.
Df Residuals:                    1137   BIC:                            -7032.
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0002      0.00

In [None]:
### END CODE HERE ###