<a href="https://colab.research.google.com/github/dr17549/stock-returns-tweets-predict/blob/main/Farah's_Copy_of_CourseWork.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **<center>Machine Learning and Finance </center>**


## <center> CourseWork 2024 - StatArb </center>



In this coursework, you will delve into and replicate selected elements of the research detailed in the paper **[End-to-End Policy Learning of a Statistical Arbitrage Autoencoder Architecture](https://arxiv.org/pdf/2402.08233.pdf)**. **However, we will not reproduce the entire study.**

## Overview

This study redefines Statistical Arbitrage (StatArb) by combining Autoencoder architectures and policy learning to generate trading strategies. Traditionally, StatArb involves finding the mean of a synthetic asset through classical or PCA-based methods before developing a mean reversion strategy. However, this paper proposes a data-driven approach using an Autoencoder trained on US stock returns, integrated into a neural network representing portfolio trading policies to output portfolio allocations directly.


## Coursework Goal

This coursework will replicate these results, providing hands-on experience in implementing and evaluating this innovative end-to-end policy learning Autoencoder within financial trading strategies.

## Outline

- [Data Preparation and Exploration](#Data-Preparation-and-Exploration)
- [Fama French Analysis](#Fama-French-Analysis)
- [PCA Analysis](#PCA-Analysis)
- [Ornstein Uhlenbeck](#Ornstein-Uhlenbeck)
- [Autoencoder Analysis](#Autoencoder-Analysis)



**Description:**
The Coursework is graded on a 100 point scale and is divided into five  parts. Below is the mark distribution for each question:

| **Problem**  | **Question**          | **Number of Marks** |
|--------------|-----------------------|---------------------|
| **Part A**   | Question 1            | 4                   |
|              | Question 2            | 1                   |
|              | Question 3            | 3                   |
|              | Question 4            | 3                   |
|              | Question 5            | 1                   |
|              | Question 6            | 3                   |
|**Part  B**    | Question 7           | 1                   |
|              | Question 8            | 5                   |
|              | Question 9            | 4                   |
|              | Question 10           | 5                   |
|              | Question 11           | 2                   |
|              | Question 12           | 3                   |
|**Part  C**    | Question 13          | 3                   |
|              | Question 14           | 1                   |
|              | Question 15           | 3                   |
|              | Question 16           | 2                   |
|              | Question 17           | 7                   |
|              | Question 18           | 6                   |
|              | Question 19           | 3                   |
|  **Part  D** | Question 20           | 3                   |
|              | Question 21           | 5                   |
|              | Question 22           | 2                   |
|  **Part  E** | Question 23           | 2                   |
|              | Question 24           | 1                   |
|              | Question 25           | 3                   |
|              | Question 26           | 10                  |
|              | Question 27           | 1                   |
|              | Question 28           | 3                   |
|              | Question 29           | 3                   |
|              | Question 30           | 7                   |




Please read the questions carefully and do your best. Good luck!

## Objectives



## 1. Data Preparation and Exploration
Collect, clean, and prepare US stock return data for analysis.

## 2. Fama French Analysis
Utilize Fama French Factors to isolate the idiosyncratic components of stock returns, differentiating them from market-wide effects. This analysis helps in understanding the unique characteristics of individual stocks relative to broader market trends.

## 3. PCA Analysis
Employ Principal Component Analysis (PCA) to identify hidden structures and reduce dimensionality in the data. This method helps in extracting significant patterns that might be obscured in high-dimensional datasets.

## 4. Ornstein-Uhlenbeck Process
Analyze mean-reverting behavior in stock prices using the Ornstein-Uhlenbeck process. This stochastic process is useful for modeling and forecasting based on the assumption that prices will revert to a long-term mean.

## 5. Building a Basic Autoencoder Model
Construct and train a standard Autoencoder to extract residual idiosyncratic risk.








# Data Preparation and Exploration


---
<font color=green>Q1: (4 Marks)</font>
<br><font color='green'>
Write a Python function that accepts a URL parameter and retrieves the NASDAQ-100 companies and their ticker symbols by scraping the relevant Wikipedia page using **[Requests](https://pypi.org/project/requests/)** and **[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)**. Your function should return the data as a list of tuples, with each tuple containing the company name and its ticker symbol. Then, call your function with the appropriate Wikipedia page URL and print the data in a 'Company: Ticker' format.

</font>

---


In [None]:
import requests
from bs4 import BeautifulSoup

def extract_tickers(link, table_id, num_cols):
  page = requests.get(link, auth=('user', 'pass'))
  soup = BeautifulSoup(page.text, 'html.parser')
  table = soup.find(id=table_id)
  data = []
  rows = table.find_all('tr')
  for row in rows[1:]:
    cols = row.find_all('td')[:num_cols]
    if not cols:
        cols = row.find_all('th')[:num_cols]
    cols = [ele.text.strip() for ele in cols]
    data.append(cols)
  return data



data = extract_tickers(link = 'https://en.wikipedia.org/wiki/Nasdaq-100',table_id = "constituents", num_cols = 2 )


---
<font color=green>Q2: (1 Mark)</font>
<br><font color='green'>
Given a list of tuples representing NASDAQ-100 companies (where each tuple contains a company name and its ticker symbol), write a Python script to extract all ticker symbols into a separate list called `tickers_list`.
</font>
---


In [None]:
ticker_list = []

for item in data:
  ticker_list.append(item[1])

---
<font color=green>Q3: (3 Marks)</font>
<br><font color='green'>
Using **[yfinance](https://pypi.org/project/yfinance/)** library, write a Python script that accepts a list of stock ticker symbols. For each symbol, download the adjusted closing price data, store it in a dictionary with the ticker symbol as the key, and then convert the final dictionary into a Pandas DataFrame. Handle any errors encountered during data retrieval by printing a message indicating which symbol failed
</font>
---

In [None]:
import yfinance as yf

stock_data ={}
for ticker in ticker_list:
    ticker_info = yf.Ticker(ticker)
    hist = ticker_info.history(period="20y", auto_adjust=True)
    if len(hist) == 0:
      print(ticker, " ticker not found")
    else:
        if 'Close' in hist.columns:
            stock_data[ticker] = hist['Close']
        else:
            print("Adj Close price data not available for", ticker)






---
<font color=green>Q4: (3 Marks)</font>
<br><font color='green'>
Write a Python script to analyze stock data stored in a dictionary `stock_data` (where each key is a stock ticker symbol, and each value is a Pandas Series of adjusted closing prices). The script should:
1. Convert the dictionary into a DataFrame.
2. Calculate the daily returns for each stock.
3. Identify columns (ticker symbols) with at least 2000 non-NaN values in their daily returns.
4. Create a new DataFrame that only includes these filtered ticker symbols.
5. Remove any remaining rows with NaN values in this new DataFrame.
</font>

---

In [None]:
import pandas as pd
missing_threshold = 20000

new_df = pd.DataFrame()

stock_df = pd.DataFrame(stock_data)
for col in stock_df.columns:
    column = stock_df[col]
    daily_return = column.pct_change()
    if not daily_return.isna().sum() >= missing_threshold:
      new_df[col]=stock_df[col]




NameError: name 'stock_data' is not defined

---
<font color=green>Q5: (1 Mark)</font>
<br><font color='green'>
Download the dataset named `df_filtered_nasdaq_100` from the GitHub repository of the course.
</font>

---

In [None]:
url = 'https://raw.githubusercontent.com/Jandsy/ml_finance_imperial/main/Coursework/df_filtered_nasdaq_100.csv'

df_filtered = pd.read_csv(url, index_col =0)

df_filtered.head()

Unnamed: 0_level_0,ADBE,ADP,GOOGL,GOOG,AMZN,AMD,AEP,AMGN,ADI,ANSS,...,TTWO,TMUS,TSLA,TXN,VRSK,VRTX,WBA,WBD,WDAY,XEL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-12-10,-0.006699,0.006004,-0.003292,-0.002861,-0.003715,0.042553,-0.024356,0.011274,0.008826,0.004968,...,-0.004777,0.007485,0.011358,0.002996,-0.002616,0.010279,0.00096,-0.002109,0.005037,-0.013889
2015-12-11,0.027653,-0.024924,-0.012657,-0.01413,-0.033473,-0.036735,-0.005831,-0.028308,-0.00385,-0.013951,...,-0.011575,-0.009356,-0.044259,-0.012648,-0.015996,-0.034709,-0.020499,-0.035928,-0.05763,0.004311
2015-12-14,0.020127,0.015241,0.016151,0.012045,0.027744,-0.008475,-0.000366,0.019078,-0.001932,0.003899,...,0.005141,0.014444,0.007188,0.000178,0.01439,-0.016491,0.010402,-0.033613,0.000253,0.010017
2015-12-15,0.008149,0.010283,-0.003213,-0.005844,0.00111,0.008547,0.027503,0.028525,-0.004752,0.009544,...,0.023018,0.044907,0.011483,0.023657,0.013923,0.013656,-0.00545,0.002646,-0.00038,0.007934
2015-12-16,0.01638,0.008541,0.021708,0.019761,0.026008,0.076271,0.017309,0.012053,0.01733,0.008354,...,0.006389,0.025419,0.060699,0.009036,0.021117,0.010658,0.031664,0.024887,0.029378,0.023897


---
<font color=green>Q6: (3 Marks) </font>
<br><font color='green'>
Conduct an in-depth analysis of the `df_filtered_nasdaq_100` dataset from GitHub. Answer the following questions:
- Which stock had the best performance over the entire period?
- What is the average daily return of 'AAPL'?
- What is the worst daily return? Provide the stock name and the date it occurred.
</font>

---

In [None]:

# Which stock had the best performance over the entire period?
best_performance = -float('inf')
best_performance_ticker = None

for ticker in df_filtered.columns:
    column = df_filtered[ticker]
    # Calculate cumulative return
    cumulative_return = (1 + column).prod() - 1  # Equivalent to multiplying all (1 + daily returns) and then subtracting 1
    # Convert to percentage
    cumulative_return_percent = cumulative_return * 100
    # Update the best performance and ticker if current performance is better
    if cumulative_return_percent > best_performance:
        best_performance = cumulative_return_percent
        best_performance_ticker = ticker

print("The best performance over the entire period is", best_performance_ticker," with return of ", best_performance )


# What is the average daily return of 'AAPL'?
print("The Average Daily Return of 'AAPL' is ", df_filtered["AAPL"].mean())



# What is the worst daily return? Provide the stock name and the date it occurred.

The best performance over the entire period is NVDA  with return of  11158.842785050534
The Average Daily Return of 'AAPL' is  0.0010849409515136207


# Fama French Analysis

The Fama-French five-factor model is an extension of the classic three-factor model used in finance to describe stock returns. It is designed to better capture the risk associated with stocks and explain differences in returns. This model includes the following factors:

1. **Market Risk (MKT)**: The excess return of the market over the risk-free rate. It captures the overall market's premium.
2. **Size (SMB, "Small Minus Big")**: The performance of small-cap stocks relative to large-cap stocks.
3. **Value (HML, "High Minus Low")**: The performance of stocks with high book-to-market values relative to those with low book-to-market values.
4. **Profitability (RMW, "Robust Minus Weak")**: The difference in returns between companies with robust (high) and weak (low) profitability.
5. **Investment (CMA, "Conservative Minus Aggressive")**: The difference in returns between companies that invest conservatively and those that invest aggressively.

## Additional Factor

6. **Momentum (MOM)**: This factor represents the tendency of stocks that have performed well in the past to continue performing well, and the reverse for stocks that have performed poorly.

### Mathematical Representation

The return of a stock $R_i^t$ at time $t$ can be modeled as follows :

$$
R_i^t - R_f^t = \alpha_i^t + \beta_{i,MKT}^t(R_M^t - R_f^t) + \beta_{i,SMB}^t \cdot SMB^t + \beta_{i,HML}^t \cdot HML^t + \beta_{i,RMW}^t \cdot RMW^t + \beta_{i,CMA}^t \cdot CMA^t + \beta_{i,MOM}^t \cdot MOM^t + \epsilon_i^t
$$

Where:
- $ R_i^t $ is the return of stock $i$ at time $t$
- $R_f^t $is the risk-free rate at time $t$
- $ R_M^t $ is the market return at time $t$
- $\alpha_i^t $ is the abnormal return or alpha of stock $ i $ at time $t$
- $\beta^t $ coefficients represent the sensitivity of the stock returns to each factor at time $t$
- $\epsilon_i^t $ is the error term or idiosyncratic risk unique to stock $ i $ at time $t$

This model is particularly useful for identifying which factors significantly impact stock returns and for constructing a diversified portfolio that is optimized for given risk preferences.




---
<font color=green>Q7: (1 Mark) </font>
<br><font color='green'>
Download the `fama_french_dataset` from the course's GitHub account.
</font>

---

In [None]:
url = 'https://raw.githubusercontent.com/Jandsy/ml_finance_imperial/main/Coursework/fama_french_dataset.csv'

fama_french_dataset = pd.read_csv(url, index_col =0)

fama_french_dataset.head()

Unnamed: 0,Mkt-RF,SMB,HML,RMW,CMA,RF,Mom
1963-07-01,-0.67,0.02,-0.35,0.03,0.13,0.012,-0.21
1963-07-02,0.79,-0.28,0.28,-0.08,-0.21,0.012,0.42
1963-07-03,0.63,-0.18,-0.1,0.13,-0.25,0.012,0.41
1963-07-05,0.4,0.09,-0.28,0.07,-0.3,0.012,0.07
1963-07-08,-0.63,0.07,-0.2,-0.27,0.06,0.012,-0.45


---
<font color=green>Q8: (5 Marks)</font>
<br><font color='green'>

Write a Python function called `get_sub_df_ticker(ticker, date, df_filtered, length_history)` that extracts a historical sub-dataframe for a given `ticker` from `df_filtered`. The function should use `length_history` to determine the number of trading days to include, ending at the specified `date`. Return the sub-dataframe for the specified `ticker`.
</font>

---


In [None]:
fama_french_dataset.head()
from datetime import datetime, timedelta


def get_sub_df_ticker(ticker, date, df_filtered, length_history):
  sub_data = df_filtered[ticker]
  end_date = datetime.strptime(date, '%Y-%m-%d')
  start_date = end_date - timedelta(days=length_history)
  sub_data.index = pd.to_datetime(sub_data.index, errors='coerce')
  final = sub_data.loc[start_date:end_date]
  return (final.to_frame(name=ticker))



get_sub_df_ticker("AAPL", '2024-03-28', df_filtered, 252)

Unnamed: 0_level_0,AAPL
Date,Unnamed: 1_level_1
2023-07-20,-0.010097
2023-07-21,-0.006162
2023-07-24,0.004220
2023-07-25,0.004514
2023-07-26,0.004545
...,...
2024-03-22,0.005310
2024-03-25,-0.008300
2024-03-26,-0.006673
2024-03-27,0.021213


---
<font color=green>Q9: (4 Marks)</font>
<br><font color='green'>
Create a Python function named `df_ticker_with_fama_french(ticker, date, df_filtered, length_history, fama_french_data)` that uses `get_sub_df_ticker` to extract historical data for a specific `ticker`. Incorporate the Fama-French factors from `fama_french_data` into the extracted sub-dataframe. Adjust the ticker's returns by subtracting the risk-free rate ('RF') and add other relevant Fama-French factors ('Mkt-RF', 'SMB', 'HML', 'RMW', 'CMA', and 'Mom'). Return the resulting sub-dataframe.
</font>

---

In [None]:
def df_ticker_with_fama_french(ticker, date, df_filtered, length_history, fama_french_data):
  sub_df = get_sub_df_ticker(ticker, date, df_filtered, length_history)
  fama_french_data.index = pd.to_datetime(fama_french_data.index, errors='coerce')
  result = pd.merge(fama_french_data, sub_df, left_index=True, right_index=True, how='right')
  result["excess_return"] = result[ticker] - result["RF"]
  return result


df_ticker_with_fama_french("AAPL", '2024-03-28', df_filtered, 252, fama_french_dataset)

Unnamed: 0_level_0,Mkt-RF,SMB,HML,RMW,CMA,RF,Mom,AAPL,excess_return
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2023-07-20,-0.81,-0.35,0.65,0.45,1.48,0.022,-0.09,-0.010097,-0.032097
2023-07-21,-0.05,-0.37,-0.37,-0.16,0.11,0.022,-0.81,-0.006162,-0.028162
2023-07-24,0.31,-0.03,0.78,0.96,-0.22,0.022,-0.48,0.004220,-0.017780
2023-07-25,0.25,-0.23,-0.79,0.47,-0.40,0.022,0.91,0.004514,-0.017486
2023-07-26,0.02,0.87,1.02,-0.35,0.64,0.022,-1.62,0.004545,-0.017455
...,...,...,...,...,...,...,...,...,...
2024-03-22,-0.23,-0.98,-0.53,0.29,-0.37,0.021,0.43,0.005310,-0.015690
2024-03-25,-0.26,-0.10,0.88,-0.22,-0.17,0.021,-0.34,-0.008300,-0.029300
2024-03-26,-0.26,0.10,-0.13,-0.50,0.23,0.021,0.09,-0.006673,-0.027673
2024-03-27,0.88,1.29,0.91,-0.14,0.58,0.021,-1.34,0.021213,0.000213


---
<font color=green>Q10: (5 Marks) </font>
<br><font color='green'>
Write a Python function named `extract_beta_fama_french` to perform a rolling regression analysis for a given stock at a specific time point using the Fama-French model. The function should accept the following parameters:

- `ticker`: A string indicating the stock symbol.
- `date`: A string specifying the date for the analysis.
- `length_history`: An integer representing the number of days of historical data to include.
- `df_filtered`: A pandas DataFrame (assumed to be derived from question 5) containing filtered stock data.
- `fama_french_data`: A pandas DataFrame (assumed to be from question 7) that includes Fama-French factors.

Utilize the `statsmodels.api` library to conduct the regression.
</font>

---

In [None]:
import statsmodels.api as sm

def extract_beta_fama_french (ticker, date, df_filtered, length_history, fama_french_data):
  result_df = df_ticker_with_fama_french(ticker, date, df_filtered, length_history, fama_french_data)
  X = result_df[['Mkt-RF', 'SMB', 'HML']]
  X = sm.add_constant(X)  # Add constant for the intercept
  y = result_df['excess_return']  # Assuming the stock returns are in a column named 'return'
  model = sm.OLS(y, X).fit()
  return model


---
<font color=green>Q11: (2 Marks) </font>
<br><font color='green'>
Apply the `extract_beta_fama_french` function to the stock symbol 'AAPL' for the date '2024-03-28', using a historical data length of 252 days. Ensure that the `df_filtered` and `fama_french_data` DataFrames are correctly prepared and available in your environment before executing this function. The parameters for the function call are set as follows:

- **Ticker**: 'AAPL'
- **Date**: '2024-03-28'
- **Length of History**: 252 days
</font>

---



In [None]:
model = extract_beta_fama_french ("AAPL", '2024-03-28', df_filtered, 252, fama_french_dataset)

print(model)

<statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x791403dcdb40>


---
<font color=green>Q12: (2 Marks)</font>
<br><font color='green'>
Once the `extract_beta_fama_french` function has been applied to 'AAPL' with the specified parameters, the next step is to analyze the regression summary to identify which Fama-French factor explains the most variance in 'AAPL' returns during the specified period.

Follow these steps to perform the analysis:

1. **Review the Summary**: Examine the regression output, focusing on the coefficients and their statistical significance (p-values).
2. **Identify Key Factor**: Determine which factor has the highest absolute coefficient value and is statistically significant (typically p < 0.05). This factor can be considered as having the strongest influence on 'AAPL' returns for the period.

</font>

---

**Write your answers here:**

In [None]:
summary = model.summary()
print(summary)

p_values = model.pvalues
coefficients = model.params

                            OLS Regression Results                            
Dep. Variable:          excess_return   R-squared:                       0.389
Model:                            OLS   Adj. R-squared:                  0.379
Method:                 Least Squares   F-statistic:                     36.36
Date:                Fri, 17 May 2024   Prob (F-statistic):           3.16e-18
Time:                        10:12:32   Log-Likelihood:                 561.23
No. Observations:                 175   AIC:                            -1114.
Df Residuals:                     171   BIC:                            -1102.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0221      0.001    -29.046      0.0

# PCA Analysis


In literature, another method exists for extracting residuals for each stock, utilizing the PCA approach to identify hidden factors in the data. Let's describe this method.

The return of a stock $R_i^t$ at time $t$ can be modeled as follows :

$$
R_i^t  = \sum_{j=1}^m\beta_{i,j}^t F_j^t  + \epsilon_i^t
$$

Where:
- $ R_i^t $ is the return of stock $i$ at time $t$
- $m$ is the number of factors selected from PCA
-  $ F_j^t $ is the $j$-th hidden factor constructed from PCA at time $t$
- $\beta_{i,j}^t $ are the coefficients representing the sensitivity of the stock returns to each hidden factor.
- $\epsilon_i^t $  is the residual term for stock $i$ at time $t$, representing the portion of the return not explained by the PCA factors.

### Representation of Stock Return Data

Consider the return data for $N$ stocks over $T$ periods, represented by the matrix $R$ of size $T \times N$:

$$
R = \left[
\begin{array}{cccc}
R_1^T & R_2^T & \cdots & R_N^T \\
R_1^{T-1} & R_2^{T-1} & \cdots & R_N^{T-1} \\
\vdots & \vdots & \ddots & \vdots \\
R_1^1 & R_2^1 & \cdots & R_N^1 \\
\end{array}
\right]
$$

Each element $R_i^k$ of the matrix represents the return of stock $i$ at time $k$ and is defined as:

$$
R_i^k = \frac{S_{i,k} - S_{i, k-1}}{S_{i, k-1}}, \quad k=1,\cdots, T, \quad i=1,\cdots,N
$$

where $S_{i,k}$ denotes the adjusted close price of stock $i$ at time $k$.

### Standardization of Returns

To adjust for varying volatilities across stocks, we standardize the returns as follows:

$$
Z_i^t = \frac{R_i^t - \mu_i}{\sigma_i}
$$

where $\mu_i$ and $\sigma_i$ are the mean and standard deviation of returns for stock $i$ over the period $[t-T, t]$, respectively.

### Empirical Correlation Matrix

The empirical correlation matrix $C$ is computed from the standardized returns:

$$
C = \frac{1}{T-1} Z^T Z
$$

where $Z^T$ is the transpose of matrix $Z$.

### Singular Value Decomposition (SVD)

We apply Singular Value Decomposition to the correlation matrix $C$:

$$
C = U \Sigma V^T
$$

Here, $U$ and $V$ are orthogonal matrices representing the left and right singular vectors, respectively, and $\Sigma$ is a diagonal matrix containing the singular values, which are the square roots of the eigenvalues.

### Construction of Hidden Factors

For each of the top $m$ components, we construct the selected hidden factors as follows:

$$
F_j^t = \sum_{i=1}^N \frac{\lambda_{i,j}}{\sigma_i} R_i^t
$$

where $\lambda_{i,j}$ is the $i$-th component of the $j$-th eigenvector (ranked by eigenvalue magnitude).


---
<font color=green>Q13 (3 Marks):

For the specified period from March 29, 2023 ('2023-03-29'), to March 28, 2024 ('2024-03-28'), generate the matrix $Z$ by standardizing the stock returns using the DataFrame `df_filtered_new`
</font>

---


In [None]:
## Insert your code here
df_filtered_new = df_filtered
start_date = datetime.strptime('2023-03-29', '%Y-%m-%d')
end_date = datetime.strptime('2024-03-28', '%Y-%m-%d')
df_filtered_new.index = pd.to_datetime(df_filtered_new.index, errors='coerce')
df_filtered_new = df_filtered_new.loc[start_date:end_date]

df_filtered_new = (df_filtered_new - df_filtered_new.mean()) / df_filtered_new.std()

df_filtered_new

Unnamed: 0_level_0,ADBE,ADP,GOOGL,GOOG,AMZN,AMD,AEP,AMGN,ADI,ANSS,...,TTWO,TMUS,TSLA,TXN,VRSK,VRTX,WBA,WBD,WDAY,XEL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-03-29,0.658925,2.189521,0.106285,0.207821,1.503570,0.444433,1.020467,0.727398,1.860839,0.356414,...,0.517876,0.615242,0.814986,1.344650,1.105776,0.127319,0.497115,0.429736,2.277783,1.269711
2023-03-30,0.273051,-0.221306,-0.389329,-0.434766,0.787034,0.526996,0.277291,0.076716,1.630406,0.936497,...,-0.109884,0.439972,0.233493,1.187056,0.239468,-0.525980,0.691243,0.446100,0.389915,0.430715
2023-03-31,0.360570,1.136308,1.540732,1.439604,0.531735,-0.056439,0.475365,0.008639,0.935425,1.044069,...,1.337544,0.117646,2.058867,0.640237,0.325258,0.538943,-0.008816,0.565842,1.611597,0.597545
2023-04-03,-0.713053,-2.259591,0.252735,0.407399,-0.591892,-0.600163,-0.086824,0.759731,-0.328378,-0.555499,...,-0.377687,1.191774,-2.030038,-0.684867,-0.214460,0.183344,1.205798,-0.547975,-0.634285,0.121519
2023-04-04,0.560730,-1.137430,0.099652,0.013877,0.658632,-0.342217,0.223111,0.872403,-0.399698,-0.127907,...,1.434972,-0.347689,-0.377655,-1.394525,-0.538698,-0.487155,0.553153,0.755053,-0.537767,1.011197
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-03-22,-1.146803,-0.516681,1.151431,1.085070,0.074915,0.081847,-0.150701,-0.278013,-0.555224,0.126867,...,0.046870,-0.246035,-0.386620,-0.054031,-0.476030,-0.091837,-0.421327,-0.946797,0.106149,-0.002851
2024-03-25,0.659349,-1.221008,-0.372489,-0.341074,0.109668,-0.292706,-0.084892,1.184710,-0.959285,-0.282029,...,-2.575803,0.240999,0.343241,-0.651297,-1.151639,-0.024341,0.166127,0.118796,-0.427841,0.321648
2024-03-26,-0.032708,0.234343,0.131651,0.109342,-0.556130,-0.244717,-0.377797,0.183364,-0.577464,0.317533,...,0.150958,-0.070195,0.960796,-1.177043,-0.384593,0.306386,-0.206326,-0.246683,0.237594,-0.878121
2024-03-27,-0.363721,1.052056,-0.024166,-0.010591,0.315892,0.224881,2.192442,1.128111,1.411127,-0.309662,...,0.034693,0.474269,0.396878,1.991081,0.941800,-0.265792,1.179504,1.004419,-0.793726,2.193537


---
<font color=green>Q14: (1 Mark) </font>
<br><font color='green'>
Download the `Z_matrix` matrix from the course's GitHub account.
</font>

---

In [None]:
url = 'https://raw.githubusercontent.com/Jandsy/ml_finance_imperial/main/Coursework/Z_matrix.csv'

Z_matrix = pd.read_csv(url, index_col =0)

Z_matrix

Unnamed: 0_level_0,ADBE,ADP,GOOGL,GOOG,AMZN,AMD,AEP,AMGN,ADI,ANSS,...,TTWO,TMUS,TSLA,TXN,VRSK,VRTX,WBA,WBD,WDAY,XEL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-03-29,0.658925,2.189521,0.106285,0.207821,1.503570,0.444433,1.020467,0.727398,1.860839,0.356414,...,0.517876,0.615242,0.814986,1.344650,1.105776,0.127319,0.497115,0.429736,2.277783,1.269711
2023-03-30,0.273051,-0.221306,-0.389329,-0.434766,0.787034,0.526996,0.277291,0.076716,1.630406,0.936497,...,-0.109884,0.439972,0.233493,1.187056,0.239468,-0.525980,0.691243,0.446100,0.389915,0.430715
2023-03-31,0.360570,1.136308,1.540732,1.439604,0.531735,-0.056439,0.475365,0.008639,0.935425,1.044069,...,1.337544,0.117646,2.058867,0.640237,0.325258,0.538943,-0.008816,0.565842,1.611597,0.597545
2023-04-03,-0.713053,-2.259591,0.252735,0.407399,-0.591892,-0.600163,-0.086824,0.759731,-0.328378,-0.555499,...,-0.377687,1.191774,-2.030038,-0.684867,-0.214460,0.183344,1.205798,-0.547975,-0.634285,0.121519
2023-04-04,0.560730,-1.137430,0.099652,0.013877,0.658632,-0.342217,0.223111,0.872403,-0.399698,-0.127907,...,1.434972,-0.347689,-0.377655,-1.394525,-0.538698,-0.487155,0.553153,0.755053,-0.537767,1.011197
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-03-22,-1.146803,-0.516681,1.151431,1.085070,0.074915,0.081847,-0.150701,-0.278013,-0.555224,0.126867,...,0.046870,-0.246035,-0.386620,-0.054031,-0.476030,-0.091837,-0.421327,-0.946797,0.106149,-0.002851
2024-03-25,0.659349,-1.221008,-0.372489,-0.341074,0.109668,-0.292706,-0.084892,1.184710,-0.959285,-0.282029,...,-2.575803,0.240999,0.343241,-0.651297,-1.151639,-0.024341,0.166127,0.118796,-0.427841,0.321648
2024-03-26,-0.032708,0.234343,0.131651,0.109342,-0.556130,-0.244717,-0.377797,0.183364,-0.577464,0.317533,...,0.150958,-0.070195,0.960796,-1.177043,-0.384593,0.306386,-0.206326,-0.246683,0.237594,-0.878121
2024-03-27,-0.363721,1.052056,-0.024166,-0.010591,0.315892,0.224881,2.192442,1.128111,1.411127,-0.309662,...,0.034693,0.474269,0.396878,1.991081,0.941800,-0.265792,1.179504,1.004419,-0.793726,2.193537


---
<font color=green>Q15: (3 Marks) </font>
<br><font color='green'>
For the specified period from March 29, 2023 ('2023-03-29'), to March 28, 2024 ('2024-03-28'), compute the correlation matrix
$C$ using the matrix `Z_matrix`.
</font>

---

In [None]:
T = Z_matrix.shape[0]

C = (1 / (T - 1)) * (Z_matrix.T @ Z_matrix)

C

Unnamed: 0,ADBE,ADP,GOOGL,GOOG,AMZN,AMD,AEP,AMGN,ADI,ANSS,...,TTWO,TMUS,TSLA,TXN,VRSK,VRTX,WBA,WBD,WDAY,XEL
ADBE,1.000000,0.218513,0.397890,0.400601,0.463488,0.444032,-0.035967,0.198781,0.321991,0.387483,...,0.257931,0.102167,0.268863,0.326597,0.171580,0.164760,0.033955,0.099841,0.418110,0.019105
ADP,0.218513,1.000000,0.294213,0.298841,0.168206,0.045884,0.228457,0.214813,0.279607,0.238355,...,0.290311,0.113985,0.178128,0.297954,0.325258,0.176771,0.142369,0.243986,0.320836,0.164682
GOOGL,0.397890,0.294213,1.000000,0.997415,0.521199,0.371105,-0.006803,0.118938,0.222252,0.292286,...,0.238219,0.086673,0.267941,0.192188,0.178622,0.142447,0.052710,0.042072,0.289137,0.025701
GOOG,0.400601,0.298841,0.997415,1.000000,0.525626,0.371568,-0.004037,0.118296,0.223710,0.294542,...,0.242111,0.091456,0.268114,0.198044,0.180110,0.146190,0.060822,0.045516,0.293575,0.026392
AMZN,0.463488,0.168206,0.521199,0.525626,1.000000,0.463049,-0.010849,0.123745,0.290872,0.342042,...,0.222346,0.120301,0.303368,0.299500,0.144325,0.104962,0.017926,0.162937,0.403757,-0.058870
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
VRTX,0.164760,0.176771,0.142447,0.146190,0.104962,0.039540,0.239861,0.281759,0.110189,0.142121,...,0.180810,0.139184,0.144443,0.198258,0.251863,1.000000,0.159124,0.062726,0.101851,0.184369
WBA,0.033955,0.142369,0.052710,0.060822,0.017926,0.002629,0.309717,0.214701,0.208907,0.096813,...,0.115189,0.063538,0.168495,0.199627,0.038371,0.159124,1.000000,0.361533,0.010855,0.194839
WBD,0.099841,0.243986,0.042072,0.045516,0.162937,0.092733,0.325463,0.220342,0.310681,0.095919,...,0.129023,0.084315,0.282265,0.355450,0.002990,0.062726,0.361533,1.000000,0.160860,0.183837
WDAY,0.418110,0.320836,0.289137,0.293575,0.403757,0.334587,0.017659,0.068097,0.315426,0.382360,...,0.293949,0.142648,0.277744,0.346161,0.195588,0.101851,0.010855,0.160860,1.000000,-0.019310


---
<font color=green>Q16: (2 Marks) </font>
<br><font color='green'>
Refind the correlation matrix from the from March 29, 2023 ('2023-03-29'), to March 28, 2024 ('2024-03-28') using pandas correlation matrix method.
</font>

---

In [None]:
Z_matrix.corr()

Unnamed: 0,ADBE,ADP,GOOGL,GOOG,AMZN,AMD,AEP,AMGN,ADI,ANSS,...,TTWO,TMUS,TSLA,TXN,VRSK,VRTX,WBA,WBD,WDAY,XEL
ADBE,1.000000,0.218513,0.397890,0.400601,0.463488,0.444032,-0.035967,0.198781,0.321991,0.387483,...,0.257931,0.102167,0.268863,0.326597,0.171580,0.164760,0.033955,0.099841,0.418110,0.019105
ADP,0.218513,1.000000,0.294213,0.298841,0.168206,0.045884,0.228457,0.214813,0.279607,0.238355,...,0.290311,0.113985,0.178128,0.297954,0.325258,0.176771,0.142369,0.243986,0.320836,0.164682
GOOGL,0.397890,0.294213,1.000000,0.997415,0.521199,0.371105,-0.006803,0.118938,0.222252,0.292286,...,0.238219,0.086673,0.267941,0.192188,0.178622,0.142447,0.052710,0.042072,0.289137,0.025701
GOOG,0.400601,0.298841,0.997415,1.000000,0.525626,0.371568,-0.004037,0.118296,0.223710,0.294542,...,0.242111,0.091456,0.268114,0.198044,0.180110,0.146190,0.060822,0.045516,0.293575,0.026392
AMZN,0.463488,0.168206,0.521199,0.525626,1.000000,0.463049,-0.010849,0.123745,0.290872,0.342042,...,0.222346,0.120301,0.303368,0.299500,0.144325,0.104962,0.017926,0.162937,0.403757,-0.058870
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
VRTX,0.164760,0.176771,0.142447,0.146190,0.104962,0.039540,0.239861,0.281759,0.110189,0.142121,...,0.180810,0.139184,0.144443,0.198258,0.251863,1.000000,0.159124,0.062726,0.101851,0.184369
WBA,0.033955,0.142369,0.052710,0.060822,0.017926,0.002629,0.309717,0.214701,0.208907,0.096813,...,0.115189,0.063538,0.168495,0.199627,0.038371,0.159124,1.000000,0.361533,0.010855,0.194839
WBD,0.099841,0.243986,0.042072,0.045516,0.162937,0.092733,0.325463,0.220342,0.310681,0.095919,...,0.129023,0.084315,0.282265,0.355450,0.002990,0.062726,0.361533,1.000000,0.160860,0.183837
WDAY,0.418110,0.320836,0.289137,0.293575,0.403757,0.334587,0.017659,0.068097,0.315426,0.382360,...,0.293949,0.142648,0.277744,0.346161,0.195588,0.101851,0.010855,0.160860,1.000000,-0.019310


---
<font color=green>Q17: (7 Marks) </font>
<br><font color='green'>
Conduct Singular Value Decomposition on the correlation matrix $C$. Follow these steps:


1.   **Perform SVD**: Decompose the matrix $C$ into its singular values and vectors.
2.   **Rank Eigenvalues**: Sort the resulting singular values (often squared to compare to eigenvalues) in descending order.
3. **Select Components**: Extract the first 20 components based on the largest singular values.
4. **Variance Explained**: Print the variance explained by the first 20 Components and dimensions of differents matrix that you created.

</font>

---

In [None]:
## Insert your code here

---
<font color=green>Q18: (6 Marks) </font>
<br><font color='green'>
Extract the 20 hidden factors in a matrix F. Check that shape of F is $(252,20)$
</font>

</font>

---

In [None]:
## Insert your code here

---
<font color=green>Q19: (3 Marks) </font>
<br><font color='green'>
Perform the Regression Analysis of 'AAPL' for the date '2024-03-28', using a historical data length of 252 days using previous $F$ Matrix. Compare the R-squared from the ones obtained at Q11.
</font>

</font>

---

In [None]:
## Insert your code here

# Ornstein Uhlenbeck

The Ornstein-Uhlenbeck process is defined by the following stochastic differential equation (SDE):

$$ dX_t = \theta (\mu - X_t) dt + \sigma dW_t $$

where:

- **$ X_t $**: The value of the process at time $ t $.
- **$ \mu $**: The long-term mean (equilibrium level) to which the process reverts.
- **$ \theta $**: The speed of reversion or the rate at which the process returns to the mean.
- **$ \sigma $**: The volatility (standard deviation), representing the magnitude of random fluctuations.
- **$ W_t $**: A Wiener process or Brownian motion that adds stochastic (random) noise.

This equation describes a process where the variable $ X_t $ moves towards the mean $ \mu $ at a rate determined by $ \theta $, with random noise added by $ \sigma dW_t $.

---
<font color=green>Q20: (3 Marks) </font>
<br><font color='green'>
In the context of mean reversion, which quantity should be modeled using an Ornstein-Uhlenbeck process?
</font>

---

**Write your answers here:**

---
<font color=green>Q21: (5 Marks) </font>
<br><font color='green'>
Explain how the parameters $ \theta $ and $ \sigma $ can be determined using the following equations. Also, detail the underlying assumptions:
$$ E[X] = \mu $$
$$ \text{Var}[X] = \frac{\sigma^2}{2\theta} $$
</font>

---

**Write your answers here:**

---
<font color=green>Q22: (2 Marks) </font>
<br><font color='green'>
Create a function named `extract_s_scores` which computes 's scores' for the last element in a list of floating-point numbers. This function calculates the scores using the following formula $ \text{s scores} = \frac{X_T - \mu}{\sigma} $ where `list_xi` represents a list containing a sequence of floating-point numbers $(X_0, \cdots, X_T)$.

</font>

---

In [None]:
## Insert your code here

# Autoencoder Analysis

Autoencoders are neural networks used for unsupervised learning, particularly for dimensionality reduction and feature extraction. Training an autoencoder on the $Z_i$ matrix aims to identify hidden factors capturing the intrinsic structures in financial data.

### Architecture
- **Encoder**: Compresses input data into a smaller latent space representation.
  - *Input Layer*: Matches the number of features in the $Z_i$ matrix.
  - *Hidden Layers*: Compress data through progressively smaller layers.
  - *Latent Space*: Encodes the data into hidden factors.
- **Decoder**: Reconstructs input data from the latent space.
  - *Hidden Layers*: Gradually expand to the original dimension.
  - *Output Layer*: Matches the input layer to recreate the original matrix.

### Training
The autoencoder is trained by minimizing reconstruction loss, usually mean squared error (MSE), between the input $Z_i$ matrix and the decoder's output.

### Hidden Factors Extraction
After training, the encoder's latent space provides the most important underlying patterns in the stock returns.

---
<font color=green>Q23: (2 Marks) </font>
<br><font color='green'>
Modify the standardized returns matrix `Z_matrix` to reduce the influence of extreme outliers on model trainingby ensuring that all values in the matrix `Z_matrix` do not exceed 3 standard deviations from the mean. Specifically, cap these values at the interval $-3, 3]$. Store the adjusted values in a new matrix, `Z_hat`.
</font>

----

In [None]:
## Insert your code here

---
<font color=green>Q24: (1 Marks) </font>
<br><font color='green'>
Fetch the `Z_hat` data from GitHub, and we'll proceed with it now.
</font>



In [None]:
## Insert your code here

---
<font color=green>Q25: (3 Marks) </font>
<br><font color='green'>
Segment the standardized and capped returns matrix $\hat{Z}$ into two subsets for model training and testing. Precisly Allocate 70% of the data in $\hat{Z}$ to the training set $ \hat{Z}_{train} $ and Allocate the remaining 30% to the testing set $\hat{Z}_{test}$. Treat each stock within $\hat{Z}$ as an individual sample, by flattening temporal dependencies.
</font>



In [None]:
## Insert your code here

---
<font color=green>Q26: (10 Marks) </font>
<br><font color='green'>
Please create an autoencoder following the instructions provided in  **[End-to-End Policy Learning of a Statistical Arbitrage Autoencoder Architecture](https://arxiv.org/pdf/2402.08233.pdf)**, Use the model 'Variant 2' in Table 1.
</font>

---

In [None]:
## Insert your code here

---
<font color=green>Q27 (1 Mark) :

Display all the parameters of the deep neural network.
</font>

---

In [None]:
## Insert your code here

---
<font color=green>Q28: (3 Marks) </font>
<br><font color='green'>
Train your model using the Adam optimizer for 20 epochs with a batch size equal to 8 and validation split to 20%. Specify the loss function you've chosen.
</font>


In [None]:
## Insert your code here

---
<font color=green>Q29: (3 Marks) </font>
<br><font color='green'>
Predict using the testing set and extract the residuals based on the methodology described in **[End-to-End Policy Learning of a Statistical Arbitrage Autoencoder Architecture](https://arxiv.org/pdf/2402.08233.pdf)**.
for 'NVDA' stock.
</font>

---

In [None]:
## Insert your code here

<font color=green>Q30: (7 Marks) </font>
<br><font color='green'>
By reading carrefully the paper **[End-to-End Policy Learning of a Statistical Arbitrage Autoencoder Architecture](https://arxiv.org/pdf/2402.08233.pdf)**, answers the following question:
1. **Summarize the Key Actions**: Highlight the main experiments and methodologies employed by the authors in Section 5.
2. **Reproduction Steps**: Detail the necessary steps required to replicate the authors' approach based on the descriptions provided in the paper.
3. **Proposed Improvement**: Suggest one potential enhancement to the methodology that could potentially increase the effectiveness or efficiency of the model.



**Write your answers here:**








