Unsupervised learning in trading involves using ML techniques to analyse data and discover patterns, relationships, and structures within the data without any pre-defined label or target variable. 

Unsupervised learning has multiple applications to trading
- clustering 
- dimensionality reduction 
- anomaly detection
- market regime detection 
- portfolio optimization 

Our plan for this project:
 - Download SP500 stocks prices data
 - Calculate different technical indicators and features for each stock 
 - Aggregate monthly and filter only top 150 most liquid stocks for each month
 - Calculate monthly returns for different time horizons 
 - Use Fama-French Factors to calculate rolling betas for each stock 
 - Train a K-Means model for each month to cluster similar stocks together
 - For each month, select assets based on a cluster and form a portfolio using Efficient Frontier max sharpe ratio portfolio optimisation 
 - Visualise portfolio returns and compare with SP500 

=> LIMITATION: We are using only the most recent SP500 stocks list and therefore suffer from the surviviorship bias. We should use surviviorship free data. 

### 1. Downloading Data

In [28]:
from statsmodels.regression.rolling import RollingOLS
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pandas as pd
import numpy as np
import datetime as dt
import yfinance as yf
import pandas_ta
import warnings
warnings.filterwarnings("ignore")

sp500 = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")[0]

sp500["Symbol"] = sp500["Symbol"].replace(".", "-")
symbols_list = sp500["Symbol"].unique().tolist()

end_date = "2024-07-09"
start_date = pd.to_datetime(end_date)-pd.DateOffset(365*8)

df = yf.download(tickers = symbols_list, start=start_date, end=end_date).stack()

df.index.names = ['date','ticker']
df.columns = df.columns.str.lower()

df


[**                     5%%                      ]  27 of 502 completedFailed to get ticker 'BRK.B' reason: HTTPSConnectionPool(host='guce.yahoo.com', port=443): Read timed out. (read timeout=30)
[******                12%%                      ]  59 of 502 completed

$BF.B: possibly delisted; No price data found  (1d 2016-07-11 00:00:00 -> 2024-07-09)


[*********************100%%**********************]  501 of 502 completed

27 Failed downloads:
['ALB', 'IRM', 'SRE', 'NEM', 'BRO', 'ISRG', 'CTLT', 'VTR', 'JCI', 'CRWD', 'HPE']: ConnectionError(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')))
[*********************100%%**********************]  501 of 502 completed['CSGP', 'GL', 'DECK', 'CLX', 'VRSN', 'LNT', 'CDW', 'PFE', 'UNP', 'PPL', 'V', 'SCHW', 'CI', 'REG']: ConnectionError(ReadTimeoutError("HTTPSConnectionPool(host='query2.finance.yahoo.com', port=443): Read timed out."))
['BRK.B']: YFTzMissingError('$%ticker%: possibly delisted; No timezone found')
['BF.B']: YFPricesMissingError('$%ticker%: possibly delisted; No price data found  (1d 2016-07-11 00:00:00 -> 2024-07-09)')


Unnamed: 0_level_0,Price,adj close,close,high,low,open,volume
date,ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-07-11,A,42.635681,45.400002,45.770000,45.349998,45.610001,1094700.0
2016-07-11,AAL,29.945486,31.160000,31.440001,30.219999,30.230000,12374400.0
2016-07-11,AAPL,22.268711,24.245001,24.412500,24.182501,24.187500,95179600.0
2016-07-11,ABBV,45.603844,64.349998,64.989998,64.070000,64.250000,9641500.0
2016-07-11,ABT,36.353951,42.119999,42.310001,42.000000,42.029999,9052300.0
...,...,...,...,...,...,...,...
2024-07-08,XYL,134.059998,134.059998,135.979996,133.860001,134.809998,746800.0
2024-07-08,YUM,127.940002,127.940002,130.440002,127.610001,129.869995,1846100.0
2024-07-08,ZBH,106.389999,106.389999,108.389999,106.139999,107.839996,1651100.0
2024-07-08,ZBRA,314.480011,314.480011,314.959991,310.570007,311.989990,209100.0


### 2. Calculating features and techical indicators for each stock

For each stock, we will compute the following metrics:
 - Garman-Klass volatility
 - RSI
 - Bollinger Bands
 - ATR
 - MACD
 - Dollar volume

In [31]:
df["garman-klass volatility"] = ((np.log(df["high"]) - np.log(df["low"]))**2)/2 - (2*np.log(2)-1)*((np.log(df["adj close"])-np.log(df["open"]))**2)
df["rsi"] = df.groupby(level=1)["adj close"].transform(lambda x: pandas_ta.rsi(close=x,length=20))

# We want to normalise the indicators from this point.
df["bb_low"] = df.groupby(level=1)["adj close"].transform(lambda x: pandas_ta.bbands(close= np.log1p(x), length = 20).iloc[:,0])
df["bb_mid"] = df.groupby(level=1)["adj close"].transform(lambda x: pandas_ta.bbands(close= np.log1p(x), length = 20).iloc[:,1])
df["bb_high"] = df.groupby(level=1)["adj close"].transform(lambda x: pandas_ta.bbands(close= np.log1p(x), length = 20).iloc[:,2])
