# Unsupervised Learning Trading Strategy

* Download/Load SP500 stocks prices data.
* Calculate different features and indicators on each stock.
* Aggregate on monthly level and filter top 150 most liquid stocks.
* Calculate Monthly Returns for different time-horizons.
* Download Fama-French Factors and Calculate Rolling Factor Betas.
* For each month fit a K-Means Clustering Algorithm to group similar assets based on their features.
* For each month select assets based on the cluster and form a portfolio based on Efficient Frontier max sharpe ratio optimization.
* Visualize Portfolio returns and compare to SP500 returns.

## 1. Download/Load SP500 stocks prices data.

In [1]:
from statsmodels.regression.rolling import RollingOLS
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pandas as pd
import numpy as np
import datetime as dt
import yfinance as yf
import pandas_ta
import warnings
warnings.filterwarnings('ignore')

sp500 = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0]

sp500['Symbol'] = sp500['Symbol'].str.replace('.', '-')

symbols_list = sp500['Symbol'].unique().tolist()

end_date = '2024-08-27'

start_date = pd.to_datetime(end_date)-pd.DateOffset(365*8)

df = yf.download(tickers=symbols_list,
                 start=start_date,
                 end=end_date).stack()

df.index.names = ['date', 'ticker']

df.columns = df.columns.str.lower()

df

[*********************100%***********************]  503 of 503 completed


Unnamed: 0_level_0,Price,adj close,close,high,low,open,volume
date,ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-08-29 00:00:00+00:00,A,44.720493,47.619999,47.889999,47.310001,47.450001,1333000.0
2016-08-29 00:00:00+00:00,AAL,34.864262,36.169998,36.410000,36.049999,36.130001,4760700.0
2016-08-29 00:00:00+00:00,AAPL,24.632544,26.705000,26.860001,26.572500,26.655001,99881200.0
2016-08-29 00:00:00+00:00,ABBV,45.698769,64.510002,65.089996,64.190002,64.839996,5099200.0
2016-08-29 00:00:00+00:00,ABT,37.360985,43.250000,43.570000,42.950001,43.000000,9223200.0
...,...,...,...,...,...,...,...
2024-08-26 00:00:00+00:00,XYL,135.888855,136.250000,138.470001,135.929993,137.250000,808500.0
2024-08-26 00:00:00+00:00,YUM,134.279999,134.949997,136.300003,134.759995,135.929993,1727500.0
2024-08-26 00:00:00+00:00,ZBH,114.629997,114.629997,116.370003,114.230003,115.320000,913500.0
2024-08-26 00:00:00+00:00,ZBRA,347.690002,347.690002,352.970001,346.500000,352.970001,248200.0


## 2. Calculate features and technical indicators for each stock.

* Garman-Klass Volatility
* RSI
* Bollinger Bands
* ATR
* MACD
* Dollar Volume

\begin{equation}
\text{Garman-Klass Volatility} = \frac{(\ln(\text{High}) - \ln(\text{Low}))^2}{2} - (2\ln(2) - 1)(\ln(\text{Adj Close}) - \ln(\text{Open}))^2
\end{equation}

In [2]:
df['garman_klass_vol'] = ((np.log(df['high'])-np.log(df['low']))**2)/2-(2*np.log(2)-1)*((np.log(df['adj close'])-np.log(df['open']))**2)

df['rsi'] = df.groupby(level=1)['adj close'].transform(lambda x: pandas_ta.rsi(close=x, length=20))

df['bb_low'] = df.groupby(level=1)['adj close'].transform(lambda x: pandas_ta.bbands(close=np.log1p(x), length=20).iloc[:,0])
                                                          
df['bb_mid'] = df.groupby(level=1)['adj close'].transform(lambda x: pandas_ta.bbands(close=np.log1p(x), length=20).iloc[:,1])
                                                          
df['bb_high'] = df.groupby(level=1)['adj close'].transform(lambda x: pandas_ta.bbands(close=np.log1p(x), length=20).iloc[:,2])

def compute_atr(stock_data):
    atr = pandas_ta.atr(high=stock_data['high'],
                        low=stock_data['low'],
                        close=stock_data['close'],
                        length=14)
    return atr.sub(atr.mean()).div(atr.std())

df['atr'] = df.groupby(level=1, group_keys=False).apply(compute_atr)

def compute_macd(close):
    macd = pandas_ta.macd(close=close, length=20).iloc[:,0]
    return macd.sub(macd.mean()).div(macd.std())

df['macd'] = df.groupby(level=1, group_keys=False)['adj close'].apply(compute_macd)

df['dollar_volume'] = (df['adj close']*df['volume'])/1e6

df

Unnamed: 0_level_0,Price,adj close,close,high,low,open,volume,garman_klass_vol,rsi,bb_low,bb_mid,bb_high,atr,macd,dollar_volume
date,ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2016-08-29 00:00:00+00:00,A,44.720493,47.619999,47.889999,47.310001,47.450001,1333000.0,-0.001282,,,,,,,59.612418
2016-08-29 00:00:00+00:00,AAL,34.864262,36.169998,36.410000,36.049999,36.130001,4760700.0,-0.000442,,,,,,,165.978290
2016-08-29 00:00:00+00:00,AAPL,24.632544,26.705000,26.860001,26.572500,26.655001,99881200.0,-0.002347,,,,,,,2460.328010
2016-08-29 00:00:00+00:00,ABBV,45.698769,64.510002,65.089996,64.190002,64.839996,5099200.0,-0.047184,,,,,,,233.027161
2016-08-29 00:00:00+00:00,ABT,37.360985,43.250000,43.570000,42.950001,43.000000,9223200.0,-0.007531,,,,,,,344.587835
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-08-26 00:00:00+00:00,XYL,135.888855,136.250000,138.470001,135.929993,137.250000,808500.0,0.000133,53.380037,4.836505,4.886408,4.936311,1.013690,-0.071432,109.866139
2024-08-26 00:00:00+00:00,YUM,134.279999,134.949997,136.300003,134.759995,135.929993,1727500.0,0.000007,51.342837,4.890780,4.914775,4.938770,0.527320,0.636457,231.968698
2024-08-26 00:00:00+00:00,ZBH,114.629997,114.629997,116.370003,114.230003,115.320000,913500.0,0.000158,59.349814,4.669664,4.712685,4.755706,-0.641330,0.483530,104.714502
2024-08-26 00:00:00+00:00,ZBRA,347.690002,347.690002,352.970001,346.500000,352.970001,248200.0,0.000083,59.968911,5.740242,5.817311,5.894380,0.317951,0.735118,86.296659


## 3. Aggregate to monthly level and filter top 150 most liquid stocks for each month.

* To reduce training time and experiment with features and strategies, we convert the business-daily data to month-end frequency.