<a href="https://colab.research.google.com/github/dkalenov/ML-Trading/blob/1_unsupervised-learning/K_Means_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Case Study

Traders who are particuarly keen on Pairs Trading, need a way to find pairs that:
    
    a) Are similar in risk and behaviour
    b) Are cointegrated

K-Means Clustering is a Clustering technique as to which a vast array of features can have their data compared with one another and grouped into clusters of similarity. The applications of this are vast as described in the theory sections of the course.

Once stocks/cryptocurrencies/FOREX pairs are grouped, they can then have cointegration calculations run against them to further help with statistical methods. Although cointegration is more a statistics method rather than Machine Learning, the code has been included here for convenience.

### Imports

In [None]:
# Remove unwanted warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Data extraction and management
import pandas as pd
import numpy as np
from pandas_datareader.data import DataReader
from pandas_datareader.nasdaq_trader import get_nasdaq_symbols

# Feature Engineering
from sklearn.preprocessing import StandardScaler

# Machine Learning
from sklearn.cluster import KMeans
from sklearn import metrics

# Cointegration and Statistics
from statsmodels.tsa.stattools import coint
import statsmodels.api as sm

# Reporting visualization
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

### Data Extraction

In [None]:
# import yfinance as yf

# data = yf.download("BTC", "2017-01-01", "2024-03-28")
# data = data[["Open", "High", "Low", "Adj Close", "Volume"]]
# data.tail()

In [None]:
# Set Data Extraction parameters
start_date = "2017-01-01"
end_date = "2024-03-27"
file_name = "data/raw_data_etf.csv"
file_name_coint = "data/raw_data_coint_pairs.csv"
load_existing = False
load_coint_pairs = False

In [None]:
# Get New or Load Existing Data
# Allow 15 mins for new data
import os
import yfinance as yf

if not os.path.exists('data'):
    os.makedirs('data')

if not load_existing:
  symbols = get_nasdaq_symbols()
  symbols = symbols[(symbols["ETF"] == True) & (symbols["Market Category"] == "G")] # G = NASDAQ GLOBAL MARKET
  symbols = list(symbols.index.values)
  data =yf.download(symbols, "2017-01-01", "2024-03-28")["Adj Close"]
  data.to_csv(file_name)

[*********************100%%**********************]  616 of 616 completed
ERROR:yfinance:
1 Failed download:
ERROR:yfinance:['INRO']: Exception("%ticker%: Period 'max' is invalid, must be one of ['1d', '5d']")


In [39]:
# Load (or re-load for consistency) Data and remove features with NaN's

url = 'https://raw.githubusercontent.com/dkalenov/ML-Trading/1_unsupervised-learning/raw_data_etf.csv'
data = pd.read_csv(url,  delimiter=',', encoding='utf-8', encoding_errors='ignore')


# data = pd.read_csv(file_name)
data.dropna(axis=1, inplace=True)
data = data.set_index("Date")
print("Shape: ", data.shape)
print("Null Values: ", data.isnull().values.any())
data.head()

Shape:  (1820, 258)
Null Values:  False


Unnamed: 0_level_0,AADR,AAXJ,ACWI,ACWX,AGNG,AGZD,AIA,AIRR,ALTY,ANGL,...,VTHR,VTIP,VTWG,VTWO,VTWV,VWOB,VXUS,VYMI,WOOD,XT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-01-03,38.256813,48.610233,52.050144,33.607403,14.513459,18.454012,40.685699,23.448168,8.507782,20.096876,...,92.241486,39.782997,106.508606,49.461548,90.864632,55.448395,37.350456,42.318398,47.666119,25.109818
2017-01-04,38.915295,49.092216,52.495319,33.980911,14.513459,18.49613,40.903088,23.793995,8.627369,20.159475,...,93.301636,39.799175,108.744781,50.203377,92.23938,55.6693,37.762672,42.747505,48.091232,25.436157
2017-01-05,39.332634,49.784531,52.696091,34.271423,14.513459,18.49231,41.407421,23.390034,8.691715,20.208172,...,93.132347,39.847736,107.907433,49.702747,90.995972,56.025684,38.04557,43.161812,48.055809,25.520075
2017-01-06,39.202793,49.539143,52.704815,34.130314,14.513459,18.476995,41.233509,23.390034,8.775432,20.229038,...,93.372902,39.782997,107.355507,49.493393,90.50563,55.8689,37.964741,42.932453,48.055809,25.566696
2017-01-09,39.341915,49.582966,52.582603,34.080517,14.513459,18.431034,41.34655,22.897402,8.719054,20.242949,...,93.052177,39.807278,107.288902,49.16116,89.411079,56.011406,37.892002,42.814091,47.745827,25.629171


### Feature Engineering

In [None]:
# Create DataFrame with Returns and Volatility information
df_returns = pd.DataFrame(data.pct_change().mean() * 255, columns=["Returns"])
df_returns["Volatility"] = data.pct_change().std() * np.sqrt(255)
df_returns.head()

Unnamed: 0,Returns,Volatility
AADR,0.093457,0.223828
AAXJ,0.067449,0.206261
ACWI,0.121401,0.17992
ACWX,0.081153,0.179137
AGNG,0.118713,0.179965


In [None]:
# Scale Features
scaler = StandardScaler()
scaler = scaler.fit_transform(df_returns)
scaled_data = pd.DataFrame(scaler, columns=df_returns.columns, index=df_returns.index)
df_scaled = scaled_data
df_scaled.head()

Unnamed: 0,Returns,Volatility
AADR,0.01997,0.222306
AAXJ,-0.310375,0.036983
ACWI,0.374891,-0.240908
ACWX,-0.136314,-0.249167
AGNG,0.34075,-0.240433
