<a href="https://colab.research.google.com/github/dkalenov/ML-Trading/blob/1_unsupervised-learning/Principle_Component_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Case Study

The VIX is a measure of the implied volatility for the S&P500. The higher the "Close" of the VIX, the higher the uncertainty in the market. Fortunately, there are ways of capitlizing on the predictability of the VIX. Being able to predict implied volatility for the market with any degree of certainty better than 50/50 can give an options trader of ETF trader (VIX tracking ETF's) an edge.

There could be many, many features that impact whether a stock price goes up or down. Predicting directionality for stocks, the VIX, commodities, FX etc represents a huge potential gain for every small percentage increment in forecast ability.

However, having so many features allows room for noise and can create adverse impacts to your supervised learning. Therefore, we will explore deplying PCA (Principle Component Analysis) as a tool to find useful information within a vast array of indicators and features to help with our supervised learning later on.

PCA can equally be extremely good at helping to understand correlations too and further study on PCS (based on articles and papers below) is encorouged.

### Imports

In [2]:
! pip install ta

Collecting ta
  Downloading ta-0.11.0.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ta
  Building wheel for ta (setup.py) ... [?25l[?25hdone
  Created wheel for ta: filename=ta-0.11.0-py3-none-any.whl size=29411 sha256=b4ea6950d9f4fdd2c97e7f5e2927054c3e65eb4ae7e597def1d9b47b8d759a42
  Stored in directory: /root/.cache/pip/wheels/5f/67/4f/8a9f252836e053e532c6587a3230bc72a4deb16b03a829610b
Successfully built ta
Installing collected packages: ta
Successfully installed ta-0.11.0


In [38]:
# Remove unwanted warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)

# Data Management
import pandas as pd
import numpy as np
from pandas_datareader.data import DataReader
from ta import add_all_ta_features

# Statistics
from statsmodels.tsa.stattools import adfuller

# Unsupervised Machine Learning
from sklearn.decomposition import PCA

# Supervised Machine Learning
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

# Reporting
import matplotlib.pyplot as plt

### Initial Data Extraction

In [40]:
# Data Extraction
import yfinance as yf

df = yf.download("^VIX", "2017-01-01", "2024-03-28")
df.tail()

[*********************100%%**********************]  1 of 1 completed


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2024-03-21,12.98,13.08,12.4,12.92,12.92,0
2024-03-22,12.92,13.15,12.58,13.06,13.06,0
2024-03-25,13.67,13.67,13.11,13.19,13.19,0
2024-03-26,13.12,13.43,12.84,13.24,13.24,0
2024-03-27,13.13,13.34,12.66,12.78,12.78,0


In [41]:
# Add TA
df = add_all_ta_features(df, open="Open", high="High", low="Low", close="Adj Close", volume="Volume", fillna=True)
df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,volume_adi,volume_obv,volume_cmf,volume_fi,...,momentum_ppo,momentum_ppo_signal,momentum_ppo_hist,momentum_pvo,momentum_pvo_signal,momentum_pvo_hist,momentum_kama,others_dr,others_dlr,others_cr
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-01-03,14.07,14.07,12.85,12.85,12.85,0,-0.0,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,12.85,0.0,0.0,0.0
2017-01-04,12.78,12.8,11.63,11.85,11.85,0,-0.0,0,0.0,-0.0,...,-0.624394,-0.124879,-0.499515,0.0,0.0,0.0,10.584451,-7.782101,-8.101594,-7.782101
2017-01-05,11.96,12.09,11.4,11.67,11.67,0,-0.0,0,0.0,-0.0,...,-1.226732,-0.345249,-0.881483,0.0,0.0,0.0,12.55084,-1.51899,-1.530645,-9.182881
2017-01-06,11.7,11.74,10.98,11.32,11.32,0,-0.0,0,0.0,-0.0,...,-1.916831,-0.659566,-1.257265,0.0,0.0,0.0,11.306731,-2.999146,-3.045041,-11.90662
2017-01-09,11.71,12.08,11.46,11.56,11.56,0,-0.0,0,0.0,-0.0,...,-2.289756,-0.985604,-1.304152,0.0,0.0,0.0,11.383026,2.120148,2.097985,-10.03891


In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1820 entries, 2017-01-03 to 2024-03-27
Data columns (total 92 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Open                       1820 non-null   float64
 1   High                       1820 non-null   float64
 2   Low                        1820 non-null   float64
 3   Close                      1820 non-null   float64
 4   Adj Close                  1820 non-null   float64
 5   Volume                     1820 non-null   int64  
 6   volume_adi                 1820 non-null   float64
 7   volume_obv                 1820 non-null   int64  
 8   volume_cmf                 1820 non-null   float64
 9   volume_fi                  1820 non-null   float64
 10  volume_em                  1820 non-null   float64
 11  volume_sma_em              1820 non-null   float64
 12  volume_vpt                 1820 non-null   float64
 13  volume_vwap                182

In [48]:
df.describe()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,volume_adi,volume_obv,volume_cmf,volume_fi,...,momentum_ppo,momentum_ppo_signal,momentum_ppo_hist,momentum_pvo,momentum_pvo_signal,momentum_pvo_hist,momentum_kama,others_dr,others_dlr,others_cr
count,1820.0,1820.0,1820.0,1820.0,1820.0,1820.0,1820.0,1820.0,1820.0,1820.0,...,1820.0,1820.0,1820.0,1820.0,1820.0,1820.0,1820.0,1820.0,1820.0,1820.0
mean,19.228209,20.350879,18.154319,19.046159,19.046159,0.0,0.0,0.0,0.0,0.0,...,-0.285067,-0.283662,-0.001405,0.0,0.0,0.0,19.288455,0.318528,-0.0003,48.219134
std,8.143595,8.924793,7.336414,8.058332,8.058332,0.0,0.0,0.0,0.0,0.0,...,5.740389,5.177555,2.217185,0.0,0.0,0.0,7.398471,8.441505,7.812252,62.710757
min,9.01,9.31,8.56,9.14,9.14,0.0,0.0,0.0,0.0,-0.0,...,-11.667174,-9.837047,-7.939708,0.0,0.0,0.0,9.827117,-25.905673,-29.983121,-28.871595
25%,13.4775,14.09,12.9375,13.325,13.325,0.0,0.0,0.0,0.0,-0.0,...,-3.948572,-3.487951,-1.217569,0.0,0.0,0.0,13.870389,-4.297046,-4.392103,3.696495
50%,17.43,18.355,16.57,17.224999,17.224999,0.0,0.0,0.0,0.0,-0.0,...,-1.198398,-0.871604,-0.102886,0.0,0.0,0.0,17.880522,-0.711446,-0.713989,34.046684
75%,22.82,24.0725,21.615,22.610001,22.610001,0.0,0.0,0.0,0.0,0.0,...,2.324325,1.9878,1.048587,0.0,0.0,0.0,23.186884,3.406701,3.349958,75.953307
max,82.690002,85.470001,70.370003,82.690002,82.690002,0.0,0.0,0.0,0.0,-0.0,...,33.144188,29.926287,13.120804,0.0,0.0,0.0,56.547472,115.597925,76.824503,543.501945


### Data Preprocessing - Stationarity

In [61]:
non_stationaries = []

for col in df.columns:
    if len(df[col].unique()) == 1: # Check if a column contains only one unique value
        continue
    dftest = adfuller(df[col].values)
    p_value = dftest[1]
    t_test = dftest[0] < dftest[4]["1%"]
    if p_value > 0.05 or not t_test:
        non_stationaries.append(col)
print(f"Non-Stationary Features Found: {len(non_stationaries)}")

Non-Stationary Features Found: 8


In [None]:
# Identify non-stationary columns
non_stationaries = []

for col in df.columns:
    dftest = adfuller(df[col].values)
    p_value = dftest[1]
    t_test = dftest[0] < dftest[4]["1%"]
    if p_value > 0.05 or not t_test:
        non_stationaries.append(col)
print(f"Non-Stationary Features Found: {len(non_stationaries)}")

In [62]:
# Convert non-stationaries to stationary
df_stationary = df.copy()
df_stationary[non_stationaries] = df_stationary[non_stationaries].pct_change()
df_stationary = df_stationary.iloc[1:]

In [63]:
# Find NaN Rows
na_list = df_stationary.columns[df_stationary.isna().any().tolist()]
df_stationary.drop(columns=na_list, inplace=True)

In [64]:
# Handle inf values
df_stationary.replace([np.inf, -np.inf], 0, inplace=True)
df_stationary.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,volume_adi,volume_obv,volume_cmf,volume_fi,...,momentum_ppo,momentum_ppo_signal,momentum_ppo_hist,momentum_pvo,momentum_pvo_signal,momentum_pvo_hist,momentum_kama,others_dr,others_dlr,others_cr
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-01-04,12.78,12.8,11.63,11.85,11.85,0,-0.0,0,0.0,-0.0,...,-0.624394,-0.124879,-0.499515,0.0,0.0,0.0,10.584451,-7.782101,-8.101594,-7.782101
2017-01-05,11.96,12.09,11.4,11.67,11.67,0,-0.0,0,0.0,-0.0,...,-1.226732,-0.345249,-0.881483,0.0,0.0,0.0,12.55084,-1.51899,-1.530645,-9.182881
2017-01-06,11.7,11.74,10.98,11.32,11.32,0,-0.0,0,0.0,-0.0,...,-1.916831,-0.659566,-1.257265,0.0,0.0,0.0,11.306731,-2.999146,-3.045041,-11.90662
2017-01-09,11.71,12.08,11.46,11.56,11.56,0,-0.0,0,0.0,-0.0,...,-2.289756,-0.985604,-1.304152,0.0,0.0,0.0,11.383026,2.120148,2.097985,-10.03891
2017-01-10,11.59,11.79,11.31,11.49,11.49,0,-0.0,0,0.0,-0.0,...,-2.607109,-1.309905,-1.297204,0.0,0.0,0.0,11.411436,-0.605542,-0.607383,-10.583662
