<a href="https://colab.research.google.com/github/dkalenov/ML-Trading/blob/1_unsupervised-learning/Principle_Component_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Case Study

The VIX is a measure of the implied volatility for the S&P500. The higher the "Close" of the VIX, the higher the uncertainty in the market. Fortunately, there are ways of capitlizing on the predictability of the VIX. Being able to predict implied volatility for the market with any degree of certainty better than 50/50 can give an options trader of ETF trader (VIX tracking ETF's) an edge.

There could be many, many features that impact whether a stock price goes up or down. Predicting directionality for stocks, the VIX, commodities, FX etc represents a huge potential gain for every small percentage increment in forecast ability.

However, having so many features allows room for noise and can create adverse impacts to your supervised learning. Therefore, we will explore deplying PCA (Principle Component Analysis) as a tool to find useful information within a vast array of indicators and features to help with our supervised learning later on.

PCA can equally be extremely good at helping to understand correlations too and further study on PCS (based on articles and papers below) is encorouged.

### Imports

In [1]:
! pip install ta

Collecting ta
  Downloading ta-0.11.0.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ta
  Building wheel for ta (setup.py) ... [?25l[?25hdone
  Created wheel for ta: filename=ta-0.11.0-py3-none-any.whl size=29411 sha256=fecad8f42cca5a01c68979fff2876e1876a0f069affb67a0cbe3384dc2713550
  Stored in directory: /root/.cache/pip/wheels/5f/67/4f/8a9f252836e053e532c6587a3230bc72a4deb16b03a829610b
Successfully built ta
Installing collected packages: ta
Successfully installed ta-0.11.0


In [2]:
# Remove unwanted warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)

# Data Management
import pandas as pd
import numpy as np
from pandas_datareader.data import DataReader
from ta import add_all_ta_features

# Statistics
from statsmodels.tsa.stattools import adfuller

# Unsupervised Machine Learning
from sklearn.decomposition import PCA

# Supervised Machine Learning
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

# Reporting
import matplotlib.pyplot as plt

### Initial Data Extraction

In [68]:
# Data Extraction
import yfinance as yf

df = yf.download("^VIX", "2017-01-01", "2022-01-01")
df.head()

[*********************100%%**********************]  1 of 1 completed


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-01-03,14.07,14.07,12.85,12.85,12.85,0
2017-01-04,12.78,12.8,11.63,11.85,11.85,0
2017-01-05,11.96,12.09,11.4,11.67,11.67,0
2017-01-06,11.7,11.74,10.98,11.32,11.32,0
2017-01-09,11.71,12.08,11.46,11.56,11.56,0


In [69]:
# Add TA
df = add_all_ta_features(df, open="Open", high="High", low="Low", close="Adj Close", volume="Volume", fillna=True)

In [None]:
df.info()

In [16]:
df.describe()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,volume_adi,volume_obv,volume_cmf,volume_fi,...,momentum_ppo,momentum_ppo_signal,momentum_ppo_hist,momentum_pvo,momentum_pvo_signal,momentum_pvo_hist,momentum_kama,others_dr,others_dlr,others_cr
count,1821.0,1821.0,1821.0,1821.0,1821.0,1821.0,1821.0,1821.0,1821.0,1821.0,...,1821.0,1821.0,1821.0,1821.0,1821.0,1821.0,1821.0,1821.0,1821.0,1821.0
mean,19.22475,20.346897,18.1514,19.042845,19.042845,0.0,0.0,0.0,0.0,0.0,...,-0.285955,-0.283996,-0.001959,0.0,0.0,0.0,19.28664,0.319342,0.00068,48.193339
std,8.142696,8.923959,7.335456,8.05736,8.05736,0.0,0.0,0.0,0.0,0.0,...,5.738937,5.176152,2.216702,0.0,0.0,0.0,7.396205,8.439257,7.810217,62.703189
min,9.01,9.31,8.56,9.14,9.14,0.0,0.0,0.0,0.0,-0.0,...,-11.667174,-9.837047,-7.939708,0.0,0.0,0.0,9.827139,-25.905673,-29.983121,-28.871595
25%,13.47,14.09,12.93,13.31,13.31,0.0,0.0,0.0,0.0,0.0,...,-3.946344,-3.487659,-1.217367,0.0,0.0,0.0,13.870061,-4.292394,-4.387241,3.579767
50%,17.43,18.34,16.559999,17.219999,17.219999,0.0,0.0,0.0,0.0,0.0,...,-1.199566,-0.873431,-0.109091,0.0,0.0,0.0,17.880101,-0.70797,-0.710488,34.007773
75%,22.82,24.07,21.610001,22.610001,22.610001,0.0,0.0,0.0,0.0,0.0,...,2.323653,1.985471,1.048119,0.0,0.0,0.0,23.186717,3.405903,3.349187,75.953307
max,82.690002,85.470001,70.370003,82.690002,82.690002,0.0,0.0,0.0,0.0,-0.0,...,33.144188,29.926287,13.120804,0.0,0.0,0.0,56.547472,115.597925,76.824503,543.501945


### Data Preprocessing - Stationarity

In [70]:
# Identify non-stationary columns
non_stationaries = []

for col in df.columns:
    if df[col].nunique() > 1:  # Проверка на различные значения в столбце
        dftest = adfuller(df[col].values)
        p_value = dftest[1]
        t_test = dftest[0] < dftest[4]["1%"]
        if p_value > 0.05 or not t_test:
            non_stationaries.append(col)
print(f"Non-Stationary Features Found: {len(non_stationaries)}")

Non-Stationary Features Found: 15


In [71]:
non_stationaries

['Low',
 'volatility_bbh',
 'volatility_bbl',
 'volatility_kcl',
 'volatility_dcl',
 'volatility_dcm',
 'trend_sma_fast',
 'trend_sma_slow',
 'trend_ema_slow',
 'trend_ichimoku_base',
 'trend_ichimoku_b',
 'trend_visual_ichimoku_a',
 'trend_visual_ichimoku_b',
 'trend_psar_up',
 'momentum_kama']

In [90]:
# Convert non-stationaries to stationary
df_stationary = df.copy()
df_stationary[non_stationaries] = df_stationary[non_stationaries].pct_change()
df_stationary = df_stationary.iloc[1:]

In [91]:
df_stationary.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,volume_adi,volume_obv,volume_cmf,volume_fi,...,momentum_ppo,momentum_ppo_signal,momentum_ppo_hist,momentum_pvo,momentum_pvo_signal,momentum_pvo_hist,momentum_kama,others_dr,others_dlr,others_cr
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-01-04,12.78,12.8,-0.094942,11.85,11.85,0,-0.0,0,0.0,-0.0,...,-0.624394,-0.124879,-0.499515,0.0,0.0,0.0,-0.131561,-7.782101,-8.101594,-7.782101
2017-01-05,11.96,12.09,-0.019776,11.67,11.67,0,-0.0,0,0.0,-0.0,...,-1.226732,-0.345249,-0.881483,0.0,0.0,0.0,0.053153,-1.51899,-1.530645,-9.182881
2017-01-06,11.7,11.74,-0.036842,11.32,11.32,0,-0.0,0,0.0,-0.0,...,-1.916831,-0.659566,-1.257265,0.0,0.0,0.0,-0.024185,-2.999146,-3.045041,-11.90662
2017-01-09,11.71,12.08,0.043716,11.56,11.56,0,-0.0,0,0.0,-0.0,...,-2.289756,-0.985604,-1.304152,0.0,0.0,0.0,0.003828,2.120148,2.097985,-10.03891
2017-01-10,11.59,11.79,-0.013089,11.49,11.49,0,-0.0,0,0.0,-0.0,...,-2.607109,-1.309905,-1.297204,0.0,0.0,0.0,-0.000855,-0.605542,-0.607383,-10.583662


In [85]:
# Find NaN Rows
na_list = df_stationary.columns[df_stationary.isna().any().tolist()]
df_stationary.drop(columns=na_list, inplace=True)

In [94]:
na_list

Index([], dtype='object')

In [95]:
df_stationary.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,volume_adi,volume_obv,volume_cmf,volume_fi,...,momentum_ppo,momentum_ppo_signal,momentum_ppo_hist,momentum_pvo,momentum_pvo_signal,momentum_pvo_hist,momentum_kama,others_dr,others_dlr,others_cr
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-01-04,12.78,12.8,-0.094942,11.85,11.85,0,-0.0,0,0.0,-0.0,...,-0.624394,-0.124879,-0.499515,0.0,0.0,0.0,-0.131561,-7.782101,-8.101594,-7.782101
2017-01-05,11.96,12.09,-0.019776,11.67,11.67,0,-0.0,0,0.0,-0.0,...,-1.226732,-0.345249,-0.881483,0.0,0.0,0.0,0.053153,-1.51899,-1.530645,-9.182881
2017-01-06,11.7,11.74,-0.036842,11.32,11.32,0,-0.0,0,0.0,-0.0,...,-1.916831,-0.659566,-1.257265,0.0,0.0,0.0,-0.024185,-2.999146,-3.045041,-11.90662
2017-01-09,11.71,12.08,0.043716,11.56,11.56,0,-0.0,0,0.0,-0.0,...,-2.289756,-0.985604,-1.304152,0.0,0.0,0.0,0.003828,2.120148,2.097985,-10.03891
2017-01-10,11.59,11.79,-0.013089,11.49,11.49,0,-0.0,0,0.0,-0.0,...,-2.607109,-1.309905,-1.297204,0.0,0.0,0.0,-0.000855,-0.605542,-0.607383,-10.583662


In [96]:
# Handle inf values
df_stationary.replace([np.inf, -np.inf], 0, inplace=True)
df_stationary.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,volume_adi,volume_obv,volume_cmf,volume_fi,...,momentum_ppo,momentum_ppo_signal,momentum_ppo_hist,momentum_pvo,momentum_pvo_signal,momentum_pvo_hist,momentum_kama,others_dr,others_dlr,others_cr
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-01-04,12.78,12.8,-0.094942,11.85,11.85,0,-0.0,0,0.0,-0.0,...,-0.624394,-0.124879,-0.499515,0.0,0.0,0.0,-0.131561,-7.782101,-8.101594,-7.782101
2017-01-05,11.96,12.09,-0.019776,11.67,11.67,0,-0.0,0,0.0,-0.0,...,-1.226732,-0.345249,-0.881483,0.0,0.0,0.0,0.053153,-1.51899,-1.530645,-9.182881
2017-01-06,11.7,11.74,-0.036842,11.32,11.32,0,-0.0,0,0.0,-0.0,...,-1.916831,-0.659566,-1.257265,0.0,0.0,0.0,-0.024185,-2.999146,-3.045041,-11.90662
2017-01-09,11.71,12.08,0.043716,11.56,11.56,0,-0.0,0,0.0,-0.0,...,-2.289756,-0.985604,-1.304152,0.0,0.0,0.0,0.003828,2.120148,2.097985,-10.03891
2017-01-10,11.59,11.79,-0.013089,11.49,11.49,0,-0.0,0,0.0,-0.0,...,-2.607109,-1.309905,-1.297204,0.0,0.0,0.0,-0.000855,-0.605542,-0.607383,-10.583662


### Data Preprocessing - Scaling and Target Setting

In [101]:
# Set Target (for Supervised ML later on)
df_stationary["TARGET"] = -1
df_stationary.loc[df_stationary["Adj Close"].shift(-1) > df_stationary["Adj Close"], "TARGET"] = 1
df_stationary.head(3)

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,volume_adi,volume_obv,volume_cmf,volume_fi,...,momentum_ppo_signal,momentum_ppo_hist,momentum_pvo,momentum_pvo_signal,momentum_pvo_hist,momentum_kama,others_dr,others_dlr,others_cr,TARGET
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-01-04,12.78,12.8,-0.094942,11.85,11.85,0,-0.0,0,0.0,-0.0,...,-0.124879,-0.499515,0.0,0.0,0.0,-0.131561,-7.782101,-8.101594,-7.782101,-1
2017-01-05,11.96,12.09,-0.019776,11.67,11.67,0,-0.0,0,0.0,-0.0,...,-0.345249,-0.881483,0.0,0.0,0.0,0.053153,-1.51899,-1.530645,-9.182881,-1
2017-01-06,11.7,11.74,-0.036842,11.32,11.32,0,-0.0,0,0.0,-0.0,...,-0.659566,-1.257265,0.0,0.0,0.0,-0.024185,-2.999146,-3.045041,-11.90662,1


In [102]:
df_stationary.dropna(inplace=True)

In [103]:
# Split Target from Featureset
X = df_stationary.iloc[:, :-1] # all rows and all columns are selected except the last column ["TARGET"].
y = df_stationary.iloc[:, -1] # select the last column ["TARGET"]

In [104]:
# Feature Scaling
df_sc = df_stationary.copy()
X_fs = StandardScaler().fit_transform(X)

In [107]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X_fs, y, test_size=0.7, random_state=42)