# Stock Market Analysis using PCA

## Problem Statement

As trading becomes automated, we have seen that traders seek to use as much data as they can for their analyses. But we all know that adding more variables leads to more complications and that in turn might make it harder to come to solid conclusions. Think about it, we have more than 3000 companies in the New York Stock Exchange. A simple exercise to find pairs between them will be really computing-intensive. Wouldn’t it be wonderful if we could use a lot of variables but still somehow make it simpler?

## Business Objective

Understand the use of Principal Component Analysis to identify highly correlated stocks and create a Pair Trading Strategy

## Steps of Analysis
- Generate Data
- Prepare Data for PCA
- Perform Assumption check for PCA
- Perform PCA
- Identify Pairs from PCA result
- Create a Trading Strategy

### Import Libraries

In [5]:
import yfinance as yf
import pandas as pd
import numpy as np
from scipy.stats import zscore

import warnings
warnings.filterwarnings('ignore')

### Generate Data
We are fetching Top 20 Companies data in NSE Exchange based on their Market Capital from Yahoo Finance

In [3]:
stock_df = pd.read_csv('NSE.csv')
df = pd.DataFrame()

for ticker in stock_df.Ticker[:20]:
    df[ticker] = yf.download(ticker, '2018-1-1', '2022-03-15')['Adj Close']

df.head()

NameError: name 'pd' is not defined

### Let's Normalize the data for Daily returns instead of Closing Price

In [None]:
data_daily_returns = df.pct_change()
data_daily_returns.dropna(inplace=True)
data_daily_returns.head()

In [None]:
data_daily_returns.shape

We have 1000+ Days daily returns on which PCA will be done

### Let's Check for Correlation in the Daily returns of the Stocks

In [None]:
plt.figure(figsize=(18,8))
sns.heatmap(data_daily_returns.corr(), annot=True)

### Let's validate the Assumptions

#### Bartlett test of Sphericity
Bartlett's test of sphericity tests the hypothesis that the variables are uncorrelated in the data.
* H0: All variables in the data are uncorrelated
* Ha: At least one pair of variable in the data is correlated.<br><br>
If the null hypothesis cannot be rejected then PCA is not advisable.<br><br>
If the p-value is small, then we can reject the null hypothesis and agree that there is atleast one pair of variables in the data which are correlated hence PCA is recommended.


In [None]:
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_sq_value, p_value = calculate_bartlett_sphericity(data_daily_returns)
p_value

Derived p-value is 0.0 which is less than alpha (0.05) hence we reject the null hypothesis and confirm that there's atleast one pair in the data which can be gropued.

#### KMO Test
The Kaiser-Meyer-Olkin (KMO) - measure of sampling adequacy (MSA) is an index used to examine how appropriate PCA is.  
<br>Generally, if MSA is less than 0.5, *PCA is not recommended*, since no reduction is expected. On the other hand, if MSA > 0.7 is expected to provide a considerable reduction is the dimension and extraction of meaningful components.

In [None]:
from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all, kmo_model = calculate_kmo(data_daily_returns)
print(kmo_model)

Since the MSA value is greater than 0.5 the test suggests we have sufficient data to perform PCA.

### Let's Perform PCA

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 5)
pca.fit_transform(data_daily_returns)

### Let's Check the Cummulative Variance to identify the number of Components

In [None]:
np.cumsum(pca.explained_variance_ratio_)

We will use the first 5 components from this PCA model

### Let's validate the Factor Loadings to identify the groups

In [None]:
df_comp = pd.DataFrame(pca.components_,columns=list(data_daily_returns))

from matplotlib.patches import Rectangle
fig, ax = plt.subplots(figsize=(15,4), facecolor='w', edgecolor='k')
ax = sns.heatmap(df_comp, annot=True, vmax=1, vmin=0, cmap='Blues', cbar=False, fmt='.2g', ax=ax,
                yticklabels=['PC0','PC1','PC2','PC3','PC4'])

column_max = df_comp.abs().idxmax(axis=0)

for col, variable in enumerate(df_comp.columns):
    position = df_comp.index.get_loc(column_max[variable])
    ax.add_patch(Rectangle((col, position), 1, 1, fill=False, edgecolor='red', lw=1))

### Let's Randomly pick 2 tickers from the first component as that's explaining the most variance

In [None]:
df1 = df[['HDFCBANK.NS', 'KOTAKBANK.NS']]
df1.head()

### Let's Explore these two tickers

In [None]:
df1.plot(figsize=(15,6))

Above plot clearly explains that the two stocks moved almost in the same direction in the past 4 years, hence these can be a perfect pair to trade

In [None]:
df1.corr()['HDFCBANK.NS']

Correlation between the two company's share prices is very high

In [None]:
returns = df1.pct_change()
returns.dropna(inplace=True)
returns.plot(figsize=(15,6))

It can be seen from the above plot that the percent change in price over previous day is almost same for both the stocks and hence this also suggests that the two will make a good pair

In [None]:
df1['Ratio'] = df1['HDFCBANK.NS'] / df1['KOTAKBANK.NS']
df1['Ratio'].plot(figsize=(15,6))

Ratio of the two prices suggest that the maximum difference between them is 35% but we need to normalize the ratio on z scale and check if the ration ranges between -1 to +1 SD. The reason is if we want to buy one and sell other we want the prices to reverse and that's when we stand a chance to make profits.

In [None]:
def zscore(series):
    return (series - series.mean()) / np.std(series)


zscore(df1['Ratio']).plot(figsize=(15,6))
plt.axhline(zscore(df1['Ratio']).mean())
plt.axhline(1.0, color='red')
plt.axhline(-1.0, color='green')

plt.show()

Above plot suggests that the ratio mostly ranges between -1 to +1 SD, hence these 2 can be considered for creating a strategy

### Let's create a Trading Strategy

In [None]:
ratios = df1['Ratio']

train = ratios[:726]
test = ratios[726:]

ratios_mavg5 = train.rolling(window=5, center=False).mean()
ratios_mavg60 = train.rolling(window=60, center=False).mean()
std_60 = train.rolling(window=60, center=False).std()
zscore_60_5 = (ratios_mavg5 - ratios_mavg60)/std_60

buy = train.copy()
sell = train.copy()
buy[zscore_60_5>-1] = 0
sell[zscore_60_5<1] = 0

In [None]:
plt.figure(figsize=(12,7))
S1 = df1['HDFCBANK.NS'].iloc[:726]
S2 = df1['KOTAKBANK.NS'].iloc[:726]


S1[60:].plot(color='b')
S2[60:].plot(color='c')
buyR = 0*S1.copy()
sellR = 0*S1.copy()

# When you buy the ratio, you buy stock S1 and sell S2
buyR[buy!=0] = S1[buy!=0]
sellR[buy!=0] = S2[buy!=0]

# When you sell the ratio, you sell stock S1 and buy S2
buyR[sell!=0] = S2[sell!=0]
sellR[sell!=0] = S1[sell!=0]

buyR[60:].plot(color='g', linestyle='None', marker='^')
sellR[60:].plot(color='r', linestyle='None', marker='^')
x1, x2, y1, y2 = plt.axis()
plt.axis((x1, x2, min(S1.min(), S2.min()), max(S1.max(), S2.max())))

plt.legend(['HDFCBANK', 'KOTAKBANK', 'Buy Signal', 'Sell Signal'])
plt.show()

Above plot gives an indication on historical data when to buy and when to sell these stocks

### Areas of Improvement and Further Steps
This is by no means a perfect strategy and the implementation of our strategy isn't the best. However, there are several things that can be improved upon. 

# This case study is for Educational Purpose Only. Trade at your own risk. 
# Great Learning or the Mentor will not be liable for any market risks