# 0. PCA ANALYSIS OF DOW JONES STOCKS

This notebook is based on by Nathan Thomas's notebook published in:
https://towardsdatascience.com/applying-pca-to-the-yield-curve-4d2023e555b3
which we have commented and extended.

# 1. Import and clean data

First we import the stock prices.

In [1]:
!pip install openpyxl



In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Import data from excel
df = pd.read_csv("indu_dly.csv", index_col="Date")
df = df.pct_change(1).dropna(how="any")

## 2. Compute the eigenvalues & eigenvectors

In [5]:
from sklearn.decomposition import PCA

In [8]:
pca = PCA()
pca.fit(df)  # Do the fitting
eigenValues = pca.explained_variance_  # Horizontal eigenvalues ordered from left to right
eigenVectors = pca.components_         # Horizontal eigenvectors ordered from top to bottom

## 3. PCA projections

We now calculate the PCA projections (or
what we have been calling the tranformed "Z" features of  4.PCAInMoreDepth.pptx slides 25 to 32).
These are "latent" or hidden features (as per slide 37) that
drive the movement of the rates as a whole.
pc1 is the most important latent feature, the one the captures the most variance.

In [9]:
principal_component_projections = pca.transform(df)  #####
pc1_proj = principal_component_projections[:, 0] #####

## 4. Comparison with Dow Jones Index

In [10]:
df_indu_index = pd.read_csv("indu_index_dly.csv", index_col="Date")
df_indu_index_ret = df_indu_index.pct_change(1).dropna(how="any")
indu_index = df_indu_index_ret.squeeze()

The correlation between up down movements of the pc1 projection and those of the Dow Jones Index is very high:

In [12]:
np.corrcoef(pc1_proj, indu_index)

array([[1.        , 0.98062392],
       [0.98062392, 1.        ]])

## 5. Variance

The percent variance (explained) of first principal component (first eigenvalue)

In [13]:
explained_variance_pc1 = pca.explained_variance_ratio_[0] * 100

In [15]:
##### some ratio*100 #% variance of first principal component (first eigenvalue)
explained_variance_pc1

53.3612817168958

The percent variance (explained) of first principal component projection

In [17]:
projections_covariance = np.cov(principal_component_projections.T)

variance_pc1_projection = projections_covariance[0, 0]
total_variance = np.sum(np.var(df.values, axis=0, ddof=1))

variance_pc1_percent = (variance_pc1_projection / total_variance) * 100

In [18]:
##### another ratio*100 #variance of first principal component projection
variance_pc1_percent

53.36128171689584

THEY ARE THE SAME

## 6. Betas

Calculate the betas by regression:

In [19]:
from sklearn.linear_model import LinearRegression
betas_by_regression = []
for column in df.columns.values.tolist():
    reg = LinearRegression().fit(pc1_proj.reshape(-1, 1), df[column])
    #reg = LinearRegression().fit(df_indu_index_ret.iloc[:,0].values.reshape(-1,1), df[column])
    betas_by_regression.append(reg.coef_)

In [None]:
betas_by_regression = pd.DataFrame(betas_by_regression, columns=["Betas"], index=df.columns)
betas_by_regression.head(50)

Unnamed: 0,Betas
CSCO,0.203719
DIS,0.206642
XOM,0.173782
BA,0.195282
UNH,0.195802
MMM,0.163117
HD,0.1865
VZ,0.132984
TRV,0.205211
JNJ,0.106099


Calculate the betas by eigenvector pc1:

In [22]:
betas_by_pc1_eigenvector = eigenVectors[0, :] ##### select the betas from the eigenVectors
betas_by_pc1_eigenvector = pd.DataFrame(betas_by_pc1_eigenvector, columns=["Betas"], index=df.columns)
betas_by_pc1_eigenvector.head(50)

Unnamed: 0,Betas
CSCO,0.203719
DIS,0.206642
XOM,0.173782
BA,0.195282
UNH,0.195802
MMM,0.163117
HD,0.1865
VZ,0.132984
TRV,0.205211
JNJ,0.106099


THEY ARE THE SAME

## 7. Using np.linealg.eig

In [23]:
# with np.linealg.eig
df_mean = df.mean()
df_ctr = df-df_mean
cov_matrix_array = np.array(np.cov(df_ctr, rowvar=False))
eigenValues, eigenVectors = np.linalg.eig(cov_matrix_array)
idx = eigenValues.argsort()[::-1]
eigenValues_ordered = eigenValues[idx]
eigenVectors_ordered = eigenVectors[:,idx] #vertical eigenvectors ordered from left to right
principal_component_projections  = np.matmul(eigenVectors_ordered.transpose(), df_ctr.transpose().values).transpose()
pc1 = principal_component_projections[:,0]
np.corrcoef(pc1, df_indu_index_ret.iloc[:,0].values)

array([[1.        , 0.98062392],
       [0.98062392, 1.        ]])