# Self-study try-it activity 23.1: Data processing for PCA

PCA is a dimensionality reduction technique that captures the highest possible variance from a set number of dimensions. Additionally, it will find the projection that minimises the sum of the squared distances between the original and the projected data points.

In [None]:
#Import the necessary libraries
import numpy.random as rand
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt


Below is a simulated data set (shown in blue) that has been projected onto four different components (shown in orange).

Which of these projections maximises the variance of the data that is kept and is likely to be the principal component?

This activity helps you visually identify the direction that captures the most variance – an essential concept in understanding how PCA selects principal components.

In [None]:
rand.seed(1234)

x1 = 3 * np.outer(rand.normal(0,1,50), np.array([1,-1]))
x2 = rand.multivariate_normal([0,0],np.diag([0.1,0.1]), size = 50)

X = x1 + x2

projection = np.array((1,0)).reshape((2,1))

R = np.matrix([[1/np.sqrt(2), -1/np.sqrt(2)], [1/np.sqrt(2), 1/np.sqrt(2)]])

fig, axs = plt.subplots(2,2, figsize = (25,25))
for i in range(4):
    axs[i//2, i % 2 ].set_xlim([-8,8])
    axs[i//2, i % 2 ].set_ylim([-8,8])
    axs[i//2, i % 2 ].scatter(X[:,0], X[:,1], alpha = 0.7)
    axs[i//2, i % 2 ].title.set_text("Projection %s" % (i+1))
    axs[i//2, i % 2 ].title.set_size(20)
    scores = X @ projection
    Xi = np.outer(scores, projection)
    axs[i//2, i % 2 ].scatter(Xi[:,0],Xi[:,1])
    projection = R @ projection

plt.show()


 ### To-do 1:

 Which of these projections maximises the variance of the data that is kept and is likely to be the principal component?
 
* Projection 1
* Projection 2
* Projection 3
* Projection 4

### Answer:
The projection that retains the largest spread of data along its axis is the one that maximises the variance of the data, and that is likely to be the principal axis. In this case, it is projection 4.

## Dimensionality deduction with PCA

PCA is commonly used to reduce the dimensionality of high-dimensional data sets by selecting a subset of components that retain most of the original variance. This technique is particularly valuable when preparing data for machine learning models, as it can improve training efficiency and model generalisation.

In this notebook, you will work with a data set comprising S&P 500 stock prices over the past five years. Given the high degree of correlation among stocks in the market, retaining all individual stock features may introduce redundancy and unnecessarily increase model complexity and training time.

Instead, PCA allows you to project the data onto a smaller set of orthogonal components that capture the majority of the variance. This approach simplifies the feature space while preserving the underlying structure, making it a more effective input for downstream modelling tasks.

In [None]:

df = pd.read_csv("data/all_stocks_5yr.csv")

names = df["Name"].unique()
range(len(df[ df["Name"] == names[0]]["close"]))
l = len(df[ df["Name"] == names[0]]["close"])
data = pd.DataFrame(index = range(l), columns= names)

for name in names:
    x = df[ df["Name"] == name]["close"]
    if x.isnull().any() or len(x) != l:
        data = data.drop(columns= name)
    else:
        data[name] = np.array(x)

data.head()





The above code cell loads five years of S&P 500 stock data and constructs a clean data frame of closing prices, excluding stocks with missing values or inconsistent time series lengths. The result is a well-aligned data set suitable for analysis or modeling.

## Interpreting the scree plot

The scree plot below displays the proportion of variance explained by each of the first ten principal components derived from the PCA of the stock data set. This visualisation is a key tool for determining the optimal number of components to retain.

Your task is to analyse the scree plot and decide how many principal components you would keep to effectively represent the data set while minimising dimensionality. Consider both the individual and cumulative variance explained and identify the 'elbow point' – the point beyond which additional components contribute marginally to the total variance.

### To-do 2:

Based on the scree plot of the first ten principal components from the PCA and any additional analysis you consider appropriate, how many components would you retain to effectively represent this data set, and what is your rationale for that choice?

In [None]:
pca = PCA(10).fit(data)
var = pca.explained_variance_ratio_
var_explained = np.zeros(10)
for i in range(10):
    var_explained[i] = sum(var[:i+1])

for i in range(10):
    print("\nComponent", i+1 , "\nFraction of total variance explained by this variable:", var[i],
            "\n Total fraction of variance explained by the first %s variable(s):" % (i +1), var_explained[i] )

plt.figure(figsize = (10,10))
plt.plot(range(1, len(var_explained)+1), var, label = "variance of i th component", marker = "o")
plt.plot(range(1, len(var_explained)+1), var_explained, label = "variance explained by the first i components", marker = "o")
plt.xlabel("component i")
plt.ylabel("variance")
plt.title("Scree plot for S&P500 stocks")
plt.legend()
plt.show()




### Answer:

Based on the scree plot and variance output from PCA, retaining the first three principal components to represent this data set would be the right choice. This is because:

- Component 1 alone explains approximately 78.4 per cent of the total variance, indicating a dominant underlying structure in the data.

- Component 2 adds another 9.2 per cent, bringing the cumulative variance to 87.5 per cent.

- Component 3 contributes 5.4 per cent, raising the total to 92.96 per cent.

Together, these three components capture nearly 93 per cent of the total variance, which is a strong threshold for dimensionality reduction. Beyond this point, each additional component contributes marginally (less than 3 percent combined for components 4–10), suggesting diminishing returns.

Retaining three components strikes a balance between model simplicity and information preservation, reducing computational overhead while maintaining the integrity of the data set’s structure. This choice is especially appropriate for highly correlated financial data, where a few latent factors often drive most of the variation.

## Pre-processing the data

When analysing which stocks contribute most to the principal components, it's important to consider how the data was pre-processed. If you apply PCA directly to the raw closing prices, stocks with higher average prices dominate the principal components in raw data because PCA is sensitive to scale. This can lead to misleading interpretations, as variance may reflect price magnitude rather than meaningful movement.

In most financial analyses, you are more interested in **relative variability**, which means how much a stock's price fluctuates in proportion to its own value rather than its absolute magnitude. A high variance driven by a large price scale does not necessarily indicate meaningful movement or volatility.

To ensure a fair and interpretable PCA, you standardise the data so that each stock has a **mean of zero** and a **variance of one**. This transformation allows PCA to focus on patterns of variation that are comparable across stocks, regardless of their original price levels, leading to more meaningful insights.


## Re-evaluating component selection

### To-do 3:

Examine the scree plot provided below. Based on its structure and the variance explained by each principal component, can you reasonably assume that the same number of components should be retained as in your previous PCA? Consider whether the distribution of variance has changed and justify your conclusion with reference to the plot.


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(data)


data_scaled = scaler.transform(data)

#Number of component guards (optional but safer)
n_comp = min(10, data_scaled.shape[1])
pca_scaled = PCA(n_comp).fit(data_scaled)
var_scaled = pca_scaled.explained_variance_ratio_
var_scaled_explained = np.cumsum(var_scaled)

#Print the scaled values you just computed
for i in range(n_comp):
    print(
            f"\nComponent {i+1}"
            f"\nFraction of total variance explained by this component:{var_scaled[i]:.4f}"
            f"\nTotal fraction of variance explained by the first {i+1} component(s): {var_scaled_explained[i]:.4f}"
        )
#Plot using the scaled arrays for both series and axis length
plt.figure(figsize=(10, 10))
plt.plot(range(1, n_comp+1), var_scaled, marker="o", label=" Explained variance ratio (component i)")
plt.plot(range(1, n_comp+1), var_scaled_explained, marker="o", label="Cumulative explained variance")
plt.xlabel("component ")
plt.ylabel("variance ratio")
plt.title("Scree plot (standardized data, S&P 500 stocks)")
plt.legend()
plt.show()

### Answer:

No, it is not necessarily safe to assume that the same number of components should be retained as in the previous PCA analysis.

This is because, although the variance distribution across components appears similar to the earlier output, this result is based on standardised data, where each feature has been scaled to have a mean of zero and a variance of one. This pre-processing step ensures that the PCA reflects relative variability rather than being dominated by features with larger absolute values.

In this standardised context:

- The first component still explains a substantial 78.4 per cent of the variance.

- The first three components together explain approximately 93 per cent, which is a strong threshold for dimensionality reduction.

- Additional components beyond the third contribute marginally to the total variance (less than 2.5 per cent combined for components 4–10).

Therefore, while the number of components to retain may coincide with the previous analysis (i.e. three components), this decision is now better justified because the PCA is based on scaled data, making the interpretation more meaningful and fair across all features.

## Comparing PCA results: raw vs pre-processed data

Applying PCA to raw versus pre-processed data can yield significantly different principal components. This distinction is illustrated below through two key observations:

1. **Top contributing stocks**:  
   When you examine the five stocks that contribute most to the first principal component, we find that the results differ between the raw and standardised data sets. This indicates that the PCA model identifies different sources of variance depending on whether the data has been scaled.

2. **Score distribution across components**:  
   The plots of scores along the first two principal components may appear visually similar, but the underlying scales differ substantially. In the standardised data, features are adjusted to have equal variance, which can amplify or compress certain patterns. This transformation leads to a more balanced and interpretable representation of the data structure.

These differences highlight the importance of pre-processing when applying PCA, especially in contexts where feature magnitudes vary widely.


### To-do 4:

Imagine you are an investor planning to allocate capital across S&P 500 stocks based on their weights in the first principal component derived from historical data. However, recognising the potential risk due to high variance in returns, you decide to mitigate this by splitting your investment: half based on the first principal component and half based on the second, which is uncorrelated with the first and exhibits lower variance.

Given this strategy, would it be more appropriate to derive the principal components from the **raw data** or from **pre-processed (standardized) data**? Justify your answer regarding interpretability, risk, and the nature of PCA.


In [None]:
stocks = data.columns[np.argsort(pca.components_[0])[-1:-6:-1]]

print("Largest 5 contributors to the principal component of unpreprocessed data:", stocks)

stocks_scaled = data.columns[np.argsort(pca_scaled.components_[0])[-1:-6:-1]]

print("Largest 5 contributors to the principal component of preprocessed data:", stocks_scaled)

fig, axs = plt.subplots(1,2, figsize = (20,10))
axs[0].plot(pca.transform(data)[:,0],pca.transform(data)[:,1])
axs[0].scatter(pca.transform(data)[:,0],pca.transform(data)[:,1], c = np.arange(l)/l)
axs[1].plot(pca_scaled.transform(data_scaled)[:,0],pca_scaled.transform(data_scaled)[:,1])
axs[1].scatter(pca_scaled.transform(data_scaled)[:,0],pca_scaled.transform(data_scaled)[:,1], c = np.arange(l)/l)
axs[0].set_title("Raw Data")
axs[1].set_title("Pre-processed Data")
axs[0].set_xlabel("Scores along the first principal component ")
axs[0].set_ylabel("Scores along the second principal component ")
axs[1].set_xlabel("Scores along the first principal component ")
axs[1].set_ylabel("Scores along the secondary principal component ")
plt.show()
print(np.mean(data))

### Answer:

When you apply PCA to stock data without pre-processing, the results are influenced mainly by stocks with high price values and not necessarily those that behave interestingly or vary substantially. For example, expensive stocks such as AMZN or GOOGL might dominate the analysis because their prices are large, not because they're more volatile or informative.

However, in finance, you usually care more about how much a stock moves relative to its value and not just its raw price. That's why you standardise the data before applying PCA. This means adjusting all stocks so they have the same average and spread, allowing PCA to focus on actual movement patterns, not just price scale.

So if you're using PCA to guide investment decisions, such as choosing stocks based on how they contribute to principal components, then pre-processing is essential. It gives you a fair comparison across all stocks and helps you make decisions based on meaningful variation, not just size.