Preprocessing Instructions

This is a very good article that tells best techniques to preprocess before PCA: 
https://towardsdatascience.com/pca-a-practical-journey-preprocessing-encoding-and-inspiring-applications-64371cb134a
1. Load the Dataset

	•	Use the pandas library to load the CSV file and set the date column as the index.
	•	Parse the dates to ensure time-series operations can be performed seamlessly.
	•	Mathematical Reasoning:
	•	Represent the dataset as a matrix $X$ with dimensions $T \times N$, where $T$ is the number of time periods (rows) and $N$ is the number of stocks (columns).
	•	Proper indexing ensures that each row corresponds to a time step, allowing for time-series analysis.

2. Handle Missing Data

	•	Replace any missing values in the dataset with the column mean. Missing values can distort PCA if we do not handle them properly.
	•	For a column $j$ with missing values, compute its mean: i.e. replace missing values in column $j$ with $\bar{X}j$, so that:
$$
\bar{X}j = \frac{1}{T} \sum{i=1}^{T} X_{i,j} \quad \text{(excluding missing values)}
$$
$$
X{i,j} = \bar{X}j \quad \text{if } X{i,j} \text{ is missing.}
$$

3. Filter Zero-Variance Columns

	•	Identify and remove any columns (stocks) with zero variance. This would mean that the column has constant values accross time and do not contribute to PCA or meaningful analysis. If $\text{Var}(X_j) = 0$, the column is constant and should be removed since it does not contribute to the total variance.
	•	Calculate the variance of a column $j$:
$$
\text{Var}(X_j) = \frac{1}{T} \sum_{i=1}^{T} \big(X_{i,j} - \bar{X}_j\big)^2
$$

4. Normalize the Data

	•	Standardize the dataset to have zero mean and unit variance. This ensures all columns are on the same scale, which is critical for PCA, as it is sensitive to differences in scale.
	•	Mathematical Reasoning:
	•	Normalize each column $j$ as follows:
$$
Z_{i,j} = \frac{X_{i,j} - \bar{X}_j}{\sigma_j}
$$
where:
	•	$\bar{X}_j$ is the mean of column $j$
	•	$\sigma_j$ is the standard deviation of column $j$
	•	This ensures that:
$$
\text{Mean}(Z_j) = 0 \quad \text{and} \quad \text{Var}(Z_j) = 1
$$

In [3]:
# Import necessary libraries
import pandas as pd
import numpy as np
import sklearn
from sklearn.preprocessing import StandardScaler

# Load the dataset
file_path = "daily_ret_clean.csv"
data = pd.read_csv(file_path, index_col=0, parse_dates=True)

# Display basic information
print("Dataset Shape:", data.shape)
print("Preview:")
print(data.head())

# Step 1: Handle missing data
# Replace missing values with column mean
data = data.fillna(data.mean())

# Step 2: Filter columns with zero variance
zero_variance_columns = data.columns[data.var() == 0]
print(f"Removing {len(zero_variance_columns)} columns with zero variance.")
data = data.drop(columns=zero_variance_columns)

# Step 3: Normalize the data
scaler = StandardScaler()
normalized_data = scaler.fit_transform(data)

# Convert normalized data back to a DataFrame
normalized_df = pd.DataFrame(normalized_data, index=data.index, columns=data.columns)
print("Preprocessing complete.")
print("Normalized Data (Preview):")
print(normalized_df.head())

# Save the preprocessed data (optional)
normalized_df.to_csv("preprocessed_returns.csv")

Dataset Shape: (6037, 1201)
Preview:
               10026     10032     10044     10065     10104     10107  \
date                                                                     
2000-01-03  0.012195 -0.017045  0.029762 -0.003724  0.054099 -0.001606   
2000-01-04 -0.084337 -0.026734  0.017341 -0.013084 -0.088360 -0.033780   
2000-01-05  0.059211 -0.000742  0.000000 -0.007576 -0.052815  0.010544   
2000-01-06  0.003106  0.001486 -0.090909  0.000000 -0.058824 -0.033498   
2000-01-07  0.003096  0.007418  0.125000  0.003817  0.076823  0.013068   

               10138     10145     10200     10207  ...     88664     88912  \
date                                                ...                       
2000-01-03 -0.049069 -0.017335 -0.020000  0.046358  ...  0.001164 -0.045983   
2000-01-04 -0.030249 -0.017641  0.000000  0.000000  ... -0.081395 -0.004525   
2000-01-05 -0.001835 -0.013468 -0.020408  0.012658  ...  0.020253  0.022727   
2000-01-06  0.029412  0.019340  0.020833  0.02500