# Final Report: Stock Relationship Analysis via MST and Clustering

Author: Tianyu (Cheryl) Wu

Date: December 11, 2025

# 1. Project Overview and Background

This project explores how stocks are related to each other based on historical price movements, and whether those relationships form natural groups. I measure similarity between stocks using log-return correlations, then construct a Minimum Spanning Tree (MST) to keep only the strongest connections. By cutting weak links in the MST, I form clusters of stocks that move closely together. Finally, I visualize both the network structure and the time-series behavior of each cluster.

**Finance Background**:
Stocks often move together because they are influenced by similar economic forces, such as sector exposure or shared risk factors. Correlation measures how closely two stocks move together over time. An MST simplifies a dense correlation network by keeping only the most important relationships, making the overall structure easier to interpret. Cutting weak connections in the MST helps reveal groups of stocks with stronger internal relationships.

# 2. Program Workflow and Features

## 2.1 MVP Definition

When the program is run, the workflow proceeds as follows:

**1. Data Loading and Preprocessing**

The user inputs a list of stock tickers. Historical adjusted close prices are downloaded and converted into log returns.

**2. Relationship Measurement**

Correlation and covariance matrices are computed from log returns to quantify pairwise stock relationships.

**3. MST Construction**

A Minimum Spanning Tree is built using distance = 1 − correlation, retaining only the closest connections between stocks.

**4. MST-based Clustering**

Clusters are formed by cutting MST edges below a chosen correlation threshold (fixed or dynamically selected).

**5. Visualization**

MST graph visualization

Time-series plots for stocks within each cluster

These steps constitute the core MVP and are all completed.

## 2.2 Additional Features

In addition to the MVP, the following extended features were implemented:

- Cluster summary statistics (average, minimum, and maximum internal correlation)

- Correlation heatmaps and clustered heatmaps

- Rolling-window MST analysis to study how structure and clustering change over time

These features go beyond the basic requirements and provide additional interpretability and stability analysis.

# 3. Implementation Details

This section provides implementation notes for each feature, listed in the same order as above.

## 3.1 MVP Features
(1) Measuring Stock Relationships

- Description: Compute log returns and correlation / covariance matrices.

- Implementation: Log returns are computed via first differences of log prices; correlations are computed using Pandas.

- Code Location: compute_log_returns(), compute_correlation_matrix()

- Notes: Straightforward implementation; Pandas handles most of the heavy lifting.

(2) Minimum Spanning Tree Construction

- Description: Construct an MST using correlation-based distances.

- Implementation: Kruskal’s algorithm with a manually implemented Union-Find data structure.

- Code Location: build_mst_from_corr(), UnionFind class

- Notes: This is the most algorithm-heavy part of the project but produces clean and stable MST edges.

(3) MST-based Clustering

- Description: Form clusters by cutting MST edges below a correlation threshold.

- Implementation: Supports both fixed thresholds and dynamic thresholds based on MST correlation quantiles.

- Code Location: clusters_from_mst()

- Notes: Clustering logic works as intended; threshold tuning is still being evaluated.

(4) Visualization

- Description: Visualize MST structure and cluster time series.

- Implementation: Custom Matplotlib-based visualizations without external graph libraries.

- Code Location: plot_mst_graph(), plot_cluster_time_series()

- Notes: Circle layout keeps MST readable for small to medium numbers of stocks.

## 3.2 Additional Features

(1) MST-level Structural Metrics

- Description: Compute summary metrics that describe the overall structure of the MST.

- Implementation: Metrics such as total tree length, average edge correlation, and node degrees are computed directly from MST edges.

- Code Location: compute_mst_metrics()

- Notes: Provides a compact numerical summary of how tightly connected the network is.

(2) Bootstrap-based MST Stability Analysis

- Description: Evaluate how stable MST edges are under resampling of the time dimension.

- Implementation: The price data is bootstrapped along the time axis. For each bootstrap sample, correlations and the MST are recomputed, and edge frequencies are recorded.

- Code Location: bootstrap_mst_stability()

- Notes: Helps distinguish robust relationships from noise-driven edges.

(3) Dynamic Threshold MST-based Clustering

- Description: Form clusters by cutting MST edges using a dynamically selected correlation threshold.

- Implementation: The threshold is chosen as a quantile of MST edge correlations, and connected components after cutting define clusters.

- Code Location: clusters_from_mst()

- Notes: Makes clustering adaptive to different sets of user-selected stocks.


(4) Cluster Summary Statistics

- Description: Compute summary statistics for each cluster, including average, minimum, and maximum internal correlation.

- Implementation: Statistics are computed using correlations of MST edges within clusters.

- Code Location: compute_cluster_statistics()

- Notes: Useful for comparing how “tight” or “loose” different clusters are.

# 4. Changes Since Presentation

- Added user interation section, letting the user to chooes their own stock tickers, and generated the plots/ results based on these tickers.

- Made an interaction app with Streamlit.

- Polished code, making it easier to read, and cited all the code generated with AI.

- Adjusted dynamic threshold selection for MST-based clustering.

# 5. Missing Features / Areas for Improvement

- Implemented rolling-window MST analysis to study temporal stability.

- Added cluster summary statistics and clustered heatmaps for improved interpretability.

- Sector-level comparison is not yet implemented and may be added if time permits.