# DATA 620 - Final Project Proposal

Derek G. Nokes

## Introduction

Graph theory is used to model pairwise relations between objects and is widely employed in the analysis of complex systems. Many, physical, biological, and social systems – including financial markets – are aptly described by networks (i.e., mathematical structures composed of nodes connected by edges). Common applications of graph theory include methods to extract statistically reliable information from correlation-based systems. More specifically, graph-based clustering techniques are used to reveal communities (clusters) of similar elements in a network. Hierarchical clustering procedures in which communities are overlapping and organized in a nested structure, can be used to identify the fundamental frame of interactions in a system.

If we consider a correlation-based system of $n$ elements, where all elements are connected (i.e., they form a ‘complete’ graph), the pairwise correlation coefficient between each set of elements can be interpreted as the strength of the link (i.e., edge weight) connecting the pairs of elements. Very little information can be gleaned from the topology of such a complete graph. Instead, we focus on extracting a subgraph, commonly referred to as the Minimal Spanning Tree (MST). Constructed based on the so-called 'nearest neighbor single linkage nodes algorithm', a Minimum Spanning Tree is a subset of the edges of a connected, edge-weighted undirected graph that connects all of the nodes (vertices) together without any cycles, in such a way as to minimize total edge weight. The topological properties of the Minimal Spanning Tree provide an effective means for summarizing the most essential features of a correlation-based system.

## Objectives

The ultimate dual objectives of this project are to 1) develop robust metrics that can characterize the time-varying level of diversity in a universe of single stocks, and 2) propose a feedback or feedforward control based on these diversity metrics that can be used to enhance the performance of a simple systematic trading strategy.

## Motivation

Although the public equity markets are highly accessible for nearly all classes of global investors, these markets pose some significant challenges. In particular, there is a significant degree of co-movement across single stocks, making the construction of a well-diversified portfolio difficult. The high degree of co-movement makes an investor vulnerable to broad-based declines in equity markets.

One of the simplest and most effective strategies employed by active investors to control the risk associated with broad-based declines and enhance performance when markets are rising, involves exploiting a well-known stylized fact of equity markets, namely that stocks that are moving strongly in a particular direction tend to continue to move in the same direction (i.e., they possess ‘momentum’). Momentum investing systems focus on identifying stocks that are moving persistently in a particular direction and taking a position to benefit from that directional movement. The long-only version of such strategies buys stocks that are rising most persistently and exits long positions when markets reverse.

To reduce the volatility of such a strategy, stocks that move together are typically grouped and bet as though they represent a single ‘factor’. Indeed, the most challenging aspect of developing a so-called momentum system is not the identification of momentum stocks, but rather the selection of diverse groups of stocks that – when held together – provide portfolio return smoothing and accelerate the speed of compounding. 

For the class of market participants employing fully systematic approaches to manage their investments, it is possible to determine the exact responses of their strategies to any conceivable set of market conditions. As a result, they can conduct sensitivity analysis to systematically uncover undesirable strategy behavior and enhance strategy robustness.
Systematic traders generally use sensitivity analysis to identify the set conditions under which the system will operate within acceptable bounds. In this project, we refer to this set of conditions as the *operational domain* of the strategy (for a specific set of trading model parameters). The broader the spectrum of market conditions over which a trading system can perform within acceptable performance bounds (i.e. the broader the operational domain of the strategy), the more robust the system.

In general, the *operational domain* of a trading strategy can be broadened through the introduction of feedback and feedforward risk controls. Feedback risk controls operate to reduce the impact of unpredictable phenomena or events on strategy performance, while feedforward controls exploit regularities in market structure to make local predictions that aid in the enhancement of strategy performance. We use feedback controls when poor trading performance is not driven by something we can predict. We use feedforward controls when we understand the drivers of poor performance and there is enough persistence in the market conditions for us to effectively anticipate future poor performance.

In this project, we seek to first develop metrics that can quantify the evolution of the state of diversity in a particular universe of single stocks over time, then propose an associated feedback or feedforward control that can be used to enhance the performance of a simple systematic trading strategy.

## Data Sources

To perform the purposed analysis briefly outlined in this proposal, we require a list of the current constituents of the S&P500 index, along with corresponding sector, sub-industry, and price data.

The first part of the data set - the instrument master for our universe under study - is to be scraped from Wikipedia (https://en.wikipedia.org/wiki/List_of_S%26P_500_companies). The second part of our data set - corresponding prices for each instrument - are collected from Yahoo finance using the 'pandas_datareader' package.

A third data set composed of company descriptions and sector/industry/sub-industry classifications was gathered via web-scraping. This data set included 9000+ company descriptions scrapped from the public Bloomberg website. Bloomberg has indicated that the collection of this data is a violation of the terms of service agreement and thus work on all pieces of the project that required this third data set unfortunately had to be halted.

## Work Plan

The rough work plan for this project is as follows:

1) Exploit the findings of Random Matrix Theory (RMT) to reduce the statistical uncertainty associated large noisy correlation matrices due to the finite length of time series and facilitate the extraction of statistically reliable information about the cross-sectional relationships between single stocks from the return-based correlation matrix;

2) Construct correlation-based networks for each day over the period of study and apply both spectral- and graph theory-based clustering techniques to uncover non-random structure in the returns of the single stocks in the chosen instrument universe;

3) Develop metrics to characterize the time-varying level of diversity in the chosen instrument universe based on the evolution of the topological properties of the Minimal Spanning Tree (MST) over time;

4) Explore the robustness / stability of the different metrics developed before and after RMT-based filtering of the input correlation matrices;

5) Propose a feedback or feedforward control based on the diversity metrics developed that could be used to enhance the performance of a simple systematic trading strategy.

## Concerns

Although there is a deep literature on the application of Minimal Spanning Trees to financial markets, there are numerous well-known issues to which no solutions are yet available.

Initial exploratory work has shown that even with the application of RMT-based filtering of the correlation matrices used to construct the Minimal Spanning Trees, the most important single stocks - i.e., those at the center of clusters - appear to change over time. It is not yet clear whether this reflects fragility in the Minimal Spanning Tree (MST) methodology or part of the underlying dynamics of the markets. Small perturbations of the input data can cause big differences in the resulting clusters. RMT-based filtering clearly improves the stability of results, but a bootstrapping-based approach might have yielded a more robust, intuitive, and interpretable result. However, if the observed instability partly results from our use of the Pearson linear correlation - which is known to be sensitive to outliers and is not ideal for non-Gaussian distributions - bootstrapping would not necessarily improve our results.

The use of only data that is publicly available limits practical usability of the work in this project. In particular, use of publicly available data impairs our ability to address survivorship bias. Use of commercial data that includes all companies that have moved in and out of the S&P500 over the period under study cannot be made available to unlicensed parties and thus would unfortunately obstruct reproducibility.

