# Abstract

# Table of Contents

# Introduction

## Background

The SEC is an independent U.S. government agency in charge of oversight of the securities markets. Its primary goal is to protect investors, prevent fraud, and assure fair stock market trades (SEC, n.d.). All publicly traded companies are required by federal law to file quarterly (10Q) and annual (10K) reports with the SEC (Form and Content of Financial Statements, 2022). 

The ability to identify anomalies can help individual investors avoid investing in companies with higher financial risk. It can help broker-dealers invest client assets in better suited securities that carry less risk. Lastly, anomaly detection can help the SEC identify companies that might require an additional audit of their financial statements.

## Problem Statement

Manually retrieving and reviewing every public company's financial statements to find anomalies is a very laborious process that also requires expertise in finance and accounting. 

**Individual Investors**  
For individual investors without any financial acumen or data-wrangling skills, reviewing SEC-filed financial statements for anomalous companies is a very challenging task. The ability to easily identify anomalous companies would save them time and reduce their need for financial expertise.

**Coporate Investors**  
Prospective customers expect their broker or financial advisor to instantaneously identify anomalous companies and answer questions about them. Having a tool at their disposal that clusters anomalous companies would provide value to both the advisor and consumer.

**The SEC**  
A tool for clustering companies on the basis of their submitted financials could allow agency analysts to quickly identify suspicious companies that may require closer review.

## Objectives

1. Cluster public companies based on commonly disclosed financial metrics. 
2. Detect anomalous public companies given their commonly disclosed financial metrics. 

# Literature Review

Existing research on clustering financial statements for auditing and investing purposes focuses on distance, partition, hierarchical, and neural-network-based methods. One of the main challenges of this research is balancing model flexibility and interpretability. Flexibility is required to accurately capture patterns in complex multidimensional data, and to lessen the burden of prior knowledge placed on the end user. Intepretability, on the other hand, is necessary for the end user, whether it be an auditor or prospective investor, to trust the results enough to actually use them. The present study attempts to address this challenge and fill a methodological gap in the existing literature by exploring density-based clustering methods, which have been shown to perform well on other financial anomaly detection tasks (Ahmed et al., 2016). Specifically, this work uses the Ordering Points to Identify the Clustering Structure (OPTICS) algorithm to cluster U.S. public company financial statements and detect anomalies. OPTICS is flexible in that it can fit clusters of arbitrary shape and represent clustering information for a wide range of hyperparameter values. Furthermore, its resulting reachability plot provides intuitive cluster interpretation, especially when accompanied by a corresponding attribute plot (Ankerst et al., 1999). 

# Methodology

This project uses the density-based OPTICS algorithm to cluster and detect anomalous U.S. public company financial statements. The raw data were retrieved from the SEC's APIs and consist of the most commonly disclosed financial metrics of 2020. We explored the data for patterns and issues, which informed our preprocessing decisions. The data were normalized, imputed, and decomposed into their principal components prior to modeling. We used the following Python libraries to implement our methodology: requests, numpy, pandas, matplotlib, seaborn, pca, scikit-learn, and streamlit. 

## Data Acquisition and Aggregation

Data were acquired via the SEC's Extensible Business Markup Language (XBML) APIs. These APIs contain data on thousands of distinct facts, but were were only interested in the most cmmonly reported financial ones. To identify the most prevalent metrics, we randomly sampled 100 companies from the Company Facts API (SEC, 2022a). The sample companies have collectively disclosed 5,305 unique facts, but we only retained the 46 facts reported by the majority of companies in the sample. Data on those metrics for the 2020 calendar year were then retrieved from the Frames API, though three of them had no available data (SEC, 2022b). This resulted in a dataset of 7,683 companies that disclosed at least one of the 43 facts for the 2020 calendar year. Figure 1 demonstrates the data acquisition process. Once acquired, the data were then randomly split into a training (80%) and test (20%) set. Although this work involves unsupervised learning of unlabeled data, the test set serves to ensure the methodology does not overfit to spurious clusters in the training set. 

In [34]:
from IPython.display import Image

print('Figure 1 \nData Acquisition Workflow')
Image(url='../figures/1-data-acquisition-workflow.png', width=750, height=750)

Figure 1 
Data Acquisition Workflow


## Data Quality  

The primary data quality issue was the prevalance of missing values (See Figure 2). We only retained variables missing fewer than 30% of their values, which included the following 11 variables: 
1. Current Assets 
2. Total Assets 
3. Cash and Cash Equivalents at Carrying Value 
4. Authorized Shares of Common Stock 
5. Issued Shares of Common Stock 
6. Common Stock Value
7. Liabilities and Stockholders Equity 
8. Current Liabilities 
9. Total Liabilities 
10. Retained Earnings (Accumulated Deficit)
11. Stockholders Equity 

## Exploratory Data Analysis

We explored the univariate and bivariate distributions of the 11 variables and found them all to be extremely right skewed. Figure 2 shows the distribution of total assets. Futhermore, many of them were strongly positively correlated (See Figure 3). For example, Figure 4 illustrates the strong positive correlation between Retained Earnings (Accumulated Deficit) and Stockholders Equity. 

In [35]:
print('Figure 2 \nDistribution of Total Assets')
Image(url='../figures/2-distribution-of-total-assets.png', width=750, height=750)

Figure 2 
Distribution of Total Assets


In [36]:
print('Figure 3 \nCorrelation Matrix Heatmap')
Image(url='../figures/3-correlation-matrix-heatmap.png', width=750, height=750)

Figure 3 
Correlation Matrix Heatmap


In [33]:
print('Figure 4 \nScatterplot of Retained Earnings (Accumulated Deficit) and Stockholders Equity')
Image(url='../figures/4-scatterplot-of-retained-earnings-accumulated-deficit-and-stockholders-equity.png', width=750, height=750)

Figure 4 
Scatterplot of Retained Earnings (Accumulated Deficit) and Stockholders Equity


## Feature Engineering

We applied preprocessing techniques to de-skew, center, and scale the features, impute missing values, and remove multicollinearity. These transformations were applied to the training and test datasets independently to avoid information leakage. 

### Normalization

We applied the Yeo-Johnson power transformation to each variable because it can handle negative values, unliked the Box-Cox transformation (Yeo & Johnson, 2000). This made features more normally distributed with a mean of 0 and a standard deviation of 1. Figure 5 shows the transformed distribution of total assets, which can be compared to its original distribution shown in Figure 2. 

In [37]:
print('Figure 5 \nCorrelation Matrix Heatmap')
Image(url='../figures/5-transformed-distribution-of-total-assets.png', width=750, height=750)

Figure 5 
Correlation Matrix Heatmap


### Imputation

We leveraged the strong mututal information among the features to impute their missing values. Specifically, each missing value was estimated from the Euclidean distance-weighted average of its five nearest neighbors, which has been shown to be a robust method (Troyanskaya et al., 2001). 

### Principal Component Analysis

We applied principal component analysis to remove multicollinearity among the features. The first six principal components explained 94.5% of the variance, and the remaining five were discarded. 

In [38]:
print('Figure 6 \nExplained Variance by Principal Component')
Image(url='../figures/6-explained-variance-by-principal-component.png', width=750, height=750)

Figure 6 
Explained Variance by Principal Component


Analysis of each component's loadings found them to be quite interpretable For example, PC1 was positively correlated with all of the features, and therefore appeared to represent overall company size. PC2 was most strongly correlated with stockholders' equity and retained earnings, so it could be understood as overall company value. Figure 7 shows a scatter plot of the first two principal components and their strongest loadings: total assets and stockholders' equity, respectively. We can see that most companies are densely distributed within this two-dimensional projection. 

In [39]:
print('Figure 7 \nScatterplot of First Two Principal Components')
Image(url='../figures/7-scatterplot-of-first-two-principal-components.png', width=750, height=750)

Figure 7 
Scatterplot of First Two Principal Components


## Modeling

We used the density-based algorithm OPTICS to cluster the companies and detect anomalies. The training and test datasets were used for hyperparameter tuning and validation, respectively. The final model was retrained on the entire dataset, interpreted, and deployed to an interactive web application. 

### Selection of Modeling Techniques 
We decided to use the density-based OPTICS clustering algorithm for its flexibility and interpretability. It can identify clusters of arbitrary shape and has built-in anomaly detection. Furthermore, its reachability scores provide an intuitive representation of the clustering structure and anomaly strength (Ankerst et al., 1999). 

### Test Design
Because ground truth cluster and outlier labels were unavailable for this task, we relied on internal cluster validation methods. Specifically, we used the Silhoutte coefficient as our measure of cluster validity (Rousseeuw, 1987). We used 5-fold cross-validation on the training data to find the hyperparameter values that yielded the highest Silhouette coefficient. The hyperparameters we tuned were ɛ, which is the neighborhood radius in Euclidean distance, and *MinPts*, which is the number of neighbors required for a point to be considered "core" (Ankerst et al., 1999). 

Instead of using OPTICS' reachability slope criteria to extract clusters, we opted for its DBSCAN-like method because it allows the user to explicitly set and visualize the noise threshold on the y-axis of the reachability plot, which supports our goal of intuitive anomaly detection. 

# Results and Findings

The hyperparameter values that yielded the highest 5-fold cross-validated Silhouette coefficient of 0.77 were ɛ = 3.5 and *MinPts* = 1% (See Figure 8). The model was then fit to the entire training dataset, placing 99.0% of the companies into a single normal cluster and classifying the remaining 1.0% as noise. The model was then fit to the test dataset, which yielded a Silhouette coefficient of 0.70 and labeled 99.2% and 0.8% of the companies as normal and noise, respectively. 

In [40]:
print('Figure 8 \nHyperparameter Tuning Results')
Image(url='../figures/8-hyperparameter-tuning-results.png', width=750, height=750)

Figure 8 
Hyperparameter Tuning Results


## Evaluation of Results

The model's consistency across the training and test sets suggests it did not overfit, and its anomaly detection rate of roughly 1% seemed reasonable for our application. For those reasons we decided to retrain the model on the entire dataset and interpret the results. One of OPTICS' outputs is each point's reachability, which is its distance to its nearest neighbor (Ankerst et al., 1999). The reachability plot in Figure 9 shows how the vast majority of companies fell below ɛ and were fairly consistently reachable, while the small fraction that exceeded it were much more isolated. 

In [41]:
print('Figure 9 \nReachability Plot')
Image(url='../figures/9-reachability-plot.png', width=750, height=750)

Figure 9 
Reachability Plot


Figure 10 also provides some intuition behind the clustering results by visualizing the clustering results in the first two principal components. Within this two-dimensional space, we see the majority of the normal companies are densely clustered near the origin, with most of the anomalies further spread out. However, some anomalies lie close to the origin, which suggests they are outliers in one or more of the principal components not visualized here. 

In [42]:
print('Figure 10 \nScatterplot of Clustering Output in First Two Principal Components')
Image(url='../figures/10-scatterplot-of-clustering-output-in-first-two-principal-components.png', width=750, height=750)

Figure 10 
Scatterplot of Clustering Output in First Two Principal Components


After identifying the anomalies, we sorted them by their reachability scores to provide a prioritized list for manual investigation. The most anomalous company was the Federal National Mortgage Association, also known as Fannie Mae. Upon inspecting its features, we found that it was an extreme outlier in the fourth principal component, which was most strongly correlated with accumulated deficit (see Figure 11). According to our raw dataset, Fannie Mae reported an accumulated deficit of $108 billion dollars at the end of 2020, so it is unsuprising that it is anomalous within that dimension. Furthermore, the Federal Housing Finance Agency (FHFA) placed Fannie Mae into conservatorship in 2008 after concluding they would become insolvent (FHFA, 2022). This validates our previous interpretation of the fourth principal component as rep

# Discussion

## Conclusion

## Future Studies