
# Report

## Introduction

This project replicates and extends the findings of the “Lazy Prices” paper, which examines whether the repetitiveness of language in firms’ annual 10-K filings can predict future stock returns. The core idea is that investors may underreact to boilerplate disclosures, allowing firms that repeat language across years to earn abnormal returns.

To test this, we constructed a full pipeline that transforms raw SEC filing data into word frequency vectors, computes cosine similarity scores between consecutive filings for each firm, and merges these signals with CRSP stock return data. The similarity score serves as a proxy for how much a firm’s disclosure has changed year-over-year.

Using this signal, we built long-short portfolios: each month, we went long on firms with the highest similarity (most repetitive disclosures) and short on those with the lowest similarity (most novel changes). Returns were computed and tracked over time.

While we initially assessed our long-short portfolio’s performance on its own, we later compared it to other long-short portfolios used in Open Asset Pricing literature. Since our strategy is market-neutral and involves zero net investment, this provided a more appropriate benchmark. We tested the robustness of our similarity-based strategy by implementing a winsorized version that reduced the impact of extreme returns. In addition, we generated multiple randomized long-short portfolios for comparison. T...

---

## Data

### 1. SEC Filing Data – Word Frequency Dictionary

We began by transforming raw textual data from the Loughran-McDonald 10-K dictionaries into a structured CSV format using `create_csv_dictionary.py`. Each line of the raw file includes a metadata header and a compressed string of word frequency counts. We parsed these into two parts: header fields (CIK, filing date, form type, etc.) and a vector of word frequencies.

To manage scale, we processed the file in batches of 10,000 filings and wrote the parsed data incrementally to disk as `full_docdict.csv`. This file contains one row per filing, with columns for metadata and each indexed word frequency (e.g., `word_1`, `word_2`, ...). This document-feature matrix formed the basis of our textual analysis.

#### Text Preprocessing and Vectorization

We began with preprocessed 10-K filing dictionaries from the Loughran-McDonald (LM) dataset, where each filing is represented as a set of word frequency counts. These were parsed using a custom script to extract and standardize the word counts into a sparse matrix format — one row per filing, and one column per word index.

Each document was thus represented as a high-dimensional frequency vector, with stopwords, punctuation, and boilerplate formatting already removed by the LM preprocessing. These vectors served as the foundation for all textual analysis.


### 2. CRSP Monthly Stock Data from WRDS

We obtained monthly stock return data from the CRSP database via WRDS. The dataset was pre-filtered to include only standard U.S. common stocks traded on major exchanges. Specifically, we retained securities with `SHRCD` codes of 10, 11, or 12 (indicating common shares) and `EXCHCD` codes of 1, 2, or 3 (representing NYSE, AMEX, and NASDAQ, respectively).

This filtering excluded:
- Non-common equity securities (e.g., preferred shares, warrants, ADRs)
- Securities traded on OTC markets, Pink Sheets, and other non-major exchanges

As a result, our stock universe was relatively high quality, focused on actively traded common stocks from major U.S. markets. However, some limitations remained:
- Very small or illiquid firms
- Companies with highly volatile returns
- Firms that were technically listed but no longer actively traded

These residual effects contributed to the overall volatility in our portfolio-level returns. This may also **explain why the average returns associated with the similarity score signal remained modest**, especially when evaluated across the full universe. Still, the filtering ensured that our return analysis was conducted on a market-relevant and broadly investable set of firms.


### 3. Open Asset Pricing Data

To benchmark the performance of our similarity-based trading strategy, we compared it against signals from the Open Asset Pricing (OAP) dataset. These signals are well-established in the asset pricing literature and widely used for evaluating return predictability. The OAP dataset provides monthly long-short portfolio returns for various signals based on firm characteristics.

We selected two benchmark signals that reflect related economic channels:
- **BM (Book-to-Market)**: Firms with high textual similarity may be underfollowed or undervalued.
- **Mom12m (12-Month Momentum)**: Firms with high similarity might also exhibit momentum in returns.

To benchmark the performance of our similarity-based trading strategy, we compared it against signals from the Open Asset Pricing (OAP) dataset. These signals are well-established in the asset pricing literature and widely used for evaluating return predictability. The OAP dataset provides monthly long-short portfolio returns for various signals based on firm characteristics.

Given that our strategy is built on cosine similarity between firm filings—capturing how repetitive or novel a firm’s disclosure is from year to year—it is conceptually related to themes of information diffusion, firm inertia, and underreaction. Thus, we selected two benchmark signals that reflect these economic channels:

- **BM (Book-to-Market)**:
 A classic value signal. Firms with high textual similarity (i.e., those repeating old language) may be underfollowed or undervalued, which could cause their returns to correlate with BM-based value strategies.


- **Mom12m (12-Month Momentum)**:
 A trend-following signal. If disclosure similarity reflects inertia or slow reaction, firms with high similarity might also exhibit momentum in returns.


We used these two OAP signals as benchmarks in our return analysis (see `returns-analysis.ipynb`). They help assess whether our cosine similarity signal contains unique predictive power or simply overlaps with existing, well-known return drivers.

---

## Methods

### 1. Simiularity Score Computaion

We constructed a textual similarity signal by comparing each firm's consecutive 10-K filings using **cosine similarity**. Each filing, represented as a word frequency vector from full_docdict.csv, includes a CIK, filing date, and thousands of word count features. These vectors quantify how often each word appeared in the filing, forming the basis for our similarity measure.

![Cosine Similarity](figs/cosine_similarity.png)

To manage file size, we implemented chunked processing (10,000 filings at a time) to avoid memory overload. Within each chunk, we filtered only the necessary columns, converted filing dates to datetime format, and treated missing word frequencies as zeros.

Filings were then grouped by firm (CIK), sorted chronologically, and cosine similarity was calculated between each filing and its immediate predecessor. A score close to 1 indicates highly repetitive language; lower values suggest substantial textual changes. The first filing for each firm was excluded from comparison (NaN score) since no prior document existed. The results were saved incrementally to `full_similarity_scores.csv`, forming a panel dataset of firm-date similarity signals.

These similarity scores became the **core explanatory variable** in our portfolio analysis, capturing the degree of language repetition in firm disclosures over time.

#### Limitations:
 Due to chunked processing, some firms were unintentionally split across chunk boundaries. In such cases, consecutive filings could not be matched, and similarity scores were missed. These edge cases were rare and randomly distributed, so they are unlikely to have biased our overall results. We accepted this tradeoff in favor of computational efficiency and scalability.

#### Similarity Score Distribution
To validate and understand the behavior of our similarity variable, we used score_distribution.ipynb for exploratory analysis. We plotted the overall distribution of cosine similarity scores and confirmed that most scores clustered near 1.
We also examined Baxter International as a case study, revealing a sharp drop in similarity around 2010. This helped us interpret the score and confirm that our pipeline successfully captures meaningful year-to-year changes in disclosure language.


### 2. Custom Firm Identifier Mapping (CIK ↔ PERMNO)

To merge SEC 10-K filing data (indexed by CIK) with CRSP stock return data (indexed by PERMNO), we built a custom mapping that required extensive cleaning and integration. This step was critical for aligning textual similarity signals with financial returns and underpinned the accuracy of all downstream portfolio analyses.

We used two primary datasets:
- `df_class`: A Compustat/CRSP file containing PERMNO and ticker (TIC)
- `df_ticker_cik`: An SEC file mapping tickers to CIKs


#### Data Cleaning & Preparation

Both datasets were standardized through several steps:

- Tickers were lowercased and stripped of suffixes (e.g., ".A", ".B")
- CIKs were padded to 10-digit strings
- PERMNOs were converted from float to integer
- Duplicate ticker entries in df_class were removed to retain only unique (permno, tic) pairs


#### Merging Logic

We performed an inner join on the cleaned ticker field to construct a one-to-one CIK–PERMNO mapping. The resulting file, `permno_cik_map.csv`, contains approximately 4,900 matched firms. These matches represent only legitimate, actively traded companies — filtering out internal placeholders, delisted or foreign firms, and mismatched tickers.

This mapping was intentionally precise. Although the raw datasets were much larger, many entries in CRSP referred to non-trading entities (e.g., ADRs, pink sheet stocks, or Compustat placeholders). These had no valid match in the SEC data and were therefore excluded.

#### Project Implications

This identifier mapping was essential for accurate merging of similarity scores and return data. Any mismatch would have compromised the portfolio construction and biased our signal evaluation. We treated this step as a core part of the project’s data integrity pipeline, prioritizing reliable joins over raw coverage.

Despite its complexity, this mapping ensured firm-level tracking consistency and enabled us to evaluate whether changes in disclosure language were associated with return patterns in the same firm over time.


### 3. Stock Return Data → Portfolio Construction

We implemented two different strategies to construct long-short portfolios using similarity scores: one based on exact match periods (winsorized), and one based on signal propagation (propagate). Each method involved a different way of merging the similarity scores with return data and led to distinct results.

#### a. Using unique similarity scores

In this approach, we merged similarity scores with stock returns using a left join, keeping only the firm-month observations where a valid similarity score existed. This resulted in a smaller dataset that excluded firm-months without newly computed scores. To reduce the influence of extreme return values, we applied winsorization, capping the top 0.5% and flooring the bottom 25% of returns.

Each month, firms were sorted into quintiles based on their similarity scores. We formed a zero-investment, market-neutral portfolio by going long on firms in the top quintile (most repetitive) and short on firms in the bottom quintile (most novel). This strategy produced a positive average return, which aligns with the Lazy Prices hypothesis and suggests that repetitive disclosures may be underreacted to by investors.

This method offers a more conservative and cleaner evaluation by avoiding outdated or missing signals. The tradeoff is reduced coverage, but the gain is higher signal integrity per observation.

This method is applied in the `portfolio_construction_winsorized.ipynb` file.

#### b. Using propagated similarity scores

This version followed the alternative approach seen in the original Lazy Prices paper. Here, we performed a right join, keeping the full stock return panel and forward-filling the most recent similarity score until a new filing was available. The underlying logic assumes that firms trade on their latest known disclosure information until an updated report is filed.

However, in our implementation, this method produced negative average returns, which is inconsistent with prior findings. Upon review, we identified a limitation in our similarity score generation. Because we processed the dataset in chunks to manage computational load, many firms were split across chunk boundaries, causing similarity scores to be missing for several periods.

When we forward-filled these missing values, we inadvertently propagated outdated similarity signals, assigning firms to incorrect quintiles for several months. This likely led to trading on stale or invalid signals, undermining the logic of the strategy and reversing its performance.
While the propagation method is theoretically sound and empirically validated in the literature, its effectiveness depends heavily on the completeness of the underlying similarity data. In our case, the missing scores introduced enough distortion to invalidate the approach.

This method is applied in the `portfolio_construction_propagate.ipynb` file.

---

## Results

Our project tested whether firm-level textual similarity in 10-K filings could be used to construct a profitable long-short trading strategy, as proposed in the _Lazy Prices_ paper. To evaluate the performance of our signal, we implemented two trading strategies — one using carefully filtered and winsorized returns, and another that propagates similarity scores forward over time. Below, we present and interpret our two key return visualizations: log cumulative returns and rebased cumulative returns compared to OAP benchmarks.

The code that builds the analysis can be found in the `returns-analysis.ipynb` file.

### Log Cumulative Returns (Primary Output):

This plot represents the most important output of our project, capturing the cumulative log returns for:

- **wins_return (orange)**: our similarity-based long-short portfolio using only valid signal months and winsorized returns
- **prop_return (blue)**: a version that propagates similarity scores forward across months with no new filings
- **BM (green) and Mom12m (red)**: standard long-short benchmark strategies from the Open Asset Pricing (OAP) dataset

The orange line shows a clear and consistent positive trend, confirming our hypothesis: firms with more repetitive disclosures tend to outperform those with more novel language, likely due to investor inattention. This confirms the Lazy Prices hypothesis — that language similarity reflects underreaction to stale information.

The blue line, by contrast, remains flat or declines over time. This underperformance is not due to the signal itself but to implementation issues. Our cosine similarity scores were computed in large chunks due to memory constraints (`compute_similarity.py`), and many firms lacked scores in certain years. The propagation strategy — implemented in `portfolio_construction_propagate.ipynb` — forward-filled stale signals into these missing months, resulting in incorrect portfolio assignments. This caused the strategy to trade the wrong bins and underperform.

Meanwhile, the benchmark strategies (BM and Mom12m) perform as expected, confirming that the signal we built operates in a realistic market environment and that performance differences are driven by signal quality, not external noise.

![Log Cumulative Returns of Portolio](figs/log_cumulative_ret.png)

### Monthly Returns

This plot displays monthly returns for the four portfolios of interest.

![Monthly Returns](figs/monthly_returns.png)

It is worth noting that our similarity-based long-short portfolios display vastly smaller volatility compared to our Open Asset Pricing benchmarks.


### 12-Month Moving Averages of Returns


This chart shows 12-month moving averages of portfolio returns for two similarity-based strategies (prop_return and wins_return) and two benchmark factors (BM and Mom12m) from the Open Asset Pricing dataset. 

The winsorized similarity portfolio (orange) shows mildly positive average returns with some volatility, while the propagated similarity strategy (blue) remains flat, suggesting that stale or improperly filled signals likely diluted its effectiveness. The momentum strategy (Mom12m, red) displays the highest variability and strongest positive trends, especially in recent years, reflecting well-known momentum effects. The BM factor (green) exhibits more modest fluctuations but still demonstrates episodes of strong relative performance. 

![12-Month Moving Averages of Returns](figs/moving_avg_ret.png)

### Cumulative Returns – Lazy Prices vs OAP Signals

This second plot shows the standard cumulative returns of the same strategies, rebased to 1 for easier visual comparison. Again, we see that:

- The winsorized similarity strategy (orange) steadily grows in value
- The propagated strategy (blue) delivers no meaningful return
- The BM (green) and momentum (red) benchmarks significantly outperform

This graph compares our similarity-based strategies to two standard long-short benchmarks from Open Asset Pricing (OAP). The winsorized version (orange) shows steady cumulative returns, confirming the Lazy Prices hypothesis: firms with repetitive disclosures earn excess returns, likely due to investor inattention.

In contrast, the propagated version (blue) underperforms, not because the theory is wrong, but because our similarity data was incomplete. Forward-filling stale signals led to firms being assigned to the wrong portfolio bins, hurting performance.

The strong returns from the OAP benchmarks confirm that our testing period and return data are valid. 

![Cumulative Returns of Portolio](figs/cumulative-returns.png)

In `portfolio_construction_winsorized.ipynb`, we constructed our core trading strategy by filtering the dataset to include only months where firms had valid similarity scores. This decision was critical to avoid propagating outdated signals, which we had identified as a major issue in earlier attempts. We applied winsorization to stock returns — capping outliers — to reduce noise and improve robustness. Firms were sorted into quintiles based on their similarity scores, and we built a zero-investment long-short portfolio by going long on the most repetitive firms and short on the most novel. We also generated random portfolios for comparison and found that our similarity-based strategy consistently outperformed them, providing strong confirmation of our hypothesis that disclosure similarity contains meaningful return information when implemented with clean and timely data.

---

## Conclusion

This project tested the _Lazy Prices_ hypothesis by building a similarity-based long-short trading strategy using firm 10-K filings. We confirmed that firms with repetitive disclosures tend to outperform, but only when similarity signals are clean and timely. The winsorized strategy produced consistent positive returns, while the propagated version failed due to stale and missing data.
Our findings highlight that textual signals can predict returns, but implementation quality is critical. Even well-supported theories can fail if execution introduces noise. Overall, we show that disclosure-based signals are valuable — when paired with rigorous data handling.


---

## References

**SEC EDGAR Filings (10-K Texts)**
 U.S. Securities and Exchange Commission
 https://www.sec.gov/edgar.shtml


**CRSP Monthly Stock Return Data**
 Center for Research in Security Prices (CRSP), via WRDS
 https://wrds-www.wharton.upenn.edu/pages/access/crsp/


**Compustat/CRSP Merged Fundamentals**
 Wharton Research Data Services (WRDS)
 https://wrds-www.wharton.upenn.edu/pages/access/compustat/


**Open Asset Pricing Benchmarks (OAP Factors)**
 http://global-q.org/open_source.html


**CIK–Ticker Mapping Dataset**
 Derived from SEC’s Company Ticker and CIK file
 https://www.sec.gov/files/company_tickers.json


**PERMNO–Ticker Mapping File**
 Compustat/CRSP class file from WRDS (used for firm linking)


**FRED (Federal Reserve Economic Data)**
 https://fred.stlouisfed.org/

