### Project Summary: Lazy Prices Replication:


This project replicates and explores the main findings of the "Lazy Prices" paper, which studies whether repetitive language in firms’ annual financial filings predicts stock returns. The goal is to test if investors underreact to boilerplate or “lazy” disclosures—measured by textual similarity between consecutive reports—by building portfolios that go long firms with the most repetitive filings and short those with the least.

To accomplish this, we built a pipeline that takes raw SEC filings and stock return data, computes a similarity score for each firm and year, and links these signals to actual investment performance. The analysis required working with large, messy real-world datasets, so we implemented chunk processing to efficiently handle text files and numerical calculations. We also developed custom tools to merge identifiers across databases, ensuring all signals and returns were accurately matched. Our methodology closely follows standard procedures in empirical asset pricing and is designed to be both robust and reproducible.


#### Data Preparation:
We collected and processed SEC 10-K filings, monthly CRSP stock return data, and mapping files for firm identification. Initial data cleaning and standardization ensured that textual and financial datasets could be merged accurately.


#### Textual Analysis: 
Using a two-stage process, we first built a sample dictionary of word counts from select filings to validate code and methodology. We then scaled up to a full dictionary, calculating word frequency vectors for all relevant firm-year filings.


#### Similarity Computation: 
Cosine similarity scores were computed between each firm's consecutive annual filings, creating a "similarity signal" intended to capture the degree of repetitiveness or novelty in disclosure.


#### Data Merging and Filtering: 
The similarity scores were merged with monthly stock return data, creating a unified dataset for testing the pricing implications of repetitive disclosure.


#### Portfolio Construction: 
Each month, firms were sorted into quintiles based on their similarity scores. We constructed market-neutral portfolios by going long the highest-similarity firms and short the lowest-similarity firms. Portfolio returns were tracked over time and compared to S&P 500 benchmark returns.


#### Analysis and Visualization: 
We visualized the cumulative performance of the similarity-based long-short strategy against the S&P 500, following standard asset pricing anomaly testing practices (as in LeDataSciFi Ch 9.6). The report includes clear documentation of code, process, and findings, with placeholders for final results as we complete the analysis using the full dataset.


### Data:
#### Data Used :  
1. SEC Filings (10-K and 10-Q)
Source: SEC EDGAR Database
Content: Full text of annual (10-K) and quarterly (10-Q) filings for U.S. public companies.
Time Period: [Insert your actual sample period, e.g., 2012–2016.]
Processing: Downloaded as plain text or HTML files, indexed by Central Index Key (CIK), filing date, and ticker.


2. Stock Return Data
Source: CRSP Monthly Stock File (accessed via WRDS or institutional database)
Content: Monthly total returns for all covered U.S. firms (adjusted for splits/dividends).
Identifiers: Uses CRSP PERMNO, mapped to SEC CIK and ticker symbols for merging.


3. S&P 500 Index Data
Source: FRED Economic Data (SP500 series)
Content: Monthly index levels, used to calculate S&P 500 total returns for benchmark comparison.


4. Mapping and Reference Files
PERMNO–CIK–Ticker Mapping: Used to unify firm identifiers across datasets.
Stopwords List: For improved text preprocessing and word count accuracy.
Data Processing and Cleaning:
Chunk Processing:
Both the text-to-vector conversion and the similarity calculation were performed using chunk processing, which allowed us to work efficiently with very large files and avoid memory issues.

We iterated through the data in sections, computed intermediate results, and then combined the outputs for final analysis.
Text Cleaning and Vectorization:
Each filing was cleaned to remove non-informative content (headers, tables, standard phrases).
Documents were lowercased, tokenized, stripped of stopwords and punctuation, and represented as word frequency vectors.
Computing Similarity Scores:
For each firm, we paired each year’s filing with the previous year’s, and computed a similarity score using cosine similarity between their word vectors.
This produced a continuous signal for each (firm, year) reflecting how much the firm’s disclosure changed or stayed the same.


#### Merging and Filtering:
Similarity scores were merged with monthly return data using our custom mapping.
We filtered the merged dataset to retain only the months and firms where both a similarity score and return data were available.
Any rows with missing data were dropped, and we checked each month to ensure there were enough firms for meaningful portfolio tests.
Custom Firm Identifier Mapping:
One critical step in our project was to accurately link financial returns data (from CRSP) to SEC filings for each firm. This required building a robust mapping between different types of firm identifiers, specifically PERMNO (used by CRSP) and CIK (used by the SEC).

**Why Mapping Was Necessary**
Multiple Datasets, Multiple IDs:
 Financial databases use different keys to identify the same company—PERMNO for returns, CIK for filings, and ticker symbols for lookups.
Accurate Merging:
 Any error in mapping leads to mismatched signals and financial data, which would invalidate the results of the entire replication.
How We Built the Map
We developed a dedicated notebook, build_permno-cik_map.ipynb, to create this mapping from scratch.
Inputs Used:
Historical CRSP header data for all PERMNOs in the sample.
SEC/EDGAR reference tables with CIKs and tickers.
Manual validation for ambiguous or changing ticker cases (e.g., mergers, name changes).
Process:
The script parses all candidate identifiers and aligns them across time, accounting for ticker changes, missing entries, and date mismatches.
It creates a one-to-one correspondence table that is used throughout the project for every merge and join operation.
Chunk Processing:
To handle the large size of the data, the mapping is created in chunks, allowing for efficient and scalable processing.
Summary
Our custom mapping solution is a key technical achievement of this project. It allowed us to confidently merge similarity signals from text analysis with financial returns data, enabling accurate and meaningful empirical tests. The script is general enough to be adapted for future projects or new datasets, making it a valuable resource for our team and for others replicating this research.




Calculation of the Similarity Score:
Text Preprocessing & Vectorization
For each SEC filing, we extracted the main text content, converted it to lowercase, and removed common stopwords and punctuation.


The cleaned text was tokenized into words, and we counted the frequency of each unique word.


Each filing was thus represented as a word frequency vector.
Pairing Consecutive Filings


For every firm in our sample, we identified all pairs of consecutive filings (e.g., 10-K reports for years t and t+1).
Cosine Similarity Calculation
We computed the cosine similarity between each pair of word vectors. This is defined as:
where A and B are the word frequency vectors for consecutive filings.
A score close to 1 indicates that the filings are highly similar (very little change in wording or content), while a score closer to 0 suggests significant changes between years.
Output
For each (firm, year), we recorded the cosine similarity between the filing and its predecessor.


These similarity scores became our key predictive variable for the “lazy prices” analysis, capturing the extent to which firms repeated language in successive annual reports.



Portfolio Construction Process:
Preparing the Data
We started with a merged dataset containing firm identifiers, dates, similarity scores, and stock returns.


The month variable was formatted properly to ensure correct chronological sorting.


Missing values in key variables (similarity score, return) were dropped to maintain clean samples.
Portfolio Formation
For each month in the sample:
We sorted firms by their similarity score.
Firms were divided into quantile groups (e.g., quintiles based on similarity ranking).
Long Portfolio: Firms with the highest similarity scores (most repetitive disclosures).


Short Portfolio: Firms with the lowest similarity scores (least repetitive disclosures).


Calculating Monthly Returns
For each month:


We calculated the average return of firms in the long group and short group separately.


The long-short portfolio return was computed as:
Long-short Return = Average Long Return - Average Return 
These monthly returns were recorded over the entire sample period.
Winsorization 
In the winsorized version of the analysis:


Stock returns were winsorized at the 1st and 99th percentiles before portfolio construction.


This adjustment reduced the influence of extreme outliers on average returns.


 Cumulative Performance Analysis
Monthly portfolio returns were compounded over time to calculate the cumulative return (growth of $1 invested).


We compared the performance of the long-short portfolio against the S&P 500 benchmark.


Performance plots visually showed whether the strategy based on similarity signals outperformed the market.


Final Result:
We tested the hypothesis that firms with more repetitive disclosures (higher similarity scores) earn higher future returns, consistent with the Lazy Prices theory of investor underreaction.


Each month, we sorted firms by similarity score, built a long-short portfolio, and tracked returns over time.


Two versions were analyzed:
Raw returns (without winsorization)


Winsorized returns (extreme returns capped at 1st and 99th percentiles
Graphs and interpretation 
The cumulative return plots show that the long-short portfolio steadily outperformed the S&P 500 over the sample period, in both raw and winsorized versions.


Without winsorization, the portfolio return line shows sharp jumps and higher volatility, indicating that a few extreme firm returns heavily impacted performance.


After winsorization, the cumulative growth line becomes smoother and more stable, but the positive trend remains, reinforcing that the relationship between high similarity and higher returns is robust, not driven by outliers.


In both graphs, the long-short portfolio’s growth is consistent with our hypothesis: firms repeating language more heavily (higher similarity) tend to earn systematically higher returns compared to firms with more novel disclosures.


Visual comparison against the S&P 500 line clearly shows that disclosure similarity contains real predictive power, aligning with the original findings of the Lazy Prices paper.







