# Analyzing Alpha in 10-Ks and 10-Qs (Alphalens Study)

## THESIS:
Major text changes in 10-K and 10-Q filings over time indicate significant decreases in future returns. We find alpha in shorting the companies with the largest text changes in their filings and buying the companies with the smallest text changes in their filings.

Publicly listed companies in the U.S. are required by law to file "10-K" and "10-Q" reports with the [Securities and Exchange Commission](https://www.sec.gov/) (SEC). These reports provide both qualitative and quantitative descriptions of the company's performance, from revenue numbers to qualitative risk factors.

When companies file 10-Ks and 10-Qs, they are required to disclose certain pieces of information. For example, companies are required to report information about ["significant pending lawsuits or other legal proceedings"](https://www.sec.gov/fast-answers/answersreada10khtm.html). As such, 10-Ks and 10-Qs often hold valuable insights into a company's performance.

These insights, however, can be difficult to access. The average 10-K was [42,000 words long](https://www.wsj.com/articles/the-109-894-word-annual-report-1433203762) in 2013; put in perspective, that's roughly one-fifth of the length of Moby-Dick. Beyond the sheer length, dense language and lots of boilerplate can further obfuscate true meaning for many investors.

The good news? We might not need to read companies' 10-Ks and 10-Qs from cover-to-cover in order derive value from the information they contain. Specifically, Lauren Cohen, Christopher Malloy and Quoc Nguyen argue in their [recent paper](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1658471) that we can simply analyze textual changes in 10-Ks and 10-Qs to predict companies' future stock returns. For an overview of this paper from one of the authors, see the [Lazy Prices interview](https://www.youtube.com/watch?v=g96gROyc3wE) from QuantCon 2018.

*To understand how the dataset used in this post was created, be sure to see the [Data Processing notebook](https://www.quantopian.com/posts/scraping-10-ks-and-10-qs-for-alpha).*

In [None]:
from quantopian.research import run_pipeline
from quantopian.pipeline import Pipeline
from quantopian.pipeline.filters import QTradableStocksUS
import alphalens

## 1. Loading Data from Self-Serve Data

In this step, we import the data from a local .csv file via the [Self-Serve Data](https://www.quantopian.com/posts/upload-your-custom-datasets-and-signals-with-self-serve-data) feature. 

To do this, we begin with the local .csv file generated by the [Data Processing notebook](https://www.quantopian.com/posts/scraping-10-ks-and-10-qs-for-alpha). We then upload it under the Self-Serve Data tab on the [Account &gt; Data page](https://www.quantopian.com/account#data); this makes it available for import into a research notebook or pipeline.

For more on importing data using Self-Serve, check out the examples in [this forum post](https://www.quantopian.com/posts/upload-your-custom-datasets-and-signals-with-self-serve-data).

In [None]:
from quantopian.pipeline.data.user_5b102ae91141120040958556 import lazyprices3_90d

## 2. Formatting Factor Values

The data we uploaded is in a tabular form, with one row per asset per day. However, Alphalens requires that we provide data in a specific format with specific labels. Fortunately, [Pipeline](https://www.quantopian.com/tutorials/pipeline) will do all the "dirty work" for us.

In this step, we'll use Pipeline to put our data in a form that can be ingested by Alphalens.

In [None]:
def make_pipeline():
    
    jaccard_score = lazyprices3_90d.jaccard_score.latest
    cosine_score = lazyprices3_90d.cosine_score.latest
    
    screen = (QTradableStocksUS() & jaccard_score.notnull() & cosine_score.notnull())
    
    return Pipeline(columns={'jaccard_score': jaccard_score, 'cosine_score': cosine_score}, screen=screen)

In [None]:
data = run_pipeline(make_pipeline(), '2013-01-01', '2018-05-01')

## 3. Get Pricing Data

Since an alpha factor is supposed to *predict the returns* of an asset, we'll need to get records of the actual price of the asset in order to examine the performance of our alpha factor. In this step, we get pricing data for the assets in our dataset.

In [None]:
# Get list of relevant assets
assets = data.index.levels[1]

In [None]:
# Get pricing data for those assets
pricing_end_date = '2018-08-01' # Pricing end date should be later so we can get forward returns
prices = get_pricing(assets,
                     start_date='2013-01-01',
                     end_date=pricing_end_date,
                     fields='open_price')

## 4. Run Alphalens

Now that we have both our alpha factor and pricing datasets, we're ready to run our Alphalens study.

Since we have both Jaccard and cosine similarity scores, we'll run two separate Alphalens tearsheets.

### 4a. Jaccard Similarity Factor

Before creating a tearsheet, we'll use `get_clean_factor_and_forward_returns` to get our data in the correct format to be ingested by Alphalens.

**Note on parameters:** 

The `periods` parameter in `get_clean_factor_and_forward_returns` allows us to set the periods over which we assess the performance of our alpha factor (in days). Here, we'll use longer periods, since political processes tend to be longer-term phenomena.

The `quantiles` parameter allows us to set the number of bins into which we divide our assets based on their factor values. Since the original paper uses 5 quantiles to estimate portfolio performance, we'll also use 5 quantiles.

In [None]:
jaccard_factor = data[['jaccard_score']]

### Shorter periods (1, 5, 10 days)

In [None]:
factor_data_j1 = alphalens.utils.get_clean_factor_and_forward_returns(
    jaccard_factor,
    prices=prices,
    quantiles=5,
    periods =(1, 5, 10),
)

In [None]:
alphalens.tears.create_full_tear_sheet(factor_data_j1, by_group=False);

### Midrange periods (1, 2, 3 months)

In [None]:
factor_data_j2 = alphalens.utils.get_clean_factor_and_forward_returns(
    jaccard_factor,
    prices=prices,
    quantiles=5,
    periods =(20, 40, 60),
)

In [None]:
alphalens.tears.create_full_tear_sheet(factor_data_j2, by_group=False);

### Longest periods (1.5, 3, 4.5 months)

In [None]:
factor_data_j3 = alphalens.utils.get_clean_factor_and_forward_returns(
    jaccard_factor,
    prices=prices,
    quantiles=5,
    periods =(30, 60, 90),
)

In [None]:
alphalens.tears.create_full_tear_sheet(factor_data_j3, by_group=False);

### 4b. Cosine Similarity Factor

We'll put our cosine score factor through the same process.

In [None]:
cosine_factor = data[['cosine_score']]

In [None]:
factor_data_c1 = alphalens.utils.get_clean_factor_and_forward_returns(
    cosine_factor,
    prices=prices,
    quantiles=5,
    periods =(1, 5, 10),
)

In [None]:
alphalens.tears.create_full_tear_sheet(factor_data_c1, by_group=False);

### Midrange periods (1, 2, 3 months)

In [None]:
factor_data_c2 = alphalens.utils.get_clean_factor_and_forward_returns(
    cosine_factor,
    prices=prices,
    quantiles=5,
    periods =(20, 40, 60),
)

In [None]:
alphalens.tears.create_full_tear_sheet(factor_data_c2, by_group=False);

### Longest periods (1.5, 3, 4.5 months)

In [None]:
factor_data_c3 = alphalens.utils.get_clean_factor_and_forward_returns(
    cosine_factor,
    prices=prices,
    quantiles=5,
    periods =(30, 60, 90),
)

In [None]:
alphalens.tears.create_full_tear_sheet(factor_data_c3, by_group=False);

A few notes about this tearsheet:  

- In the "Cumulative Return by Quantile" plots, we want to see the top and bottom quantile "fingers" move across the plot without crossing. It looks like they're significantly different over all periods, indicating that our factor is doing a good job of separating high- and low-returning stocks.
- In the "IC Normal Dist Q-Q" plots, we want to see an S-shaped curve that indicates a Normal distribution with fat tails (since high/low factor values are the stocks that we want to long/short). We do see reasonably S-shaped curves in the plots over all periods.
- For our top and bottom quantile, the mean turnover looks reasonable -- hovering around roughly 30-40%, which is well within the contest guideline of 5-65%.

How does this compare to the paper's findings? The original paper found a spread of 31 bps in excess return between the 1st and 5th quantile for the cosine similarity score over a three-month holding period, and a spread of 53 bps for the Jaccard similarity score.

Keep in mind that the mean period wise return calculated by Alphalens is the *rate of return*. As such, it's difficult to compare the Alphalens result exactly with the original result. However, we do see a spread somewhere around 20-50 bps between quantiles (depending on the factor and period), so it seems like our results are generally in-line with the paper's findings.

The next step? Put it in an algorithm and see how it performs in real-world conditions.