# 10-K/Q Text Change Detection
### [The Code](https://github.com/calcbench/notebooks/blob/master/risk_factor_similarity_and_diffing-tf-idf.ipynb)

## Goal
Reduce the amount of time analysts spend reading 10-K/Qs by highlighting the sections which change the most between periods.

## Hypothesis
The [cosine distance](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.cosine.html) between [Term Frequency - Inverse Document Frequencey (TF-IDF)](http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting) vectors of 10-K sections is a useful proxy for semantic change in 10-K sections across time.


## Procedure
1. Use the [Calcbench Python API Client](https://github.com/calcbench/python_api_client) to download the [Risk Factors](https://www.sec.gov/fast-answers/answersreada10khtm.html) section of the 10-K from Calcbench
2. Tokenize the sections
3. Build TF-IDF matrices
4. Compute the cosine distance between each section and the same section from the previous filing/period
5. Render the matrix of distances with largest distances highlighted.
6. Review large changes by "diffing" documents with distance above a certain threshold.

## Credits
[http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/](http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/)


In [1]:
import calcbench as cb
from sklearn.feature_extraction.text import TfidfVectorizer
from bs4 import BeautifulSoup
from scipy.spatial.distance import cosine
from IPython.core.display import display, HTML
import sklearn
import itertools
from tqdm import tqdm_notebook
from matplotlib import colors
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
class NumberNormalizingVectorizer(sklearn.feature_extraction.text.TfidfVectorizer):
    def build_tokenizer(self):
        tokenize = super(NumberNormalizingVectorizer, self).build_tokenizer()
        return lambda doc: list(number_normalizer(tokenize(doc)))

In [3]:
def number_normalizer(tokens):
    """ Map all numeric tokens to a placeholder.

    For many applications, tokens that begin with a number are not directly
    useful, but the fact that such a token exists can be relevant.  By applying
    this form of dimensionality reduction, some methods may perform better.
    """

    return ("#NUMBER" if token[0].isdigit() else token for token in tokens)

In [4]:
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = itertools.tee(iterable)
    next(b, None)
    return zip(a, b)

In [5]:
document_section = "Risk Factors"
tickers = cb.tickers(index="DJIA")
first_year = 2008
end_year = 2018
diffs = pd.DataFrame(index=tickers, columns=range(end_year, first_year, -1))
for ticker in tqdm_notebook(tickers):
    ten_K_sections = (
        d
        for d in cb.document_search(
            company_identifiers=[ticker],
            document_name=document_section,
            all_history=True,
        )
        if d["fiscal_period"] == "Y"
    )
    sorted_disclosures = sorted(ten_K_sections, key=lambda d: d["fiscal_year"])
    year_pairs = pairwise(sorted_disclosures)
    for last_year, this_year in year_pairs:
        text_last_year = BeautifulSoup(last_year.get_contents(), "html.parser").text
        text_this_year = BeautifulSoup(this_year.get_contents(), "html.parser").text
        vectorizer = NumberNormalizingVectorizer(stop_words="english")
        if text_this_year and text_last_year:
            X = vectorizer.fit_transform([text_this_year, text_last_year])
            distance = cosine(X[0].todense(), X[1].todense())
            diffs[this_year["fiscal_year"]][ticker] = distance

HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




In [2]:
cb.docu

NameError: name 'sorted_disclosures' is not defined

In [10]:
len(vectorizer.get_feature_names())

1091

In [14]:
def background_gradient(s, m, M, cmap="PuBu", low=0, high=0):
    # from https://stackoverflow.com/questions/38931566/pandas-style-background-gradient-both-rows-and-columns
    rng = M - m
    norm = colors.Normalize(m - (rng * low), M + (rng * high))
    normed = norm(s.values)
    c = [colors.rgb2hex(x) for x in plt.cm.get_cmap(cmap)(normed)]
    return ["background-color: %s" % color for color in c]


def highlight_largest_diffs(diffs):
    filled_df = (
        diffs.loc[diffs.sum(axis=1).sort_values(ascending=False).index]
        .fillna(0)
        .round(3)
    )
    return filled_df.style.apply(
        background_gradient,
        cmap="Reds",
        m=filled_df.min().min(),
        M=filled_df.max().max(),
        low=0,
        high=2.5,
    )

## Hightlight Risk Factors with Greatest Change
### Brightest cells are those documents which changed the most vis-a-vis the previous period.

In [17]:
highlight_largest_diffs(diffs)

Unnamed: 0,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009
JNJ,0.021,0.01,0.607,0.02,0.57,0.0,0.042,0.0,0.026,0
JPM,0.006,0.55,0.008,0.008,0.013,0.023,0.013,0.028,0.427,0
INTC,0.758,0.0,0.011,0.002,0.068,0.018,0.032,0.094,0.017,0
DWDP,0.13,0.24,0.135,0.045,0.033,0.02,0.033,0.01,0.068,0
WBA,0.007,0.014,0.017,0.187,0.0,0.063,0.244,0.099,0.0,0
V,0.017,0.034,0.153,0.066,0.007,0.027,0.021,0.016,0.097,0
MCD,0.018,0.014,0.019,0.015,0.178,0.045,0.038,0.04,0.051,0
VZ,0.012,0.012,0.061,0.053,0.049,0.097,0.035,0.029,0.063,0
PFE,0.043,0.015,0.074,0.063,0.034,0.02,0.039,0.047,0.044,0
WMT,0.061,0.056,0.025,0.084,0.03,0.076,0.02,0.024,0.0,0


In [18]:
document_section = "Risk Factors"
ticker = "JNJ"
year = 2015
previous_year = 2016
doc = next(
    cb.document_search(
        company_identifiers=[ticker], document_name=document_section, year=year
    )
).get_contents()
previous_doc = next(
    cb.document_search(
        company_identifiers=[ticker], document_name=document_section, year=previous_year
    )
).get_contents()
display(HTML(cb.html_diff(doc, previous_doc)))

## Review Changes
#### The .607 distance between JNJ's 2015 and 2016 risk factors indicates a substantial change.  We verify the change on Calcbench's [disclosure page](https://www.calcbench.com/query/footnotes?pg_classificationMethod=tickers&pg_tickers=JNJ&doc_searchingBy=footnoteType&doc_footnoteType=1110&doc_selectedDisclosure=b-648365_section111&pc_year=2016&pc_periodType=Annual&pc_useFiscalPeriod=false&pc_rangeOption=Single%20Period&pc_dateRange=%5Bobject%20Object%5D).
![Diff](https://dl.dropboxusercontent.com/s/vjd382gr4vvhvuh/diff.png?raw=1)