# 10-K/Q Text Change Detection
### [The Code](https://github.com/calcbench/notebooks/blob/master/risk_factor_similarity_and_diffing-tf-idf.ipynb)

## Goal
Reduce the amount of time analysts spend reading 10-K/Qs by highlighting the sections which change the most between periods.

## Hypothesis
The [cosine distance](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.cosine.html) between [Term Frequency - Inverse Document Frequencey (TF-IDF)](http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting) vectors of 10-K sections is a useful proxy for symantic change in 10-K sections across time.


## Procedure
1. Use the [Calcbench Python API Client](https://github.com/calcbench/python_api_client) to download document section contents from Calcbench

2. Tokenize the sections
3. Build TF-IDF matrices
4. Compute the cosine distance between each section and the same section from the previous filing/period
5. Render the matrix of distances with largest distances highlighted.
6. Review large changes by "diffing" documents with distance above a certain threshold.


In [1]:
import calcbench as cb
from sklearn.feature_extraction.text import TfidfVectorizer
from bs4 import BeautifulSoup
from scipy.spatial.distance import cosine
from IPython.core.display import display, HTML
import sklearn
import itertools
from tqdm import tqdm_notebook
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import colors

  return f(*args, **kwds)
  return f(*args, **kwds)


In [5]:
class NumberNormalizingVectorizer(sklearn.feature_extraction.text.TfidfVectorizer):
    def build_tokenizer(self):
        tokenize = super(NumberNormalizingVectorizer, self).build_tokenizer()
        return lambda doc: list(number_normalizer(tokenize(doc)))

In [6]:
def number_normalizer(tokens):
    """ Map all numeric tokens to a placeholder.

    For many applications, tokens that begin with a number are not directly
    useful, but the fact that such a token exists can be relevant.  By applying
    this form of dimensionality reduction, some methods may perform better.
    """

    return ("#NUMBER" if token[0].isdigit() else token for token in tokens)

In [7]:
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = itertools.tee(iterable)
    next(b, None)
    return zip(a, b)

In [8]:
document_section = 'Risk Factors'
tickers = cb.tickers(index='DJIA')
first_year = 2008
end_year = 2018
diffs = pd.DataFrame(index=tickers, columns=range(end_year, first_year, -1))
for ticker in tqdm_notebook(tickers):
    ten_K_sections = (d for d in cb.document_search(company_identifiers=[ticker], 
                                                document_name=document_section, 
                                                all_history=True) if d['fiscal_period'] == 'Y')
    sorted_disclosures = sorted(ten_K_sections, key=lambda d: d['fiscal_year'])
    year_pairs = pairwise(sorted_disclosures)
    for last_year, this_year in year_pairs:
        text_last_year = BeautifulSoup(last_year.get_contents(), 'html.parser').text
        text_this_year = BeautifulSoup(this_year.get_contents(), 'html.parser').text
        vectorizer = NumberNormalizingVectorizer(stop_words='english')
        X = vectorizer.fit_transform([text_this_year, text_last_year])
        distance = cosine(X[0].todense(), X[1].todense())
        diffs[this_year['fiscal_year']][ticker] = distance

HBox(children=(IntProgress(value=0, max=30), HTML(value='')))

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):





In [41]:
def background_gradient(s, m, M, cmap='PuBu', low=0, high=0):
    # from https://stackoverflow.com/questions/38931566/pandas-style-background-gradient-both-rows-and-columns
    rng = M - m
    norm = colors.Normalize(m - (rng * low),
                            M + (rng * high))
    normed = norm(s.values)
    c = [colors.rgb2hex(x) for x in plt.cm.get_cmap(cmap)(normed)]
    return ['background-color: %s' % color for color in c]

def highlight_largest_diffs(diffs):
    filled_df = diffs.fillna(0)
    return filled_df.style.apply(background_gradient, cmap='Reds', m=filled_df.min().min(), M=filled_df.max().max(), low=0, high=2.5)

## Hightlight Risk Factors with Greatest Change
### Brightest cells are those documents which changed the most vis-a-vis the previous period.

In [40]:
highlight_largest_diffs(diffs)

Unnamed: 0,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009
MMM,0.0,0.00779628,0.00762529,0.0258541,0.00727321,0.008662,0.00506823,0.0435289,0.0260112,0
AXP,0.0,0.00929214,0.01471,0.0321971,0.016149,0.0313577,0.0245556,0.0221019,0.059023,0
AAPL,0.0,0.00395994,0.00278187,0.000551845,0.00426382,0.00349175,0.027321,0.0172534,0.00437213,0
BA,0.0,0.00484915,0.00481619,0.00161179,0.00373701,0.00870041,0.0315051,0.0358225,0.0456911,0
CAT,0.0,0.00214401,0.00873649,0.00352635,0.00924687,0.0109385,0.0179613,0.012855,0.040419,0
CVX,0.0,0.0125657,0.0279247,0.0451385,0.0198717,0.000979089,0.00496076,0.0748053,0.0569165,0
CSCO,0.00443587,0.00206469,0.0038352,0.00174538,0.00588518,0.0119791,0.00310984,0.010243,0.0,0
KO,0.0,0.0129111,0.00665081,0.0138153,0.0120799,0.0107884,0.036683,0.0350291,0.0650325,0
DWDP,0.0,0.240127,0.134602,0.0446338,0.0326869,0.0198903,0.033084,0.00999427,0.0681977,0
XOM,0.0,0.0305509,0.00840683,0.0179265,0.00275424,0.0069008,0.0115724,0.0150352,0.0,0


## Review Changes
#### The .607 distance between JNJ's 2015 and 2016 risk factors indicates a substantial change.  We verify the change on Calcbench's [disclosure page](https://www.calcbench.com/query/footnotes?pg_classificationMethod=tickers&pg_tickers=JNJ&doc_searchingBy=footnoteType&doc_footnoteType=1110&doc_selectedDisclosure=b-648365_section111&pc_year=2016&pc_periodType=Annual&pc_useFiscalPeriod=false&pc_rangeOption=Single%20Period&pc_dateRange=%5Bobject%20Object%5D).
![Diff](https://dl.dropboxusercontent.com/s/vjd382gr4vvhvuh/diff.png?raw=1)