# Mini Intro to Textual Analysis: Wordlist
by Dr Liang Jin

Part of Mini Python Sessions: [github.com/drliangjin/minipy](https://github.com/drliangjin/minipy)

Bodnaruk, Loughran, and McDonald (2015)

Note: Predefined SEC strings
- 10K filing codes: '10-K', '10-K405', '10KSB', '10-KSB', '10KSB40'
- 10Q filing codes: '10-Q', '10QSB', '10-QSB'

Excellent (but outdated) resources by Bill McDonald can be found from [Software Repository for Accounting and Finance](https://sraf.nd.edu/). It's relevant but dangerous to use those codes straightway.

In [17]:
# NLTK stands for Natural Language Took Kits
# It is the most popular advanced textual analysis tool/package in Python
# It has tons of features and also comes with a large collection of corpus (a collection of text) to play with
# These features however need to be downloaded by running the following:
# import nltk
# nltk.download(); This thing is huge, please note it will take quite sometime to finalise the downloads.

## Getting Text Data

In [18]:
import requests
from bs4 import BeautifulSoup

In [19]:
# Apple's 10-K filing on 2017
# IBM's 10-K filing on 20120228
urls = ['https://www.sec.gov/Archives/edgar/data/51143/000104746912001742/a2206744z10-k.htm',
        'https://www.sec.gov/Archives/edgar/data/320193/000032019317000070/a10-k20179302017.htm']

### Soup, HTML and Text
- Remove tables: All characters appearing between `<TABLE>` and `</TABLE>` tages are removed 
- NOTE: unless numeric characters/(alphabetic + numeric chars) <= 15% (BLM, 2015), **can you do this?**

In [20]:
# Define a function to create our Soup object and then extract text
# The key is here is: when we have HTML structure, we remove tables otherwise it can be tricky
def url_to_text(url):
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'lxml')
    for table in soup.find_all('table'):
        table.decompose()
    text = soup.get_text()
    return text

In [21]:
# Actually obtain the text from requests via EDGAR, parsing using BS4 then text
texts = [url_to_text(url) for url in urls]

### Store our data within Python

In [23]:
# hold files locally
import pickle # nice module name, isn't it?

firms = ['Apple', 'IBM']

# access index and value for a list
for idx, val in enumerate(firms):
    with open(val + ".pkl", "wb") as f:
        pickle.dump(texts[idx], f)

### Load our pickled data into a dictionary

In [24]:
# Load pickled files into a dictionary
# Key: company name
# value: parsed 10K text
data = {}

for _, val in enumerate(firms):
    with open(val + ".pkl", "rb") as f:
        data[val] = pickle.load(f)

In [25]:
# Check company names to make sure our data has been loaded properly
data.keys()

dict_keys(['Apple', 'IBM'])

In [39]:
# Check texts
#
data['IBM'][:2000]

'\n10-K\n1\na10-k20179302017.htm\n10-K\n\n\n\n\nDocument\nUNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549FORM 10-K(Mark One)☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934For the fiscal year ended September\xa030, 2017or☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934For the transition period from\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 to \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0Commission File Number: 001-36743Apple Inc.(Exact name of Registrant as specified in its charter)(408) 996-1010(Registrant’s telephone number, including area code)Securities registered pursuant to Section\xa012(b)\xa0of the Act:Securities registered pursuant to Section\xa012(g)\xa0of the Act:  NoneIndicate by check mark if the Registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.Yes\xa0\xa0☒\xa0\xa0\xa0\xa0No\xa0\xa0☐Indicate by check mark if the Registra

## (very) Basic Features Extraction

In [50]:
# Let's first define some handy funcs
# Count total Words
def count_words(text):
    return len(str(text).split(" "))

# Count total characters
def count_chars(text):
    return len(str(text))

# Count numerics
def count_digit(text):
    return len([word for word in str(text).split(" ") if word.isdigit()]) # isdigit() is a string method

In [56]:
from nltk.corpus import stopwords

In [57]:
# Count stopwords
def count_stopwords(text, stop=stopwords.words('english')):
    num = len([word for word in str(text).split(' ') if word in stop])
    perc = num/len(str(text).split(" "))
    return num, perc

In [58]:
# stopwords list from LM's website

lm_list = ['ME', 'MY', 'MYSELF', 'WE', 'OUR', 'OURS', 'OURSELVES', 'YOU', 'YOUR', 'YOURS',
                       'YOURSELF', 'YOURSELVES', 'HE', 'HIM', 'HIS', 'HIMSELF', 'SHE', 'HER', 'HERS', 'HERSELF',
                       'IT', 'ITS', 'ITSELF', 'THEY', 'THEM', 'THEIR', 'THEIRS', 'THEMSELVES', 'WHAT', 'WHICH',
                       'WHO', 'WHOM', 'THIS', 'THAT', 'THESE', 'THOSE', 'AM', 'IS', 'ARE', 'WAS', 'WERE', 'BE',
                       'BEEN', 'BEING', 'HAVE', 'HAS', 'HAD', 'HAVING', 'DO', 'DOES', 'DID', 'DOING', 'AN',
                       'THE', 'AND', 'BUT', 'IF', 'OR', 'BECAUSE', 'AS', 'UNTIL', 'WHILE', 'OF', 'AT', 'BY',
                       'FOR', 'WITH', 'ABOUT', 'BETWEEN', 'INTO', 'THROUGH', 'DURING', 'BEFORE',
                       'AFTER', 'ABOVE', 'BELOW', 'TO', 'FROM', 'UP', 'DOWN', 'IN', 'OUT', 'ON', 'OFF', 'OVER',
                       'UNDER', 'AGAIN', 'FURTHER', 'THEN', 'ONCE', 'HERE', 'THERE', 'WHEN', 'WHERE', 'WHY',
                       'HOW', 'ALL', 'ANY', 'BOTH', 'EACH', 'FEW', 'MORE', 'MOST', 'OTHER', 'SOME', 'SUCH',
                       'NO', 'NOR', 'NOT', 'ONLY', 'OWN', 'SAME', 'SO', 'THAN', 'TOO', 'VERY', 'CAN',
                       'JUST', 'SHOULD', 'NOW']

lm_stopwords = [word.lower() for word in lm_list]

In [61]:
# A very large proportion of the whole text is stopwords!
count_words(data['IBM']), count_stopwords(data['IBM'])

(36277, (12611, 0.3476307302147366))

### Load our dictionary into Pandas DataFrame

In [27]:
# We can either keep it in dictionary format or put it into a pandas dataframe
# put our corpus into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth', 150)

df = pd.DataFrame.from_dict(data, orient='index', columns = ['text'])

In [28]:
df

Unnamed: 0,text
Apple,\n10-K\n1\na2206744z10-k.htm\n10-K\n\n\nQuickLinks\n -- Click here to rapidly navigate through this document\n\n\n\n\n\n\n\n\n \n\n \nUNITED STATE...
IBM,"\n10-K\n1\na10-k20179302017.htm\n10-K\n\n\n\n\nDocument\nUNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549FORM 10-K(Mark One)☒..."


## Cleaning the Text Data (Text Pre-Processing)

Data analysis is more of art than science, in a sense. For example, when we are trying to clean our data, we need manual inputs and our judgements. After all, no data can be perfect; especially for text data, the cleaning or pre-processing can go on forever. We are just going to execute the most common/simple cleaning steps; you can continue to work on to improve your results, i.e., replicating BLM(2015)'s work...

### Common data cleanning steps:
- Make text all lower case
- Remove special expression from non-human languages
- Remove punctuation
- Remove numerical values
- Remove stop words
- Tokenize text

### More advanced cleaning steps after tokenization:
- Stemming / lemmatization
- Tagging
- N-grams
- And more...

### Replicate BLM(2015) on constraining words
- Read the paper carefully, focusing on sections such as `II. Data`
- Go through `Appendix B. Parsing the 10-K Filings` (discard the first 4 steps for now as they are for txt files)

In [29]:
# OK, basic text cleaning
import re

In [30]:
# deal with reserved special html characters such as non-breaking space (`&nbsp`)
html_chars = {'&lt': 'lt', '&#60': 'lt', 
              '&gt': 'gt', '&#62': 'gt',
              '&nbsp': '', '&#160': '', 
              '&quot': '"', '&#34': '"', 
              '&apos': '\'', '&#39': '\'',
              '&amp': '&', '&#38': '&'}

In [13]:
def clean_text_round1(text):
    pass

In [40]:
def clean_text_round2(text):
    # convert to lower case
    text = text.lower()
    text = re.sub(r'(\t|\v)', '', text)
    # remove \xa0 which is non-breaking space from ISO 8859-1, how to delete all remaining ISO 8859-1 symbols & chars?
    text = re.sub(r'\xa0', ' ', text)
    # remove newline feeds (\n) following hyphens
    text = re.sub(r'(-+)\n{2,}', r'\1', text)
    # remove hyphens preceded and followed by a blank space
    text = re.sub(r'\s-\s', '', text)
    # replace 'and/or' with 'and or'
    text = re.sub(r'and/or', r'and or', text)
    # tow or more hypens, periods, or equal signs, possiblly followed by spaces are removed
    text = re.sub(r'[-|\.|=]{2,}\s*', r'', text)
    # all underscores are removed
    text = re.sub(r'_', '', text)
    # 3 or more spaces are replaced by a single space
    text = re.sub(r'\s{3,}', ' ', text)
    # three or more line feeds, possibly separated by spaces are replaced by two line feeds
    text = re.sub(r'(\n\s*){3,}', '\n\n', text)
    # remove hyphens before a line feed
    text = re.sub(r'-+\n', '\n', text)
    # replace hyphens preceding a capitalized letter with a space
    text = re.sub(r'-+([A-Z].*)', r' \1', text)
    # remove capitalized or all capitals for March, May and August
    text = re.sub(r'(March|MARCH|May|MAY|August|AUGUST)', '', text)
    # remove punctuations
    # text = re.sub('[]'.format(re.escape(string.punctuation)), '', text)
    # remove line feeds
    # text = re.sub('\n', ' ', text)
    # remove numbers?
    # replace single line feed \n with single space
    #text = re.sub(r'\n', ' ', text)
    return text

In [41]:
df

Unnamed: 0,text
Apple,\n10-K\n1\na2206744z10-k.htm\n10-K\n\n\nQuickLinks\n -- Click here to rapidly navigate through this document\n\n\n\n\n\n\n\n\n \n\n \nUNITED STATE...
IBM,"\n10-K\n1\na10-k20179302017.htm\n10-K\n\n\n\n\nDocument\nUNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549FORM 10-K(Mark One)☒..."


In [42]:
df1 = pd.DataFrame(df['text'].apply(clean_text_round2))

In [43]:
df1

Unnamed: 0,text
Apple,\n10-k\n1\na2206744z10-k.htm\n10-k quicklinks\n click here to rapidly navigate through this document united states\nsecurities and exchange commis...
IBM,"\n10-k\n1\na10-k20179302017.htm\n10-k document\nunited statessecurities and exchange commissionwashington, d.c. 20549form 10-k(mark one)☒ annual r..."


## Advanced Text Processing

In [None]:
lm_constrwords

### Document-Term Matrix

To continue working on our textual analysis of 10-K filings (can be as simple as word counts or can be as fancy as machine learning based techniques, the text must be tokenized, meaning broken down into smaller pieces. NLTK provides methods to do so, such as breaking text into sentenses and words. We can also do this using scikit-learn's CountVectorizer. The output will be multiple rows representing different documents (such a 10-K file) and multiple columns (lots of columns) representing a different word.

In [None]:
# We are going to create a document-term matrix using CountVectorizer
# NOTE: we can remove stop words which are common words that add no additional meaning to the text, such as 'a', 'the', etc.
# NOTE: later we can try use LM defined stopwords for 10-K, we can even create our own stopwords dictionary
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
cv_data = cv.fit_transform(df.text)

dtm = pd.DataFrame(cv_data.toarray(), columns=cv.get_feature_names())
dtm.index = df.index
dtm

In [None]:
# Let's pickle our dataframes
dtm.to_pickle('dtm.pkl')

# and our CountVectorizer object
with open("cv.pkl", "wb") as f:
    pickle.dump(cv, f)

## Exploratory Data Analysis

### Top words

In [None]:
data = dtm.transpose()
data.head(50)

In [None]:
# Find the most common words used by the 10-K files
top_words = {}

for firm in data.columns:
    top = data[firm].sort_values(ascending=False).head(30)
    top_words[firm] = list(zip(top.index, top.values))

In [None]:
top_words

### Bag of Words (BoW)

In [62]:
# Create a bag of contrainning words from BLM (2015)
lm_constrwords = []

with open('words_from_pdf.txt', 'r') as rf:
        lines = rf.read().splitlines() # readlines() create a newline character "\n" each line
        for line in lines:
            words = line.split(sep=' ')
            for word in words:
                lm_constrwords.append(word)

# sort words alphabetically               
lm_constrwords.sort()

# You can write to a local file of course    

In [63]:
lm_constrwords

['abide',
 'abiding',
 'bound',
 'bounded',
 'commit',
 'commitment',
 'commitments',
 'commits',
 'committed',
 'committing',
 'compel',
 'compelled',
 'compelling',
 'compels',
 'comply',
 'compulsion',
 'compulsory',
 'confine',
 'confined',
 'confinement',
 'confines',
 'confining',
 'constrain',
 'constrained',
 'constraining',
 'constrains',
 'constraint',
 'constraints',
 'covenant',
 'covenanted',
 'covenanting',
 'covenants',
 'depend',
 'dependance',
 'dependances',
 'dependant',
 'dependencies',
 'dependent',
 'depending',
 'depends',
 'dictate',
 'dictated',
 'dictates',
 'dictating',
 'directive',
 'directives',
 'earmark',
 'earmarked',
 'earmarking',
 'earmarks',
 'encumber',
 'encumbered',
 'encumbering',
 'encumbers',
 'encumbrance',
 'encumbrances',
 'entail',
 'entailed',
 'entailing',
 'entails',
 'entrench',
 'entrenched',
 'escrow',
 'escrowed',
 'escrows',
 'forbade',
 'forbid',
 'forbidden',
 'forbidding',
 'forbids',
 'impair',
 'impaired',
 'impairing',
 'impa