# <span style="color:navy"> Introduction


In this notebook we will crawl 10-K from SEC, extract item 1 in 10-K and using PCA learned model to give three different scores for this company.

# <span style="color:navy"> STEP 1 : Get Apple's [AAPL] 2020 10-K 

Though we are using AAPL as example 10-K here, the pipeline being built is generic & can be used for other companies 10-K
 
[SEC Website URL for 10-K (TEXT version)](https://www.sec.gov/Archives/edgar/data/320193/000032019320000096/0000320193-20-000096.txt)

[SEC Website URL for 10-K (HTML version)](https://www.sec.gov/Archives/edgar/data/320193/000032019320000096/aapl-20200926.htm)

All the documents can be easily ssearched via CIK or company details via [SEC's search tool](https://www.sec.gov/cgi-bin/browse-edgar?CIK=0000320193&owner=exclude&action=getcompany&Find=Search)

In [2]:
# Import requests to retrive Web Urls example HTML. TXT 
import requests

# Get the HTML data from the 2018 10-K from Apple
r = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000032019320000096/0000320193-20-000096.txt')
raw_10k = r.text

In [3]:
print(raw_10k[0:1500])

<SEC-DOCUMENT>0000320193-20-000096.txt : 20201030
<SEC-HEADER>0000320193-20-000096.hdr.sgml : 20201030
<ACCEPTANCE-DATETIME>20201029180625
ACCESSION NUMBER:		0000320193-20-000096
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		99
CONFORMED PERIOD OF REPORT:	20200926
FILED AS OF DATE:		20201030
DATE AS OF CHANGE:		20201029

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			Apple Inc.
		CENTRAL INDEX KEY:			0000320193
		STANDARD INDUSTRIAL CLASSIFICATION:	ELECTRONIC COMPUTERS [3571]
		IRS NUMBER:				942404110
		STATE OF INCORPORATION:			CA
		FISCAL YEAR END:			0926

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-36743
		FILM NUMBER:		201273977

	BUSINESS ADDRESS:	
		STREET 1:		ONE APPLE PARK WAY
		CITY:			CUPERTINO
		STATE:			CA
		ZIP:			95014
		BUSINESS PHONE:		(408) 996-1010

	MAIL ADDRESS:	
		STREET 1:		ONE APPLE PARK WAY
		CITY:			CUPERTINO
		STATE:			CA
		ZIP:			95014

	FORMER COMPANY:	
		FORMER CONFORMED NAME:	APPLE INC
		DATE OF NAME CHANG

# <span style="color:navy"> STEP 2 : Get document from the downloaded 10-K file
    
For our purposes, we are only interested in the sections that contain the 10-K information. All the sections, including the 10-K are contained within the <DOCUMENT> and </DOCUMENT> tags. Each section within the document tags is clearly marked by a <TYPE> tag followed by the name of the section.


In [4]:
# Regex to find <DOCUMENT> tags
doc_start_pattern = re.compile(r'<DOCUMENT>')
doc_end_pattern = re.compile(r'</DOCUMENT>')
# Regex to find <TYPE> tag prceeding any characters, terminating at new line
type_pattern = re.compile(r'<TYPE>[^\n]+')


# Create 3 lists with the span idices for each regex

### There are many <Document> Tags in this text file, each as specific exhibit like 10-K, EX-10.17 etc
### First filter will give us document tag start <end> and document tag end's <start> 
### We will use this to later grab content in between these tags
doc_start_is = [x.end() for x in doc_start_pattern.finditer(raw_10k)]
doc_end_is = [x.start() for x in doc_end_pattern.finditer(raw_10k)]

### Type filter is interesting, it looks for <TYPE> with Not flag as new line, ie terminare there, with + sign
### to look for any char afterwards until new line \n. This will give us <TYPE> followed Section Name like '10-K'
### Once we have have this, it returns String Array, below line will with find content after <TYPE> ie, '10-K' 
### as section names
doc_types = [x[len('<TYPE>'):] for x in type_pattern.findall(raw_10k)]

document = {}

# Create a loop to go through each section type and save only the 10-K section in the dictionary
for doc_type, doc_start, doc_end in zip(doc_types, doc_start_is, doc_end_is):
    if doc_type == '10-K':
        document[doc_type] = raw_10k[doc_start:doc_end]

In [5]:
# display excerpt the document
document['10-K'][0:500]

'\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>aapl-20200926.htm\n<DESCRIPTION>10-K\n<TEXT>\n<XBRL>\n<?xml version="1.0" ?><!--XBRL Document Created with Wdesk from Workiva--><!--Copyright 2020 Workiva--><!--r:5595bda7-992d-4241-bfcc-976a7edbf862,g:a71cbd11-d0c6-4466-a60e-9f48342467d1,d:ef781ab58e4f4fcaa872ddbd30da40e1--><html xmlns:xbrldi="http://xbrl.org/2006/xbrldi" xmlns:iso4217="http://www.xbrl.org/2003/iso4217" xmlns="http://www.w3.org/1999/xhtml" xmlns:srt="http://fasb.org/srt/2020-01-31" xmlns:ixt-sec="h'

# <span style="color:navy"> STEP 2: using Python regular expression to automatically extract section Item 1. from 10-K

All 10-K items can be found in either of the following patterns:

1. `>Item 1.`

2. `>Item&#160;1.` 

3. `>Item&nbsp;1.`

4. `ITEM 1.` 

In the code below we will write a single regular expression that can match all four patterns for Items 1 and then use the `.finditer()` method to match the regex to `document['10-K']`.


In [6]:
import re

# Create the regular expression for the above cases
regex = re.compile(r'((>)*I(tem|TEM)(\s|&#160;|&nbsp;|&#xa0;)(1|2)\.(\s|&#160;|&nbsp;|<))')

# Use finditer to math the regex
matches = regex.finditer(document['10-K'])

# Write a for loop to print the matches
for match in matches:
    print(match)

<re.Match object; span=(241604, 241613), match='>Item 1.<'>
<re.Match object; span=(245710, 245719), match='>Item 2.<'>
<re.Match object; span=(276050, 276064), match='>Item 1.&#160;'>
<re.Match object; span=(393776, 393790), match='>Item 2.&#160;'>


In the code below we will create a pandas dataframe with the following column names: `'item','start','end'`. In the `item` column save the `match.group()` in lower case letters, in the ` start` column save the `match.start()`, and in the `end` column save the ``match.end()`. 

In [7]:
import pandas as pd

# Matches
matches = regex.finditer(document['10-K'])

# Create the dataframe
test_df = pd.DataFrame([(x.group(), x.start(), x.end()) for x in matches])

test_df.columns = ['item', 'start', 'end']
test_df['item'] = test_df.item.str.lower()

# Display the dataframe
test_df.head()

Unnamed: 0,item,start,end
0,>item 1.<,241604,241613
1,>item 2.<,245710,245719
2,>item 1.&#160;,276050,276064
3,>item 2.&#160;,393776,393790


In [8]:
# Get rid of unnesesary charcters from the dataframe
test_df.replace('&#160;',' ',regex=True,inplace=True)
test_df.replace('&nbsp;',' ',regex=True,inplace=True)
test_df.replace(' ','',regex=True,inplace=True)
test_df.replace('\.','',regex=True,inplace=True)
test_df.replace('>','',regex=True,inplace=True)

# display the dataframe
test_df.head()

Unnamed: 0,item,start,end
0,item1<,241604,241613
1,item2<,245710,245719
2,item1,276050,276064
3,item2,393776,393790


In [9]:
# Drop duplicates
pos_dat = test_df.sort_values('start', ascending=True).drop_duplicates(subset=['item'], keep='last')

# Display the dataframe
pos_dat

Unnamed: 0,item,start,end
0,item1<,241604,241613
1,item2<,245710,245719
2,item1,276050,276064
3,item2,393776,393790


In [10]:
# Set item as the dataframe index
pos_dat.set_index('item', inplace=True)

# display the dataframe
pos_dat

Unnamed: 0_level_0,start,end
item,Unnamed: 1_level_1,Unnamed: 2_level_1
item1<,241604,241613
item2<,245710,245719
item1,276050,276064
item2,393776,393790


<b> Get The Financial Information From Each Item </b>

The above dataframe contains the starting and end index of each match for Items 1A, 7, and 7A. In the code below, we will save all the text from the starting index of `item1a` till the starting index of `item1b` into a variable called `item_1a_raw`. Similarly, save all the text from the starting index of `item7` till the starting index of `item7a` into a variable called `item_7_raw`. Finally,  save all the text from the starting index of `item7a` till the starting of `item8` into a variable called `item_7a_raw`. We can accomplish all of this by making the correct slices of `document['10-K']`.

In [11]:
# Get Item 1
item_1_raw = document['10-K'][pos_dat['start'].loc['item1']:pos_dat['start'].loc['item2']]

Let's look at the extract raw (HTML) item1 from 10-K

In [12]:
item_1_raw[0:1000]

'>Item 1.&#160;&#160;&#160;&#160;Business</span></div><div style="margin-top:9pt;text-align:justify"><span style="color:#000000;font-family:\'Helvetica\',sans-serif;font-size:9pt;font-weight:700;line-height:120%">Company Background</span></div><div style="margin-top:6pt;text-align:justify"><span style="color:#000000;font-family:\'Helvetica\',sans-serif;font-size:9pt;font-weight:400;line-height:120%">The Company designs, manufactures and markets smartphones, personal computers, tablets, wearables and accessories, and sells a variety of related services. The Company&#8217;s fiscal year is the 52- or 53-week period that ends on the last Saturday of September. The Company is a California corporation established in 1977.</span></div><div style="margin-top:16pt;text-align:justify"><span style="color:#000000;font-family:\'Helvetica\',sans-serif;font-size:9pt;font-weight:700;line-height:120%">Products</span></div><div style="margin-top:9pt;text-align:justify"><span style="color:#000000;font-fa

We can see that the extracted item1 looks pretty messy, it contais HTML tags, Unicode characters, etc...
Before we can do a proper processing in these items we need to clean them up. This means we need to remove all HTML Tags, unicode characters, etc... In Python, we can use **Beautifulsoup** packages to do all the cleaning for us.

# <span style="color:navy"> STEP 4 : Extract the clear item1 text from HTML using Python BeautifulSoup library

In [13]:
# Import BeautifulSoup
from bs4 import BeautifulSoup

### First convert the raw text we have to exrtacted to BeautifulSoup object 
item_1_content = BeautifulSoup(item_1_raw, 'lxml')

In [23]:
### By just applying .pretiffy() we see that raw text start to look oragnized, as BeautifulSoup
### apply indentation according to the HTML Tag tree structure
item1_text = item_1_content.get_text("\n\n")
print(item1_text[0:1500])

>Item 1.    Business

Company Background

The Company designs, manufactures and markets smartphones, personal computers, tablets, wearables and accessories, and sells a variety of related services. The Company’s fiscal year is the 52- or 53-week period that ends on the last Saturday of September. The Company is a California corporation established in 1977.

Products

iPhone

iPhone

®

 is the Company’s line of smartphones based on its iOS operating system. During 2020, the Company released a new iPhone SE. In October 2020, the Company announced four new iPhone models with 5G technology: iPhone 12 and iPhone 12 Pro were available starting in October 2020, and iPhone 12 Pro Max and iPhone 12 mini are both expected to be available in November 2020.

Mac

Mac

®

 is the Company’s line of personal computers based on its macOS

®

 operating system. During 2020, the Company released a new 16-inch MacBook Pro

®

, a fully redesigned Mac Pro

®

, and updated versions of its MacBook Air

®


# <span style="color:navy"> STEP 5 : Counting Word Patterns in Item1 Using Regular Expressions
Below the word lists:
    
* **Strategic positioning**: differenti\*, unique\*, superior\*, premium\*, excellen\*, leading edge, upscale, high\* price\*,
high\* margin\*, high\* end\*, inelasticity\*, cost leader\*, low\* pric\*, low\* cost\*, cost advantage\*, competitive
pric\*, aggressive pric\*
* **Operations**: efficien\*, high\* yield\*, process\* improvement\*, asset\* utilization\*, capacity\* utilization\*, scope\*,
scale\*, breath\*, broad, mass, high\* volume\*, large\* volume\*, economy\* of scale, new\* product\*, quality\*,
reliab\*, durable\*
* **Marketing**: marketing\*, advertis\*, brand\*, reputation\*, trademark\*
* **Service**: customer\* service\*, consumer\* service\*, customer\* need\*, sales support\*, post-purchase service\*,
customer\* preference\*, consumer\* preference\*, consumer\* relation\*, consumer\* experience\*, consumer\*
support\*, loyalty\*, customiz\*, tailor\*, personaliz\*, responsive\*, on time, timely
* **Technology**: innovate\*, creativ\*, research and development, R&D, techni*, technolog\*, patent\*, proprietar\*
* **Infrastructure**: control\* cost\*, control\* expense\*, control\* overhead\*, minimiz\* cost\*, minimiz\* expense\*,
minimiz\* overhead\*, reduce\* cost\*, reduce\* expense\*, reduce\* overhead\*, cut\* cost\*, cut\* expense\*, cut\*
overhead\*, decreas\* cost\*, decreas\* expense\*, decreas\* overhead\*, monitor\* cost\*, monitor\* expense\*,
monitor\* overhead\*, sav\* cost\*, sav\* expense\*, sav\* overhead\*, cost\* control\*, cost\* minimization\*, cost\*
reduction\*, cost\* saving\*, cost\* improvement\*, expense\* control\*, expense\* minimization\*, expense\*
reduction\*, expense\* saving\*, expense\* improvement\*, overhead\* control\*, overhead\* minimization\*,
overhead\* reduction\*, overhead\* saving\*, overhead\* improvement\*    
* **Human resources management**: talent\*, train\*, skill\*, intellectual propert\*, human capital\*

In [25]:
import re

regexes = [
	"\\bdifferenti\w*\\b",
	"\\bunique\w*\\b",
	"\\bsuperior\w*\\b",
	"\\bpremium\w*\\b",
	"\\bexcellen\w*\\b",
	"\\bleading\s+edge\\b",
	"\\bupscale\\b",
	"\\bhigh\w*\s+price\w*\\b",
	"\\bhigh\w*\s+margin\w*\\b",
	"\\bhigh\w*\s+end\w*\\b",
	"\\binelasticity\w*\\b",
	"\\bcost\s+leader\w*\\b",
	"\\blow\w*\s+pric\w*\\b",
	"\\blow\w*\s+cost\w*\\b",
	"\\bcost\s+advantage\w*\\b",
	"\\bcompetitive\s+pric\w*\\b",
	"\\baggressive\s+pric\w*\\b",
	"\\befficien\w*\\b",
	"\\bhigh\w*\s+yield\w*\\b",
	"\\bprocess\w*\s+improvement\w*\\b",
	"\\basset\w*\s+utilization\w*\\b",
	"\\bcapacity\w*\s+utilization\w*\\b",
	"\\bscope\w*\\b",
	"\\bscale\w*\\b",
	"\\bbreath\w*\\b",
	"\\bbroad\\b",
	"\\bmass\\b",
	"\\bhigh\w*\s+volume\w*\\b",
	"\\blarge\w*\s+volume\w*\\b",
	"\\beconomy\w*\s+of\s+scale\\b",
	"\\bnew\w*\s+product\w*\\b",
	"\\bquality\w*\\b",
	"\\breliab\w*\\b",
	"\\bdurable\w*\\b",
	"\\bmarketing\w*\\b",
	"\\badvertis\w*\\b",
	"\\bbrand\w*\\b",
	"\\breputation\w*\\b",
	"\\btrademark\w*\\b",
	"\\bcustomer\w*\s+service\w*\\b",
	"\\bconsumer\w*\s+service\w*\\b",
	"\\bcustomer\w*\s+need\w*\\b",
	"\\bsales\s+support\w*\\b",
	"\\bpost-purchase\s+service\w*\\b",
	"\\bcustomer\w*\s+preference\w*\\b",
	"\\bconsumer\w*\s+preference\w*\\b",
	"\\bconsumer\w*\s+relation\w*\\b",
	"\\bconsumer\w*\s+experience\w*\\b",
	"\\bconsumer\w*\s+support\w*\\b",
	"\\bloyalty\w*\\b",
	"\\bcustomiz\w*\\b",
	"\\btailor\w*\\b",
	"\\bpersonaliz\w*\\b",
	"\\bresponsive\w*\\b",
	"\\bon\s+time\\b",
	"\\btimely\\b",
	"\\binnovate\w*\\b",
	"\\bcreativ\w*\\b",
	"\\bresearch\s+and\s+development\\b",
	"\\br&d\\b",
	"\\btechni\w*\\b",
	"\\btechnolog\w*\\b",
	"\\bpatent\w*\\b",
	"\\bproprietar\w*\\b",
	"\\bcontrol\w*\s+cost\w*\\b",
	"\\bcontrol\w*\s+expense\w*\\b",
	"\\bcontrol\w*\s+overhead\w*\\b",
	"\\bminimiz\w*\s+cost\w*\\b",
	"\\bminimiz\w*\s+expense\w*\\b",
	"\\bminimiz\w*\s+overhead\w*\\b",
	"\\breduce\w*\s+cost\w*\\b",
	"\\breduce\w*\s+expense\w*\\b",
	"\\breduce\w*\s+overhead\w*\\b",
	"\\bcut\w*\s+cost\w*\\b",
	"\\bcut\w*\s+expense\w*\\b",
	"\\bcut\w*\s+overhead\w*\\b",
	"\\bdecreas\w*\s+cost\w*\\b",
	"\\bdecreas\w*\s+expense\w*\\b",
	"\\bdecreas\w*\s+overhead\w*\\b",
	"\\bmonitor\w*\s+cost\w*\\b",
	"\\bmonitor\w*\s+expense\w*\\b",
	"\\bmonitor\w*\s+overhead\w*\\b",
	"\\bsav\w*\s+cost\w*\\b",
	"\\bsav\w*\s+expense\w*\\b",
	"\\bsav\w*\s+overhead\w*\\b",
	"\\bcost\w*\s+control\w*\\b",
	"\\bcost\w*\s+minimization\w*\\b",
	"\\bcost\w*\s+reduction\w*\\b",
	"\\bcost\w*\s+saving\w*\\b",
	"\\bcost\w*\s+improvement\w*\\b",
	"\\bexpense\w*\s+control\w*\\b",
	"\\bexpense\w*\s+minimization\w*\\b",
	"\\bexpense\w*\s+reduction\w*\\b",
	"\\bexpense\w*\s+saving\w*\\b",
	"\\bexpense\w*\s+improvement\w*\\b",
	"\\boverhead\w*\s+control\w*\\b",
	"\\boverhead\w*\s+minimization\w*\\b",
	"\\boverhead\w*\s+reduction\w*\\b",
	"\\boverhead\w*\s+saving\w*\\b",
	"\\boverhead\w*\s+improvement\w*\\b",
	"\\btalent\w*\\b",
	"\\btrain\w*\\b",
	"\\bskill\w*\\b",
	"\\bintellectual\s+propert\w*\\b",
	"\\bhuman\s+capital\w*\\b",
	]


wordRegExpressions = []
for str in regexes:
    wordRegExpressions.append(re.compile(r"{}".format(str)))

As we have compiled a list of word patterns. We will use regular expression matches to look for counts of these word patterns in item1.

In [39]:
#convert to lower cases and strip all new lines
item1Text = item1_text.lower().replace('\n', " ")
item1_total_words = len(item1Text.split(''))

#word_counts is an array of word counts for each word pattern
word_counts = []
word_freq   = []
for regWord in wordRegExpressions:
    #get number of matches per word pattern
    matches = len(re.findall(regWord, item1Text))
    word_counts.append(matches)
    word_freq.append(matches/item1_total_words)

print(word_counts)

categories = ["Strategic positioning","Operations","Marketing","Service","Technology","Infrastructure","Human resources management"]
category_index = [0,16, 33, 38, 55, 63, 99, 104]
category_sum = []

#sum the word counts in each category 
for i in range(len(categories)):
    category_sum.append(sum(word_counts[category_index[i]:category_index[i+1]+1])) 

print("category sum:", category_sum)

[0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 22, 8, 2, 0, 5, 4, 4, 14, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 5, 0, 0, 2, 2, 5, 32, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 23, 0]
category sum: [9, 38, 30, 11, 62, 0, 26]


**OPTIONAL** In python jupyter notebook, we can use pandas datafram library to print the word counts in a tabular way.

In [40]:
df_category = pd.DataFrame([categories[i], category_sum[i]] for i in range(len(categories)))

df_category.columns = ['Category', 'Count']
df_category

Unnamed: 0,Category,Count
0,Strategic positioning,9
1,Operations,38
2,Marketing,30
3,Service,11
4,Technology,62
5,Infrastructure,0
6,Human resources management,26


In [34]:
df_word_counts = pd.DataFrame([regexes[i], word_counts[i]] for i in range(len(wordRegExpressions)))

df_word_counts.columns = ['Regular Expression for Word Pattern', 'Count']
pd.set_option('display.max_rows', 120)
df_word_counts

Unnamed: 0,Regular Expression for Word Pattern,Count
0,\bdifferenti\w*\b,0
1,\bunique\w*\b,3
2,\bsuperior\w*\b,0
3,\bpremium\w*\b,0
4,\bexcellen\w*\b,0
5,\bleading\s+edge\b,0
6,\bupscale\b,0
7,\bhigh\w*\s+price\w*\b,0
8,\bhigh\w*\s+margin\w*\b,0
9,\bhigh\w*\s+end\w*\b,0
