# SEC Web Scraper : Retrieving Financial Information

### Installation and Imports

We must first install the library and import the necessary functions from Downloader and Scraper 

In [1]:
#Install the library
!pip install sec-web-scraper
#Output below:



In [2]:
#Import
from sec_web_scraper.Downloader import Downloader
from sec_web_scraper.Scraper import *

## Build the index for year range [2002,2006]

First, we want to create our downloader object. 

Since we haven't built the index, the forms attribute of our Downloader should be empty

In [3]:
d = Downloader()
print(type(d))
#Output below:

<class 'sec_web_scraper.Downloader.Downloader'>


In [4]:
d.get_forms()
#Output below:

[]

Build the index using the `build_index_sec` function for our year range

In [5]:
d.build_index_sec(2002,2006)
#Output below:

 Downloading SEC files for Year: 2006 and QTR: 4 : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:16<00:00, 15.23s/it]


Now, we can check the existence of forms and our newly created directory : `index_sec`

In [8]:
len(d.get_forms())
#Output below:

503

As we can see above, we have found 503 unique form types for this period (2002,2006).

Let's print some of them ,perhaps the first 10

In [9]:
d.get_forms()[:10]
#Output below:

['5/A',
 '40-6B/A',
 'N-54A',
 'U-1',
 '40-8FC',
 '305B2/A',
 'F-4MEF',
 '19B-4E',
 '40-F',
 'NT 10-K']

In [11]:
%ls index_sec/
#Output below:

2002-QTR1.tsv  2003-QTR1.tsv  2004-QTR1.tsv  2005-QTR1.tsv  2006-QTR1.tsv
2002-QTR2.tsv  2003-QTR2.tsv  2004-QTR2.tsv  2005-QTR2.tsv  2006-QTR2.tsv
2002-QTR3.tsv  2003-QTR3.tsv  2004-QTR3.tsv  2005-QTR3.tsv  2006-QTR3.tsv
2002-QTR4.tsv  2003-QTR4.tsv  2004-QTR4.tsv  2005-QTR4.tsv  2006-QTR4.tsv


As we can see above, we have generated indices for each (year,quarter) pair in our specified range

## Filter our index for forms of type 5/A

Now that we have our index, we can try to look for forms filed within year range with form type `5/A`.

The `find_files_by_type` function will do this for us!

In [16]:
res = d.find_files_by_type('5/A')
res
#Output below:

Unnamed: 0,CIK,Company Name,Form Type,Date Filed,Filename,url
3,1000045,NICHOLAS FINANCIAL INC,5/A,2006-01-25,edgar/data/1000045/0000897069-06-000169.txt,edgar/data/1000045/0000897069-06-000169-index....
4,1000045,NICHOLAS FINANCIAL INC,5/A,2006-01-25,edgar/data/1000045/0000897069-06-000171.txt,edgar/data/1000045/0000897069-06-000171-index....
5,1000045,NICHOLAS FINANCIAL INC,5/A,2006-01-25,edgar/data/1000045/0000897069-06-000173.txt,edgar/data/1000045/0000897069-06-000173-index....
3987,1006057,NOVICH NEIL S,5/A,2006-01-06,edgar/data/1006057/0000790528-06-000018.txt,edgar/data/1006057/0000790528-06-000018-index....
4820,1008051,MANN ALFRED E,5/A,2006-02-24,edgar/data/1008051/0001209191-06-013293.txt,edgar/data/1008051/0001209191-06-013293-index....
...,...,...,...,...,...,...
94781,847431,NYMAGIC INC,5/A,2002-10-18,edgar/data/847431/0000902561-02-000488.txt,edgar/data/847431/0000902561-02-000488-index.html
105650,898173,O REILLY AUTOMOTIVE INC,5/A,2002-11-15,edgar/data/898173/0000898173-02-000059.txt,edgar/data/898173/0000898173-02-000059-index.html
112364,924717,SURMODICS INC,5/A,2002-11-20,edgar/data/924717/0000950134-02-014785.txt,edgar/data/924717/0000950134-02-014785-index.html
112365,924717,SURMODICS INC,5/A,2002-11-21,edgar/data/924717/0000950134-02-014829.txt,edgar/data/924717/0000950134-02-014829-index.html


As we can see above, there are 3692 different 5/A filings!

## Scraper 

Now that we have our 5/A's, what if we wanted to scrape a particular company's filing? 

Let's choose Surmodics Inc (CIK : 0000924717) 5/A filed on 2002-11-20.

First, let's just look at the information available in the SEC database for Surmodics with `get_company_filings_given_cik`

In [23]:
surmodics_dic = get_company_filings_given_cik('0000924717')
#Output below:

Surgical & Medical Instruments & Apparatus


In [26]:
surmodics_dic['addresses']
#Output below:

{'mailing': {'street1': '9924 WEST 74TH ST',
  'street2': None,
  'city': 'EDEN PRAIRIE',
  'stateOrCountry': 'MN',
  'zipCode': '55344',
  'stateOrCountryDescription': 'MN'},
 'business': {'street1': '9924 W 74TH ST',
  'street2': None,
  'city': 'EDEN PRAIRIE',
  'stateOrCountry': 'MN',
  'zipCode': '55344',
  'stateOrCountryDescription': 'MN'}}

In [31]:
surmodics_dic['filings']['files']
#Output below:

[{'name': 'CIK0000924717-submissions-001.json',
  'filingCount': 438,
  'filingFrom': '1997-12-24',
  'filingTo': '2008-11-18'}]

Now, let's get the raw 5/A text

We can use our `get_document_given_link` function

In [32]:
surmodics_link = "https://www.sec.gov/Archives/edgar/data/924717/0000950134-02-014785.txt"

In [33]:
raw_txt = get_document_given_link(surmodics_link)
#Output below:

https://www.sec.gov/Archives/edgar/data/924717/0000950134-02-014785.txt
{'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:94.0) Gecko/20100101 Firefox', 'Accept': 'application/json, text/javascript, */*; q=0.01'}
ok


In [34]:
raw_txt
#Output below:

'-----BEGIN PRIVACY-ENHANCED MESSAGE-----\nProc-Type: 2001,MIC-CLEAR\nOriginator-Name: webmaster@www.sec.gov\nOriginator-Key-Asymmetric:\n MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen\n TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB\nMIC-Info: RSA-MD5,RSA,\n Cp7qAyzP95tJvCUP8ipPYpigzAcjUT+1kQNxu+OvEfyg8h11sDb0IReng+8t/M0r\n VANkRIow3cmOm0XhWvBMkQ==\n\n<SEC-DOCUMENT>0000950134-02-014785.txt : 20021120\n<SEC-HEADER>0000950134-02-014785.hdr.sgml : 20021120\n<ACCEPTANCE-DATETIME>20021120143444\nACCESSION NUMBER:\t\t0000950134-02-014785\nCONFORMED SUBMISSION TYPE:\t5/A\nPUBLIC DOCUMENT COUNT:\t\t1\nCONFORMED PERIOD OF REPORT:\t20020930\nFILED AS OF DATE:\t\t20021120\n\nSUBJECT COMPANY:\t\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tSURMODICS INC\n\t\tCENTRAL INDEX KEY:\t\t\t0000924717\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tADHESIVES & SEALANTS [2891]\n\t\tIRS NUMBER:\t\t\t\t411356149\n\t\tSTATE OF INCORPORATION:\t\t\tMN\n\t\tFISCAL YEAR END:\t\t\t0

Woah, that's a lot of text. Let's look for the tags in the document!

In [35]:
get_document_tags(raw_txt)
#Output below:

This is x, y, z: 1856 , 45504 , <TYPE>5/A


[(1856, 45504, '<TYPE>5/A')]

As we see above, there is only 1 tag in this document for the 5/A. This means there is only 1 section in this document that starts from index 1856 and ends at index 45504