# Web Data Extraction (1)
by Dr Liang Jin

- Step 1: access crawler.idx files from SEC EDGAR
- Step 2: re-write crawler data to csv files
- Step 3: retrieve 10K filing information including URLs
- Step 4: read text from html

## Step 0: Setup...

In [None]:
# import packages as usual
import os, requests, csv, webbrowser
from urllib.request import urlopen, urlretrieve
from bs4 import BeautifulSoup

In [None]:
# define some global variables such as sample periods
beg_yr = 2016
end_yr = 2017

## Step 1: Access Crawler.idx Files...
SEC stores tons of filings in its archives and fortunately they provide index files. We can access to the index files using following url as an example:

[https://www.sec.gov/Archives/edgar/full-index/](https://www.sec.gov/Archives/edgar/full-index/)

And individual crawler.idx files are stored in a structured way:

`https://www.sec.gov/Archives/edgar/full-index/{}/{}/crawler.idx`
where `{ }/{ }` are year and quarter

In [None]:
# create a list containning all the URLs for .idx file
idx_urls = []
for year in range(beg_yr, end_yr+1):
    for qtr in ['QTR1', 'QTR2', 'QTR3', 'QTR4']:
        idx_url = 'https://www.sec.gov/Archives/edgar/full-index/{}/{}/crawler.idx'.format(year, qtr)
        idx_urls.append(idx_url)

In [None]:
# check on our URLs
idx_urls

In [None]:
# let's try downloading one of the files
urlretrieve(idx_urls[0], './example.idx');

### Task 1: Have a look at the downloaded file?

## Step 2: Rewrite Crawler data into CSV files...
The original Crawler.idx files come with extra information:
- **Company Name**: hmmm...not really useful
- **Form Type**: i.e., 10K, 10Q and others
- **CIK**: Central Index Key, claimed to be unique key to identify entities in SEC universe
- **Date Filed**: the exact filing date, NOTE, it is not necessary to be the reporting date
- **URL**: filing page address which contains the link to the actual filing in HTML format
- **Meta-data** on the crawler.idx itself
- **Other information** including headers and seperators

### Retrieve the data inside the .idx file

In [None]:
# Ok, let's get cracking
url = idx_urls[0]

# use requests package to access the contents
r = requests.get(url)

# then focus on the text data only and split the whole file into lines
lines = r.text.splitlines()

### Raw data processing

In [None]:
# Let's peek the contents
lines[:10]

In [None]:
# identify the location of the header row
# its the eighth row, so in Python the index is 7
header_loc = 7
# double check
lines[header_loc]

In [None]:
# retrieve the location of individual columns
name_loc = lines[header_loc].find('Company Name')
type_loc = lines[header_loc].find('Form Type')
cik_loc = lines[header_loc].find('CIK')
date_loc = lines[header_loc].find('Date Filed')
url_loc = lines[header_loc].find('URL')

### Re-organise the data

In [None]:
# identify the location of the first row
# its NO.10 row, so in Python the index is 9
firstdata_loc = 9
# double check
lines[firstdata_loc]

In [None]:
# create an empty list
rows = []

# loop through lines in .idx file
for line in lines[firstdata_loc:]:
    
    # collect the data from the begining until the char before 485BPOS (Form Type)
    # then strip the string, i.e., removing the heading and trailing white spaces
    company_name = line[:type_loc].strip()
    form_type = line[type_loc:cik_loc].strip()
    cik = line[cik_loc:date_loc].strip()
    date_filed = line[date_loc:url_loc].strip()
    page_url = line[url_loc:].strip()
    
    # store these collected data to a row (tuple)
    row = (company_name, form_type, cik, date_filed, page_url)
    # then append this row to the empty list rows
    rows.append(row)

### Task 2: Can you update the codes to store 10-K file only?

In [None]:
# peek again
rows[:5]

### Write to CSV file

In [None]:
# where to write?
# define directory to store data
csv_dir = './CSV/' # recommend to put this on top

# a future-proof way to create directory
# only create the folder when there is no existing one
if not os.path.isdir(csv_dir):
    os.mkdir(csv_dir)

In [None]:
# But file names? since we will have multiple files to process eventually

# create file name based on the original idx file
_ = url.split('/')
_

How about create a sensible naming scheme can be easily refered to?

How about something like **2017Q4**?

In [None]:
# get year from idx URL
file_yr = url.split('/')[-3]

# get quarter from idx URL
file_qtr = url.split('/')[-2][-1]

# Combine year, quarter, and extension to create file name
file_name = file_yr + "Q" + file_qtr + '.csv'

# then create a path so that we can write the data to local drive
file_path = os.path.join(csv_dir, file_name)

In [None]:
# Check on the path
file_path

In [None]:
# create and write to csv file
with open(file_path, 'w') as wf:
    writer = csv.writer(wf, delimiter = ',')
    writer.writerows(rows)

### Task 3: Can you loop through idx files from 2016 to 2017?