# Scraping 10-Ks and 10-Qs for Alpha (Data Cleaning)

## THESIS:
Major text changes in 10-K and 10-Q filings over time indicate significant decreases in future returns. We find alpha in shorting the companies with the largest text changes in their filings and buying the companies with the smallest text changes in their filings.

### Introduction
Publicly listed companies in the U.S. are required by law to file "10-K" and "10-Q" reports with the [Securities and Exchange Commission](https://www.sec.gov/) (SEC). These reports provide both qualitative and quantitative descriptions of the company's performance, from revenue numbers to qualitative risk factors.

When companies file 10-Ks and 10-Qs, they are required to disclose certain pieces of information. For example, companies are required to report information about ["significant pending lawsuits or other legal proceedings"](https://www.sec.gov/fast-answers/answersreada10khtm.html). As such, 10-Ks and 10-Qs often hold valuable insights into a company's performance.

These insights, however, can be difficult to access. The average 10-K was [42,000 words long](https://www.wsj.com/articles/the-109-894-word-annual-report-1433203762) in 2013; put in perspective, that's roughly one-fifth of the length of Moby-Dick. Beyond the sheer length, dense language and lots of boilerplate can further obfuscate true meaning for many investors.

The good news? We might not need to read companies' 10-Ks and 10-Qs from cover-to-cover in order derive value from the information they contain. Specifically, Lauren Cohen, Christopher Malloy and Quoc Nguyen argue in their [recent paper](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1658471) that we can simply analyze textual changes in 10-Ks and 10-Qs to predict companies' future stock returns.

In this investigation, we attempt to replicate their results on Quantopian.

(For an overview of this paper from Lauren Cohen himself, see [the Lazy Prices interview](https://www.youtube.com/watch?v=g96gROyc3wE) from QuantCon 2018.)

### Hypothesis
Companies make major textual changes to their 10-Ks and 10-Qs when major things happen to their business. Thus, we expect that textual changes to 10-Ks and 10-Qs are a signal of future stock price movement.

Since the vast majority (86%) of textual changes have negative sentiment, we generally expect that major textual changes signal a decrease in stock price (Cohen et al. 2018).

Thus, we expect to find alpha by shorting companies with large textual changes in their 10-Ks and 10-Qs.

### Methodology
1. Scrape every publicly traded company's 10-Ks and 10-Qs from the [SEC EDGAR database](https://www.sec.gov/edgar/searchedgar/companysearch.html). Remove extraneous content from the 10-Ks and 10-Qs (numerical tables, HTML tags, XBRL tags, etc).
2. For each company, compute [cosine similarity](http://scikit-learn.org/stable/modules/metrics.html#cosine-similarity) and [Jaccard similarity](http://scikit-learn.org/stable/modules/model_evaluation.html#jaccard-similarity-score) scores over the sequence of its 10-Ks and 10-Qs. Each 10-K is compared to the previous year's 10-K; each 10-Q is compared to the 10-Q from the same quarter of the previous year.
3. Compile these scores into one dataset.
4. Upload the data to Quantopian using [Self-Serve Data](https://www.quantopian.com/posts/upload-your-custom-datasets-and-signals-with-self-serve-data), then use [Alphalens](http://quantopian.github.io/alphalens/) to analyze the performance of 10-K and 10-Q text changes as an alpha factor.

This notebook covers steps 1-3. For step 4, see the [Alphalens study](https://www.quantopian.com/posts/analyzing-alpha-in-10-ks-and-10-qs) notebook.

## 0. Running This Notebook

This notebook is intended to be run locally (on your own computer), *not* within the Quantopian Research environment. We run it locally in order to generate the .csv file for upload into the Self-Serve Data feature.

In order to run this notebook, you will need to have Python 3 and the following packages installed:  

- **jupyter notebook**
- **pandas** (version 0.23.0)
- **numpy**
- **requests**
- **scikit-learn**
- **BeautifulSoup**
- **lxml**
- **tqdm**

All of these packages can be installed using conda or pip. For detailed installation instructions, see the installation documentation for each package ([jupyter](http://jupyter.org/install), [pandas](https://pandas.pydata.org/pandas-docs/stable/install.html), [numpy](https://scipy.org/install.html), [Requests](http://docs.python-requests.org/en/master/user/install/#install), [scikit-learn](http://scikit-learn.org/stable/install.html), [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup), [lxml](https://lxml.de/installation.html), [tqdm](https://pypi.org/project/tqdm/#installation)).

To run this notebook:

1. Clone it into your own Quantopian account and open it in your research environment.
2. Download it as a .ipynb file (Notebook > Download as > Notebook (.ipynb))
3. Move the .ipynb notebook file to the desired directory on your local machine.
4. Open a command line window.
5. Use `cd` in the command line to navigate to the directory containing the notebook file.
6. Run `jupyter notebook` in the command line to start a jupyter notebook session.
7. A window should open in your default web browser displaying the contents of your current directory. Click the name of the .ipynb notebook file to open it.
8. Run the cells just as you would in the Quantopian Research environment.

In [9]:
import sys
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install requests
!{sys.executable} -m pip install -U scikit-learn

Requirement already up-to-date: scikit-learn in /Users/daniel/opt/anaconda3/envs/ipykernel_py2/lib/python3.8/site-packages (0.23.1)


In [1]:
# Importing built-in libraries (no need to install these)
import re
import os
from time import gmtime, strftime
from datetime import datetime, timedelta
import unicodedata

# Importing libraries you need to install
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import requests
import bs4 as bs
from lxml import html
from tqdm import tqdm

## 1. Data Scraping

We need to know what we want to scrape, so we'll begin by compiling a complete* list of U.S. stock tickers.

*for our purposes, "complete" = everything traded on NASDAQ, NYSE, or AMEX.

In [3]:
# Get lists of tickers from NASDAQ, NYSE, AMEX
nasdaq_tickers = pd.read_csv('https://old.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nasdaq&render=download')
nyse_tickers = pd.read_csv('https://old.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nyse&render=download')
amex_tickers = pd.read_csv('https://old.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=amex&render=download')

# Drop irrelevant cols
nasdaq_tickers.drop(labels='Unnamed: 8', axis='columns', inplace=True)
nyse_tickers.drop(labels='Unnamed: 8', axis='columns', inplace=True)
amex_tickers.drop(labels='Unnamed: 8', axis='columns', inplace=True)

# Create full list of tickers/names across all 3 exchanges
tickers = list(set(list(nasdaq_tickers['Symbol']) + list(nyse_tickers['Symbol']) + list(amex_tickers['Symbol'])))

In [4]:
nasdaq_tickers


Unnamed: 0,Symbol,Name,LastSale,MarketCap,IPOyear,Sector,industry,Summary Quote
0,TXG,"10x Genomics, Inc.",88.14,$8.67B,2019.0,Capital Goods,Biotechnology: Laboratory Analytical Instruments,https://old.nasdaq.com/symbol/txg
1,YI,"111, Inc.",6.86,$564.99M,2018.0,Health Care,Medical/Nursing Services,https://old.nasdaq.com/symbol/yi
2,PIH,"1347 Property Insurance Holdings, Inc.",4.53,$27.49M,2014.0,Finance,Property-Casualty Insurers,https://old.nasdaq.com/symbol/pih
3,PIHPP,"1347 Property Insurance Holdings, Inc.",25.30,,,Finance,Property-Casualty Insurers,https://old.nasdaq.com/symbol/pihpp
4,TURN,180 Degree Capital Corp.,1.86,$57.89M,,Finance,Finance/Investors Services,https://old.nasdaq.com/symbol/turn
...,...,...,...,...,...,...,...,...
3622,ZS,"Zscaler, Inc.",108.57,$14.17B,2018.0,Technology,EDP Services,https://old.nasdaq.com/symbol/zs
3623,ZUMZ,Zumiez Inc.,26.74,$680.02M,2005.0,Consumer Services,Clothing/Shoe/Accessory Stores,https://old.nasdaq.com/symbol/zumz
3624,ZYNE,"Zynerba Pharmaceuticals, Inc.",6.04,$150.7M,2015.0,Health Care,Major Pharmaceuticals,https://old.nasdaq.com/symbol/zyne
3625,ZYXI,"Zynex, Inc.",24.06,$798.61M,,Health Care,Biotechnology: Electromedical & Electrotherape...,https://old.nasdaq.com/symbol/zyxi


Unfortunately, the SEC indexes company filings by its own internal identifier, the "Central Index Key" (CIK). We'll need to translate tickers into CIKs in order to search for company filings on EDGAR.

(The code below is an edited version of [this gist](https://gist.github.com/dougvk/8499335).)

In [5]:
# simplify this code to just use the ticker/cik pairings from the sec: https://www.sec.gov/include/ticker.txt

def MapTickerToCik(tickers):
    url = 'http://www.sec.gov/cgi-bin/browse-edgar?CIK={}&Find=Search&owner=exclude&action=getcompany'
    cik_re = re.compile(r'.*CIK=(\d{10}).*')

    cik_dict = {}
    for ticker in tqdm(tickers): # Use tqdm lib for progress bar
        results = cik_re.findall(requests.get(url.format(ticker)).text)
        if len(results):
            cik_dict[str(ticker).lower()] = str(results[0])
    
    return cik_dict

In [6]:
cik_dict = MapTickerToCik(tickers)

100%|██████████| 7009/7009 [24:16<00:00,  4.81it/s]


In [11]:
cik_dict

In [12]:
# Clean up the ticker-CIK mapping as a DataFrame
ticker_cik_df = pd.DataFrame.from_dict(data=cik_dict, orient='index')
ticker_cik_df.reset_index(inplace=True)
ticker_cik_df.columns = ['ticker', 'cik']
ticker_cik_df['cik'] = [str(cik) for cik in ticker_cik_df['cik']]

Our ultimate goal is to link each ticker to a unique CIK.

However, some CIKs might be linked to multiple tickers. For example, different [share classes](https://www.investopedia.com/terms/s/share_class.asp) within the same company would all be linked to the same CIK. Let's get rid of these duplicate mappings.

In [13]:
# Check for duplicated tickers/CIKs
print("Number of ticker-cik pairings:", len(ticker_cik_df))
print("Number of unique tickers:", len(set(ticker_cik_df['ticker'])))
print("Number of unique CIKs:", len(set(ticker_cik_df['cik'])))

Number of ticker-cik pairings: 5205
Number of unique tickers: 5205
Number of unique CIKs: 4854


It looks like about 200 (4.5%) CIKs are linked to multiple tickers. To eliminate the duplicate mappings, we'll simply keep the ticker that comes first in the alphabet. In most cases, this means we'll keep the class A shares of the stock. 

It's certainly possible to eliminate duplicates using other methods; for the sake of simplicity, we'll stick with alphabetizing for now. As long as we apply it uniformly across all stocks, it shouldn't introduce any bias.

In [14]:
# Keep first ticker alphabetically for duplicated CIKs
ticker_cik_df = ticker_cik_df.sort_values(by='ticker')
ticker_cik_df.drop_duplicates(subset='cik', keep='first', inplace=True)

In [None]:
ticker_cik_df

In [15]:
# Check that we've eliminated duplicate tickers/CIKs
print("Number of ticker-cik pairings:", len(ticker_cik_df))
print("Number of unique tickers:", len(set(ticker_cik_df['ticker'])))
print("Number of unique CIKs:", len(set(ticker_cik_df['cik'])))

Number of ticker-cik pairings: 4854
Number of unique tickers: 4854
Number of unique CIKs: 4854


At this point, we have a list of the CIKs for which we want to obtain 10-Ks and 10-Qs. We can now begin scraping from EDGAR.

As with many web scraping projects, we'll need to keep some technical considerations in mind:

- We're scraping a lot of data, so it's unlikely that we'll be able to do it all in one session without something breaking (most likely scenario: the WiFI disconnects briefly or your laptop goes to sleep). As such, we should make sure that our scraper can easily pick up where it left off without having to re-scrape anything.
- We also probably want to log warnings/errors and save that log, just in case we need to reference it later.
- The SEC limits users to [10 requests per second](https://www.sec.gov/developer), so we need to make sure we're not making requests too quickly.

In [17]:
def WriteLogFile(log_file_name, text):
    
    '''
    Helper function.
    Writes a log file with all notes and
    error messages from a scraping "session".
    
    Parameters
    ----------
    log_file_name : str
        Name of the log file (should be a .txt file).
    text : str
        Text to write to the log file.
        
    Returns
    -------
    None.
    
    '''
    
    with open(log_file_name, "a") as log_file:
        log_file.write(text)

    return

The function below scrapes all 10-Ks and 10-K405s one particular CIK. Our web scraper primarily depends on the [`requests`](http://docs.python-requests.org/en/master/) and [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) libraries.

Note that the scraper creates a different directory for each CIK, and puts all the filings for that CIK within that directory. After scraping, your file structure should look like this:


```
- 10Ks
    - CIK1
        - 10K #1
        - 10K #2
        ...
    - CIK2
        - 10K #1
        - 10K #2
        ...
    - CIK3
        - 10K #1
        - 10K #2
        ...
    ...
- 10Qs
    - CIK1
        - 10Q #1
        - 10Q #2
        ...
    - CIK2
        - 10Q #1
        - 10Q #2
        ...
    - CIK3
        - 10Q #1
        - 10Q #2
        ...
    ...
```

The scraper will create the directory for each CIK. However, we need to create different directories to hold our 10-K and 10-Q files. The exact pathname depends on your local setup, so you'll need to fill in the correct `pathname_10k` and `pathname_10q` for your machine.

In [18]:
pathname_10k = '/Volumes/xpg/scrapingEarnings/10-Ks'
pathname_10q = '/Volumes/xpg/scrapingEarnings/10-Qs'

In [19]:
def Scrape10K(browse_url_base, filing_url_base, doc_url_base, cik, log_file_name):
    
    '''
    Scrapes all 10-Ks and 10-K405s for a particular 
    CIK from EDGAR.
    
    Parameters
    ----------
    browse_url_base : str
        Base URL for browsing EDGAR.
    filing_url_base : str
        Base URL for filings listings on EDGAR.
    doc_url_base : str
        Base URL for one filing's document tables
        page on EDGAR.
    cik : str
        Central Index Key.
    log_file_name : str
        Name of the log file (should be a .txt file).
        
    Returns
    -------
    None.
    
    '''
    
    # Check if we've already scraped this CIK
    try:
        os.mkdir(cik)
    except OSError:
        print("Already scraped CIK", cik)
        return
    
    # If we haven't, go into the directory for that CIK
    os.chdir(cik)
    
    print('Scraping CIK', cik)
    
    # Request list of 10-K filings
    res = requests.get(browse_url_base % cik)
    
    # If the request failed, log the failure and exit
    if res.status_code != 200:
        os.chdir('..')
        os.rmdir(cik) # remove empty dir
        text = "Request failed with error code " + str(res.status_code) + \
               "\nFailed URL: " + (browse_url_base % cik) + '\n'
        WriteLogFile(log_file_name, text)
        return

    # If the request doesn't fail, continue...
    
    # Parse the response HTML using BeautifulSoup
    soup = bs.BeautifulSoup(res.text, "lxml")

    # Extract all tables from the response
    html_tables = soup.find_all('table')
    
    # Check that the table we're looking for exists
    # If it doesn't, exit
    if len(html_tables)<3:
        os.chdir('..')
        return
    
    # Parse the Filings table
    filings_table = pd.read_html(str(html_tables[2]), header=0)[0]
    filings_table['Filings'] = [str(x) for x in filings_table['Filings']]

    # Get only 10-K and 10-K405 document filings
    filings_table = filings_table[(filings_table['Filings'] == '10-K') | (filings_table['Filings'] == '10-K405')]

    # If filings table doesn't have any
    # 10-Ks or 10-K405s, exit
    if len(filings_table)==0:
        os.chdir('..')
        return
    
    # Get accession number for each 10-K and 10-K405 filing
    filings_table['Acc_No'] = [x.replace('\xa0',' ')
                               .split('Acc-no: ')[1]
                               .split(' ')[0] for x in filings_table['Description']]

    # Iterate through each filing and 
    # scrape the corresponding document...
    for index, row in filings_table.iterrows():
        
        # Get the accession number for the filing
        acc_no = str(row['Acc_No'])
        
        # Navigate to the page for the filing
        docs_page = requests.get(filing_url_base % (cik, acc_no))
        
        # If request fails, log the failure
        # and skip to the next filing
        if docs_page.status_code != 200:
            os.chdir('..')
            text = "Request failed with error code " + str(docs_page.status_code) + \
                   "\nFailed URL: " + (filing_url_base % (cik, acc_no)) + '\n'
            WriteLogFile(log_file_name, text)
            os.chdir(cik)
            continue

        # If request succeeds, keep going...
        
        # Parse the table of documents for the filing
        docs_page_soup = bs.BeautifulSoup(docs_page.text, 'lxml')
        docs_html_tables = docs_page_soup.find_all('table')
        if len(docs_html_tables)==0:
            continue
        docs_table = pd.read_html(str(docs_html_tables[0]), header=0)[0]
        docs_table['Type'] = [str(x) for x in docs_table['Type']]
        
        # Get the 10-K and 10-K405 entries for the filing
        docs_table = docs_table[(docs_table['Type'] == '10-K') | (docs_table['Type'] == '10-K405')]
        
        # If there aren't any 10-K or 10-K405 entries,
        # skip to the next filing
        if len(docs_table)==0:
            continue
        # If there are 10-K or 10-K405 entries,
        # grab the first document
        elif len(docs_table)>0:
            docs_table = docs_table.iloc[0]
        
        docname = docs_table['Document']
        
        # If that first entry is unavailable,
        # log the failure and exit
        if str(docname) == 'nan':
            os.chdir('..')
            text = 'File with CIK: %s and Acc_No: %s is unavailable' % (cik, acc_no) + '\n'
            WriteLogFile(log_file_name, text)
            os.chdir(cik)
            continue       
        
        # If it is available, continue...
        
        # Request the file
        file = requests.get(doc_url_base % (cik, acc_no.replace('-', ''), docname))
        
        # If the request fails, log the failure and exit
        if file.status_code != 200:
            os.chdir('..')
            text = "Request failed with error code " + str(file.status_code) + \
                   "\nFailed URL: " + (doc_url_base % (cik, acc_no.replace('-', ''), docname)) + '\n'
            WriteLogFile(log_file_name, text)
            os.chdir(cik)
            continue
        
        # If it succeeds, keep going...
        
        # Save the file in appropriate format
        if '.txt' in docname:
            # Save text as TXT
            date = str(row['Filing Date'])
            filename = cik + '_' + date + '.txt'
            html_file = open(filename, 'a')
            html_file.write(file.text)
            html_file.close()
        else:
            # Save text as HTML
            date = str(row['Filing Date'])
            filename = cik + '_' + date + '.html'
            html_file = open(filename, 'a')
            html_file.write(file.text)
            html_file.close()
        
    # Move back to the main 10-K directory
    os.chdir('..')
        
    return

In [20]:
def Scrape10Q(browse_url_base, filing_url_base, doc_url_base, cik, log_file_name):
    
    '''
    Scrapes all 10-Qs for a particular CIK from EDGAR.
    
    Parameters
    ----------
    browse_url_base : str
        Base URL for browsing EDGAR.
    filing_url_base : str
        Base URL for filings listings on EDGAR.
    doc_url_base : str
        Base URL for one filing's document tables
        page on EDGAR.
    cik : str
        Central Index Key.
    log_file_name : str
        Name of the log file (should be a .txt file).
        
    Returns
    -------
    None.
    
    '''
    
    # Check if we've already scraped this CIK
    try:
        os.mkdir(cik)
    except OSError:
        print("Already scraped CIK", cik)
        return
    
    # If we haven't, go into the directory for that CIK
    os.chdir(cik)
    
    print('Scraping CIK', cik)
    
    # Request list of 10-Q filings
    res = requests.get(browse_url_base % cik)
    
    # If the request failed, log the failure and exit
    if res.status_code != 200:
        os.chdir('..')
        os.rmdir(cik) # remove empty dir
        text = "Request failed with error code " + str(res.status_code) + \
               "\nFailed URL: " + (browse_url_base % cik) + '\n'
        WriteLogFile(log_file_name, text)
        return
    
    # If the request doesn't fail, continue...

    # Parse the response HTML using BeautifulSoup
    soup = bs.BeautifulSoup(res.text, "lxml")

    # Extract all tables from the response
    html_tables = soup.find_all('table')
    
    # Check that the table we're looking for exists
    # If it doesn't, exit
    if len(html_tables)<3:
        print("table too short")
        os.chdir('..')
        return
    
    # Parse the Filings table
    filings_table = pd.read_html(str(html_tables[2]), header=0)[0]
    filings_table['Filings'] = [str(x) for x in filings_table['Filings']]

    # Get only 10-Q document filings
    filings_table = filings_table[filings_table['Filings'] == '10-Q']

    # If filings table doesn't have any
    # 10-Ks or 10-K405s, exit
    if len(filings_table)==0:
        os.chdir('..')
        return
    
    # Get accession number for each 10-K and 10-K405 filing
    filings_table['Acc_No'] = [x.replace('\xa0',' ')
                               .split('Acc-no: ')[1]
                               .split(' ')[0] for x in filings_table['Description']]

    # Iterate through each filing and 
    # scrape the corresponding document...
    for index, row in filings_table.iterrows():
        
        # Get the accession number for the filing
        acc_no = str(row['Acc_No'])
        
        # Navigate to the page for the filing
        docs_page = requests.get(filing_url_base % (cik, acc_no))
        
        # If request fails, log the failure
        # and skip to the next filing    
        if docs_page.status_code != 200:
            os.chdir('..')
            text = "Request failed with error code " + str(docs_page.status_code) + \
                   "\nFailed URL: " + (filing_url_base % (cik, acc_no)) + '\n'
            WriteLogFile(log_file_name, text)
            os.chdir(cik)
            continue
            
        # If request succeeds, keep going...
        
        # Parse the table of documents for the filing
        docs_page_soup = bs.BeautifulSoup(docs_page.text, 'lxml')
        docs_html_tables = docs_page_soup.find_all('table')
        if len(docs_html_tables)==0:
            continue
        docs_table = pd.read_html(str(docs_html_tables[0]), header=0)[0]
        docs_table['Type'] = [str(x) for x in docs_table['Type']]
        
        # Get the 10-K and 10-K405 entries for the filing
        docs_table = docs_table[docs_table['Type'] == '10-Q']
        
        # If there aren't any 10-K or 10-K405 entries,
        # skip to the next filing
        if len(docs_table)==0:
            continue
        # If there are 10-K or 10-K405 entries,
        # grab the first document
        elif len(docs_table)>0:
            docs_table = docs_table.iloc[0]
        
        docname = docs_table['Document']
        
        # If that first entry is unavailable,
        # log the failure and exit
        if str(docname) == 'nan':
            os.chdir('..')
            text = 'File with CIK: %s and Acc_No: %s is unavailable' % (cik, acc_no) + '\n'
            WriteLogFile(log_file_name, text)
            os.chdir(cik)
            continue       
        
        # If it is available, continue...
        
        # Request the file
        file = requests.get(doc_url_base % (cik, acc_no.replace('-', ''), docname))
        
        # If the request fails, log the failure and exit
        if file.status_code != 200:
            os.chdir('..')
            text = "Request failed with error code " + str(file.status_code) + \
                   "\nFailed URL: " + (doc_url_base % (cik, acc_no.replace('-', ''), docname)) + '\n'
            WriteLogFile(log_file_name, text)
            os.chdir(cik)
            continue
            
        # If it succeeds, keep going...
        
        # Save the file in appropriate format
        if '.txt' in docname:
            # Save text as TXT
            date = str(row['Filing Date'])
            filename = cik + '_' + date + '.txt'
            html_file = open(filename, 'a')
            html_file.write(file.text)
            html_file.close()
        else:
            # Save text as HTML
            date = str(row['Filing Date'])
            filename = cik + '_' + date + '.html'
            html_file = open(filename, 'a')
            html_file.write(file.text)
            html_file.close()
        
    # Move back to the main 10-Q directory
    os.chdir('..')
        
    return

Now that we've defined our scraper functions, let's scrape. 

(A note from the future: we're scraping a lot of data, which takes *time* and *space*. For reference, these functions ultimately scraped 170 GB of 10-Qs and 125 GB of 10-Ks; the scraping took roughly 20 hours total.)

In [25]:
%%capture
# Run the function to scrape 10-Ks

# Define parameters
browse_url_base_10k = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=%s&type=10-K'
filing_url_base_10k = 'http://www.sec.gov/Archives/edgar/data/%s/%s-index.html'
doc_url_base_10k = 'http://www.sec.gov/Archives/edgar/data/%s/%s/%s'

# Set correct directory
os.chdir(pathname_10k)

# Initialize log file
# (log file name = the time we initiate scraping session)
time = strftime("%Y-%m-%d %Hh%Mm%Ss", gmtime())
log_file_name = 'log '+time+'.txt'
with open(log_file_name, 'a') as log_file:
    log_file.close()

# Iterate over CIKs and scrape 10-Ks
for cik in tqdm(ticker_cik_df['cik']):
    Scrape10K(browse_url_base=browse_url_base_10k, filing_url_base=filing_url_base_10k, doc_url_base=doc_url_base_10k, cik=cik, log_file_name=log_file_name)

 21%|██        | 996/4854 [00:00<00:00, 5185.61it/s]

Already scraped CIK 0001090872
Already scraped CIK 0001675149
Already scraped CIK 0001420529
Already scraped CIK 0000006201
Already scraped CIK 0001555074
Already scraped CIK 0000008177
Already scraped CIK 0000706688
Already scraped CIK 0001158114
Already scraped CIK 0000824142
Already scraped CIK 0001158449
Already scraped CIK 0000320193
Already scraped CIK 0001500217
Already scraped CIK 0001015647
Already scraped CIK 0001135185
Already scraped CIK 0001069183
Already scraped CIK 0000825313
Already scraped CIK 0001091587
Already scraped CIK 0001551152
Already scraped CIK 0001140859
Already scraped CIK 0000351569
Already scraped CIK 0000318306
Already scraped CIK 0001565025
Already scraped CIK 0001144980
Already scraped CIK 0000907654
Already scraped CIK 0000771497
Already scraped CIK 0000815094
Already scraped CIK 0001253986
Already scraped CIK 0000001800
Already scraped CIK 0001642081
Already scraped CIK 0001447028
Already scraped CIK 0001642122
Already scraped CIK 0001739445
Already 

 26%|██▌       | 1243/4854 [00:00<00:01, 3384.37it/s]

Already scraped CIK 0000811156
Already scraped CIK 0001026655
Already scraped CIK 0000023197
Already scraped CIK 0000021175
Already scraped CIK 0000812348
Already scraped CIK 0001071739
Already scraped CIK 0001367920
Already scraped CIK 0001677703
Already scraped CIK 0001376321
Already scraped CIK 0001733868
Already scraped CIK 0001502292
Already scraped CIK 0001567094
Already scraped CIK 0000016868
Already scraped CIK 0001385280
Already scraped CIK 0000816956
Already scraped CIK 0001787005
Already scraped CIK 0001704720
Already scraped CIK 0001224608
Already scraped CIK 0000712771
Already scraped CIK 0001130310
Already scraped CIK 0001017413
Already scraped CIK 0000883902
Already scraped CIK 0001284812
Already scraped CIK 0001304421
Already scraped CIK 0001729427
Already scraped CIK 0001434418
Already scraped CIK 0001757097
Already scraped CIK 0000911147
Already scraped CIK 0001070412
Already scraped CIK 0001610418
Already scraped CIK 0001050377
Already scraped CIK 0001467808
Already 

 40%|████      | 1955/4854 [00:00<00:00, 3209.35it/s]

Already scraped CIK 0001395213
Already scraped CIK 0001717556
Already scraped CIK 0001731388
Already scraped CIK 0001540159
Already scraped CIK 0001372920
Already scraped CIK 0000031667
Already scraped CIK 0000031978
Already scraped CIK 0001029199
Already scraped CIK 0001579214
Already scraped CIK 0001411342
Already scraped CIK 0000924168
Already scraped CIK 0001025835
Already scraped CIK 0000033185
Already scraped CIK 0001066194
Already scraped CIK 0001050441
Already scraped CIK 0001023731
Already scraped CIK 0001322439
Already scraped CIK 0000918608
Already scraped CIK 0001065332
Already scraped CIK 0000049600
Already scraped CIK 0000827871
Already scraped CIK 0000894627
Already scraped CIK 0001759783
Already scraped CIK 0000785161
Already scraped CIK 0001333493
Already scraped CIK 0001731831
Already scraped CIK 0001379041
Already scraped CIK 0001237746
Already scraped CIK 0001305253
Already scraped CIK 0000827052
Already scraped CIK 0001549084
Already scraped CIK 0001001250
Already 

 54%|█████▍    | 2613/4854 [00:00<00:00, 3127.49it/s]

Already scraped CIK 0000925528
Already scraped CIK 0000354707
Already scraped CIK 0001493761
Already scraped CIK 0001603993
Already scraped CIK 0001803901
Already scraped CIK 0001339605
Already scraped CIK 0000046619
Already scraped CIK 0000916789
Already scraped CIK 0001283140
Already scraped CIK 0001583771
Already scraped CIK 0000004447
Already scraped CIK 0001789832
Already scraped CIK 0001690947
Already scraped CIK 0001500375
Already scraped CIK 0000048039
Already scraped CIK 0001680873
Already scraped CIK 0001046025
Already scraped CIK 0000874766
Already scraped CIK 0001158420
Already scraped CIK 0001674168
Already scraped CIK 0001498828
Already scraped CIK 0001721181
Already scraped CIK 0000045919
Already scraped CIK 0001747661
Already scraped CIK 0001417398
Already scraped CIK 0001017480
Already scraped CIK 0001026785
Already scraped CIK 0001501585
Already scraped CIK 0001287808
Already scraped CIK 0001342338
Already scraped CIK 0000921082
Already scraped CIK 0001661053
Already 

 68%|██████▊   | 3312/4854 [00:00<00:00, 3308.03it/s]

Already scraped CIK 0001668397
Already scraped CIK 0001382574
Already scraped CIK 0001078099
Already scraped CIK 0000065270
Already scraped CIK 0001262104
Already scraped CIK 0001099590
Already scraped CIK 0000886977
Already scraped CIK 0001333274
Already scraped CIK 0000810332
Already scraped CIK 0001345099
Already scraped CIK 0001099219
Already scraped CIK 0001687187
Already scraped CIK 0001796514
Already scraped CIK 0001055160
Already scraped CIK 0001725872
Already scraped CIK 0001086888
Already scraped CIK 0001335730
Already scraped CIK 0001527762
Already scraped CIK 0001000209
Already scraped CIK 0000036506
Already scraped CIK 0001436126
Already scraped CIK 0000749098
Already scraped CIK 0001161728
Already scraped CIK 0001590750
Already scraped CIK 0001273931
Already scraped CIK 0000876779
Already scraped CIK 0000019411
Already scraped CIK 0000789570
Already scraped CIK 0001125345
Already scraped CIK 0001656936
Already scraped CIK 0000835011
Already scraped CIK 0000752714
Already 

 75%|███████▌  | 3646/4854 [00:01<00:00, 3317.27it/s]

Already scraped CIK 0000315131
Already scraped CIK 0001114995
Already scraped CIK 0001772720
Already scraped CIK 0000830122
Already scraped CIK 0001591890
Already scraped CIK 0000931015
Already scraped CIK 0001577916
Already scraped CIK 0001786117
Already scraped CIK 0001679826
Already scraped CIK 0001506293
Already scraped CIK 0001230245
Already scraped CIK 0001583648
Already scraped CIK 0001675634
Already scraped CIK 0001137774
Already scraped CIK 0001626115
Already scraped CIK 0001617406
Already scraped CIK 0001315399
Already scraped CIK 0000076267
Already scraped CIK 0000075677
Already scraped CIK 0000031791
Already scraped CIK 0000076282
Already scraped CIK 0000889132
Already scraped CIK 0000810136
Already scraped CIK 0001117057
Already scraped CIK 0001540755
Already scraped CIK 0001525769
Already scraped CIK 0001168455
Already scraped CIK 0001041859
Already scraped CIK 0001045609
Already scraped CIK 0001095052
Already scraped CIK 0001735556
Already scraped CIK 0001728205
Already 

 79%|███████▉  | 3844/4854 [00:25<01:08, 14.69it/s]  

Scraping CIK 0001794717
Scraping CIK 0001468666


 79%|███████▉  | 3845/4854 [00:27<11:39,  1.44it/s]

Scraping CIK 0000093676


 79%|███████▉  | 3846/4854 [00:39<1:08:04,  4.05s/it]

Scraping CIK 0001178253


 79%|███████▉  | 3847/4854 [00:42<1:05:11,  3.88s/it]

Scraping CIK 0001349436


 79%|███████▉  | 3848/4854 [00:50<1:25:21,  5.09s/it]

Scraping CIK 0001775625


 79%|███████▉  | 3849/4854 [00:51<1:02:10,  3.71s/it]

Scraping CIK 0001490978


 79%|███████▉  | 3850/4854 [00:51<46:44,  2.79s/it]  

Scraping CIK 0000911649


 79%|███████▉  | 3851/4854 [01:03<1:31:11,  5.46s/it]

Scraping CIK 0001600422


 79%|███████▉  | 3853/4854 [01:07<57:39,  3.46s/it]  

Scraping CIK 0001737706
Scraping CIK 0001703399


 79%|███████▉  | 3854/4854 [01:07<41:00,  2.46s/it]

Scraping CIK 0001019671


 79%|███████▉  | 3855/4854 [01:19<1:29:22,  5.37s/it]

Scraping CIK 0001564902


 79%|███████▉  | 3856/4854 [01:24<1:29:17,  5.37s/it]

Scraping CIK 0000088121


 79%|███████▉  | 3858/4854 [01:36<1:23:29,  5.03s/it]

Scraping CIK 0001633441
Scraping CIK 0001419612


 80%|███████▉  | 3859/4854 [01:39<1:15:35,  4.56s/it]

Scraping CIK 0001012100


 80%|███████▉  | 3861/4854 [01:51<1:18:35,  4.75s/it]

Scraping CIK 0001321851
Scraping CIK 0001017491


 80%|███████▉  | 3862/4854 [02:00<1:39:05,  5.99s/it]

Scraping CIK 0000350894


 80%|███████▉  | 3863/4854 [02:11<2:04:08,  7.52s/it]

Scraping CIK 0001453687


 80%|███████▉  | 3864/4854 [02:13<1:36:22,  5.84s/it]

Scraping CIK 0001031235


 80%|███████▉  | 3865/4854 [02:15<1:19:17,  4.81s/it]

Scraping CIK 0001320414


 80%|███████▉  | 3866/4854 [02:22<1:28:24,  5.37s/it]

Scraping CIK 0000088948


 80%|███████▉  | 3867/4854 [02:33<1:55:18,  7.01s/it]

Scraping CIK 0001616543


 80%|███████▉  | 3868/4854 [02:37<1:41:07,  6.15s/it]

Scraping CIK 0001428875


 80%|███████▉  | 3869/4854 [02:41<1:31:09,  5.55s/it]

Scraping CIK 0001485003


 80%|███████▉  | 3870/4854 [02:46<1:26:01,  5.24s/it]

Scraping CIK 0000720672


 80%|███████▉  | 3871/4854 [02:59<2:05:47,  7.68s/it]

Scraping CIK 0001541119


 80%|███████▉  | 3872/4854 [03:05<1:54:36,  7.00s/it]

Scraping CIK 0001430723


 80%|███████▉  | 3873/4854 [03:13<2:00:44,  7.39s/it]

Scraping CIK 0000086115


 80%|███████▉  | 3874/4854 [03:26<2:28:59,  9.12s/it]

Scraping CIK 0001725332


 80%|███████▉  | 3875/4854 [03:26<1:45:39,  6.48s/it]

Scraping CIK 0001576942


 80%|███████▉  | 3877/4854 [03:28<56:39,  3.48s/it]  

Scraping CIK 0001289877
Scraping CIK 0001575515


 80%|███████▉  | 3878/4854 [03:32<1:00:19,  3.71s/it]

Scraping CIK 0000090498


 80%|███████▉  | 3879/4854 [03:47<1:57:08,  7.21s/it]

Scraping CIK 0001090009


 80%|███████▉  | 3880/4854 [04:00<2:22:32,  8.78s/it]

Scraping CIK 0001753539


 80%|███████▉  | 3882/4854 [04:01<1:13:10,  4.52s/it]

Scraping CIK 0001294404
Scraping CIK 0001744894


 80%|███████▉  | 3883/4854 [04:03<1:01:36,  3.81s/it]

Scraping CIK 0000886136


 80%|████████  | 3884/4854 [04:16<1:45:19,  6.51s/it]

Scraping CIK 0001023994


 80%|████████  | 3885/4854 [04:21<1:41:32,  6.29s/it]

Scraping CIK 0000095574


 80%|████████  | 3886/4854 [04:34<2:10:02,  8.06s/it]

Scraping CIK 0001060736


 80%|████████  | 3887/4854 [04:44<2:23:14,  8.89s/it]

Scraping CIK 0001616533


 80%|████████  | 3888/4854 [04:47<1:50:12,  6.85s/it]

Scraping CIK 0000788611


 80%|████████  | 3889/4854 [04:54<1:50:40,  6.88s/it]

Scraping CIK 0000915358


 80%|████████  | 3890/4854 [05:06<2:16:12,  8.48s/it]

Scraping CIK 0001001233


 80%|████████  | 3891/4854 [05:17<2:30:33,  9.38s/it]

Scraping CIK 0000750004


 80%|████████  | 3892/4854 [05:32<2:53:58, 10.85s/it]

Scraping CIK 0001412095


 80%|████████  | 3893/4854 [05:33<2:07:59,  7.99s/it]

Scraping CIK 0001004989


 80%|████████  | 3894/4854 [05:47<2:36:31,  9.78s/it]

Scraping CIK 0001638833


 80%|████████  | 3895/4854 [05:51<2:08:38,  8.05s/it]

Scraping CIK 0001002590


 80%|████████  | 3896/4854 [06:04<2:31:29,  9.49s/it]

Scraping CIK 0001620533


 80%|████████  | 3897/4854 [06:08<2:06:06,  7.91s/it]

Scraping CIK 0001035092


 80%|████████  | 3898/4854 [06:23<2:39:25, 10.01s/it]

Scraping CIK 0000354963


 80%|████████  | 3900/4854 [06:36<2:02:50,  7.73s/it]

Scraping CIK 0001263043
Scraping CIK 0000908732


 80%|████████  | 3902/4854 [06:36<1:01:09,  3.85s/it]

Scraping CIK 0001448397
Scraping CIK 0001759631


 80%|████████  | 3903/4854 [06:37<46:08,  2.91s/it]  

Scraping CIK 0000904979


 80%|████████  | 3904/4854 [06:49<1:27:15,  5.51s/it]

Scraping CIK 0001610466


 80%|████████  | 3905/4854 [06:52<1:18:14,  4.95s/it]

Scraping CIK 0001295810


 80%|████████  | 3906/4854 [07:02<1:41:17,  6.41s/it]

Scraping CIK 0000913241


 81%|████████  | 3908/4854 [07:15<1:30:42,  5.75s/it]

Scraping CIK 0001594805
Scraping CIK 0001506439


 81%|████████  | 3909/4854 [07:20<1:30:47,  5.76s/it]

Scraping CIK 0000089800


 81%|████████  | 3910/4854 [07:32<1:59:23,  7.59s/it]

Scraping CIK 0000743238


 81%|████████  | 3911/4854 [07:46<2:27:26,  9.38s/it]

Scraping CIK 0001312109


 81%|████████  | 3912/4854 [07:46<1:45:18,  6.71s/it]

Scraping CIK 0001459839


 81%|████████  | 3913/4854 [07:47<1:19:36,  5.08s/it]

Scraping CIK 0001723866


 81%|████████  | 3915/4854 [07:49<44:06,  2.82s/it]  

Scraping CIK 0001049659
Scraping CIK 0000065596


 81%|████████  | 3916/4854 [07:59<1:19:45,  5.10s/it]

Scraping CIK 0001551693


 81%|████████  | 3917/4854 [08:03<1:14:07,  4.75s/it]

Scraping CIK 0000090168


 81%|████████  | 3919/4854 [08:16<1:17:19,  4.96s/it]

Scraping CIK 0001094324
Scraping CIK 0000832988


 81%|████████  | 3920/4854 [08:23<1:27:19,  5.61s/it]

Scraping CIK 0001010086


 81%|████████  | 3921/4854 [08:31<1:37:13,  6.25s/it]

Scraping CIK 0000230557


 81%|████████  | 3923/4854 [08:44<1:31:20,  5.89s/it]

Scraping CIK 0000916793
Scraping CIK 0001397702


 81%|████████  | 3925/4854 [08:45<47:12,  3.05s/it]  

Scraping CIK 0001659520
Scraping CIK 0000887153


 81%|████████  | 3927/4854 [08:45<24:18,  1.57s/it]

Scraping CIK 0001329394
Scraping CIK 0001094005


 81%|████████  | 3928/4854 [08:49<34:52,  2.26s/it]

Scraping CIK 0001422892


 81%|████████  | 3929/4854 [08:56<57:02,  3.70s/it]

Scraping CIK 0001269026


 81%|████████  | 3930/4854 [09:01<1:00:40,  3.94s/it]

Scraping CIK 0000908937


 81%|████████  | 3931/4854 [09:13<1:41:25,  6.59s/it]

Scraping CIK 0000894315


 81%|████████  | 3932/4854 [09:28<2:20:09,  9.12s/it]

Scraping CIK 0001650729


 81%|████████  | 3933/4854 [09:31<1:49:10,  7.11s/it]

Scraping CIK 0001451809


 81%|████████  | 3934/4854 [09:32<1:20:06,  5.22s/it]

Scraping CIK 0000719739


 81%|████████  | 3935/4854 [09:46<2:02:47,  8.02s/it]

Scraping CIK 0000701374


 81%|████████  | 3936/4854 [09:58<2:21:02,  9.22s/it]

Scraping CIK 0001753673


 81%|████████  | 3937/4854 [09:59<1:44:35,  6.84s/it]

Scraping CIK 0000091928


 81%|████████  | 3938/4854 [10:14<2:21:12,  9.25s/it]

Scraping CIK 0000091419


 81%|████████  | 3940/4854 [10:27<1:51:13,  7.30s/it]

Scraping CIK 0000932872
Scraping CIK 0000319655


 81%|████████  | 3941/4854 [10:38<2:05:30,  8.25s/it]

Scraping CIK 0000766829


 81%|████████  | 3943/4854 [10:50<1:40:14,  6.60s/it]

Scraping CIK 0001015650
Scraping CIK 0000899715


 81%|████████▏ | 3944/4854 [11:03<2:09:21,  8.53s/it]

Scraping CIK 0001065837


 81%|████████▏ | 3945/4854 [11:15<2:23:00,  9.44s/it]

Scraping CIK 0000090896


 81%|████████▏ | 3947/4854 [11:25<1:43:55,  6.87s/it]

Scraping CIK 0001594124
Scraping CIK 0000793733


 81%|████████▏ | 3948/4854 [11:37<2:05:44,  8.33s/it]

Scraping CIK 0001038074


 81%|████████▏ | 3949/4854 [11:47<2:14:40,  8.93s/it]

Scraping CIK 0000087347


 81%|████████▏ | 3950/4854 [11:59<2:25:58,  9.69s/it]

Scraping CIK 0001524741


 81%|████████▏ | 3951/4854 [12:04<2:04:33,  8.28s/it]

Scraping CIK 0001263762


 81%|████████▏ | 3952/4854 [12:13<2:06:52,  8.44s/it]

Scraping CIK 0001707502


 81%|████████▏ | 3954/4854 [12:15<1:09:24,  4.63s/it]

Scraping CIK 0001097362
Scraping CIK 0001040971


 81%|████████▏ | 3955/4854 [12:41<2:46:13, 11.09s/it]

Scraping CIK 0001621672


 82%|████████▏ | 3957/4854 [12:42<1:24:23,  5.64s/it]

Scraping CIK 0001684693
Scraping CIK 0000849869


 82%|████████▏ | 3958/4854 [12:54<1:51:26,  7.46s/it]

Scraping CIK 0001032033


 82%|████████▏ | 3959/4854 [13:07<2:18:20,  9.27s/it]

Scraping CIK 0001484565


 82%|████████▏ | 3960/4854 [13:12<1:57:42,  7.90s/it]

Scraping CIK 0001023459


 82%|████████▏ | 3962/4854 [13:19<1:20:40,  5.43s/it]

Scraping CIK 0001794783
Scraping CIK 0001418076


 82%|████████▏ | 3963/4854 [13:26<1:25:08,  5.73s/it]

Scraping CIK 0001615219


 82%|████████▏ | 3964/4854 [13:30<1:18:02,  5.26s/it]

Scraping CIK 0001390478


 82%|████████▏ | 3965/4854 [13:37<1:24:38,  5.71s/it]

Scraping CIK 0000893538


 82%|████████▏ | 3966/4854 [13:51<2:01:40,  8.22s/it]

Scraping CIK 0001366561


 82%|████████▏ | 3967/4854 [13:52<1:30:55,  6.15s/it]

Scraping CIK 0000916907


 82%|████████▏ | 3968/4854 [14:04<1:56:15,  7.87s/it]

Scraping CIK 0001038773


 82%|████████▏ | 3969/4854 [14:14<2:06:03,  8.55s/it]

Scraping CIK 0001375365


 82%|████████▏ | 3970/4854 [14:22<2:03:12,  8.36s/it]

Scraping CIK 0000898770


 82%|████████▏ | 3972/4854 [14:30<1:24:23,  5.74s/it]

Scraping CIK 0001022837
Scraping CIK 0000825542


 82%|████████▏ | 3973/4854 [14:42<1:53:54,  7.76s/it]

Scraping CIK 0001690334


 82%|████████▏ | 3974/4854 [14:44<1:29:29,  6.10s/it]

Scraping CIK 0000922612


 82%|████████▏ | 3975/4854 [14:56<1:52:23,  7.67s/it]

Scraping CIK 0001549922


 82%|████████▏ | 3976/4854 [15:00<1:38:47,  6.75s/it]

Scraping CIK 0000811808


 82%|████████▏ | 3978/4854 [15:13<1:26:57,  5.96s/it]

Scraping CIK 0001599298
Scraping CIK 0000093389


 82%|████████▏ | 3979/4854 [15:25<1:55:57,  7.95s/it]

Scraping CIK 0001702744


 82%|████████▏ | 3980/4854 [15:27<1:29:52,  6.17s/it]

Scraping CIK 0000884940


 82%|████████▏ | 3981/4854 [15:40<1:55:44,  7.95s/it]

Scraping CIK 0000948708


 82%|████████▏ | 3982/4854 [15:51<2:12:29,  9.12s/it]

Scraping CIK 0000088941


 82%|████████▏ | 3983/4854 [16:04<2:29:02, 10.27s/it]

Scraping CIK 0001705259


 82%|████████▏ | 3984/4854 [16:05<1:45:21,  7.27s/it]

Scraping CIK 0001108320


 82%|████████▏ | 3985/4854 [16:16<2:01:36,  8.40s/it]

Scraping CIK 0000091440


 82%|████████▏ | 3986/4854 [16:28<2:20:43,  9.73s/it]

Scraping CIK 0001564408


 82%|████████▏ | 3987/4854 [16:30<1:46:17,  7.36s/it]

Scraping CIK 0000827187


 82%|████████▏ | 3988/4854 [16:42<2:05:01,  8.66s/it]

Scraping CIK 0001357459


 82%|████████▏ | 3989/4854 [16:48<1:54:44,  7.96s/it]

Scraping CIK 0001131554


 82%|████████▏ | 3990/4854 [16:57<1:57:40,  8.17s/it]

Scraping CIK 0001529628


 82%|████████▏ | 3991/4854 [16:59<1:32:28,  6.43s/it]

Scraping CIK 0001326089


 82%|████████▏ | 3993/4854 [17:00<47:54,  3.34s/it]  

Scraping CIK 0001766600
Scraping CIK 0001692063


 82%|████████▏ | 3994/4854 [17:02<40:15,  2.81s/it]

Scraping CIK 0001395937


 82%|████████▏ | 3996/4854 [17:05<30:52,  2.16s/it]

Scraping CIK 0000313838
Scraping CIK 0001680378


 82%|████████▏ | 3997/4854 [17:15<1:03:20,  4.43s/it]

Scraping CIK 0000318673


 82%|████████▏ | 3998/4854 [17:29<1:42:11,  7.16s/it]

Scraping CIK 0000812796


 82%|████████▏ | 3999/4854 [17:35<1:39:33,  6.99s/it]

Scraping CIK 0001362705


 82%|████████▏ | 4001/4854 [17:44<1:15:13,  5.29s/it]

Scraping CIK 0000845982
Scraping CIK 0001367083


 82%|████████▏ | 4003/4854 [17:51<56:55,  4.01s/it]  

Scraping CIK 0001123658
Scraping CIK 0000883241


 82%|████████▏ | 4004/4854 [18:02<1:27:00,  6.14s/it]

Scraping CIK 0001610114


 83%|████████▎ | 4005/4854 [18:06<1:18:03,  5.52s/it]

Scraping CIK 0001061027


 83%|████████▎ | 4006/4854 [18:14<1:29:54,  6.36s/it]

Scraping CIK 0000018349


 83%|████████▎ | 4007/4854 [18:28<2:00:11,  8.51s/it]

Scraping CIK 0001177394


 83%|████████▎ | 4009/4854 [18:39<1:31:45,  6.52s/it]

Scraping CIK 0001121404
Scraping CIK 0000092122


 83%|████████▎ | 4011/4854 [19:01<1:50:10,  7.84s/it]

Scraping CIK 0001713947
Scraping CIK 0001301236


 83%|████████▎ | 4013/4854 [19:11<1:22:47,  5.91s/it]

Scraping CIK 0001734107
Scraping CIK 0001697500


 83%|████████▎ | 4015/4854 [19:13<47:04,  3.37s/it]  

Scraping CIK 0001417892
Scraping CIK 0001637736


 83%|████████▎ | 4016/4854 [19:13<33:27,  2.40s/it]

Scraping CIK 0001548187


 83%|████████▎ | 4017/4854 [19:14<27:38,  1.98s/it]

Scraping CIK 0000091767


 83%|████████▎ | 4018/4854 [19:26<1:10:29,  5.06s/it]

Scraping CIK 0001325670


 83%|████████▎ | 4019/4854 [19:35<1:26:05,  6.19s/it]

Scraping CIK 0001178697


 83%|████████▎ | 4020/4854 [19:36<1:03:44,  4.59s/it]

Scraping CIK 0001106838


 83%|████████▎ | 4021/4854 [19:44<1:18:18,  5.64s/it]

Scraping CIK 0001314727


 83%|████████▎ | 4022/4854 [19:46<1:01:04,  4.40s/it]

Scraping CIK 0001059262


 83%|████████▎ | 4023/4854 [19:59<1:37:08,  7.01s/it]

Scraping CIK 0001720990


 83%|████████▎ | 4024/4854 [20:00<1:12:48,  5.26s/it]

Scraping CIK 0000109177


 83%|████████▎ | 4026/4854 [20:14<1:16:25,  5.54s/it]

Scraping CIK 0001291855
Scraping CIK 0001706946


 83%|████████▎ | 4027/4854 [20:16<59:49,  4.34s/it]  

Scraping CIK 0001163668


 83%|████████▎ | 4028/4854 [20:16<45:00,  3.27s/it]

Scraping CIK 0001063761


 83%|████████▎ | 4029/4854 [20:28<1:20:12,  5.83s/it]

Scraping CIK 0000064040


 83%|████████▎ | 4030/4854 [20:40<1:43:54,  7.57s/it]

Scraping CIK 0001005210


 83%|████████▎ | 4031/4854 [20:52<2:03:51,  9.03s/it]

Scraping CIK 0001210618


 83%|████████▎ | 4032/4854 [20:56<1:42:10,  7.46s/it]

Scraping CIK 0001606268


 83%|████████▎ | 4033/4854 [20:59<1:25:46,  6.27s/it]

Scraping CIK 0001353283


 83%|████████▎ | 4034/4854 [21:04<1:19:28,  5.82s/it]

Scraping CIK 0001452857


 83%|████████▎ | 4035/4854 [21:10<1:19:43,  5.84s/it]

Scraping CIK 0000886835


 83%|████████▎ | 4036/4854 [21:22<1:43:53,  7.62s/it]

Scraping CIK 0001637761


 83%|████████▎ | 4038/4854 [21:25<1:00:22,  4.44s/it]

Scraping CIK 0000885740
Scraping CIK 0001289945


 83%|████████▎ | 4040/4854 [21:34<54:24,  4.01s/it]  

Scraping CIK 0001639920
Scraping CIK 0000831547


 83%|████████▎ | 4041/4854 [21:45<1:24:21,  6.23s/it]

Scraping CIK 0001364885


 83%|████████▎ | 4042/4854 [21:54<1:35:47,  7.08s/it]

Scraping CIK 0001701108


 83%|████████▎ | 4043/4854 [21:56<1:14:29,  5.51s/it]

Scraping CIK 0001104855


 83%|████████▎ | 4044/4854 [22:07<1:34:29,  7.00s/it]

Scraping CIK 0001092699


 83%|████████▎ | 4045/4854 [22:12<1:26:54,  6.45s/it]

Scraping CIK 0001517375


 83%|████████▎ | 4046/4854 [22:12<1:03:08,  4.69s/it]

Scraping CIK 0000877422


 83%|████████▎ | 4047/4854 [22:24<1:31:13,  6.78s/it]

Scraping CIK 0001132105


 83%|████████▎ | 4048/4854 [22:27<1:17:39,  5.78s/it]

Scraping CIK 0000867773


 83%|████████▎ | 4049/4854 [22:36<1:30:29,  6.74s/it]

Scraping CIK 0000088205


 83%|████████▎ | 4050/4854 [22:49<1:52:28,  8.39s/it]

Scraping CIK 0001512673


 83%|████████▎ | 4051/4854 [22:52<1:32:45,  6.93s/it]

Scraping CIK 0001648428


 83%|████████▎ | 4053/4854 [22:56<55:31,  4.16s/it]  

Scraping CIK 0000909037
Scraping CIK 0001383395


 84%|████████▎ | 4054/4854 [22:56<39:21,  2.95s/it]

Scraping CIK 0001126956


 84%|████████▎ | 4055/4854 [23:09<1:17:40,  5.83s/it]

Scraping CIK 0001781162


 84%|████████▎ | 4056/4854 [23:10<1:00:16,  4.53s/it]

Scraping CIK 0001538217


 84%|████████▎ | 4057/4854 [23:16<1:04:20,  4.84s/it]

Scraping CIK 0001308606


 84%|████████▎ | 4058/4854 [23:28<1:35:06,  7.17s/it]

Scraping CIK 0000034782


 84%|████████▎ | 4059/4854 [23:42<2:03:16,  9.30s/it]

Scraping CIK 0000861878


 84%|████████▎ | 4060/4854 [23:57<2:24:59, 10.96s/it]

Scraping CIK 0000924717


 84%|████████▎ | 4061/4854 [24:09<2:27:29, 11.16s/it]

Scraping CIK 0001032208


 84%|████████▎ | 4062/4854 [24:22<2:34:08, 11.68s/it]

Scraping CIK 0001310114


 84%|████████▎ | 4063/4854 [24:27<2:08:32,  9.75s/it]

Scraping CIK 0001628063


 84%|████████▎ | 4064/4854 [24:44<2:34:53, 11.76s/it]

Scraping CIK 0001043337


 84%|████████▎ | 4065/4854 [24:56<2:36:12, 11.88s/it]

Scraping CIK 0000016859


 84%|████████▍ | 4066/4854 [24:56<1:50:12,  8.39s/it]

Scraping CIK 0001525287


 84%|████████▍ | 4067/4854 [25:01<1:35:28,  7.28s/it]

Scraping CIK 0000850261


 84%|████████▍ | 4068/4854 [25:08<1:35:13,  7.27s/it]

Scraping CIK 0000873303


 84%|████████▍ | 4069/4854 [25:21<1:57:49,  9.01s/it]

Scraping CIK 0001290149


 84%|████████▍ | 4070/4854 [25:24<1:34:07,  7.20s/it]

Scraping CIK 0001727196


 84%|████████▍ | 4071/4854 [25:25<1:12:04,  5.52s/it]

Scraping CIK 0001031029


 84%|████████▍ | 4072/4854 [25:37<1:36:52,  7.43s/it]

Scraping CIK 0001494891


 84%|████████▍ | 4073/4854 [25:41<1:19:54,  6.14s/it]

Scraping CIK 0000764038


 84%|████████▍ | 4074/4854 [25:55<1:50:54,  8.53s/it]

Scraping CIK 0000920371


 84%|████████▍ | 4075/4854 [26:08<2:08:31,  9.90s/it]

Scraping CIK 0001051514


 84%|████████▍ | 4076/4854 [26:16<2:03:44,  9.54s/it]

Scraping CIK 0000314590


 84%|████████▍ | 4077/4854 [26:17<1:27:23,  6.75s/it]

Scraping CIK 0001402436


 84%|████████▍ | 4078/4854 [26:24<1:28:37,  6.85s/it]

Scraping CIK 0001236275


 84%|████████▍ | 4079/4854 [26:30<1:27:37,  6.78s/it]

Scraping CIK 0000832428


 84%|████████▍ | 4080/4854 [26:44<1:55:46,  8.97s/it]

Scraping CIK 0001779474


 84%|████████▍ | 4082/4854 [26:45<59:24,  4.62s/it]  

Scraping CIK 0000921638
Scraping CIK 0001509470


 84%|████████▍ | 4083/4854 [26:51<1:04:40,  5.03s/it]

Scraping CIK 0001351636


 84%|████████▍ | 4084/4854 [26:54<55:04,  4.29s/it]  

Scraping CIK 0001549346


 84%|████████▍ | 4085/4854 [26:59<56:15,  4.39s/it]

Scraping CIK 0000096793


 84%|████████▍ | 4086/4854 [27:12<1:29:41,  7.01s/it]

Scraping CIK 0001517396


 84%|████████▍ | 4087/4854 [27:12<1:03:41,  4.98s/it]

Scraping CIK 0001477294


 84%|████████▍ | 4088/4854 [27:19<1:10:23,  5.51s/it]

Scraping CIK 0000718937


 84%|████████▍ | 4089/4854 [27:32<1:40:46,  7.90s/it]

Scraping CIK 0001499717


 84%|████████▍ | 4090/4854 [27:38<1:31:22,  7.18s/it]

Scraping CIK 0001479094


 84%|████████▍ | 4091/4854 [27:44<1:28:16,  6.94s/it]

Scraping CIK 0001095651


 84%|████████▍ | 4092/4854 [27:57<1:50:45,  8.72s/it]

Scraping CIK 0001581164


 84%|████████▍ | 4093/4854 [28:02<1:36:45,  7.63s/it]

Scraping CIK 0000719220


 84%|████████▍ | 4094/4854 [28:15<1:55:05,  9.09s/it]

Scraping CIK 0000094344


 84%|████████▍ | 4095/4854 [28:26<2:03:48,  9.79s/it]

Scraping CIK 0000914712


 84%|████████▍ | 4096/4854 [28:39<2:16:54, 10.84s/it]

Scraping CIK 0001757898


 84%|████████▍ | 4097/4854 [28:40<1:39:09,  7.86s/it]

Scraping CIK 0000874977


 84%|████████▍ | 4099/4854 [28:53<1:22:53,  6.59s/it]

Scraping CIK 0001723935
Scraping CIK 0001227636


 84%|████████▍ | 4100/4854 [28:55<1:06:38,  5.30s/it]

Scraping CIK 0000351834


 84%|████████▍ | 4101/4854 [29:06<1:28:12,  7.03s/it]

Scraping CIK 0001399520


 85%|████████▍ | 4102/4854 [29:13<1:26:40,  6.92s/it]

Scraping CIK 0001070154


 85%|████████▍ | 4103/4854 [29:25<1:46:28,  8.51s/it]

Scraping CIK 0001022671


 85%|████████▍ | 4105/4854 [29:38<1:25:29,  6.85s/it]

Scraping CIK 0000932787
Scraping CIK 0001082923


 85%|████████▍ | 4107/4854 [29:49<1:10:56,  5.70s/it]

Scraping CIK 0001131383
Scraping CIK 0001492915


 85%|████████▍ | 4109/4854 [29:54<47:08,  3.80s/it]  

Scraping CIK 0001745431
Scraping CIK 0001623526


 85%|████████▍ | 4110/4854 [29:55<35:37,  2.87s/it]

Scraping CIK 0001753886


 85%|████████▍ | 4111/4854 [29:56<28:19,  2.29s/it]

Scraping CIK 0001538990


 85%|████████▍ | 4112/4854 [30:02<43:58,  3.56s/it]

Scraping CIK 0001013934


 85%|████████▍ | 4113/4854 [30:15<1:18:46,  6.38s/it]

Scraping CIK 0000874238


 85%|████████▍ | 4114/4854 [30:28<1:44:43,  8.49s/it]

Scraping CIK 0001008586


 85%|████████▍ | 4115/4854 [30:40<1:56:06,  9.43s/it]

Scraping CIK 0001382101


 85%|████████▍ | 4116/4854 [30:41<1:25:28,  6.95s/it]

Scraping CIK 0000885508


 85%|████████▍ | 4117/4854 [30:53<1:44:32,  8.51s/it]

Scraping CIK 0000933034


 85%|████████▍ | 4118/4854 [31:03<1:47:44,  8.78s/it]

Scraping CIK 0001692830


 85%|████████▍ | 4119/4854 [31:03<1:17:53,  6.36s/it]

Scraping CIK 0000093751


 85%|████████▍ | 4120/4854 [31:18<1:49:11,  8.93s/it]

Scraping CIK 0001465128


 85%|████████▍ | 4121/4854 [31:26<1:43:39,  8.48s/it]

Scraping CIK 0001137789


 85%|████████▍ | 4122/4854 [31:36<1:50:57,  9.09s/it]

Scraping CIK 0001499453


 85%|████████▍ | 4123/4854 [31:38<1:22:45,  6.79s/it]

Scraping CIK 0001289340


 85%|████████▍ | 4124/4854 [31:46<1:29:41,  7.37s/it]

Scraping CIK 0000016918


 85%|████████▌ | 4126/4854 [32:00<1:18:20,  6.46s/it]

Scraping CIK 0000311337
Scraping CIK 0000912593


 85%|████████▌ | 4127/4854 [32:14<1:47:21,  8.86s/it]

Scraping CIK 0001621563


 85%|████████▌ | 4128/4854 [32:18<1:27:56,  7.27s/it]

Scraping CIK 0001314772


 85%|████████▌ | 4129/4854 [32:25<1:25:48,  7.10s/it]

Scraping CIK 0001552275


 85%|████████▌ | 4130/4854 [32:30<1:17:32,  6.43s/it]

Scraping CIK 0001508171


 85%|████████▌ | 4131/4854 [32:35<1:12:40,  6.03s/it]

Scraping CIK 0001172631


 85%|████████▌ | 4132/4854 [32:41<1:13:50,  6.14s/it]

Scraping CIK 0000095552


 85%|████████▌ | 4133/4854 [32:53<1:33:29,  7.78s/it]

Scraping CIK 0001356576


 85%|████████▌ | 4135/4854 [32:58<58:37,  4.89s/it]  

Scraping CIK 0001517399
Scraping CIK 0001718108


 85%|████████▌ | 4137/4854 [32:59<32:53,  2.75s/it]

Scraping CIK 0000909327
Scraping CIK 0001084201


 85%|████████▌ | 4138/4854 [33:00<23:27,  1.97s/it]

Scraping CIK 0000868271


 85%|████████▌ | 4139/4854 [33:11<56:48,  4.77s/it]

Scraping CIK 0000945394


 85%|████████▌ | 4141/4854 [33:24<1:01:08,  5.14s/it]

Scraping CIK 0001340677
Scraping CIK 0001739936


 85%|████████▌ | 4142/4854 [33:25<46:44,  3.94s/it]  

Scraping CIK 0001160308


 85%|████████▌ | 4143/4854 [33:34<1:03:06,  5.33s/it]

Scraping CIK 0000089140


 85%|████████▌ | 4144/4854 [33:41<1:08:02,  5.75s/it]

Scraping CIK 0001642545


 85%|████████▌ | 4145/4854 [33:41<50:08,  4.24s/it]  

Scraping CIK 0001092796


 85%|████████▌ | 4146/4854 [33:51<1:09:50,  5.92s/it]

Scraping CIK 0001710583


 85%|████████▌ | 4147/4854 [33:53<56:10,  4.77s/it]  

Scraping CIK 0001739942


 85%|████████▌ | 4148/4854 [33:55<44:43,  3.80s/it]

Scraping CIK 0001111863


 85%|████████▌ | 4149/4854 [33:55<32:28,  2.76s/it]

Scraping CIK 0000093556


 85%|████████▌ | 4150/4854 [34:10<1:16:21,  6.51s/it]

Scraping CIK 0001089907


 86%|████████▌ | 4151/4854 [34:21<1:32:41,  7.91s/it]

Scraping CIK 0000004127


 86%|████████▌ | 4152/4854 [34:35<1:51:10,  9.50s/it]

Scraping CIK 0001000623


 86%|████████▌ | 4153/4854 [34:48<2:02:53, 10.52s/it]

Scraping CIK 0000007332


 86%|████████▌ | 4154/4854 [35:01<2:14:29, 11.53s/it]

Scraping CIK 0001773427


 86%|████████▌ | 4155/4854 [35:02<1:36:57,  8.32s/it]

Scraping CIK 0001692115


 86%|████████▌ | 4156/4854 [35:05<1:18:07,  6.72s/it]

Scraping CIK 0001514705


 86%|████████▌ | 4157/4854 [35:12<1:16:24,  6.58s/it]

Scraping CIK 0000310354


 86%|████████▌ | 4158/4854 [35:25<1:39:42,  8.60s/it]

Scraping CIK 0000310142


 86%|████████▌ | 4160/4854 [35:37<1:19:00,  6.83s/it]

Scraping CIK 0001723980
Scraping CIK 0001758530


 86%|████████▌ | 4161/4854 [35:37<55:51,  4.84s/it]  

Scraping CIK 0000835324


 86%|████████▌ | 4162/4854 [35:52<1:28:08,  7.64s/it]

Scraping CIK 0001527599


 86%|████████▌ | 4163/4854 [35:54<1:11:10,  6.18s/it]

Scraping CIK 0001601712


 86%|████████▌ | 4164/4854 [35:58<1:02:37,  5.45s/it]

Scraping CIK 0000310764


 86%|████████▌ | 4165/4854 [36:23<2:09:08, 11.25s/it]

Scraping CIK 0001010612


 86%|████████▌ | 4166/4854 [36:34<2:10:00, 11.34s/it]

Scraping CIK 0000894158


 86%|████████▌ | 4167/4854 [36:44<2:02:20, 10.68s/it]

Scraping CIK 0000817720


 86%|████████▌ | 4168/4854 [36:54<2:00:10, 10.51s/it]

Scraping CIK 0001408278


 86%|████████▌ | 4169/4854 [36:59<1:42:55,  9.02s/it]

Scraping CIK 0001610950


 86%|████████▌ | 4170/4854 [37:03<1:24:11,  7.38s/it]

Scraping CIK 0000095953


 86%|████████▌ | 4171/4854 [37:16<1:45:02,  9.23s/it]

Scraping CIK 0000864240


 86%|████████▌ | 4172/4854 [37:27<1:51:04,  9.77s/it]

Scraping CIK 0001556263


 86%|████████▌ | 4173/4854 [37:30<1:24:47,  7.47s/it]

Scraping CIK 0000945114


 86%|████████▌ | 4174/4854 [37:41<1:36:40,  8.53s/it]

Scraping CIK 0000096021


 86%|████████▌ | 4175/4854 [37:52<1:48:00,  9.54s/it]

Scraping CIK 0000732717


 86%|████████▌ | 4176/4854 [38:03<1:52:25,  9.95s/it]

Scraping CIK 0001378453


 86%|████████▌ | 4178/4854 [38:12<1:14:31,  6.62s/it]

Scraping CIK 0001144800
Scraping CIK 0001585583


 86%|████████▌ | 4179/4854 [38:16<1:07:56,  6.04s/it]

Scraping CIK 0001017303


 86%|████████▌ | 4180/4854 [38:27<1:25:16,  7.59s/it]

Scraping CIK 0000942126


 86%|████████▌ | 4182/4854 [38:36<1:01:19,  5.48s/it]

Scraping CIK 0001395064
Scraping CIK 0001499620


 86%|████████▌ | 4183/4854 [38:36<43:17,  3.87s/it]  

Scraping CIK 0001724965


 86%|████████▌ | 4185/4854 [38:37<24:30,  2.20s/it]

Scraping CIK 0001588084
Scraping CIK 0001552670


 86%|████████▌ | 4186/4854 [38:37<17:36,  1.58s/it]

Scraping CIK 0000024545


 86%|████████▋ | 4187/4854 [38:51<56:28,  5.08s/it]

Scraping CIK 0001647170


 86%|████████▋ | 4188/4854 [38:51<41:37,  3.75s/it]

Scraping CIK 0001359931


 86%|████████▋ | 4190/4854 [38:55<29:46,  2.69s/it]

Scraping CIK 0000906338
Scraping CIK 0000809248


 86%|████████▋ | 4191/4854 [39:05<52:23,  4.74s/it]

Scraping CIK 0001092289


 86%|████████▋ | 4193/4854 [39:12<43:36,  3.96s/it]  

Scraping CIK 0000808439
Scraping CIK 0000096536


 86%|████████▋ | 4194/4854 [39:18<47:08,  4.29s/it]

Scraping CIK 0001295401


 86%|████████▋ | 4195/4854 [39:30<1:14:48,  6.81s/it]

Scraping CIK 0000768899


 86%|████████▋ | 4196/4854 [39:42<1:32:19,  8.42s/it]

Scraping CIK 0001693415


 86%|████████▋ | 4197/4854 [39:44<1:09:20,  6.33s/it]

Scraping CIK 0001539638


 86%|████████▋ | 4198/4854 [39:49<1:06:31,  6.08s/it]

Scraping CIK 0001668370


 87%|████████▋ | 4199/4854 [39:51<50:46,  4.65s/it]  

Scraping CIK 0001447051


 87%|████████▋ | 4200/4854 [40:00<1:04:23,  5.91s/it]

Scraping CIK 0001583107


 87%|████████▋ | 4202/4854 [40:03<38:49,  3.57s/it]  

Scraping CIK 0001743340
Scraping CIK 0001077428


 87%|████████▋ | 4203/4854 [40:14<1:03:56,  5.89s/it]

Scraping CIK 0000356171


 87%|████████▋ | 4204/4854 [40:27<1:26:50,  8.02s/it]

Scraping CIK 0000096699


 87%|████████▋ | 4205/4854 [40:36<1:28:34,  8.19s/it]

Scraping CIK 0001595585


 87%|████████▋ | 4206/4854 [40:37<1:06:24,  6.15s/it]

Scraping CIK 0000019612


 87%|████████▋ | 4207/4854 [40:51<1:32:45,  8.60s/it]

Scraping CIK 0000855874


 87%|████████▋ | 4208/4854 [41:04<1:46:35,  9.90s/it]

Scraping CIK 0000733590


 87%|████████▋ | 4209/4854 [41:17<1:56:42, 10.86s/it]

Scraping CIK 0001027838


 87%|████████▋ | 4210/4854 [41:20<1:29:34,  8.35s/it]

Scraping CIK 0000890319


 87%|████████▋ | 4212/4854 [41:32<1:12:02,  6.73s/it]

Scraping CIK 0001269238
Scraping CIK 0001394319


 87%|████████▋ | 4213/4854 [41:36<1:03:34,  5.95s/it]

Scraping CIK 0001075607


 87%|████████▋ | 4214/4854 [41:48<1:20:23,  7.54s/it]

Scraping CIK 0001370755


 87%|████████▋ | 4215/4854 [41:54<1:15:48,  7.12s/it]

Scraping CIK 0001464963


 87%|████████▋ | 4216/4854 [42:01<1:15:29,  7.10s/it]

Scraping CIK 0001750019


 87%|████████▋ | 4217/4854 [42:02<55:50,  5.26s/it]  

Scraping CIK 0001411688


 87%|████████▋ | 4218/4854 [42:08<57:16,  5.40s/it]

Scraping CIK 0000909494


 87%|████████▋ | 4220/4854 [42:20<54:44,  5.18s/it]  

Scraping CIK 0000947263
Scraping CIK 0001673481


 87%|████████▋ | 4221/4854 [42:21<42:27,  4.03s/it]

Scraping CIK 0000816761


 87%|████████▋ | 4222/4854 [42:28<52:23,  4.97s/it]

Scraping CIK 0001260221


 87%|████████▋ | 4223/4854 [42:37<1:03:12,  6.01s/it]

Scraping CIK 0001477449


 87%|████████▋ | 4224/4854 [42:40<55:32,  5.29s/it]  

Scraping CIK 0001051512


 87%|████████▋ | 4225/4854 [42:52<1:15:17,  7.18s/it]

Scraping CIK 0000098222


 87%|████████▋ | 4226/4854 [43:06<1:36:16,  9.20s/it]

Scraping CIK 0001094285


 87%|████████▋ | 4227/4854 [43:19<1:49:21, 10.47s/it]

Scraping CIK 0001650372


 87%|████████▋ | 4228/4854 [43:20<1:18:45,  7.55s/it]

Scraping CIK 0000790703


 87%|████████▋ | 4229/4854 [43:34<1:39:36,  9.56s/it]

Scraping CIK 0000842023


 87%|████████▋ | 4230/4854 [43:46<1:48:24, 10.42s/it]

Scraping CIK 0000886986


 87%|████████▋ | 4231/4854 [43:47<1:16:47,  7.40s/it]

Scraping CIK 0001766526


 87%|████████▋ | 4232/4854 [43:48<56:58,  5.50s/it]  

Scraping CIK 0001592560


 87%|████████▋ | 4233/4854 [43:48<40:34,  3.92s/it]

Scraping CIK 0000814052


 87%|████████▋ | 4234/4854 [43:48<29:23,  2.85s/it]

Scraping CIK 0001385157


 87%|████████▋ | 4235/4854 [43:58<49:47,  4.83s/it]

Scraping CIK 0001561921


 87%|████████▋ | 4236/4854 [43:59<39:01,  3.79s/it]

Scraping CIK 0000061398


 87%|████████▋ | 4237/4854 [44:12<1:06:06,  6.43s/it]

Scraping CIK 0001024725


 87%|████████▋ | 4238/4854 [44:26<1:28:35,  8.63s/it]

Scraping CIK 0001660280


 87%|████████▋ | 4239/4854 [44:27<1:05:19,  6.37s/it]

Scraping CIK 0000034956


 87%|████████▋ | 4241/4854 [44:40<59:32,  5.83s/it]  

Scraping CIK 0000932470
Scraping CIK 0000097210


 87%|████████▋ | 4242/4854 [44:52<1:19:37,  7.81s/it]

Scraping CIK 0001599947


 87%|████████▋ | 4243/4854 [44:56<1:08:23,  6.72s/it]

Scraping CIK 0000927355


 87%|████████▋ | 4244/4854 [45:10<1:31:06,  8.96s/it]

Scraping CIK 0001084384


 87%|████████▋ | 4245/4854 [45:18<1:26:07,  8.49s/it]

Scraping CIK 0000818686


 87%|████████▋ | 4246/4854 [45:19<1:04:38,  6.38s/it]

Scraping CIK 0000097216


 87%|████████▋ | 4247/4854 [45:33<1:27:24,  8.64s/it]

Scraping CIK 0000092230


 88%|████████▊ | 4248/4854 [45:48<1:46:24, 10.53s/it]

Scraping CIK 0001733413


 88%|████████▊ | 4250/4854 [45:49<53:49,  5.35s/it]  

Scraping CIK 0001588823
Scraping CIK 0001381668


 88%|████████▊ | 4251/4854 [46:13<1:50:22, 10.98s/it]

Scraping CIK 0000096943


 88%|████████▊ | 4252/4854 [46:26<1:55:50, 11.55s/it]

Scraping CIK 0000850429


 88%|████████▊ | 4254/4854 [46:40<1:25:46,  8.58s/it]

Scraping CIK 0000736744
Scraping CIK 0000878518


 88%|████████▊ | 4255/4854 [46:40<1:00:25,  6.05s/it]

Scraping CIK 0001001614


 88%|████████▊ | 4257/4854 [46:52<54:35,  5.49s/it]  

Scraping CIK 0001413159
Scraping CIK 0001021162


 88%|████████▊ | 4258/4854 [47:07<1:22:11,  8.27s/it]

Scraping CIK 0001534675


 88%|████████▊ | 4259/4854 [47:12<1:13:05,  7.37s/it]

Scraping CIK 0000039899


 88%|████████▊ | 4261/4854 [47:27<1:06:52,  6.77s/it]

Scraping CIK 0001308106
Scraping CIK 0000931427


 88%|████████▊ | 4262/4854 [47:27<47:12,  4.79s/it]  

Scraping CIK 0000027419


 88%|████████▊ | 4263/4854 [47:39<1:09:12,  7.03s/it]

Scraping CIK 0001001316


 88%|████████▊ | 4264/4854 [47:46<1:09:09,  7.03s/it]

Scraping CIK 0001712189


 88%|████████▊ | 4265/4854 [47:48<52:05,  5.31s/it]  

Scraping CIK 0001769318


 88%|████████▊ | 4266/4854 [47:48<38:20,  3.91s/it]

Scraping CIK 0000070318


 88%|████████▊ | 4267/4854 [48:01<1:05:52,  6.73s/it]

Scraping CIK 0001773087


 88%|████████▊ | 4268/4854 [48:02<48:11,  4.93s/it]  

Scraping CIK 0001760689


 88%|████████▊ | 4269/4854 [48:03<36:48,  3.78s/it]

Scraping CIK 0000714562


 88%|████████▊ | 4270/4854 [48:15<1:00:46,  6.24s/it]

Scraping CIK 0000944695


 88%|████████▊ | 4271/4854 [48:30<1:24:22,  8.68s/it]

Scraping CIK 0001134115


 88%|████████▊ | 4272/4854 [48:34<1:12:11,  7.44s/it]

Scraping CIK 0000811212


 88%|████████▊ | 4273/4854 [48:44<1:19:44,  8.24s/it]

Scraping CIK 0000730263


 88%|████████▊ | 4274/4854 [48:56<1:29:26,  9.25s/it]

Scraping CIK 0001489096


 88%|████████▊ | 4275/4854 [49:04<1:25:50,  8.90s/it]

Scraping CIK 0000903129


 88%|████████▊ | 4276/4854 [49:16<1:33:36,  9.72s/it]

Scraping CIK 0001320695


 88%|████████▊ | 4278/4854 [49:24<1:02:02,  6.46s/it]

Scraping CIK 0001512717
Scraping CIK 0000098246


 88%|████████▊ | 4280/4854 [49:36<55:01,  5.75s/it]  

Scraping CIK 0000912958
Scraping CIK 0001756699


 88%|████████▊ | 4281/4854 [49:36<38:52,  4.07s/it]

Scraping CIK 0000715787


 88%|████████▊ | 4282/4854 [49:50<1:07:05,  7.04s/it]

Scraping CIK 0001393726


 88%|████████▊ | 4283/4854 [49:57<1:07:59,  7.14s/it]

Scraping CIK 0000318833


 88%|████████▊ | 4284/4854 [50:09<1:21:01,  8.53s/it]

Scraping CIK 0001409171


 88%|████████▊ | 4285/4854 [50:17<1:18:44,  8.30s/it]

Scraping CIK 0000109198


 88%|████████▊ | 4287/4854 [50:30<1:03:33,  6.73s/it]

Scraping CIK 0000911971
Scraping CIK 0001491487


 88%|████████▊ | 4289/4854 [50:44<1:00:03,  6.38s/it]

Scraping CIK 0001071321
Scraping CIK 0000098362


 88%|████████▊ | 4291/4854 [51:07<1:14:41,  7.96s/it]

Scraping CIK 0001722890
Scraping CIK 0000909724


 88%|████████▊ | 4292/4854 [51:18<1:23:01,  8.86s/it]

Scraping CIK 0000352998


 88%|████████▊ | 4294/4854 [51:27<58:37,  6.28s/it]  

Scraping CIK 0001001807
Scraping CIK 0001668105


 88%|████████▊ | 4295/4854 [51:29<44:53,  4.82s/it]

Scraping CIK 0000884217


 89%|████████▊ | 4296/4854 [51:41<1:06:12,  7.12s/it]

Scraping CIK 0001731348


 89%|████████▊ | 4297/4854 [51:43<50:08,  5.40s/it]  

Scraping CIK 0001723069


 89%|████████▊ | 4298/4854 [51:43<36:16,  3.91s/it]

Scraping CIK 0001524025


 89%|████████▊ | 4300/4854 [51:48<27:55,  3.02s/it]

Scraping CIK 0001094517
Scraping CIK 0001504167


 89%|████████▊ | 4302/4854 [51:53<23:18,  2.53s/it]

Scraping CIK 0000840551
Scraping CIK 0001756262


 89%|████████▊ | 4304/4854 [51:54<13:40,  1.49s/it]

Scraping CIK 0001744676
Scraping CIK 0001562476


 89%|████████▊ | 4305/4854 [51:59<21:24,  2.34s/it]

Scraping CIK 0000097745


 89%|████████▊ | 4306/4854 [52:27<1:32:52, 10.17s/it]

Scraping CIK 0001005817


 89%|████████▊ | 4307/4854 [52:39<1:38:11, 10.77s/it]

Scraping CIK 0001543418


 89%|████████▉ | 4308/4854 [52:44<1:21:19,  8.94s/it]

Scraping CIK 0001598428


 89%|████████▉ | 4309/4854 [52:47<1:05:39,  7.23s/it]

Scraping CIK 0001283699


 89%|████████▉ | 4310/4854 [52:56<1:11:02,  7.84s/it]

Scraping CIK 0001474439


 89%|████████▉ | 4311/4854 [53:02<1:05:26,  7.23s/it]

Scraping CIK 0000097134


 89%|████████▉ | 4312/4854 [53:16<1:22:02,  9.08s/it]

Scraping CIK 0001438133


 89%|████████▉ | 4313/4854 [53:20<1:10:28,  7.82s/it]

Scraping CIK 0000937098


 89%|████████▉ | 4315/4854 [53:25<43:27,  4.84s/it]  

Scraping CIK 0001419945
Scraping CIK 0001166663


 89%|████████▉ | 4316/4854 [53:25<30:50,  3.44s/it]

Scraping CIK 0001430306


 89%|████████▉ | 4317/4854 [53:32<38:24,  4.29s/it]

Scraping CIK 0000794170


 89%|████████▉ | 4319/4854 [53:44<42:11,  4.73s/it]

Scraping CIK 0001296484
Scraping CIK 0001720580


 89%|████████▉ | 4321/4854 [53:46<24:11,  2.72s/it]

Scraping CIK 0000879764
Scraping CIK 0001731176


 89%|████████▉ | 4323/4854 [53:47<14:43,  1.66s/it]

Scraping CIK 0001597095
Scraping CIK 0001290677


 89%|████████▉ | 4324/4854 [53:52<22:37,  2.56s/it]

Scraping CIK 0000077543


 89%|████████▉ | 4325/4854 [54:07<55:10,  6.26s/it]

Scraping CIK 0001593195


 89%|████████▉ | 4326/4854 [54:11<49:40,  5.64s/it]

Scraping CIK 0001561680


 89%|████████▉ | 4327/4854 [54:16<48:03,  5.47s/it]

Scraping CIK 0000724742


 89%|████████▉ | 4328/4854 [54:29<1:06:05,  7.54s/it]

Scraping CIK 0001455684


 89%|████████▉ | 4329/4854 [54:31<53:46,  6.15s/it]  

Scraping CIK 0000097517


 89%|████████▉ | 4330/4854 [54:44<1:09:40,  7.98s/it]

Scraping CIK 0001116132


 89%|████████▉ | 4331/4854 [54:55<1:17:22,  8.88s/it]

Scraping CIK 0001576018


 89%|████████▉ | 4332/4854 [55:00<1:08:03,  7.82s/it]

Scraping CIK 0001595893


 89%|████████▉ | 4333/4854 [55:01<49:38,  5.72s/it]  

Scraping CIK 0001580345


 89%|████████▉ | 4334/4854 [55:06<47:57,  5.53s/it]

Scraping CIK 0001206264


 89%|████████▉ | 4335/4854 [55:17<1:01:50,  7.15s/it]

Scraping CIK 0000098677


 89%|████████▉ | 4336/4854 [55:29<1:14:43,  8.65s/it]

Scraping CIK 0000096869


 89%|████████▉ | 4337/4854 [55:43<1:29:01, 10.33s/it]

Scraping CIK 0001431959


 89%|████████▉ | 4338/4854 [55:50<1:20:09,  9.32s/it]

Scraping CIK 0000007039


 89%|████████▉ | 4339/4854 [56:03<1:28:44, 10.34s/it]

Scraping CIK 0001434621


 89%|████████▉ | 4340/4854 [56:10<1:21:18,  9.49s/it]

Scraping CIK 0001069878


 89%|████████▉ | 4341/4854 [56:22<1:25:32, 10.00s/it]

Scraping CIK 0001389170


 89%|████████▉ | 4342/4854 [56:30<1:19:57,  9.37s/it]

Scraping CIK 0001651561


 89%|████████▉ | 4344/4854 [56:32<44:02,  5.18s/it]  

Scraping CIK 0001075124
Scraping CIK 0000888721


 90%|████████▉ | 4345/4854 [56:32<31:08,  3.67s/it]

Scraping CIK 0001616212


 90%|████████▉ | 4346/4854 [56:33<22:23,  2.64s/it]

Scraping CIK 0001526520


 90%|████████▉ | 4347/4854 [56:39<31:45,  3.76s/it]

Scraping CIK 0000864749


 90%|████████▉ | 4349/4854 [56:58<49:55,  5.93s/it]  

Scraping CIK 0001655891
Scraping CIK 0000036146


 90%|████████▉ | 4350/4854 [57:14<1:13:58,  8.81s/it]

Scraping CIK 0001708405


 90%|████████▉ | 4351/4854 [57:16<55:59,  6.68s/it]  

Scraping CIK 0000099780


 90%|████████▉ | 4352/4854 [57:29<1:11:39,  8.56s/it]

Scraping CIK 0001476150


 90%|████████▉ | 4353/4854 [57:35<1:06:50,  8.01s/it]

Scraping CIK 0000099302


 90%|████████▉ | 4354/4854 [57:48<1:17:15,  9.27s/it]

Scraping CIK 0001113169


 90%|████████▉ | 4355/4854 [57:58<1:21:05,  9.75s/it]

Scraping CIK 0001530804


 90%|████████▉ | 4357/4854 [58:04<50:00,  6.04s/it]  

Scraping CIK 0001232384
Scraping CIK 0001611746


 90%|████████▉ | 4358/4854 [58:05<35:24,  4.28s/it]

Scraping CIK 0001158041


 90%|████████▉ | 4359/4854 [58:05<25:23,  3.08s/it]

Scraping CIK 0000842633


 90%|████████▉ | 4360/4854 [58:18<49:20,  5.99s/it]

Scraping CIK 0000357301


 90%|████████▉ | 4361/4854 [58:30<1:05:13,  7.94s/it]

Scraping CIK 0000732026


 90%|████████▉ | 4362/4854 [58:42<1:15:46,  9.24s/it]

Scraping CIK 0001660734


 90%|████████▉ | 4363/4854 [58:45<58:45,  7.18s/it]  

Scraping CIK 0001630472


 90%|████████▉ | 4364/4854 [58:47<45:33,  5.58s/it]

Scraping CIK 0001552033


 90%|████████▉ | 4365/4854 [58:52<44:17,  5.44s/it]

Scraping CIK 0001327318


 90%|████████▉ | 4366/4854 [58:55<39:50,  4.90s/it]

Scraping CIK 0001371285


 90%|████████▉ | 4367/4854 [59:01<40:44,  5.02s/it]

Scraping CIK 0000086312


 90%|█████████ | 4369/4854 [59:16<45:41,  5.65s/it]  

Scraping CIK 0001683825
Scraping CIK 0001563880


 90%|█████████ | 4370/4854 [59:16<33:33,  4.16s/it]

Scraping CIK 0001429560


 90%|█████████ | 4371/4854 [59:20<31:39,  3.93s/it]

Scraping CIK 0001747079


 90%|█████████ | 4372/4854 [59:21<24:52,  3.10s/it]

Scraping CIK 0001173643


 90%|█████████ | 4373/4854 [59:21<17:52,  2.23s/it]

Scraping CIK 0000876378


 90%|█████████ | 4375/4854 [59:30<24:13,  3.03s/it]

Scraping CIK 0001190723
Scraping CIK 0001046050


 90%|█████████ | 4376/4854 [59:45<51:14,  6.43s/it]

Scraping CIK 0001380846


 90%|█████████ | 4377/4854 [59:50<47:47,  6.01s/it]

Scraping CIK 0000916365


 90%|█████████ | 4378/4854 [1:00:02<1:01:26,  7.74s/it]

Scraping CIK 0001519061


 90%|█████████ | 4380/4854 [1:00:06<37:03,  4.69s/it]  

Scraping CIK 0000928876
Scraping CIK 0001318605


 90%|█████████ | 4381/4854 [1:00:12<39:39,  5.03s/it]

Scraping CIK 0001508655


 90%|█████████ | 4382/4854 [1:00:19<44:26,  5.65s/it]

Scraping CIK 0001046179


 90%|█████████ | 4383/4854 [1:00:19<31:33,  4.02s/it]

Scraping CIK 0000100493


 90%|█████████ | 4384/4854 [1:00:33<54:51,  7.00s/it]

Scraping CIK 0001499832


 90%|█████████ | 4385/4854 [1:00:36<45:07,  5.77s/it]

Scraping CIK 0000098338


 90%|█████████ | 4386/4854 [1:00:47<56:53,  7.29s/it]

Scraping CIK 0001066116


 90%|█████████ | 4387/4854 [1:00:47<40:13,  5.17s/it]

Scraping CIK 0001466258


 90%|█████████ | 4388/4854 [1:00:53<43:08,  5.56s/it]

Scraping CIK 0000737758


 90%|█████████ | 4389/4854 [1:01:06<58:44,  7.58s/it]

Scraping CIK 0001671933


 90%|█████████ | 4390/4854 [1:01:09<47:44,  6.17s/it]

Scraping CIK 0001013880


 90%|█████████ | 4391/4854 [1:01:22<1:03:48,  8.27s/it]

Scraping CIK 0000831641


 90%|█████████ | 4392/4854 [1:01:34<1:12:40,  9.44s/it]

Scraping CIK 0001293282


 91%|█████████ | 4393/4854 [1:01:42<1:09:49,  9.09s/it]

Scraping CIK 0000844965


 91%|█████████ | 4394/4854 [1:01:55<1:19:22, 10.35s/it]

Scraping CIK 0000926042


 91%|█████████ | 4395/4854 [1:01:56<55:56,  7.31s/it]  

Scraping CIK 0001116942


 91%|█████████ | 4396/4854 [1:02:07<1:04:32,  8.45s/it]

Scraping CIK 0000910267


 91%|█████████ | 4397/4854 [1:02:17<1:08:38,  9.01s/it]

Scraping CIK 0001492674


 91%|█████████ | 4398/4854 [1:02:21<57:21,  7.55s/it]  

Scraping CIK 0001373707


 91%|█████████ | 4399/4854 [1:02:25<48:26,  6.39s/it]

Scraping CIK 0000946581


 91%|█████████ | 4400/4854 [1:02:36<59:36,  7.88s/it]

Scraping CIK 0000868675


 91%|█████████ | 4402/4854 [1:02:37<29:43,  3.95s/it]

Scraping CIK 0001757399
Scraping CIK 0001008654


 91%|█████████ | 4403/4854 [1:02:48<46:51,  6.23s/it]

Scraping CIK 0001679268


 91%|█████████ | 4405/4854 [1:02:51<27:52,  3.72s/it]

Scraping CIK 0000912892
Scraping CIK 0001376986


 91%|█████████ | 4406/4854 [1:03:01<41:36,  5.57s/it]

Scraping CIK 0000704415


 91%|█████████ | 4407/4854 [1:03:15<59:19,  7.96s/it]

Scraping CIK 0001758730


 91%|█████████ | 4408/4854 [1:03:16<43:38,  5.87s/it]

Scraping CIK 0000899751


 91%|█████████ | 4409/4854 [1:03:28<57:13,  7.72s/it]

Scraping CIK 0000100378


 91%|█████████ | 4410/4854 [1:03:39<1:06:08,  8.94s/it]

Scraping CIK 0001447669


 91%|█████████ | 4411/4854 [1:03:42<51:09,  6.93s/it]  

Scraping CIK 0000795212


 91%|█████████ | 4412/4854 [1:03:54<1:03:18,  8.59s/it]

Scraping CIK 0001644406


 91%|█████████ | 4413/4854 [1:03:58<53:06,  7.23s/it]  

Scraping CIK 0001465740


 91%|█████████ | 4414/4854 [1:04:06<54:33,  7.44s/it]

Scraping CIK 0001459417


 91%|█████████ | 4415/4854 [1:04:10<46:51,  6.40s/it]

Scraping CIK 0001581280


 91%|█████████ | 4416/4854 [1:04:12<35:44,  4.90s/it]

Scraping CIK 0001418091


 91%|█████████ | 4418/4854 [1:04:16<24:39,  3.39s/it]

Scraping CIK 0001342874
Scraping CIK 0001770787


 91%|█████████ | 4419/4854 [1:04:17<18:46,  2.59s/it]

Scraping CIK 0000025743


 91%|█████████ | 4420/4854 [1:04:26<32:35,  4.51s/it]

Scraping CIK 0000097476


 91%|█████████ | 4421/4854 [1:04:38<47:54,  6.64s/it]

Scraping CIK 0001289460


 91%|█████████ | 4422/4854 [1:04:46<52:18,  7.27s/it]

Scraping CIK 0000217346


 91%|█████████ | 4423/4854 [1:05:00<1:05:41,  9.14s/it]

Scraping CIK 0001300734


 91%|█████████ | 4424/4854 [1:05:02<50:30,  7.05s/it]  

Scraping CIK 0000860731


 91%|█████████ | 4425/4854 [1:05:13<59:49,  8.37s/it]

Scraping CIK 0001537917


 91%|█████████ | 4426/4854 [1:05:18<51:14,  7.18s/it]

Scraping CIK 0001742927


 91%|█████████ | 4427/4854 [1:05:19<38:25,  5.40s/it]

Scraping CIK 0001133311


 91%|█████████ | 4428/4854 [1:05:29<48:34,  6.84s/it]

Scraping CIK 0001336917


 91%|█████████ | 4429/4854 [1:05:37<51:16,  7.24s/it]

Scraping CIK 0000100517


 91%|█████████▏| 4430/4854 [1:05:50<1:01:44,  8.74s/it]

Scraping CIK 0000101538


 91%|█████████▏| 4431/4854 [1:05:56<55:53,  7.93s/it]  

Scraping CIK 0001425292


 91%|█████████▏| 4432/4854 [1:06:01<50:04,  7.12s/it]

Scraping CIK 0000008504


 91%|█████████▏| 4433/4854 [1:06:07<48:03,  6.85s/it]

Scraping CIK 0001029800


 91%|█████████▏| 4434/4854 [1:06:18<56:13,  8.03s/it]

Scraping CIK 0000731653


 91%|█████████▏| 4435/4854 [1:06:29<1:01:32,  8.81s/it]

Scraping CIK 0001543151


 91%|█████████▏| 4436/4854 [1:06:29<44:40,  6.41s/it]  

Scraping CIK 0001137547


 91%|█████████▏| 4437/4854 [1:06:42<57:07,  8.22s/it]

Scraping CIK 0001087456


 91%|█████████▏| 4439/4854 [1:06:53<44:03,  6.37s/it]  

Scraping CIK 0001610520
Scraping CIK 0000729986


 91%|█████████▏| 4440/4854 [1:07:06<57:33,  8.34s/it]

Scraping CIK 0001463361


 91%|█████████▏| 4441/4854 [1:07:07<43:00,  6.25s/it]

Scraping CIK 0000857855


 92%|█████████▏| 4443/4854 [1:07:21<41:32,  6.06s/it]

Scraping CIK 0001775898
Scraping CIK 0001275014


 92%|█████████▏| 4444/4854 [1:07:31<47:41,  6.98s/it]

Scraping CIK 0000074208


 92%|█████████▏| 4445/4854 [1:07:44<1:01:23,  9.01s/it]

Scraping CIK 0001611547


 92%|█████████▏| 4446/4854 [1:07:48<49:47,  7.32s/it]  

Scraping CIK 0001334933


 92%|█████████▏| 4447/4854 [1:07:55<49:57,  7.37s/it]

Scraping CIK 0000101984


 92%|█████████▏| 4448/4854 [1:08:06<57:31,  8.50s/it]

Scraping CIK 0001041514


 92%|█████████▏| 4449/4854 [1:08:17<1:00:59,  9.04s/it]

Scraping CIK 0001617669


 92%|█████████▏| 4450/4854 [1:08:21<50:36,  7.52s/it]  

Scraping CIK 0000101199


 92%|█████████▏| 4451/4854 [1:08:34<1:01:35,  9.17s/it]

Scraping CIK 0000100726


 92%|█████████▏| 4452/4854 [1:08:46<1:08:44, 10.26s/it]

Scraping CIK 0000912767


 92%|█████████▏| 4453/4854 [1:08:56<1:06:59, 10.02s/it]

Scraping CIK 0000914156


 92%|█████████▏| 4454/4854 [1:09:09<1:13:16, 10.99s/it]

Scraping CIK 0001381531


 92%|█████████▏| 4455/4854 [1:09:18<1:09:13, 10.41s/it]

Scraping CIK 0000101295


 92%|█████████▏| 4456/4854 [1:09:25<1:02:03,  9.36s/it]

Scraping CIK 0000884614


 92%|█████████▏| 4457/4854 [1:09:37<1:06:47, 10.09s/it]

Scraping CIK 0001094972


 92%|█████████▏| 4458/4854 [1:09:37<47:07,  7.14s/it]  

Scraping CIK 0000004457


 92%|█████████▏| 4459/4854 [1:09:52<1:03:06,  9.59s/it]

Scraping CIK 0000352915


 92%|█████████▏| 4460/4854 [1:10:06<1:10:55, 10.80s/it]

Scraping CIK 0000798783


 92%|█████████▏| 4461/4854 [1:10:20<1:15:58, 11.60s/it]

Scraping CIK 0001511737


 92%|█████████▏| 4462/4854 [1:10:25<1:04:18,  9.84s/it]

Scraping CIK 0001401521


 92%|█████████▏| 4463/4854 [1:10:33<1:00:26,  9.27s/it]

Scraping CIK 0000746838


 92%|█████████▏| 4464/4854 [1:10:46<1:07:45, 10.42s/it]

Scraping CIK 0000217410


 92%|█████████▏| 4465/4854 [1:10:47<47:49,  7.38s/it]  

Scraping CIK 0000875657


 92%|█████████▏| 4466/4854 [1:10:59<56:39,  8.76s/it]

Scraping CIK 0001308208


 92%|█████████▏| 4467/4854 [1:11:07<56:02,  8.69s/it]

Scraping CIK 0001403568


 92%|█████████▏| 4468/4854 [1:11:14<52:22,  8.14s/it]

Scraping CIK 0000101382


 92%|█████████▏| 4470/4854 [1:11:27<43:33,  6.81s/it]  

Scraping CIK 0001033767
Scraping CIK 0000752642


 92%|█████████▏| 4471/4854 [1:11:40<55:09,  8.64s/it]

Scraping CIK 0001077771


 92%|█████████▏| 4472/4854 [1:11:53<1:02:44,  9.85s/it]

Scraping CIK 0001622229


 92%|█████████▏| 4474/4854 [1:11:54<32:18,  5.10s/it]  

Scraping CIK 0000110390
Scraping CIK 0000100716


 92%|█████████▏| 4475/4854 [1:12:05<43:46,  6.93s/it]

Scraping CIK 0000706863


 92%|█████████▏| 4476/4854 [1:12:17<52:10,  8.28s/it]

Scraping CIK 0000717954


 92%|█████████▏| 4477/4854 [1:12:30<1:01:39,  9.81s/it]

Scraping CIK 0001020859


 92%|█████████▏| 4478/4854 [1:12:42<1:06:05, 10.55s/it]

Scraping CIK 0000731766


 92%|█████████▏| 4479/4854 [1:12:54<1:08:17, 10.93s/it]

Scraping CIK 0001620280


 92%|█████████▏| 4480/4854 [1:12:57<53:25,  8.57s/it]  

Scraping CIK 0000005513


 92%|█████████▏| 4481/4854 [1:13:11<1:01:58,  9.97s/it]

Scraping CIK 0000100885


 92%|█████████▏| 4482/4854 [1:13:23<1:05:36, 10.58s/it]

Scraping CIK 0000920427


 92%|█████████▏| 4483/4854 [1:13:33<1:04:39, 10.46s/it]

Scraping CIK 0001494319


 92%|█████████▏| 4484/4854 [1:13:36<51:00,  8.27s/it]  

Scraping CIK 0001041657


 92%|█████████▏| 4485/4854 [1:13:49<59:43,  9.71s/it]

Scraping CIK 0001505155


 92%|█████████▏| 4486/4854 [1:13:53<48:25,  7.90s/it]

Scraping CIK 0001090727


 92%|█████████▏| 4487/4854 [1:14:06<58:34,  9.58s/it]

Scraping CIK 0001627475


 92%|█████████▏| 4488/4854 [1:14:07<42:50,  7.02s/it]

Scraping CIK 0000912615


 92%|█████████▏| 4489/4854 [1:14:20<52:55,  8.70s/it]

Scraping CIK 0001375205


 93%|█████████▎| 4490/4854 [1:14:24<45:24,  7.49s/it]

Scraping CIK 0001668243


 93%|█████████▎| 4491/4854 [1:14:25<33:32,  5.54s/it]

Scraping CIK 0001067701


 93%|█████████▎| 4492/4854 [1:14:38<46:02,  7.63s/it]

Scraping CIK 0001740547


 93%|█████████▎| 4493/4854 [1:14:39<34:30,  5.74s/it]

Scraping CIK 0001522727


 93%|█████████▎| 4494/4854 [1:14:44<33:05,  5.52s/it]

Scraping CIK 0000883945


 93%|█████████▎| 4495/4854 [1:14:55<41:35,  6.95s/it]

Scraping CIK 0000931584


 93%|█████████▎| 4496/4854 [1:15:06<49:58,  8.38s/it]

Scraping CIK 0001286973


 93%|█████████▎| 4497/4854 [1:15:07<35:28,  5.96s/it]

Scraping CIK 0000027093


 93%|█████████▎| 4498/4854 [1:15:16<41:25,  6.98s/it]

Scraping CIK 0000036104


 93%|█████████▎| 4499/4854 [1:15:27<49:11,  8.31s/it]

Scraping CIK 0001073429


 93%|█████████▎| 4500/4854 [1:15:38<53:20,  9.04s/it]

Scraping CIK 0001610682


 93%|█████████▎| 4501/4854 [1:15:42<44:23,  7.55s/it]

Scraping CIK 0000101594


 93%|█████████▎| 4502/4854 [1:15:55<54:18,  9.26s/it]

Scraping CIK 0001665918


 93%|█████████▎| 4503/4854 [1:15:58<42:21,  7.24s/it]

Scraping CIK 0001088034


 93%|█████████▎| 4504/4854 [1:16:07<45:01,  7.72s/it]

Scraping CIK 0000082020


 93%|█████████▎| 4505/4854 [1:16:19<51:51,  8.91s/it]

Scraping CIK 0000821130


 93%|█████████▎| 4506/4854 [1:16:30<55:40,  9.60s/it]

Scraping CIK 0000896264


 93%|█████████▎| 4507/4854 [1:16:41<58:35, 10.13s/it]

Scraping CIK 0000885978


 93%|█████████▎| 4508/4854 [1:16:52<1:00:29, 10.49s/it]

Scraping CIK 0001670349


 93%|█████████▎| 4509/4854 [1:16:54<45:18,  7.88s/it]  

Scraping CIK 0000923571


 93%|█████████▎| 4510/4854 [1:17:00<41:00,  7.15s/it]

Scraping CIK 0001082554


 93%|█████████▎| 4511/4854 [1:17:11<47:50,  8.37s/it]

Scraping CIK 0001261654


 93%|█████████▎| 4512/4854 [1:17:21<50:12,  8.81s/it]

Scraping CIK 0000755001


 93%|█████████▎| 4513/4854 [1:17:32<55:07,  9.70s/it]

Scraping CIK 0000706698


 93%|█████████▎| 4514/4854 [1:17:44<58:53, 10.39s/it]

Scraping CIK 0001030471


 93%|█████████▎| 4515/4854 [1:17:51<52:46,  9.34s/it]

Scraping CIK 0000102109


 93%|█████████▎| 4516/4854 [1:18:02<54:19,  9.64s/it]

Scraping CIK 0001385849


 93%|█████████▎| 4517/4854 [1:18:05<43:04,  7.67s/it]

Scraping CIK 0000891166


 93%|█████████▎| 4518/4854 [1:18:12<42:26,  7.58s/it]

Scraping CIK 0000102212


 93%|█████████▎| 4519/4854 [1:18:25<51:24,  9.21s/it]

Scraping CIK 0000102037


 93%|█████████▎| 4521/4854 [1:18:39<40:44,  7.34s/it]

Scraping CIK 0001729173
Scraping CIK 0001403161


 93%|█████████▎| 4522/4854 [1:18:47<42:06,  7.61s/it]

Scraping CIK 0001524358


 93%|█████████▎| 4523/4854 [1:18:52<37:45,  6.84s/it]

Scraping CIK 0000314808


 93%|█████████▎| 4525/4854 [1:19:08<37:25,  6.83s/it]

Scraping CIK 0000917851
Scraping CIK 0000717720


 93%|█████████▎| 4526/4854 [1:19:20<45:16,  8.28s/it]

Scraping CIK 0001253176


 93%|█████████▎| 4527/4854 [1:19:21<33:53,  6.22s/it]

Scraping CIK 0000203527


 93%|█████████▎| 4528/4854 [1:19:34<44:44,  8.23s/it]

Scraping CIK 0001290476


 93%|█████████▎| 4529/4854 [1:19:44<46:40,  8.62s/it]

Scraping CIK 0000764195


 93%|█████████▎| 4531/4854 [1:19:46<25:18,  4.70s/it]

Scraping CIK 0001603207
Scraping CIK 0001501570


 93%|█████████▎| 4532/4854 [1:19:50<23:41,  4.42s/it]

Scraping CIK 0001111335


 93%|█████████▎| 4533/4854 [1:20:01<35:15,  6.59s/it]

Scraping CIK 0000887359


 93%|█████████▎| 4534/4854 [1:20:13<42:32,  7.98s/it]

Scraping CIK 0001205922


 93%|█████████▎| 4535/4854 [1:20:14<31:47,  5.98s/it]

Scraping CIK 0001129260


 93%|█████████▎| 4536/4854 [1:20:19<29:25,  5.55s/it]

Scraping CIK 0001570827


 93%|█████████▎| 4537/4854 [1:20:21<23:58,  4.54s/it]

Scraping CIK 0001384101


 93%|█████████▎| 4538/4854 [1:20:25<23:34,  4.48s/it]

Scraping CIK 0001601548


 94%|█████████▎| 4539/4854 [1:20:29<21:52,  4.17s/it]

Scraping CIK 0000103145


 94%|█████████▎| 4541/4854 [1:20:40<23:19,  4.47s/it]

Scraping CIK 0001370431
Scraping CIK 0001393052


 94%|█████████▎| 4542/4854 [1:20:45<24:40,  4.74s/it]

Scraping CIK 0001692376


 94%|█████████▎| 4543/4854 [1:20:47<18:56,  3.66s/it]

Scraping CIK 0001468091


 94%|█████████▎| 4544/4854 [1:20:47<13:35,  2.63s/it]

Scraping CIK 0001507385


 94%|█████████▎| 4545/4854 [1:20:57<24:39,  4.79s/it]

Scraping CIK 0001566610


 94%|█████████▎| 4546/4854 [1:21:01<23:16,  4.53s/it]

Scraping CIK 0001615165


 94%|█████████▎| 4547/4854 [1:21:03<19:30,  3.81s/it]

Scraping CIK 0001409269


 94%|█████████▎| 4548/4854 [1:21:05<17:06,  3.35s/it]

Scraping CIK 0000863894


 94%|█████████▎| 4549/4854 [1:21:12<22:44,  4.47s/it]

Scraping CIK 0001575434


 94%|█████████▍| 4551/4854 [1:21:13<12:11,  2.42s/it]

Scraping CIK 0001293135
Scraping CIK 0000103379


 94%|█████████▍| 4552/4854 [1:21:26<27:55,  5.55s/it]

Scraping CIK 0001584549


 94%|█████████▍| 4553/4854 [1:21:27<20:32,  4.10s/it]

Scraping CIK 0001272830


 94%|█████████▍| 4554/4854 [1:21:35<27:08,  5.43s/it]

Scraping CIK 0000059440


 94%|█████████▍| 4555/4854 [1:21:47<37:04,  7.44s/it]

Scraping CIK 0000783324


 94%|█████████▍| 4556/4854 [1:21:59<43:24,  8.74s/it]

Scraping CIK 0001082324


 94%|█████████▍| 4557/4854 [1:22:06<40:49,  8.25s/it]

Scraping CIK 0000059255


 94%|█████████▍| 4558/4854 [1:22:19<47:59,  9.73s/it]

Scraping CIK 0000813828


 94%|█████████▍| 4559/4854 [1:22:33<52:59, 10.78s/it]

Scraping CIK 0000912093


 94%|█████████▍| 4560/4854 [1:22:44<54:20, 11.09s/it]

Scraping CIK 0001705696


 94%|█████████▍| 4561/4854 [1:22:46<39:52,  8.16s/it]

Scraping CIK 0000751978


 94%|█████████▍| 4562/4854 [1:22:58<45:40,  9.39s/it]

Scraping CIK 0001734517


 94%|█████████▍| 4564/4854 [1:22:59<23:37,  4.89s/it]

Scraping CIK 0001742770
Scraping CIK 0001529192


 94%|█████████▍| 4565/4854 [1:22:59<16:41,  3.46s/it]

Scraping CIK 0001706431


 94%|█████████▍| 4566/4854 [1:23:00<12:55,  2.69s/it]

Scraping CIK 0000751365


 94%|█████████▍| 4567/4854 [1:23:11<25:07,  5.25s/it]

Scraping CIK 0001592386


 94%|█████████▍| 4568/4854 [1:23:15<23:14,  4.88s/it]

Scraping CIK 0001565228


 94%|█████████▍| 4570/4854 [1:23:20<15:36,  3.30s/it]

Scraping CIK 0001762506
Scraping CIK 0001066119


 94%|█████████▍| 4571/4854 [1:23:20<11:11,  2.37s/it]

Scraping CIK 0000879682


 94%|█████████▍| 4572/4854 [1:23:32<24:31,  5.22s/it]

Scraping CIK 0000794172


 94%|█████████▍| 4574/4854 [1:23:43<22:50,  4.90s/it]

Scraping CIK 0001582581
Scraping CIK 0001607678


 94%|█████████▍| 4575/4854 [1:23:46<19:43,  4.24s/it]

Scraping CIK 0000103595


 94%|█████████▍| 4576/4854 [1:23:59<31:49,  6.87s/it]

Scraping CIK 0001035002


 94%|█████████▍| 4577/4854 [1:24:14<43:36,  9.44s/it]

Scraping CIK 0001520504


 94%|█████████▍| 4578/4854 [1:24:14<30:55,  6.72s/it]

Scraping CIK 0000714310


 94%|█████████▍| 4579/4854 [1:24:28<40:25,  8.82s/it]

Scraping CIK 0001396009


 94%|█████████▍| 4580/4854 [1:24:39<42:23,  9.28s/it]

Scraping CIK 0001729149


 94%|█████████▍| 4581/4854 [1:24:42<34:17,  7.54s/it]

Scraping CIK 0000102729


 94%|█████████▍| 4582/4854 [1:24:56<43:35,  9.61s/it]

Scraping CIK 0001124610


 94%|█████████▍| 4583/4854 [1:25:05<41:32,  9.20s/it]

Scraping CIK 0001579157


 94%|█████████▍| 4584/4854 [1:25:10<35:57,  7.99s/it]

Scraping CIK 0001347178


 94%|█████████▍| 4585/4854 [1:25:18<36:01,  8.03s/it]

Scraping CIK 0001733186


 94%|█████████▍| 4587/4854 [1:25:20<19:38,  4.41s/it]

Scraping CIK 0001508475
Scraping CIK 0000899689


 95%|█████████▍| 4588/4854 [1:25:43<43:57,  9.91s/it]

Scraping CIK 0001602065


 95%|█████████▍| 4589/4854 [1:25:47<35:34,  8.05s/it]

Scraping CIK 0000093314


 95%|█████████▍| 4590/4854 [1:25:54<33:54,  7.71s/it]

Scraping CIK 0001705682


 95%|█████████▍| 4591/4854 [1:25:55<26:14,  5.99s/it]

Scraping CIK 0001505413


 95%|█████████▍| 4593/4854 [1:26:01<17:49,  4.10s/it]

Scraping CIK 0000839923
Scraping CIK 0000103872


 95%|█████████▍| 4594/4854 [1:26:12<27:04,  6.25s/it]

Scraping CIK 0000807707


 95%|█████████▍| 4595/4854 [1:26:25<35:47,  8.29s/it]

Scraping CIK 0001535929


 95%|█████████▍| 4596/4854 [1:26:31<31:49,  7.40s/it]

Scraping CIK 0001487952


 95%|█████████▍| 4597/4854 [1:26:36<29:36,  6.91s/it]

Scraping CIK 0001495320


 95%|█████████▍| 4598/4854 [1:26:42<27:45,  6.51s/it]

Scraping CIK 0001597313


 95%|█████████▍| 4599/4854 [1:26:46<24:48,  5.84s/it]

Scraping CIK 0001660334


 95%|█████████▍| 4600/4854 [1:26:47<18:46,  4.43s/it]

Scraping CIK 0001681622


 95%|█████████▍| 4602/4854 [1:26:49<10:38,  2.53s/it]

Scraping CIK 0001580864
Scraping CIK 0001104038


 95%|█████████▍| 4604/4854 [1:26:54<09:22,  2.25s/it]

Scraping CIK 0001657312
Scraping CIK 0001361113


 95%|█████████▍| 4605/4854 [1:26:57<10:15,  2.47s/it]

Scraping CIK 0001166388


 95%|█████████▍| 4606/4854 [1:27:09<22:04,  5.34s/it]

Scraping CIK 0001682745


 95%|█████████▍| 4607/4854 [1:27:11<18:21,  4.46s/it]

Scraping CIK 0001421182


 95%|█████████▍| 4608/4854 [1:27:19<22:40,  5.53s/it]

Scraping CIK 0001442145


 95%|█████████▍| 4609/4854 [1:27:26<24:01,  5.88s/it]

Scraping CIK 0001014473


 95%|█████████▍| 4610/4854 [1:27:38<31:14,  7.68s/it]

Scraping CIK 0001674101


 95%|█████████▍| 4611/4854 [1:27:39<23:14,  5.74s/it]

Scraping CIK 0000883237


 95%|█████████▌| 4612/4854 [1:27:46<25:03,  6.21s/it]

Scraping CIK 0001207074


 95%|█████████▌| 4613/4854 [1:27:54<26:18,  6.55s/it]

Scraping CIK 0001599489


 95%|█████████▌| 4614/4854 [1:27:57<22:21,  5.59s/it]

Scraping CIK 0000875320


 95%|█████████▌| 4615/4854 [1:28:09<29:46,  7.48s/it]

Scraping CIK 0000797721


 95%|█████████▌| 4616/4854 [1:28:20<34:30,  8.70s/it]

Scraping CIK 0000102752


 95%|█████████▌| 4617/4854 [1:28:32<37:34,  9.51s/it]

Scraping CIK 0000103730


 95%|█████████▌| 4618/4854 [1:28:46<42:44, 10.87s/it]

Scraping CIK 0001607716


 95%|█████████▌| 4619/4854 [1:28:50<34:29,  8.80s/it]

Scraping CIK 0001692819


 95%|█████████▌| 4620/4854 [1:28:52<26:27,  6.78s/it]

Scraping CIK 0001526119


 95%|█████████▌| 4621/4854 [1:28:57<23:49,  6.13s/it]

Scraping CIK 0001616318


 95%|█████████▌| 4622/4854 [1:29:01<22:01,  5.70s/it]

Scraping CIK 0001172088


 95%|█████████▌| 4623/4854 [1:29:02<15:44,  4.09s/it]

Scraping CIK 0001411685


 95%|█████████▌| 4624/4854 [1:29:09<19:07,  4.99s/it]

Scraping CIK 0000890447


 95%|█████████▌| 4625/4854 [1:29:17<22:59,  6.03s/it]

Scraping CIK 0001525221


 95%|█████████▌| 4626/4854 [1:29:23<22:13,  5.85s/it]

Scraping CIK 0000740260


 95%|█████████▌| 4627/4854 [1:29:37<31:51,  8.42s/it]

Scraping CIK 0001085243


 95%|█████████▌| 4628/4854 [1:29:39<24:12,  6.43s/it]

Scraping CIK 0001641489


 95%|█████████▌| 4629/4854 [1:29:42<20:21,  5.43s/it]

Scraping CIK 0001463972


 95%|█████████▌| 4630/4854 [1:29:48<21:31,  5.77s/it]

Scraping CIK 0000884219


 95%|█████████▌| 4631/4854 [1:30:03<30:49,  8.29s/it]

Scraping CIK 0001713952


 95%|█████████▌| 4633/4854 [1:30:05<16:33,  4.49s/it]

Scraping CIK 0001681348
Scraping CIK 0000881524


 95%|█████████▌| 4634/4854 [1:30:16<23:57,  6.53s/it]

Scraping CIK 0001674910


 95%|█████████▌| 4635/4854 [1:30:18<19:13,  5.27s/it]

Scraping CIK 0000072444


 96%|█████████▌| 4636/4854 [1:30:31<27:08,  7.47s/it]

Scraping CIK 0001640266


 96%|█████████▌| 4637/4854 [1:30:34<22:23,  6.19s/it]

Scraping CIK 0000732712


 96%|█████████▌| 4638/4854 [1:30:46<28:21,  7.88s/it]

Scraping CIK 0001616707


 96%|█████████▌| 4639/4854 [1:30:49<23:24,  6.53s/it]

Scraping CIK 0000943452


 96%|█████████▌| 4640/4854 [1:31:02<29:50,  8.37s/it]

Scraping CIK 0000311094


 96%|█████████▌| 4641/4854 [1:31:15<34:52,  9.83s/it]

Scraping CIK 0000936528


 96%|█████████▌| 4643/4854 [1:31:27<25:43,  7.31s/it]

Scraping CIK 0001716770
Scraping CIK 0001212545


 96%|█████████▌| 4644/4854 [1:31:37<27:59,  8.00s/it]

Scraping CIK 0000737468


 96%|█████████▌| 4645/4854 [1:31:50<33:31,  9.63s/it]

Scraping CIK 0001000697


 96%|█████████▌| 4646/4854 [1:32:01<35:11, 10.15s/it]

Scraping CIK 0001575793


 96%|█████████▌| 4648/4854 [1:32:05<19:30,  5.68s/it]

Scraping CIK 0001595761
Scraping CIK 0001618921


 96%|█████████▌| 4650/4854 [1:32:09<12:14,  3.60s/it]

Scraping CIK 0001517496
Scraping CIK 0000719245


 96%|█████████▌| 4651/4854 [1:32:09<08:41,  2.57s/it]

Scraping CIK 0000801337


 96%|█████████▌| 4652/4854 [1:32:21<18:08,  5.39s/it]

Scraping CIK 0001650962


 96%|█████████▌| 4653/4854 [1:32:24<16:16,  4.86s/it]

Scraping CIK 0000929008


 96%|█████████▌| 4654/4854 [1:32:35<22:29,  6.75s/it]

Scraping CIK 0001318220


 96%|█████████▌| 4655/4854 [1:32:38<17:56,  5.41s/it]

Scraping CIK 0001497770


 96%|█████████▌| 4656/4854 [1:32:44<18:55,  5.74s/it]

Scraping CIK 0001327811


 96%|█████████▌| 4657/4854 [1:32:48<17:18,  5.27s/it]

Scraping CIK 0000106040


 96%|█████████▌| 4658/4854 [1:33:01<24:05,  7.37s/it]

Scraping CIK 0000105132


 96%|█████████▌| 4659/4854 [1:33:13<28:57,  8.91s/it]

Scraping CIK 0001052100


 96%|█████████▌| 4660/4854 [1:33:25<31:11,  9.64s/it]

Scraping CIK 0000783325


 96%|█████████▌| 4662/4854 [1:33:38<24:19,  7.60s/it]

Scraping CIK 0001734902
Scraping CIK 0000766704


 96%|█████████▌| 4663/4854 [1:33:54<32:03, 10.07s/it]

Scraping CIK 0000030697


 96%|█████████▌| 4664/4854 [1:34:09<36:39, 11.58s/it]

Scraping CIK 0000793074


 96%|█████████▌| 4665/4854 [1:34:21<36:42, 11.66s/it]

Scraping CIK 0001423902


 96%|█████████▌| 4666/4854 [1:34:26<30:28,  9.73s/it]

Scraping CIK 0000880631


 96%|█████████▌| 4667/4854 [1:34:32<26:39,  8.55s/it]

Scraping CIK 0001309108


 96%|█████████▌| 4668/4854 [1:34:42<27:27,  8.86s/it]

Scraping CIK 0000106532


 96%|█████████▌| 4670/4854 [1:34:54<21:09,  6.90s/it]

Scraping CIK 0001264136
Scraping CIK 0000072971


 96%|█████████▌| 4671/4854 [1:35:05<25:29,  8.36s/it]

Scraping CIK 0000107687


 96%|█████████▋| 4672/4854 [1:35:17<28:12,  9.30s/it]

Scraping CIK 0001722684


 96%|█████████▋| 4673/4854 [1:35:18<20:42,  6.86s/it]

Scraping CIK 0001699136


 96%|█████████▋| 4674/4854 [1:35:20<16:29,  5.50s/it]

Scraping CIK 0001165002


 96%|█████████▋| 4675/4854 [1:35:30<19:50,  6.65s/it]

Scraping CIK 0001013706


 96%|█████████▋| 4676/4854 [1:35:40<23:14,  7.83s/it]

Scraping CIK 0001527541


 96%|█████████▋| 4677/4854 [1:35:46<20:46,  7.04s/it]

Scraping CIK 0000106640


 96%|█████████▋| 4678/4854 [1:35:59<26:00,  8.87s/it]

Scraping CIK 0001169988


 96%|█████████▋| 4680/4854 [1:36:04<15:39,  5.40s/it]

Scraping CIK 0001030997
Scraping CIK 0001770088


 96%|█████████▋| 4681/4854 [1:36:04<11:03,  3.83s/it]

Scraping CIK 0000908315


 96%|█████████▋| 4682/4854 [1:36:16<18:16,  6.37s/it]

Scraping CIK 0001636222


 96%|█████████▋| 4683/4854 [1:36:19<15:24,  5.41s/it]

Scraping CIK 0001640251


 96%|█████████▋| 4684/4854 [1:36:20<11:04,  3.91s/it]

Scraping CIK 0000946486


 97%|█████████▋| 4685/4854 [1:36:30<16:22,  5.82s/it]

Scraping CIK 0000850460


 97%|█████████▋| 4686/4854 [1:36:42<21:10,  7.56s/it]

Scraping CIK 0001682149


 97%|█████████▋| 4688/4854 [1:36:43<11:10,  4.04s/it]

Scraping CIK 0001123799
Scraping CIK 0001576789


 97%|█████████▋| 4689/4854 [1:36:43<07:53,  2.87s/it]

Scraping CIK 0001445305


 97%|█████████▋| 4691/4854 [1:36:46<05:26,  2.00s/it]

Scraping CIK 0001738699
Scraping CIK 0001425287


 97%|█████████▋| 4692/4854 [1:36:52<08:42,  3.23s/it]

Scraping CIK 0001370450


 97%|█████████▋| 4693/4854 [1:37:00<12:45,  4.75s/it]

Scraping CIK 0001018164


 97%|█████████▋| 4694/4854 [1:37:13<18:40,  7.00s/it]

Scraping CIK 0001262823


 97%|█████████▋| 4695/4854 [1:37:22<20:45,  7.83s/it]

Scraping CIK 0001604665


 97%|█████████▋| 4696/4854 [1:37:26<17:07,  6.50s/it]

Scraping CIK 0001255474


 97%|█████████▋| 4697/4854 [1:37:36<19:36,  7.50s/it]

Scraping CIK 0001140536


 97%|█████████▋| 4698/4854 [1:37:47<22:17,  8.58s/it]

Scraping CIK 0000823768


 97%|█████████▋| 4699/4854 [1:38:01<26:20, 10.20s/it]

Scraping CIK 0000107263


 97%|█████████▋| 4700/4854 [1:38:15<29:20, 11.43s/it]

Scraping CIK 0001465885


 97%|█████████▋| 4701/4854 [1:38:20<24:33,  9.63s/it]

Scraping CIK 0001319161


 97%|█████████▋| 4702/4854 [1:38:30<24:26,  9.65s/it]

Scraping CIK 0001492658


 97%|█████████▋| 4703/4854 [1:38:36<21:31,  8.55s/it]

Scraping CIK 0000105418


 97%|█████████▋| 4704/4854 [1:38:48<23:53,  9.55s/it]

Scraping CIK 0001604028


 97%|█████████▋| 4705/4854 [1:38:52<19:31,  7.86s/it]

Scraping CIK 0000104169


 97%|█████████▋| 4706/4854 [1:39:15<30:28, 12.35s/it]

Scraping CIK 0000879526


 97%|█████████▋| 4707/4854 [1:39:27<29:58, 12.24s/it]

Scraping CIK 0001157647


 97%|█████████▋| 4709/4854 [1:39:39<20:33,  8.50s/it]

Scraping CIK 0001356570
Scraping CIK 0000108516


 97%|█████████▋| 4710/4854 [1:39:50<22:41,  9.45s/it]

Scraping CIK 0001764925


 97%|█████████▋| 4711/4854 [1:39:51<16:04,  6.75s/it]

Scraping CIK 0001674227


 97%|█████████▋| 4712/4854 [1:39:53<12:44,  5.38s/it]

Scraping CIK 0001701051


 97%|█████████▋| 4713/4854 [1:39:55<10:08,  4.32s/it]

Scraping CIK 0001025378


 97%|█████████▋| 4714/4854 [1:40:09<16:42,  7.16s/it]

Scraping CIK 0001594686


 97%|█████████▋| 4716/4854 [1:40:12<09:58,  4.34s/it]

Scraping CIK 0001323404
Scraping CIK 0000806968


 97%|█████████▋| 4718/4854 [1:40:13<04:58,  2.20s/it]

Scraping CIK 0001370416
Scraping CIK 0001518832


 97%|█████████▋| 4719/4854 [1:40:18<06:51,  3.05s/it]

Scraping CIK 0000011544


 97%|█████████▋| 4720/4854 [1:40:31<13:22,  5.99s/it]

Scraping CIK 0000104894


 97%|█████████▋| 4721/4854 [1:40:44<18:24,  8.31s/it]

Scraping CIK 0000828916


 97%|█████████▋| 4722/4854 [1:40:59<22:24, 10.19s/it]

Scraping CIK 0001732845


 97%|█████████▋| 4723/4854 [1:41:00<16:26,  7.53s/it]

Scraping CIK 0000108385


 97%|█████████▋| 4725/4854 [1:41:13<13:34,  6.32s/it]

Scraping CIK 0001364125
Scraping CIK 0000203596


 97%|█████████▋| 4726/4854 [1:41:26<17:45,  8.32s/it]

Scraping CIK 0001569994


 97%|█████████▋| 4727/4854 [1:41:31<15:32,  7.34s/it]

Scraping CIK 0001647088


 97%|█████████▋| 4728/4854 [1:41:34<12:37,  6.01s/it]

Scraping CIK 0000828944


 97%|█████████▋| 4730/4854 [1:41:46<11:40,  5.65s/it]

Scraping CIK 0001771279
Scraping CIK 0000719955


 97%|█████████▋| 4731/4854 [1:41:58<15:05,  7.36s/it]

Scraping CIK 0000105016


 97%|█████████▋| 4732/4854 [1:42:09<17:10,  8.45s/it]

Scraping CIK 0001175535


 98%|█████████▊| 4733/4854 [1:42:20<18:49,  9.33s/it]

Scraping CIK 0000105770


 98%|█████████▊| 4734/4854 [1:42:33<20:35, 10.30s/it]

Scraping CIK 0000945983


 98%|█████████▊| 4735/4854 [1:42:45<21:24, 10.79s/it]

Scraping CIK 0001002135


 98%|█████████▊| 4736/4854 [1:42:57<21:58, 11.18s/it]

Scraping CIK 0001166928


 98%|█████████▊| 4737/4854 [1:43:07<21:17, 10.92s/it]

Scraping CIK 0001532390


 98%|█████████▊| 4738/4854 [1:43:12<17:24,  9.01s/it]

Scraping CIK 0001015328


 98%|█████████▊| 4739/4854 [1:43:32<23:49, 12.43s/it]

Scraping CIK 0001288403


 98%|█████████▊| 4740/4854 [1:43:43<22:40, 11.93s/it]

Scraping CIK 0000776867


 98%|█████████▊| 4741/4854 [1:43:58<24:12, 12.85s/it]

Scraping CIK 0001601669


 98%|█████████▊| 4742/4854 [1:43:59<17:22,  9.31s/it]

Scraping CIK 0000078128


 98%|█████████▊| 4743/4854 [1:44:13<19:52, 10.74s/it]

Scraping CIK 0001653247


 98%|█████████▊| 4744/4854 [1:44:16<15:24,  8.41s/it]

Scraping CIK 0000795403


 98%|█████████▊| 4745/4854 [1:44:30<18:10, 10.00s/it]

Scraping CIK 0000878828


 98%|█████████▊| 4746/4854 [1:44:40<18:16, 10.15s/it]

Scraping CIK 0001693256


 98%|█████████▊| 4747/4854 [1:44:42<13:43,  7.70s/it]

Scraping CIK 0001365135


 98%|█████████▊| 4749/4854 [1:44:50<09:34,  5.47s/it]

Scraping CIK 0001525494
Scraping CIK 0001631574


 98%|█████████▊| 4750/4854 [1:44:52<07:50,  4.53s/it]

Scraping CIK 0000910679


 98%|█████████▊| 4751/4854 [1:45:02<10:29,  6.11s/it]

Scraping CIK 0000838875


 98%|█████████▊| 4752/4854 [1:45:08<10:25,  6.13s/it]

Scraping CIK 0000105319


 98%|█████████▊| 4753/4854 [1:45:19<12:39,  7.52s/it]

Scraping CIK 0000108312


 98%|█████████▊| 4754/4854 [1:45:32<15:00,  9.00s/it]

Scraping CIK 0001091907


 98%|█████████▊| 4755/4854 [1:45:44<16:24,  9.95s/it]

Scraping CIK 0000839470


 98%|█████████▊| 4756/4854 [1:45:53<15:56,  9.76s/it]

Scraping CIK 0000110471


 98%|█████████▊| 4757/4854 [1:46:06<17:04, 10.56s/it]

Scraping CIK 0000106535


 98%|█████████▊| 4758/4854 [1:46:19<18:03, 11.28s/it]

Scraping CIK 0001361658


 98%|█████████▊| 4759/4854 [1:46:27<16:23, 10.35s/it]

Scraping CIK 0001174922


 98%|█████████▊| 4760/4854 [1:46:37<16:01, 10.23s/it]

Scraping CIK 0001034760


 98%|█████████▊| 4761/4854 [1:46:48<16:08, 10.42s/it]

Scraping CIK 0001163302


 98%|█████████▊| 4762/4854 [1:47:00<16:43, 10.90s/it]

Scraping CIK 0001641631


 98%|█████████▊| 4763/4854 [1:47:02<12:37,  8.32s/it]

Scraping CIK 0001332551


 98%|█████████▊| 4764/4854 [1:47:13<13:34,  9.05s/it]

Scraping CIK 0001534525


 98%|█████████▊| 4765/4854 [1:47:17<11:10,  7.53s/it]

Scraping CIK 0001626878


 98%|█████████▊| 4766/4854 [1:47:19<08:53,  6.06s/it]

Scraping CIK 0001698530


 98%|█████████▊| 4767/4854 [1:47:21<06:53,  4.75s/it]

Scraping CIK 0001168054


 98%|█████████▊| 4768/4854 [1:47:31<09:06,  6.35s/it]

Scraping CIK 0000072903


 98%|█████████▊| 4769/4854 [1:47:45<12:21,  8.72s/it]

Scraping CIK 0001620179


 98%|█████████▊| 4770/4854 [1:47:49<10:02,  7.18s/it]

Scraping CIK 0001083220


 98%|█████████▊| 4771/4854 [1:47:56<09:45,  7.05s/it]

Scraping CIK 0001582313


 98%|█████████▊| 4772/4854 [1:47:59<07:57,  5.83s/it]

Scraping CIK 0001271214


 98%|█████████▊| 4773/4854 [1:48:02<06:53,  5.10s/it]

Scraping CIK 0001346302


 98%|█████████▊| 4774/4854 [1:48:03<05:09,  3.87s/it]

Scraping CIK 0001501697


 98%|█████████▊| 4775/4854 [1:48:05<04:15,  3.23s/it]

Scraping CIK 0001274737


 98%|█████████▊| 4776/4854 [1:48:05<03:08,  2.42s/it]

Scraping CIK 0001616000


 98%|█████████▊| 4778/4854 [1:48:09<02:35,  2.05s/it]

Scraping CIK 0001398453
Scraping CIK 0000743988


 98%|█████████▊| 4779/4854 [1:48:21<06:15,  5.01s/it]

Scraping CIK 0001280600


 98%|█████████▊| 4780/4854 [1:48:25<05:45,  4.67s/it]

Scraping CIK 0001326732


 99%|█████████▊| 4782/4854 [1:48:28<03:36,  3.00s/it]

Scraping CIK 0001510593
Scraping CIK 0001655020


 99%|█████████▊| 4783/4854 [1:48:31<03:27,  2.93s/it]

Scraping CIK 0000034088


 99%|█████████▊| 4784/4854 [1:48:46<07:37,  6.53s/it]

Scraping CIK 0000791908


 99%|█████████▊| 4785/4854 [1:48:56<08:49,  7.67s/it]

Scraping CIK 0001561627


 99%|█████████▊| 4787/4854 [1:49:02<05:33,  4.97s/it]

Scraping CIK 0001787425
Scraping CIK 0001767258


 99%|█████████▊| 4789/4854 [1:49:03<02:53,  2.67s/it]

Scraping CIK 0001803696
Scraping CIK 0000917225


 99%|█████████▊| 4790/4854 [1:49:14<05:29,  5.14s/it]

Scraping CIK 0001166003


 99%|█████████▊| 4791/4854 [1:49:23<06:26,  6.13s/it]

Scraping CIK 0000818479


 99%|█████████▊| 4792/4854 [1:49:35<08:20,  8.08s/it]

Scraping CIK 0001346610


 99%|█████████▊| 4793/4854 [1:49:35<05:48,  5.71s/it]

Scraping CIK 0001770450


 99%|█████████▉| 4794/4854 [1:49:36<04:11,  4.19s/it]

Scraping CIK 0001410428


 99%|█████████▉| 4796/4854 [1:49:43<03:30,  3.64s/it]

Scraping CIK 0001023549
Scraping CIK 0001453593


 99%|█████████▉| 4797/4854 [1:49:50<04:10,  4.40s/it]

Scraping CIK 0001347858


 99%|█████████▉| 4799/4854 [1:49:56<03:12,  3.49s/it]

Scraping CIK 0001725033
Scraping CIK 0001524472


 99%|█████████▉| 4800/4854 [1:50:01<03:32,  3.93s/it]

Scraping CIK 0000775368


 99%|█████████▉| 4801/4854 [1:50:13<05:40,  6.43s/it]

Scraping CIK 0001644903


 99%|█████████▉| 4802/4854 [1:50:15<04:30,  5.21s/it]

Scraping CIK 0001345016


 99%|█████████▉| 4803/4854 [1:50:22<04:49,  5.67s/it]

Scraping CIK 0001670592


 99%|█████████▉| 4804/4854 [1:50:23<03:36,  4.33s/it]

Scraping CIK 0001614178


 99%|█████████▉| 4805/4854 [1:50:25<02:49,  3.46s/it]

Scraping CIK 0001569329


 99%|█████████▉| 4807/4854 [1:50:29<01:59,  2.54s/it]

Scraping CIK 0001738906
Scraping CIK 0001661125


 99%|█████████▉| 4809/4854 [1:50:29<00:59,  1.32s/it]

Scraping CIK 0001759614
Scraping CIK 0001722964


 99%|█████████▉| 4811/4854 [1:50:31<00:41,  1.03it/s]

Scraping CIK 0001513845
Scraping CIK 0000108985


 99%|█████████▉| 4813/4854 [1:50:42<01:57,  2.87s/it]

Scraping CIK 0000904851
Scraping CIK 0000716006


 99%|█████████▉| 4815/4854 [1:50:55<02:43,  4.19s/it]

Scraping CIK 0001631761
Scraping CIK 0001121702


 99%|█████████▉| 4817/4854 [1:51:03<02:16,  3.68s/it]

Scraping CIK 0001516899
Scraping CIK 0001041061


 99%|█████████▉| 4818/4854 [1:51:14<03:37,  6.04s/it]

Scraping CIK 0001673358


 99%|█████████▉| 4820/4854 [1:51:17<02:00,  3.54s/it]

Scraping CIK 0000884247
Scraping CIK 0001530238


 99%|█████████▉| 4821/4854 [1:51:17<01:23,  2.53s/it]

Scraping CIK 0001617640


 99%|█████████▉| 4822/4854 [1:51:20<01:20,  2.50s/it]

Scraping CIK 0001296205


 99%|█████████▉| 4823/4854 [1:51:27<01:58,  3.81s/it]

Scraping CIK 0001136869


 99%|█████████▉| 4824/4854 [1:51:38<03:01,  6.04s/it]

Scraping CIK 0000877212


 99%|█████████▉| 4826/4854 [1:51:51<02:39,  5.70s/it]

Scraping CIK 0001785566
Scraping CIK 0001667313


 99%|█████████▉| 4828/4854 [1:51:53<01:26,  3.33s/it]

Scraping CIK 0001674988
Scraping CIK 0001463172


 99%|█████████▉| 4829/4854 [1:51:57<01:25,  3.43s/it]

Scraping CIK 0000917470


100%|█████████▉| 4830/4854 [1:52:09<02:23,  5.96s/it]

Scraping CIK 0001375151


100%|█████████▉| 4832/4854 [1:52:15<01:32,  4.20s/it]

Scraping CIK 0001773086
Scraping CIK 0001794515


100%|█████████▉| 4833/4854 [1:52:15<01:02,  2.98s/it]

Scraping CIK 0000109380


100%|█████████▉| 4834/4854 [1:52:28<02:01,  6.06s/it]

Scraping CIK 0001107421


100%|█████████▉| 4835/4854 [1:52:34<01:52,  5.91s/it]

Scraping CIK 0000855612


100%|█████████▉| 4837/4854 [1:52:45<01:30,  5.34s/it]

Scraping CIK 0001687451
Scraping CIK 0001704292


100%|█████████▉| 4838/4854 [1:52:45<01:00,  3.78s/it]

Scraping CIK 0001585521


100%|█████████▉| 4839/4854 [1:52:46<00:42,  2.83s/it]

Scraping CIK 0001131312


100%|█████████▉| 4840/4854 [1:52:54<01:00,  4.35s/it]

Scraping CIK 0001439404


100%|█████████▉| 4842/4854 [1:52:59<00:39,  3.25s/it]

Scraping CIK 0001041668
Scraping CIK 0001725160


100%|█████████▉| 4843/4854 [1:52:59<00:25,  2.32s/it]

Scraping CIK 0001684144


100%|█████████▉| 4844/4854 [1:53:01<00:20,  2.06s/it]

Scraping CIK 0001713683


100%|█████████▉| 4845/4854 [1:53:02<00:17,  1.92s/it]

Scraping CIK 0001587221


100%|█████████▉| 4847/4854 [1:53:07<00:13,  1.94s/it]

Scraping CIK 0001677250
Scraping CIK 0001555280


100%|█████████▉| 4848/4854 [1:53:12<00:16,  2.71s/it]

Scraping CIK 0001318008


100%|█████████▉| 4849/4854 [1:53:20<00:22,  4.52s/it]

Scraping CIK 0001423774


100%|█████████▉| 4850/4854 [1:53:22<00:14,  3.53s/it]

Scraping CIK 0001305323


100%|█████████▉| 4851/4854 [1:53:28<00:13,  4.41s/it]

Scraping CIK 0001403752


100%|█████████▉| 4852/4854 [1:53:30<00:07,  3.65s/it]

Scraping CIK 0001621443


100%|█████████▉| 4853/4854 [1:53:33<00:03,  3.59s/it]

Scraping CIK 0000846475


100%|██████████| 4854/4854 [1:53:40<00:00,  1.41s/it]


In [39]:
%%capture
# Run the function to scrape 10-Qs

# Define parameters
browse_url_base_10q = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=%s&type=10-Q&count=1000'
filing_url_base_10q = 'http://www.sec.gov/Archives/edgar/data/%s/%s-index.html'
doc_url_base_10q = 'http://www.sec.gov/Archives/edgar/data/%s/%s/%s'

# Set correct directory (fill this out yourself!)
os.chdir(pathname_10q)

# Initialize log file
# (log file name = the time we initiate scraping session)
time = strftime("%Y-%m-%d %Hh%Mm%Ss", gmtime())
log_file_name = 'log '+time+'.txt'
log_file = open(log_file_name, 'a')
log_file.close()

# Iterate over CIKs and scrape 10-Ks
for cik in tqdm(ticker_cik_df['cik']):
    Scrape10Q(browse_url_base=browse_url_base_10q, filing_url_base=filing_url_base_10q, doc_url_base=doc_url_base_10q, cik=cik, log_file_name=log_file_name)

We now have 10-Ks and 10-Qs in HTML or plaintext (.txt) format for each CIK. Before computing our similarity scores, however, we need to clean the files up a bit.

As outlined in the paper, we will:

> ... remove all tables (if their numeric character content is 
>     greater than 15%), HTML tags, XBRL tables, exhibits, 
>     ASCII-encoded PDFs, graphics, XLS, and other binary files.

In [None]:
def RemoveNumericalTables(soup):
    
    '''
    Removes tables with >15% numerical characters.
    
    Parameters
    ----------
    soup : BeautifulSoup object
        Parsed result from BeautifulSoup.
        
    Returns
    -------
    soup : BeautifulSoup object
        Parsed result from BeautifulSoup
        with numerical tables removed.
        
    '''
    
    # Determines percentage of numerical characters
    # in a table
    def GetDigitPercentage(tablestring):
        if len(tablestring)>0.0:
            numbers = sum([char.isdigit() for char in tablestring])
            length = len(tablestring)
            return numbers/length
        else:
            return 1
    
    # Evaluates numerical character % for each table
    # and removes the table if the percentage is > 15%
    [x.extract() for x in soup.find_all('table') if GetDigitPercentage(x.get_text())>0.15]
    
    return soup

In [None]:
def RemoveTags(soup):
    
    '''
    Drops HTML tags, newlines and unicode text from
    filing text.
    
    Parameters
    ----------
    soup : BeautifulSoup object
        Parsed result from BeautifulSoup.
        
    Returns
    -------
    text : str
        Filing text.
        
    '''
    
    # Remove HTML tags with get_text
    text = soup.get_text()
    
    # Remove newline characters
    text = text.replace('\n', ' ')
    
    # Replace unicode characters with their
    # "normal" representations
    text = unicodedata.normalize('NFKD', text)
    
    return text

In [None]:
def ConvertHTML(cik):
    
    '''
    Removes numerical tables, HTML tags,
    newlines, unicode text, and XBRL tables.
    
    Parameters
    ----------
    cik : str
        Central Index Key used to scrape files.
    
    Returns
    -------
    None.
    
    '''
    
    # Look for files scraped for that CIK
    try: 
        os.chdir(cik)
    # ...if we didn't scrape any files for that CIK, exit
    except FileNotFoundError:
        print("Could not find directory for CIK", cik)
        return
        
    print("Parsing CIK %s..." % cik)
    parsed = False # flag to tell if we've parsed anything
    
    # Try to make a new directory within the CIK directory
    # to store the text representations of the filings
    try:
        os.mkdir('rawtext')
    # If it already exists, continue
    # We can't exit at this point because we might be
    # partially through parsing text files, so we need to continue
    except OSError:
        pass
    
    # Get list of scraped files
    # excluding hidden files and directories
    file_list = [fname for fname in os.listdir() if not (fname.startswith('.') | os.path.isdir(fname))]
    
    # Iterate over scraped files and clean
    for filename in file_list:
            
        # Check if file has already been cleaned
        new_filename = filename.replace('.html', '.txt')
        text_file_list = os.listdir('rawtext')
        if new_filename in text_file_list:
            continue
        
        # If it hasn't been cleaned already, keep going...
        
        # Clean file
        with open(filename, 'r') as file:
            parsed = True
            soup = bs.BeautifulSoup(file.read(), "lxml")
            soup = RemoveNumericalTables(soup)
            text = RemoveTags(soup)
            with open('rawtext/'+new_filename, 'w') as newfile:
                newfile.write(text)
    
    # If all files in the CIK directory have been parsed
    # then log that
    if parsed==False:
        print("Already parsed CIK", cik)
    
    os.chdir('..')
    return

We can now apply this function to each of our 10-K and 10-Q files.

In [50]:
%%capture
# For 10-Ks...

os.chdir(pathname_10k)

# Iterate over CIKs and clean HTML filings
for cik in tqdm(ticker_cik_df['cik']):
    ConvertHTML(cik)

In [53]:
%%capture
# For 10-Qs...

os.chdir(pathname_10q)

# Iterate over CIKs and clean HTML filings
for cik in tqdm(ticker_cik_df['cik']):
    ConvertHTML(cik)

After running the two cells above, we have cleaned plaintext 10-K and 10-Q filings for each CIK. At this point, our file structure looks like this:


```
- 10Ks
    - CIK1
        - 10K #1
        - 10K #2
        ...
        - rawtext
    - CIK2
        - 10K #1
        - 10K #2
        ...
        - rawtext
    - CIK3
        - 10K #1
        - 10K #2
        ...
        - rawtext
    ...
- 10Qs
    - CIK1
        - 10Q #1
        - 10Q #2
        ...
        - rawtext
    - CIK2
        - 10Q #1
        - 10Q #2
        ...
        - rawtext
    - CIK3
        - 10Q #1
        - 10Q #2
        ...
        - rawtext
    ...
```

We can now begin computing our alpha factor (similarity scores).

## 2. Computing Similarity Scores

We'll use [cosine similarity](http://scikit-learn.org/stable/modules/metrics.html#cosine-similarity) and [Jaccard similarity](http://scikit-learn.org/stable/modules/model_evaluation.html#jaccard-similarity-score) to compare documents.

(The original paper also uses two other, simpler similarity measures, but cosine and Jaccard appeared to result in the best alpha factor performance -- and are much less computationally intensive to compute.)

In [None]:
def ComputeCosineSimilarity(words_A, words_B):
    
    '''
    Compute cosine similarity between document A and
    document B.
    
    Parameters
    ----------
    words_A : set
        Words in document A.
    words_B : set
        Words in document B
        
    Returns
    -------
    cosine_score : float
        Cosine similarity between document
        A and document B.
        
    '''
    
    # Compile complete set of words in A or B
    words = list(words_A.union(words_B))
    
    # Determine which words are in A
    vector_A = [1 if x in words_A else 0 for x in words]
    
    # Determine which words are in B
    vector_B = [1 if x in words_B else 0 for x in words]
    
    # Compute cosine score using scikit-learn
    array_A = np.array(vector_A).reshape(1, -1)
    array_B = np.array(vector_B).reshape(1, -1)
    cosine_score = cosine_similarity(array_A, array_B)[0,0]
    
    return cosine_score

In [None]:
def ComputeJaccardSimilarity(words_A, words_B):
    
    '''
    Compute Jaccard similarity between document A and
    document B.
    
    Parameters
    ----------
    words_A : set
        Words in document A.
    words_B : set
        Words in document B
        
    Returns
    -------
    jaccard_score : float
        Jaccard similarity between document
        A and document B.
        
    '''
    
    # Count number of words in both A and B
    words_intersect = len(words_A.intersection(words_B))
    
    # Count number of words in A or B
    words_union = len(words_A.union(words_B))
    
    # Compute Jaccard similarity score
    jaccard_score = words_intersect / words_union
    
    return jaccard_score

Before continuing, let's double-check that these functions are working properly.

The paper gives the following sample sentences to compare:

> $D_A$: We expect demand to increase.
> $D_B$: We expect worldwide demand to increase.
> $D_C$: We expect weakness in sales.

As noted in the paper, the cosine similarity between $D_A$ and $D_B$ should be $0.91$, and the cosine similarity between $D_A$ and $D_C$ should be $0.40$.

Meanwhile, the Jaccard similarity between $D_A$ and $D_B$ should be $0.83$, and the Jaccard similarity between $D_A$ and $D_C$ should be $0.25$.

Let's double-check that our functions return the correct results.

In [None]:
d_a = set(['we', 'expect', 'demand', 'to', 'increase'])
d_b = set(['we', 'expect', 'worldwide', 'demand', 'to', 'increase'])
d_c = set(['we', 'expect', 'weakness', 'in', 'sales'])

print("Cosine similarity between A and B:", ComputeCosineSimilarity(d_a, d_b))
print("Cosine similarity between A and C:", ComputeCosineSimilarity(d_a, d_c))
print("Jaccard similarity between A and B:", ComputeJaccardSimilarity(d_a, d_b))
print("Jaccard similarity between A and C:", ComputeJaccardSimilarity(d_a, d_c))

Everything looks good! Now, let's begin applying these similarity computations to the scraped 10-Ks and 10-Qs.

We'll start with 10-Qs. This is slightly difficult, because we want to compare each 10-Q to the 10-Q from the *same quarter of the previous year*.

Keep in mind that 10-Qs are filed three times per year, and 10-Ks are filed once per year. Thus, we should simply be able to order the 10-Qs by filing date, then compare each 10-Q to the third-to-last file. For example, if our sorted list of 10-Q files was: `[10Q-1, 10Q-2, 10Q-3, 10Q-4, 10Q-5, 10Q-6...]` then we would iterate over the list and compare `10Q-4` to `10Q-1`, `10Q-5` to `10Q-2`, and so on.

Unfortunately, filings aren't so clean in the real world. Sometimes, companies don't file 10-Qs for a quarter (or more) for a variety of reasons. We don't want to compare 10-Qs from different quarters; quarter-on-quarter differences will create misleading noise in our ultimate factor values. [why don't companies file 10Ks? why are 10Qs different per quarter?]

As such, we'll take each 10-Q and look for 10-Qs that are dated between 345 and 385 days earlier. If one exists, we'll compute the similarity; if no such file exists, we'll report the scores as `NaN`.

In [None]:
def ComputeSimilarityScores10Q(cik):
    
    '''
    Computes cosine and Jaccard similarity scores
    over 10-Qs for a particular CIK.
    
    Compares each 10-Q to the 10-Q from the same
    quarter of the previous year.
    
    Parameters
    ----------
    cik: str
        Central Index Key used to scrape and name
        files.
        
    Returns
    -------
    None.
    
    '''
    
    # Define how stringent we want to be about 
    # "previous year"
    year_short = timedelta(345)
    year_long = timedelta(385)
    
    # Open directory that holds plain 10-Q textfiles
    # for the CIK
    os.chdir(cik+'/rawtext')
    print("Parsing CIK %s..." % cik)
    
    # Get list of files to compare
    file_list = [fname for fname in os.listdir() if not 
                 (fname.startswith('.') | os.path.isdir(fname))]
    file_list.sort()
    
    # Check if scores have already been calculated
    try:
        os.mkdir('../metrics')
    # ... if they have already been calculated, exit
    except OSError:
        print("Already parsed CIK %s..." % cik)
        os.chdir('../..')
        return
    
    # Check if enough files exist to compare
    # ... if there aren't enough files, exit
    if len(file_list) < 4:
        print("No files to compare for CIK", cik)
        os.chdir('../..')
        return
    
    # Initialize dataframe to hold similarity scores
    dates = [x[-14:-4] for x in file_list]
    cosine_score = [0]*len(dates)
    jaccard_score = [0]*len(dates)
    data = pd.DataFrame(columns={'cosine_score': cosine_score, 
                                 'jaccard_score': jaccard_score},
                       index=dates)
    
    # Iterate over each quarter...
    for j in range(3):
        
        # Get text and date of earliest filing from that quarter
        file_name_A = file_list[j]
        with open(file_name_A, 'r') as file:
            file_text_A = file.read()
        date_A = datetime.strptime(file_name_A[-14:-4], '%Y-%m-%d')
        
        # Iterate over the rest of the filings from that quarter...
        for i in range(j+3, len(file_list), 3):

            # Get name and date of the later file
            file_name_B = file_list[i]
            date_B = datetime.strptime(file_name_B[-14:-4], '%Y-%m-%d')
            
            # If B was not filed within ~1 year after A...
            if (date_B > (date_A + year_long)) or (date_B < (date_A + year_short)):
                
                print(date_B.strftime('%Y-%m-%d'), "is not within a year of", date_A.strftime('%Y-%m-%d'))
                
                # Record values as NaN
                data.at[date_B.strftime('%Y-%m-%d'), 'cosine_score'] = 'NaN'
                data.at[date_B.strftime('%Y-%m-%d'), 'jaccard_score'] = 'NaN'
                
                # Pretend as if we found new date_A in the next year
                date_A = date_A.replace(year=date_B.year)
                
                # Move to next filing
                continue
                
            # If B was filed within ~1 year of A...
            
            # Get file text
            with open(file_name_B, 'r') as file:
                file_text_B = file.read()

            # Get sets of words in A, B
            words_A = set(re.findall(r"[\w']+", file_text_A))
            words_B = set(re.findall(r"[\w']+", file_text_B))

            # Calculate similarity score
            cosine_score = ComputeCosineSimilarity(words_A, words_B)
            jaccard_score = ComputeJaccardSimilarity(words_A, words_B)

            # Store value (indexing by the date of document B)
            data.at[date_B.strftime('%Y-%m-%d'), 'cosine_score'] = cosine_score
            data.at[date_B.strftime('%Y-%m-%d'), 'jaccard_score'] = jaccard_score

            # Reset value for next loop
            # Don't re-read files, for efficiency
            file_text_A = file_text_B
            date_A = date_B

    # Save scores
    os.chdir('../metrics')
    data.to_csv(cik+'_sim_scores.csv', index=True)
    os.chdir('../..')

Fortunately, 10-Ks are easier. Though there can still be time-jumps in 10-K filings, we don't mind as much if we're comparing a 10-K from 2006 to a 10-K from 2002. This is because we don't have to worry about non-substantive quarter-on-quarter differences as we would with 10-Qs. In fact, it might actually be better if our data reflects textual changes in 10-Ks across "absent" years.

In [None]:
def ComputeSimilarityScores10K(cik):
    
    '''
    Computes cosine and Jaccard similarity scores
    over 10-Ks for a particular CIK.
    
    Parameters
    ----------
    cik: str
        Central Index Key used to scrape and name
        files.
        
    Returns
    -------
    None.
    
    '''
    
    # Open the directory that holds plaintext
    # filings for the CIK
    os.chdir(cik+'/rawtext')
    print("Parsing CIK %s..." % cik)
    
    # Get list of files to over which to compute scores
    # excluding hidden files and directories
    file_list = [fname for fname in os.listdir() if not 
                 (fname.startswith('.') | os.path.isdir(fname))]
    file_list.sort()
    
    # Check if scores have already been calculated...
    try:
        os.mkdir('../metrics')
    # ... if they have been, exit
    except OSError:
        print("Already parsed CIK %s..." % cik)
        os.chdir('../..')
        return
    
    # Check if enough files exist to compute sim scores...
    # If not, exit
    if len(file_list) < 2:
        print("No files to compare for CIK", cik)
        os.chdir('../..')
        return
    
    # Initialize dataframe to store sim scores
    dates = [x[-14:-4] for x in file_list]
    cosine_score = [0]*len(dates)
    jaccard_score = [0]*len(dates)
    data = pd.DataFrame(columns={'cosine_score': cosine_score, 
                                 'jaccard_score': jaccard_score},
                       index=dates)
        
    # Open first file
    file_name_A = file_list[0]
    with open(file_name_A, 'r') as file:
        file_text_A = file.read()
        
    # Iterate over each 10-K file...
    for i in range(1, len(file_list)):

        file_name_B = file_list[i]

        # Get file text B
        with open(file_name_B, 'r') as file:
            file_text_B = file.read()

        # Get set of words in A, B
        words_A = set(re.findall(r"[\w']+", file_text_A))
        words_B = set(re.findall(r"[\w']+", file_text_B))

        # Calculate similarity scores
        cosine_score = ComputeCosineSimilarity(words_A, words_B)
        jaccard_score = ComputeJaccardSimilarity(words_A, words_B)

        # Store score values
        date_B = file_name_B[-14:-4]
        data.at[date_B, 'cosine_score'] = cosine_score
        data.at[date_B, 'jaccard_score'] = jaccard_score

        # Reset value for next loop
        # (We don't open the file again, for efficiency)
        file_text_A = file_text_B

    # Save scores
    os.chdir('../metrics')
    data.to_csv(cik+'_sim_scores.csv', index=False)
    os.chdir('../..')

Note that we store factor values according to the date of the later document (document B). This is because we want our data to be point-in-time; we want to store the factor values according to the date that *we would have known about them in the past*.

In this case, our values (similarity scores) depend on two things: the content of document A, and the content of document B. We would have become aware of the text of document A at `date_A`, and we would have become aware of the text of document B at `date_B`.

Remember that A is stipulated to be the earlier document. Since A precedes B, we wouldn't know about the factor values at `date_A`; the content of B would not yet be available. However, we would know about the factor values at `date_B` -- at that point, both the content of A and the content of B would be available. As such, we'll store our values according to `date_B`.

(The above applies to both 10-Qs and 10-Ks.)

Let's go ahead and apply these functions to our stored 10-Qs and 10-Ks.

In [None]:
# Computing scores for 10-Qs...

os.chdir(pathname_10q)

for cik in tqdm(ticker_cik_df['cik']):
    ComputeSimilarityScores10Q(cik)

In [None]:
# Computing scores for 10-Ks...

os.chdir(pathname_10k)

for cik in tqdm(ticker_cik_df['cik']):
    ComputeSimilarityScores10K(cik)

After computing the similarity scores, our file structure looks like this:


```
- 10Ks
    - CIK1
        - 10K #1
        - 10K #2
        ...
        - rawtext
        - metrics
    - CIK2
        - 10K #1
        - 10K #2
        ...
        - rawtext
        - metrics
    - CIK3
        - 10K #1
        - 10K #2
        ...
        - rawtext
        - metrics
    ...
- 10Qs
    - CIK1
        - 10Q #1
        - 10Q #2
        ...
        - rawtext
        - metrics
    - CIK2
        - 10Q #1
        - 10Q #2
        ...
        - rawtext
        - metrics
    - CIK3
        - 10Q #1
        - 10Q #2
        ...
        - rawtext
        - metrics
    ...
```

The similarity scores for each CIK are stored in the `metrics` directory as a .csv file.

## 3. Compiling the Dataset

Now that we've scraped the data and computed the similarity scores, we're almost done. The final step is to format our data properly for upload to [Self-Serve Data](https://www.quantopian.com/posts/upload-your-custom-datasets-and-signals-with-self-serve-data).

We'll begin by consolidating the .csv files in the 10-K and 10-Q directories into a single DataFrame for each CIK.

In [None]:
def GetData(cik, pathname_10k, pathname_10q, pathname_data):
    
    '''
    Consolidate 10-K and 10-Q data into a single dataframe
    for a CIK.
    
    Parameters
    ----------
    cik : str
        Central Index Key used to scrape and
        store data.
    pathname_10k : str
        Path to directory holding 10-K files.
    pathname_10q : str
        Path to directory holding 10-Q files.
    pathname_data : str
        Path to directory holding newly
        generated data files.
        
    Returns
    -------
    None.
    
    '''
    
    # Flags to determine what data we have
    data_10k = True
    data_10q = True
    
    print("Gathering data for CIK %s..." % cik)
    file_name = ('%s_sim_scores_full.csv' % cik)
    
    # Check if data has already been gathered...
    os.chdir(pathname_data)
    file_list = [fname for fname in os.listdir() if not fname.startswith('.')]
    
    # ... if it has been, exit
    if file_name in file_list:
        print("Already gathered data for CIK", cik)
        return
    
    # Try to get 10-K data...
    os.chdir(pathname_10k+'/%s/metrics' % cik)
    try:
        sim_scores_10k = pd.read_csv(cik+'_sim_scores.csv')
    # ... if it doesn't exist, set 10-K flag to False
    except FileNotFoundError:
        print("No data to gather.")
        data_10k = False
    
    # Try to get 10-Q data...
    os.chdir(pathname_10q+'/%s/metrics' % cik)
    try:
        sim_scores_10q = pd.read_csv(cik+'_sim_scores.csv')
    # ... if it doesn't exist, set 10-Q flag to False
    except FileNotFoundError:
        print("No data to gather.")
        data_10q = False
    
    # Merge depending on available data...
    # ... if there's no 10-K or 10-Q data, exit
    if not (data_10k and data_10q):
        return
    
    # ... if there's no 10-Q data (but there is 10-K data),
    # only use the 10-K data
    if not data_10q:
        sim_scores = sim_scores_10k
    # ... if the opposite is true, only use 10-Q data
    elif not data_10k:
        sim_scores = sim_scores_10q
    # ... if there's both 10-K and 10-Q data, merge
    elif (data_10q and data_10k):
        sim_scores = pd.concat([sim_scores_10k, sim_scores_10q], 
                           axis='index')
    
    # Rename date column
    sim_scores.rename(columns={'Unnamed: 0': 'date'}, inplace=True)

    # Set CIK column
    sim_scores['cik'] = cik
    
    # Save file in the data dir
    os.chdir(pathname_data)
    sim_scores.to_csv('%s_sim_scores_full.csv' % cik, index=False)
    
    return

In [None]:
pathname_data = '< YOUR DATA PATHNAME HERE >' # Fill this out

In [None]:
for cik in tqdm(ticker_cik_df['cik']):
    GetData(cik, pathname_10k, pathname_10q, pathname_data)

Now, we have a "data" directory that looks like this:


```
- data
    - CIK1_sim_scores_full.csv
    - CIK2_sim_scores_full.csv
    ...
```

Of course, we need to consolidate each CIK's data into a single dataset.

In [None]:
def MakeDataset(file_list, pathname_full_data):
    
    '''
    Consolidates CIK datasets into a
    single dataset.
    
    Parameters
    ----------
    file_list : list
        List of .csv files to merge.
    pathname_full_data : str
        Path to directory to store
        full dataset.
        
    Returns
    -------
    None.
    
    '''
    
    # Initialize dataframe to store results
    data = pd.DataFrame(columns=['date', 'cosine_score', 'jaccard_score', 'cik'])
    
    # Iterate over files and merge all together
    for file_name in tqdm(file_list):
        new_data = pd.read_csv(file_name)
        data = data.append(new_data, sort=True)
    
    # Store result
    os.chdir(pathname_full_data)
    data.to_csv('all_sim_scores.csv', index=False)
    
    return

In [None]:
pathname_full_data = '< YOUR FULL DATA PATHNAME HERE >'

In [None]:
os.chdir(pathname_data)
file_list = [fname for fname in os.listdir() if not fname.startswith('.')]

MakeDataset(file_list, pathname_full_data)

The final step is to transform the data into a format appropriate for Self-Serve Data. This means that we want a dataset with one set of factor values per ticker per day. In other words, each day-ticker pair should have a `cosine_score` and a `jaccard_score` value.

In this step, we'll need to:

1. Map CIKs back to tickers. To do this, we'll simply merge our full dataset with `ticker_cik_df`.
2. Forward-fill values for 60 calendar days (time limit per the original paper).
3. Construct a dataset with one set of factor values per ticker per day (with `NaN`s for missing values).

In [None]:
sim_scores_full = pd.read_csv('all_sim_scores.csv')

# Cast CIKs as strings
sim_scores_full['cik'] = [str(x) for x in sim_scores_full['cik']]

# Merge to map tickers to CIKs
sim_scores_ticker = sim_scores_full.merge(ticker_cik_df, how='left', on='cik')

# Drop CIK column
sim_scores_ticker.drop(labels=['cik'], axis='columns', inplace=True)

# Drop NaN values
sim_scores_ticker.dropna(axis='index', how='any', subset=['jaccard_score', 'cosine_score'], inplace=True)

The `sim_scores_ticker` data has one row for each filing, listing the set of factor values, ticker, and date. However, some day-ticker pairs have no associated factor values. We need to manipulate this data so that we have one row per ticker per day.

To do this, we'll begin with an empty dataframe that contains one row per ticker per day. We'll then [join](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html) this formatted empty dataframe (`empty_data`) with our actual data (`sim_scores_ticker`) in such a way that preserves all the rows of `empty_data`. We'll end up with a dataframe that contains all the data from `sim_scores_ticker`, with `NaN`s inserted for day-ticker pairs where we're missing values.

In [None]:
def InitializeEmptyDataframe(start_date, end_date, tickers):
    
    '''
    Initializes an empty DataFrame with all correct indices 
    (1 entry/ticker/day)
    
    Parameters
    ----------
    start_date : datetime.datetime
        Start date of dataframe.
    end_date : datetime.datetime
        End date of dataframe.
    tickers : list
        List of tickers.
    '''
    
    window_length_days = int((end_date - start_date).days)
    date_list = [start_date+timedelta(days=x) for x in range(0, window_length_days)]
    long_date_list = date_list * len(tickers)
    long_date_list = [x.strftime('%Y-%m-%d') for x in long_date_list]
    list.sort(long_date_list)
    empty = pd.DataFrame(data={'date': long_date_list, 
                                     'ticker': tickers*len(date_list),
                                'jaccard_score': [np.nan]*len(tickers)*len(date_list),
                              'cosine_score': [np.nan]*len(tickers)*len(date_list)})
    empty = empty.groupby(['date', 'ticker']).sum()
    
    empty['jaccard_score'] = np.nan
    empty['cosine_score'] = np.nan
    
    return empty

In [None]:
# Initialize empty dataframe
start_date = datetime(2015, 1, 1)
end_date = datetime(2018, 1, 1)
tickers = list(set(sim_scores_ticker['ticker']))

empty_data = InitializeEmptyDataframe(start_date, end_date, tickers)

(Note that we set `start_date = datetime(2013, 1, 1)`. This is because Self-Serve Data has a maximum file size of 300 MB; too large of a date range will exceed that maximum. The dataset spanning 2013-2018 is 250 MB.)

In [None]:
empty_data.head()

In [None]:
# Format sim_scores data for merging
sim_scores_formatted = sim_scores_ticker.dropna(axis='index', how='any', subset=['jaccard_score', 'cosine_score'])

sim_scores_formatted = sim_scores_formatted.groupby(['date', 'ticker']).agg('mean')

Note that we use the `.agg('mean')` aggregator. This means that we'll take any rows that match each ticker-day pair and average the factor values.

In most cases, there should only be one row per ticker-day pair; however, there are some cases where 10-Ks and 10-Qs are filed on the same day, thus creating the need for `.agg('mean')`.

In [None]:
sim_scores_formatted.head()

In [None]:
formatted_data = empty_data.join(sim_scores_formatted,
                                 how='left', 
                                 on=['date', 'ticker'], 
                                 lsuffix='_empty')

formatted_data.drop(labels=['cosine_score_empty', 'jaccard_score_empty'], axis='columns', inplace=True)

In [None]:
formatted_data.head()

Our final step is to forward-fill the values by one quarter (approximately 90 calendar days). First, let's sort the data by ticker, then by calendar day:

In [None]:
forward_filled_data = formatted_data.reset_index().sort_values(by=['ticker', 'date'])

In [None]:
forward_filled_data.head()

In [None]:
forward_filled_data.fillna(method='ffill', limit=90, inplace=True)

In [None]:
forward_filled_data.head()

We've have one row per ticker per day and we've forward-filled the values by 60 days, so we're ready to save our data as a .csv file and upload it to Self-Serve.

In [None]:
forward_filled_data.to_csv('lazy_prices_data.csv', index=False)