# Scrapping the SEC for Document Filings

### Notebook Index:

- [Establishing Contact with the SEC Website](#Website)
- [Scraper Function](#Scraper)
- [Scraping the SEC](#Scraping)

-----

## Importing Libraries:

In [1]:
import requests
import json

import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
from scrapy import Selector

-----
<a class="anchor" id="Website"></a>

## Establishing Contact with the SEC:

Specifically, contact any filing dealing with Apple, Inc (AAPL).

In [2]:
url = 'https://www.sec.gov/cgi-bin/browse-edgar?CIK=AAPL&action=getcompany&owner=exclude'

In [3]:
res = requests.get(url)

### Checking the Status Code:

A Description of `200` means the connection has been accepted and established. 

In [4]:
res.status_code

200

<a class="anchor" id="Scraper"></a>

# Creating a Function to Scrape the SEC:

In [5]:
def scraper(ticker):
    """ 
    Returns the scraped Data from the SEC website corresponding to the company. 
    Additionally, the Date Column is converted to date time and placed as the index. 
    
    Parameter
    ----------
    ticker : str
        Passes the string with the company's ticker symbol.
    """
    
    collect = []
    for page in range(0, 600, 40):
        url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type=&dateb=&owner=exclude&start={}&count=40'.format(ticker, page)        
        res = requests.get(url)
        html = res.text
        # Creating a function to clean the json.
        cleaner = lambda x: BeautifulSoup(x, 'lxml').get_text()
        # Extracting the Document Type of each filing
        docs = Selector(text=html).xpath("//table[@class='tableFile2']//td[1]").extract()
        documents = [cleaner(tag) for tag in docs]
        # Extracting the Dates corresponding to the filings
        date = Selector(text=html).xpath("//table[@class='tableFile2']//td[4]").extract()
        dates = [cleaner(day) for day in date]
        # Extracting the filing number
        nums = Selector(text=html).xpath("//table[@class='tableFile2']//td[5]").extract()
        numbers = [cleaner(num).strip() for num in nums]
        # Extracting the Description of each filing
        description = Selector(text=html).xpath("//table[@class='tableFile2']//td[@class='small']").extract()
        descriptions = [cleaner(descript).strip() for descript in description]
        # Combining all features that were extracted
        x = list(zip(documents, descriptions, dates, numbers))
        collect.extend(x)
    # Turing the features into a Pandas dataframe & setting the date as the index.
    df = pd.DataFrame.from_records(collect, columns=['document_type','description', 'date', 'file_number'])       
    df['date'] = pd.to_datetime(df.date, dayfirst=True);
    df.set_index('date', inplace=True)
    df.sort_index(inplace=True, ascending=True)
    
    return df

------

<a class="anchor" id="Scraping"></a>

# Scraping the SEC:

- Apple, Inc. - AAPL
- Facebook, Inc. - FB
- Google LLC - GOOGL
- JPMorgan Chase & Co. - JPMorgan
- The Goldman Sachs Group, Inc. - GoldmanSachs
- Moody's Corporation - Moodys
- The International Business Machines Corporation (IBM) - IBM
- Twitter Inc. - Twitter
- BlackRock, Inc. - BlackRock
- Microsoft Corporation - Micrisoft

In [6]:
company_name = 'JPMorgan'

In [7]:
df = scraper('JPM')

### Inspecting the Data:

In [8]:
df.head()

Unnamed: 0_level_0,document_type,description,file_number
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-06-12,424B2,Prospectus [Rule 424(b)(2)]Acc-no: 0000950103-...,333-22267218893960
2018-06-12,424B2,Prospectus [Rule 424(b)(2)]Acc-no: 0001615774-...,333-22267218895296
2018-06-12,424B2,Prospectus [Rule 424(b)(2)]Acc-no: 0000891092-...,333-22267218894263
2018-06-12,424B2,Prospectus [Rule 424(b)(2)]Acc-no: 0001615774-...,333-22267218894207
2018-06-12,424B2,Prospectus [Rule 424(b)(2)]Acc-no: 0001615774-...,333-22267218893968


In [9]:
df.tail()

Unnamed: 0_level_0,document_type,description,file_number
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-07-27,424B2,Prospectus [Rule 424(b)(2)]Acc-no: 0001615774-...,333-22267218973978
2018-07-27,424B2,Prospectus [Rule 424(b)(2)]Acc-no: 0000891092-...,333-22267218973571
2018-07-27,424B2,Prospectus [Rule 424(b)(2)]Acc-no: 0001615774-...,333-22267218972893
2018-07-27,424B2,Prospectus [Rule 424(b)(2)]Acc-no: 0001615774-...,333-22267218974543
2018-07-27,424B2,Prospectus [Rule 424(b)(2)]Acc-no: 0001615774-...,333-22267218975317


-------------

--------
## Saving the data into a CSV:

In [10]:
df.to_csv(f'data/{company_name}_SEC.csv', index=False)