# Scrapping the SEC for Document Filings

### Index:

- [Establishing Contact with the SEC Website](#Website)
- [Scraper Function](#Scraper)
- [Scraping the SEC](#Scraping)
- [Inspecting the Data](#Data)

-----

## Importing Libraries:

In [1]:
import requests
import json

import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
from scrapy import Selector

-----
<a class="anchor" id="Website"></a>

## Establishing Contact with the SEC:

Specifically, contact any filing dealing with Apple, Inc (AAPL).

In [2]:
url = 'https://www.sec.gov/cgi-bin/browse-edgar?CIK=AAPL&action=getcompany&owner=exclude'

In [3]:
res = requests.get(url)

### Checking the Status Code:

A Description of `200` means the connection has been accepted and established. 

In [4]:
res.status_code

200

<a class="anchor" id="Scraper"></a>

# Creating a Function to Scrape the SEC:

**The Following Function will:**
- Accept a company's stock ticker and use the symbol to scrape data from the [SEC Website](https://www.sec.gov/).
- Use the BeautifulSoup and Selector libraries to request & establish a connection with the website according to the company.
- Clean the lxml text for every value extracted.
- Will Extract (Scrape) the following specifications into a pandas dataframe:
    - Document Filings
    - Date Filed
    - Filing Serial Number
    - Filing Description
- The Date will be converted into time, placed as the index, and sorted in ascending order.

In [5]:
def scraper(ticker):
    """ 
    Returns the scraped Data from the SEC website corresponding to the company. 
    Additionally, the Date Column is converted to date time and placed as the index. 
    
    Parameter
    ----------
    ticker : str
        Passes the string with the company's ticker symbol.
    """
    
    collect = []
    for page in range(0, 600, 40):
        url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type=&dateb=&owner=exclude&start={}&count=40'.format(ticker, page)        
        res = requests.get(url)
        html = res.text
        # Creating a function to clean the json.
        cleaner = lambda x: BeautifulSoup(x, 'lxml').get_text()
        # Extracting the Document Type of each filing
        docs = Selector(text=html).xpath("//table[@class='tableFile2']//td[1]").extract()
        documents = [cleaner(tag) for tag in docs]
        # Extracting the Dates corresponding to the filings
        date = Selector(text=html).xpath("//table[@class='tableFile2']//td[4]").extract()
        dates = [cleaner(day) for day in date]
        # Extracting the filing number
        nums = Selector(text=html).xpath("//table[@class='tableFile2']//td[5]").extract()
        numbers = [cleaner(num).strip() for num in nums]
        # Extracting the Description of each filing
        description = Selector(text=html).xpath("//table[@class='tableFile2']//td[@class='small']").extract()
        descriptions = [cleaner(descript).strip() for descript in description]
        # Combining all features that were extracted
        x = list(zip(documents, descriptions, dates, numbers))
        collect.extend(x)
    # Turing the features into a Pandas dataframe & setting the date as the index.
    df = pd.DataFrame.from_records(collect, columns=['document_type','description', 'date', 'file_number'])       
    df['date'] = pd.to_datetime(df.date, dayfirst=True);
    df.set_index('date', inplace=True)
    df.sort_index(inplace=True, ascending=True)
    
    return df

------

<a class="anchor" id="Scraping"></a>

# Scraping the SEC:

The following companies will be scraped:

- Apple, Inc. - AAPL
- Facebook, Inc. - FB
- Google LLC - GOOGL
- JPMorgan Chase & Co. - JPM
- The Goldman Sachs Group, Inc. - GS
- Moody's Corporation - MCO
- The International Business Machines Corporation (IBM) - IBM
- Twitter Inc. - TWTTR
- BlackRock, Inc. - BLK
- Microsoft Corporation - MSFT

In [11]:
company_name = 'Apple'

In [12]:
df = scraper('AAPL')

-------
<a class="anchor" id="Data"></a>

# Inspecting the Data

### Inspecting the Earliest Filing Scraped:

In [13]:
df.head()

Unnamed: 0_level_0,document_type,description,file_number
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1994-01-26,424B5,Prospectus [Rule 424(b)(5)]Acc-no: 0000891618-...,033-6231094502696
1994-01-26,10-Q,Quarterly report [Sections 13 or 15(d)]Acc-no:...,000-1003094502732
1994-02-10,SC 13G/A,[Amend] Statement of acquisition of beneficial...,005-3363294505635
1994-02-17,SC 13G/A,[Amend] Statement of acquisition of beneficial...,005-3363294510471
1994-02-18,SC 13G,Statement of acquisition of beneficial ownersh...,005-3363200000000


### Inspecting the Latest Filing Scraped:

In [14]:
df.tail()

Unnamed: 0_level_0,document_type,description,file_number
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-03-07,SD,Acc-no: 0001193125-18-073716 (34 Act) Size: 8...,001-3674318674202
2018-05-01,8-K,"Current report, items 2.02 and 9.01\nAcc-no: 0...",001-3674318795935
2018-05-02,10-Q,Quarterly report [Sections 13 or 15(d)]Acc-no:...,001-3674318800115
2018-05-07,8-K,"Current report, items 8.01 and 9.01\nAcc-no: 0...",001-3674318811649
2018-05-08,8-K/A,"[Amend] Current report, item 8.01\nAcc-no: 000...",001-3674318812776


--------
## Saving the data into a CSV:

In [10]:
df.to_csv(f'data/{company_name}_SEC.csv', index=False)