# SEC Scraper
This notebook scrapes SEC 10-K filings for specific companies and extracts their risk factor sections. The extracted data is saved into CSV files for further analysis.

## Introduction
In this notebook, we will create a scraper to extract risk factor sections from SEC 10-K filings. The process involves:
1. Retrieving the Central Index Key (CIK) for a given company ticker.
2. Getting the list of 10-K filings for the CIK.
3. Downloading and parsing the risk factors section from the filings.
4. Saving the extracted data into CSV files.

## Import Libraries
We start by importing necessary libraries. We will use:
- `os` for directory and file operations.
- `requests` for making HTTP requests.
- `BeautifulSoup` from the `bs4` package for parsing HTML content.
- `pandas` for handling data in DataFrame format.
- `time` for adding delays between requests to avoid server overload.

In [None]:
# Import necessary libraries
import os
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Define the base URL for the SEC EDGAR website
BASE_URL = "https://www.sec.gov"


## Get CIK for Ticker
The first step is to retrieve the Central Index Key (CIK) for a given company ticker. The CIK is a unique identifier assigned by the SEC to each company. We define a function `get_cik` that takes a company's stock ticker as input and returns its CIK.


In [None]:
# Function to get the CIK (Central Index Key) for a given company ticker
def get_cik(ticker):
    """
    This function takes a company's stock ticker as input and returns its CIK.
    """
    url = f"https://www.sec.gov/cgi-bin/browse-edgar?CIK={ticker}&owner=exclude&action=getcompany"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    cik_tag = soup.find('span', {'class': 'companyName'}).find('a')
    
    if cik_tag:
        cik = cik_tag.text.strip()
        return cik
    else:
        print(f"CIK not found for ticker: {ticker}")
        return None

# Example usage of the function
ticker = 'AAPL'
cik = get_cik(ticker)
print(f"CIK for {ticker}: {cik}")


## Get 10-K Filings for CIK
Next, we define a function `get_10k_filings` that retrieves a list of 10-K filings for a given CIK. This function will parse the SEC EDGAR website and collect information about the filings.


In [None]:
# Function to get the list of 10-K filings for a given CIK
def get_10k_filings(cik, start=0, count=100):
    """
    This function retrieves a list of 10-K filings for a given CIK.
    """
    url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={cik}&type=10-K&start={start}&count={count}&owner=exclude&output=atom"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    entries = soup.find_all('entry')
    
    filings = []
    for entry in entries:
        filing = {
            'title': entry.find('title').text,
            'link': entry.find('link')['href'],
            'summary': entry.find('summary').text,
            'updated': entry.find('updated').text
        }
        filings.append(filing)
    
    return filings

# Example usage of the function
filings = get_10k_filings(cik)
print(f"Found {len(filings)} 10-K filings for CIK {cik}")


## Extract Risk Factors
We define a function `get_risk_factors` that downloads a 10-K filing and extracts the Risk Factors section. The extraction is done by looking for the text between "Item 1A. Risk Factors" and the next "Item".


In [None]:
# Function to download and parse the risk factors section from a 10-K filing
def get_risk_factors(filing_url):
    """
    This function downloads a 10-K filing and extracts the Risk Factors section.
    """
    response = requests.get(filing_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    text = soup.get_text()
    
    # Assuming the Risk Factors section starts with "Item 1A. Risk Factors" and ends with the next "Item"
    start_idx = text.find('Item 1A. Risk Factors')
    end_idx = text.find('Item', start_idx + 1)
    
    if start_idx != -1 and end_idx != -1:
        risk_factors = text[start_idx:end_idx].strip()
        return risk_factors
    else:
        print("Risk Factors section not found")
        return None

# Example usage of the function
if filings:
    first_filing_url = filings[0]['link']
    risk_factors = get_risk_factors(first_filing_url)
    print("Extracted Risk Factors section:")
    print(risk_factors[:500])  # Print the first 500 characters of the risk factors section


## Save Risk Factors to CSV
Once we have extracted the Risk Factors section, we save it to a CSV file using the `save_risk_factors` function. The function creates a directory `sec_risk_factors` and saves the data with a filename based on the company ticker.


In [None]:
# Function to save the risk factors to a CSV file
def save_risk_factors(ticker, risk_factors):
    """
    This function saves the extracted risk factors to a CSV file.
    """
    output_dir = 'sec_risk_factors'
    os.makedirs(output_dir, exist_ok=True)
    output_file = os.path.join(output_dir, f"{ticker}_risk_factors.csv")
    
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(risk_factors)
    
    print(f"Risk factors saved to {output_file}")

# Example usage of the function
if risk_factors:
    save_risk_factors(ticker, risk_factors)


## Main Cell
It takes a list of company tickers, retrieves their CIKs, gets their 10-K filings, extracts the Risk Factors sections, and saves them to CSV files.

In [None]:
# Main cell to scrape and save risk factors for a list of company tickers
tickers = ['AAPL', 'MSFT', 'GOOGL']

for ticker in tickers:
    print(f"Processing ticker: {ticker}")
    cik = get_cik(ticker)
    
    if cik:
        filings = get_10k_filings(cik)
        
        if filings:
            first_filing_url = filings[0]['link']
            risk_factors = get_risk_factors(first_filing_url)
            
            if risk_factors:
                save_risk_factors(ticker, risk_factors)
            
