# SEC Filings Data Retrieval for RAG

### Step 1: Library Imports and API Key Configuration

The first step is to import all the required libraries and set up our API key.

- `userdata`: Used to securely access the SEC API key stored in Google Colab.
- `sec_api`: The main library for querying and extracting data from SEC filings.
- `os`: To interact with the file system, specifically for saving the extracted data.

The `sec_api_key` is retrieved from Colab's secret manager.

In [16]:
from google.colab import userdata
from sec_api import ExtractorApi, QueryApi # !pip install sec_api -q
import os

In [17]:
from google.colab import userdata
sec_api_key = userdata.get('secdata')

### 2. Define the Data Retrieval Function

This cell defines the core logic for retrieving and extracting the most relevant information from SEC filings.

-   **`companies` list**: This list holds the stock tickers of the companies we want to analyze. You can easily add or remove companies here.
-   **`get_filings(ticker)` function**:
    -   It uses the `QueryApi` to find the most recent **10-K annual report** for the given ticker.
    -   It then uses the `ExtractorApi` to pull the raw text from two critical sections:
        -   **Item 1A (`Risk Factors`)**: This section details the potential risks and uncertainties that could impact the company's business.
        -   **Item 7 (`Management's Discussion and Analysis...`)**: This provides a narrative from management on the company's financial performance, results, and future outlook.
    -   The function returns the extracted text from these two sections as a list, which is a perfect data source for a RAG system.

In [None]:
companies = [
    "AAPL", # Apple Inc.
    "MSFT", # Microsoft Corporation
    "GOOG", # Alphabet Inc. (Google)
    "AMZN", # Amazon.com, Inc.
    "META", # Meta Platforms, Inc.
]

def get_filings(ticker):

    try:
        # Finding Recent Filings with QueryAPI
        queryApi = QueryApi(api_key=sec_api_key)
        query = {
          "query": f"ticker:{ticker} AND formType:\"10-K\"",
          "from": "0",
          "size": "1",
          "sort": [{ "filedAt": { "order": "desc" } }]
        }
        filings = queryApi.get_filings(query)

        # Check if any filings were found
        if not filings["filings"]:
            print(f"No 10-K filings found for {ticker}.")
            return None

        # Getting 10-K URL
        filing_url = filings["filings"][0]["linkToFilingDetails"]

        # Extracting Text with ExtractorAPI
        extractorApi = ExtractorApi(api_key=sec_api_key)

        # Section 1A - Risk Factors
        Risk_Factors = extractorApi.get_section(filing_url, "1A", "text")

        # Section 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations
        Management_Dis = extractorApi.get_section(filing_url, "7", "text")

        # Joining Texts
        filing_text = [Risk_Factors, Management_Dis]
        return filing_text

    except Exception as e:
        print(f"An error occurred while processing {ticker}: {e}")
        return None

In [20]:
# Dictionary to store the filing data for each company
all_companies_filing_data = {}

# Loop through the list of companies and get their filing data
for ticker in companies:
    print(f"-----")
    print(f"Getting Filing Data for {ticker}")
    filing_data = get_filings(ticker)

    # Store the data if it was successfully retrieved
    if filing_data:
        all_companies_filing_data[ticker] = filing_data

    print(f"Finished getting data for {ticker}")

-----
Getting Filing Data for AAPL
Finished getting data for AAPL
-----
Getting Filing Data for MSFT
Finished getting data for MSFT
-----
Getting Filing Data for GOOG
Finished getting data for GOOG
-----
Getting Filing Data for AMZN
Finished getting data for AMZN
-----
Getting Filing Data for META
Finished getting data for META


In [None]:
if "AAPL" in all_companies_filing_data:
    print("\n--- Apple (AAPL) Filing Data ---")
    print(all_companies_filing_data["AAPL"][0][:500] + "...")


--- Apple (AAPL) Filing Data ---
 Item 1A. Risk Factors 

The Company&#8217;s business, reputation, results of operations, financial condition and stock price can be affected by a number of factors, whether currently known or unknown, including those described below. When any one or more of these risks materialize from time to time, the Company&#8217;s business, reputation, results of operations, financial condition and stock price can be materially and adversely affected. 

Because of the following factors, as well as other fa...


In [24]:
print("--- Character count for each company's filings ---")

total = 0
for company in companies:
  length = (len(all_companies_filing_data[company][0][0]) + len(all_companies_filing_data[company][0][1]))
  total += length
  print(f"{company}: {length} characters")
print(f"\nTotal character count for all companies: {total}")

--- Character count for each company's filings ---
AAPL: 86707 characters
MSFT: 124930 characters
GOOG: 144093 characters
AMZN: 109633 characters
META: 253056 characters

Total character count for all companies: 718419


## Saving Extracted Filing Data to Files

The following code cell will iterate through the `all_companies_filing_data` dictionary and save the extracted text for each company into separate files.

A new folder named `companies` will be created if it doesn't already exist. For each company, two `.txt` files will be saved in this folder:

* `[TICKER]_risks.txt`: Contains the text from Section 1A (Risk Factors).

* `[TICKER]_management_dis.txt`: Contains the text from Section 7 (Management’s Discussion and Analysis of Financial Condition and Results of Operations).


In [None]:
# Define the folder name
output_folder = "companies"

# Import the os module if it's not already imported
import os

# Create the folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Loop through the combined filing data and save each company's text to a file
for ticker, data in all_companies_filing_data.items():
    print(f"Saving filing data for {ticker}...")

    # The Risk Factors text is the first item in the list
    risks_text = data[0]
    # The Management's Discussion is the second item in the list
    management_dis_text = data[1]

    # Create the filenames for the two sections
    risks_filename = os.path.join(output_folder, f"{ticker}_risks.txt")
    management_dis_filename = os.path.join(output_folder, f"{ticker}_management_dis.txt")

    # Save the Risk Factors to its own file
    with open(risks_filename, "w", encoding="utf-8") as f:
        f.write(risks_text)

    # Save the Management's Discussion to its own file
    with open(management_dis_filename, "w", encoding="utf-8") as f:
        f.write(management_dis_text)

    print(f"Saved {risks_filename} and {management_dis_filename}")

print("\nAll files have been saved to the 'companies' folder.")

Saving filing data for AAPL...
Saved companies/AAPL_risks.txt and companies/AAPL_management_dis.txt
Saving filing data for MSFT...
Saved companies/MSFT_risks.txt and companies/MSFT_management_dis.txt
Saving filing data for GOOG...
Saved companies/GOOG_risks.txt and companies/GOOG_management_dis.txt
Saving filing data for AMZN...
Saved companies/AMZN_risks.txt and companies/AMZN_management_dis.txt
Saving filing data for META...
Saved companies/META_risks.txt and companies/META_management_dis.txt

All files have been saved to the 'companies' folder.


## Loading Extracted Files into filing data

In [26]:
# Define the folder name
output_folder = "/content/companies"

# Dictionary to store the loaded filing data for each company
loaded_filing_data = {}

# Check if the folder exists
if not os.path.exists(output_folder):
    print(f"The folder '{output_folder}' was not found. Please run the previous cell to save the data first.")
else:
    print(f"Loading data from '{output_folder}' folder...")

    # Dictionary to temporarily hold text for a company before combining
    company_data_temp = {}

    # Loop through the files in the 'companies' folder
    for filename in os.listdir(output_folder):
        if filename.endswith(".txt"):
            filepath = os.path.join(output_folder, filename)

            # Parse the filename to get the ticker and category
            parts = filename.split('_')
            ticker = parts[0]
            if len(parts) == 2:
              category = parts[1].split('.')[0]
            else:
              category = parts[1] + '_' + parts[2].split('.')[0]

            # Read the content of the file
            with open(filepath, 'r', encoding='utf-8') as f:
                content = f.read()

            # Store the content in a temporary dictionary
            if ticker not in company_data_temp:
                company_data_temp[ticker] = {}
            company_data_temp[ticker][category] = content

    # Combine the temporary data into the final dictionary
    for ticker, data in company_data_temp.items():
        # Ensure both risks and management discussion files were found
        if 'risks' in data and 'management_dis' in data:
            loaded_filing_data[ticker] = [data['risks'], data['management_dis']]
            print(f"Successfully loaded data for {ticker}")
        else:
            print(f"Warning: Could not find both 'risks' and 'management_dis' files for {ticker}. Skipping.")

    print("\nFinished loading data.")

# Assign the loaded data to the variable name used in the next cell
all_companies_filing_data = loaded_filing_data

# Print a quick check to confirm the data is loaded
if "AAPL" in all_companies_filing_data:
    print("\n--- Apple (AAPL) Filing Data (First 500 characters) ---")
    print(all_companies_filing_data["AAPL"][0][:500] + "...")
else:
    print("\nNo data found for AAPL. Check your file paths.")

Loading data from '/content/companies' folder...
Successfully loaded data for META
Successfully loaded data for MSFT
Successfully loaded data for AMZN
Successfully loaded data for GOOG
Successfully loaded data for AAPL

Finished loading data.

--- Apple (AAPL) Filing Data (First 500 characters) ---
 Item 1A. Risk Factors 

The Company&#8217;s business, reputation, results of operations, financial condition and stock price can be affected by a number of factors, whether currently known or unknown, including those described below. When any one or more of these risks materialize from time to time, the Company&#8217;s business, reputation, results of operations, financial condition and stock price can be materially and adversely affected. 

Because of the following factors, as well as other fa...


In [3]:
!pip install -q chromadb sentence-transformers langchain-text-splitters

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.6/19.6 MB[0m [31m63.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m79.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.1/103.1 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.5/16.5 MB[0m [31m86.3 MB/s[0m eta [36m0:00:

In [27]:
import os
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.utils import embedding_functions
from langchain_text_splitters import RecursiveCharacterTextSplitter
import html

# Step 1: Prepare Documents by Chunking Text

In [28]:
# Helper function to clean text before chunking
def clean_text(text):
    """Decodes HTML entities and cleans up extra whitespace."""
    # Decode HTML entities (e.g., &#8226; becomes •)
    cleaned_text = html.unescape(text)

    cleaned_text = ' '.join(cleaned_text.split())
    return cleaned_text


documents = []
metadatas = []
ids = []
doc_id_counter = 0

# Define chunking parameters
chunk_size = 2500
chunk_overlap = 200

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# Loop through the all_companies_filing_data dictionary
for ticker, filing_data in all_companies_filing_data.items():
    # The first item is the Risk Factors text
    risks_text = filing_data[0]
    # The second item is the Management's Discussion text
    management_dis_text = filing_data[1]

    # Apply cleaning function to the raw text
    risks_text_cleaned = clean_text(risks_text)
    management_dis_text_cleaned = clean_text(management_dis_text)

    # Process and chunk the cleaned Risk Factors text
    risks_chunks = text_splitter.split_text(risks_text_cleaned)
    for chunk in risks_chunks:
        documents.append(chunk)
        metadatas.append({
            "company": ticker,
            "category": "risks"
        })
        ids.append(f"doc_{doc_id_counter}")
        doc_id_counter += 1

    # Process and chunk the cleaned Management Discussion text
    management_dis_chunks = text_splitter.split_text(management_dis_text_cleaned)
    for chunk in management_dis_chunks:
        documents.append(chunk)
        metadatas.append({
            "company": ticker,
            "category": "management_dis"
        })
        ids.append(f"doc_{doc_id_counter}")
        doc_id_counter += 1

print(f"Split {len(documents)} text chunks from the in-memory data.")

Split 316 text chunks from the in-memory data.


In [36]:
print(f"Text: {documents[0]}")
print(f"\nMetadata: {metadatas[0]}")

Text: Item 1A. Risk Factors Certain factors may have a material adverse effect on our business, financial condition, and results of operations. You should consider carefully the risks and uncertainties described below, in addition to other information contained in this Annual Report on Form 10-K, including our consolidated financial statements and related notes. The risks and uncertainties described below are not the only ones we face. Additional risks and uncertainties that we are unaware of, or that we currently believe are not material, may also become important factors that adversely affect our business. If any of the following risks actually occurs, our business, financial condition, results of operations, and future prospects could be materially and adversely affected. In that event, the trading price of our Class A common stock could decline, and you could lose part or all of your investment. Summary Risk Factors Our business is subject to a number of risks, including risks that

# Step 2: Initialize Embedding Model and ChromaDB Clien

In [38]:
# Use the same SentenceTransformer model
model_name = 'BAAI/bge-small-en-v1.5'
ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)

# Initialize a persistent ChromaDB client
chroma_client = chromadb.PersistentClient(path="./chroma_db")

# Get or create the collection
collection_name = "financial_filings"
collection = chroma_client.get_or_create_collection(
    name=collection_name,
    embedding_function=ef
)

# Step 3: Add documents to ChromaDB collection

In [39]:
# Add documents in batches to avoid overwhelming the system
batch_size = 100
for i in range(0, len(documents), batch_size):
    print(f"Adding batch {i // batch_size + 1} of documents...")
    batch_docs = documents[i:i + batch_size]
    batch_metadatas = metadatas[i:i + batch_size]
    batch_ids = ids[i:i + batch_size]

    collection.add(
        documents=batch_docs,
        metadatas=batch_metadatas,
        ids=batch_ids
    )

print("\nChromaDB collection created and populated successfully!")
print(f"The collection '{collection_name}' now contains {collection.count()} documents.")

Adding batch 1 of documents...
Adding batch 2 of documents...
Adding batch 3 of documents...
Adding batch 4 of documents...

ChromaDB collection created and populated successfully!
The collection 'financial_filings' now contains 316 documents.


# Quering vector database

In [41]:
query_text = "What are the biggest risks for a tech company?"

# Correct the 'where' clause by nesting the individual filters inside an '$and' operator.
# This tells ChromaDB to find documents where BOTH conditions are met.
results = collection.query(
    query_texts=[query_text],
    n_results=5,
    where={
        "$and": [
            {"company": {"$eq": "AAPL"}},  # Condition 1: company is AAPL
            {"category": {"$eq": "risks"}}    # Condition 2: category is risks
        ]
    }
)

In [46]:
print("\nFiltered Query Results:")
# The results['documents'] is a list of lists, where the outer list corresponds to the query.
# Since we only have one query, we access the first element with [0].
print("\n", "-" * 20)
for result_doc, result_metadata in zip(results['documents'][0], results['metadatas'][0]):
    print(f"Company: {result_metadata['company']}, Category: {result_metadata['category']}")
    print(f"Document: {result_doc}...")
    print("-" * 20)


Filtered Query Results:

 --------------------
Company: AAPL, Category: risks
Document: . As a result, from time to time the Company’s services have not performed as anticipated and may not meet customer expectations. The introduction of new and complex technologies, such as artificial intelligence features, can increase these and other safety risks, including exposing users to harmful, inaccurate or other negative content and experiences. There can be no assurance the Company will be able to detect and fix all issues and defects in the hardware, software and services it offers. Failure to do so can result in widespread technical and performance issues affecting the Company’s products and services. Errors, bugs and vulnerabilities can be exploited by third parties, compromising the safety and security of a user’s device. In addition, the Company can be exposed to product liability claims, recalls, product replacements or modifications, write-offs of inventory, property, plant and equi