## Long-Form Document Extraction: Mining Information from SEC 10-K/Q Forms

Companies listed on the US stock exchanges are required to file annual and quarterly reports with the SEC. These reports are called 10K (annual) and 10Q (quarterly) filings.
10K/Q filings are information dense and contain a lot of information about the company's business, operations, and financials.
The documents have a loosely defined structure and the reported metrics and sections may differ based on the company's operations. 

That said, there are enough commonalities that we may want to extract the information in a standardized format for downstream analysis. e.g. this could be 
used to extract financial metrics for a company and analysis of key risk factors after every earnings release.

Let's take a look at Nvidia's 10-K filing for the year 2024. Here's the SEC link for the [10-K filing](https://www.sec.gov/ix?doc=/Archives/edgar/data/0001045810/000104581025000023/nvda-20250126.htm).
As you can see, this is a pretty large document with a lot of information to parse through. 

> **Note:** This principle of what fields generalize across your target documents and what might be optional is an important one to keep in mind when designing your schema. 


In [3]:
import sys, os

os.environ["LLAMA_CLOUD_API_KEY"] = "llx-B00jwwtpQf9Nx9s1ImoFTRKq1Uo2fOOqGBpCWwl0NddZloRi"


In [4]:
from IPython.display import IFrame

IFrame(src="../data/sec_edgar_filings/0001065280/10-K/0001065280-25-000044/full-submission.txt", width=600, height=400)

Let us initialize the LlamaExtract client to extract our information of interest from these 10-K filings. 

In [5]:
from dotenv import load_dotenv
from llama_cloud_services import LlamaExtract


# Load environment variables (put LLAMA_CLOUD_API_KEY in your .env file)
load_dotenv(override=True)

# Optionally, add your project id/organization id
llama_extract = LlamaExtract()

No project_id provided, fetching default project.


### 1. Defining the Extraction Schema

To begin with, we'll focus on extracting the following information from the 10K/Q filings which are common across different companies:
- *Filing Information*: Date of filing, type of filing, reporting period end date, fiscal year, fiscal quarter
- *Company Profile*: Name, ticker, reporting currency, stock exchanges, auditor
- *Financial Highlights*: Key metrics to assess the company's financial health - revenue, gross profit, operating income, net income, EPS, EBITDA, free cash flow
- *Business/Geographic Segments*: Revenue, operating income, year-over-year growth, outlook for each segment.
- *Risk Factors*: Key risks as identified by the company management.
- *Management Discussion & Analysis (MD&A)*: Key highlights from management discussion and analysis.


#### Using Pydantic Models for Schema Definition

We can use JSON to define the schema for the extraction or use Pydantic models to encapsulate the schema. In this example, we'll use Pydantic models for schema definition for a few reasons:
- **Extensibility**: They are more flexible, easier to extend and maintain. 
- **Readability**: Pydantic models are more readable (less verbose) and easier to understand. Nested models in particular are easier to read than deeply nested JSON schemas.
- **Type Safety**: By validating against the Pydantic model, your code is guaranteed to be type-safe for use downstream an part of an automated process. e.g. an extracted date field will not suddenly become a numeric type.

In this case, imagine that you have a daily ETL pipeline that searches for new 10-K/Q filings and extracts the relevant information for these companies. Once the extraction results are available in LlamaExtract, *it is guaranteed to comply with the schema definition and can be sent to the ETL pipeline without worrying about data type mismatches.*

We consider some key design considerations for the schema definition below.

In [6]:
from typing import Literal, Optional, List
from pydantic import BaseModel, Field


class FilingInfo(BaseModel):
    """Basic information about the SEC filing"""

    filing_type: Literal["10-K", "10-Q", "10-K/A", "10-Q/A"] = Field(
        description="Type of SEC filing"
    )
    filing_date: str = Field(description="Date when filing was submitted to SEC")
    reporting_period_end: str = Field(description="End date of reporting period")
    fiscal_year: int = Field(description="Fiscal year")
    fiscal_quarter: int = Field(description="Fiscal quarter (if 10-Q)", ge=1, le=4)


class CompanyProfile(BaseModel):
    """Essential company information"""

    name: str = Field(description="Legal name of company")
    ticker: str = Field(description="Stock ticker symbol")
    reporting_currency: str = Field(description="Currency used in financial statements")
    exchanges: Optional[List[str]] = Field(
        None, description="Stock exchanges where listed"
    )
    auditor: Optional[str] = Field(None, description="Company's auditor")


class FinancialHighlights(BaseModel):
    """Key financial metrics from this reporting period"""

    period_end: str = Field(description="End date of reporting period")
    comparison_period_end: Optional[str] = Field(
        None, description="End date of comparison period (typically prior year/quarter)"
    )
    currency: str = Field(description="Currency of financial figures")
    unit: str = Field(
        description="Unit of financial figures (thousands, millions, etc.)"
    )
    revenue: float = Field(description="Total revenue for period")
    revenue_prior_period: Optional[float] = Field(
        None, description="Revenue from comparison period"
    )
    revenue_growth: float = Field(description="Revenue growth percentage")
    gross_profit: Optional[float] = Field(None, description="Gross profit")
    gross_margin: float = Field(description="Gross margin percentage")
    operating_income: Optional[float] = Field(None, description="Operating income")
    operating_margin: Optional[float] = Field(
        None, description="Operating margin percentage"
    )
    net_income: float = Field(description="Net income")
    net_margin: Optional[float] = Field(None, description="Net margin percentage")
    eps: Optional[float] = Field(None, description="Basic earnings per share")
    diluted_eps: Optional[float] = Field(None, description="Diluted earnings per share")
    ebitda: Optional[float] = Field(
        None,
        description="EBITDA (Earnings Before Interest, Taxes, Depreciation, Amortization)",
    )
    free_cash_flow: Optional[float] = Field(None, description="Free cash flow")


class BusinessSegment(BaseModel):
    """Information about a business segment"""

    name: str = Field(description="Segment name")
    description: str = Field(description="Segment description")
    revenue: float = Field(None, description="Segment revenue")
    revenue_percentage: Optional[float] = Field(
        None, description="Percentage of total company revenue"
    )
    operating_income: Optional[float] = Field(
        None, description="Segment operating income"
    )
    operating_margin: Optional[float] = Field(
        None, description="Segment operating margin percentage"
    )
    year_over_year_growth: float = Field(
        None, description="Year-over-year growth percentage"
    )
    outlook: Optional[str] = Field(None, description="Future outlook for segment")


class GeographicSegment(BaseModel):
    """Information about a geographic segment"""

    region: str = Field(description="Geographic region")
    revenue: float = Field(None, description="Revenue from region")
    revenue_percentage: Optional[float] = Field(
        None, description="Percentage of total company revenue"
    )
    year_over_year_growth: Optional[float] = Field(
        None, description="Year-over-year growth percentage"
    )


class RiskFactor(BaseModel):
    """Information about a risk factor"""

    category: str = Field(
        description="Risk category (e.g., Market, Operational, Legal)"
    )
    title: Optional[str] = Field(None, description="Brief title of risk")
    description: str = Field(description="Description of risk factor")
    potential_impact: Optional[str] = Field(
        None, description="Potential business impact"
    )


class ManagementHighlights(BaseModel):
    """Key highlights from Management Discussion & Analysis"""

    business_overview: str = Field(description="Overview of business and strategy")
    key_trends: Optional[str] = Field(
        None, description="Key trends affecting performance"
    )
    liquidity_assessment: Optional[str] = Field(
        None, description="Management assessment of liquidity"
    )
    outlook_summary: str = Field(description="Future outlook/guidance")


class SECFiling(BaseModel):
    """Schema for parsing 10-K and 10-Q filings from the SEC"""

    filing_info: FilingInfo = Field(description="Basic information about the filing")
    company_profile: CompanyProfile = Field(description="Essential company information")
    financial_highlights: FinancialHighlights = Field(
        description="Key financial metrics from this reporting period"
    )
    business_segments: Optional[List[BusinessSegment]] = Field(
        None, description="Key business segments information"
    )
    geographic_segments: Optional[List[GeographicSegment]] = Field(
        None, description="Geographic segment information"
    )
    key_risks: List[RiskFactor] = Field(description="Most significant risk factors")
    mda_highlights: ManagementHighlights = Field(
        description="Key highlights from Management Discussion & Analysis"
    )

### 2. Extracting Information from $NVDA 10-K Filing

Take a look at the schema definition above. We've defined a few models to represent the different sections of the 10K/Q filing. 
We've also defined a `SECFiling` model that combines all the sections into a single model. 


#### Design Considerations for Schema Definition

- **Optional Fields**: There are quite a few optional fields in the schema. There are many fields that we would like to extract if present, but we know that they are not present in all filings. 
  e.g. companies which only has a US footprint will not have a geographic breakdown of their financials. It is important to designate these fields as optional so that the LLM is not 
  forced to make up values for these fields. Designating these fields as optional helps provide an escape hatch for the LLM to not hallucinate values for these fields. Note, however, that if aggressively marking fields as optional might result in the LLM being overly lazy and not attempt to extract information for these fields. So there's a balance in what fields to mark as optional! 
- **Descriptions for Fields**: While not mandatory, it is always a good idea to provide a description for each field. This helps the LLM understand the context in which the field is being extracted and can improve the accuracy of the extraction.  
- **Enums**: We use enums to limit the possible values for a field. e.g. the `FilingInfo` model has an enum for the possible values of `filing_type`.  

Now, let us create an agent to extract this information from the 10K/Q filing.

In [7]:
from llama_cloud.core.api_error import ApiError

try:
    existing_agent = llama_extract.get_agent(name="sec-10k-filing")
    if existing_agent:
        llama_extract.delete_agent(existing_agent.id)
except ApiError as e:
    if e.status_code == 404:
        pass
    else:
        raise

agent = llama_extract.create_agent(name="sec-10k-filing", data_schema=SECFiling)

In [8]:
nvda_10k_extract = agent.extract("../data/sec-edgar-filings/0001065280/10-K/0001065280-25-000044/full-submission.txt")

Uploading files:   0%|          | 0/1 [00:00<?, ?it/s]


FileNotFoundError: [Errno 2] No such file or directory: '../data/sec_edgar_filings/0001065280/10-K/0001065280-25-000044/full-submission.txt'

In [None]:
nvda_10k_extract.data

{'filing_info': {'filing_type': '10-K',
  'filing_date': '',
  'reporting_period_end': '2025-01-26',
  'fiscal_year': 2025,
  'fiscal_quarter': 0},
 'company_profile': {'name': 'NVIDIA Corporation',
  'ticker': 'NVDA',
  'reporting_currency': 'USD',
  'exchanges': ['The Nasdaq Global Select Market'],
  'auditor': None},
 'financial_highlights': {'period_end': '2025-01-26',
  'comparison_period_end': '2024-01-28',
  'currency': 'USD',
  'unit': 'thousands',
  'revenue': 68038.0,
  'revenue_prior_period': 26974.0,
  'revenue_growth': 0.0,
  'gross_profit': None,
  'gross_margin': 75.0,
  'operating_income': None,
  'operating_margin': None,
  'net_income': 72880.0,
  'net_margin': None,
  'eps': None,
  'diluted_eps': None,
  'ebitda': None,
  'free_cash_flow': None},
 'business_segments': [{'name': 'Compute & Networking',
   'description': 'Strong demand for our accelerated computing and AI solutions. Revenue from Data Center computing grew 162% driven primarily by demand for our Hopper

### 3. Assessing the Extraction Results

Let's take a look at the extraction results for Nvidia's 10K filing. The description for management highlights and key risks looks reasonable at first glance. It is hard to verify the accuracy of the financial metrics since this is a long document with many pages.

#### Adding Page Numbers to the Extraction Schema

One way to make it easier to verify the accuracy of the extraction results is to add the page numbers to the extraction schema. This way, we can see which page numbers contain the key financial information. Let us add a `page_numbers` as a sub-field to `FinancialHighlights`, `BusinessSegment` and `GeographicSegment` fields to make it easier for us to verify key financial metrics extracted. 

> **Note**: Page numbers might be off by one due to the relative placement of the page numbers and the surrounding context from which the information is extracted, but it is a quick way to navigate to the relevant sections of the document and sanity test some fields.


In [None]:
from pydantic.fields import FieldInfo

FinancialHighlights.__annotations__["page_numbers"] = List[int]
FinancialHighlights.model_fields["page_numbers"] = FieldInfo(
    annotation=List[int],
    description="Page numbers (at bottom of the page) where the financial metrics above are extracted from.",
)
FinancialHighlights.model_rebuild(force=True)

BusinessSegment.model_fields["page_numbers"] = FieldInfo(
    annotation=List[int],
    description="Page numbers (at bottom of the page) where the financial metrics above are extracted from.",
)
BusinessSegment.model_rebuild(force=True)

GeographicSegment.model_fields["page_numbers"] = FieldInfo(
    annotation=List[int],
    description="Page numbers (at bottom of the page) where the financial metrics above are extracted from.",
)
GeographicSegment.model_rebuild(force=True)

SECFiling.model_rebuild(force=True)
SECFiling.model_json_schema()

{'$defs': {'BusinessSegment': {'description': 'Information about a business segment',
   'properties': {'name': {'description': 'Segment name',
     'title': 'Name',
     'type': 'string'},
    'description': {'description': 'Segment description',
     'title': 'Description',
     'type': 'string'},
    'revenue': {'default': None,
     'description': 'Segment revenue',
     'title': 'Revenue',
     'type': 'number'},
    'revenue_percentage': {'anyOf': [{'type': 'number'}, {'type': 'null'}],
     'default': None,
     'description': 'Percentage of total company revenue',
     'title': 'Revenue Percentage'},
    'operating_income': {'anyOf': [{'type': 'number'}, {'type': 'null'}],
     'default': None,
     'description': 'Segment operating income',
     'title': 'Operating Income'},
    'operating_margin': {'anyOf': [{'type': 'number'}, {'type': 'null'}],
     'default': None,
     'description': 'Segment operating margin percentage',
     'title': 'Operating Margin'},
    'year_over_

In [None]:
agent.data_schema = SECFiling

In [None]:
nvda_10k_extract = agent.extract("./data/sec_filings/nvda_10k.pdf")

In [None]:
nvda_10k_extract.data

{'filing_info': {'filing_type': '10-K',
  'filing_date': '2025-01-26',
  'reporting_period_end': '2025-01-26',
  'fiscal_year': 2025,
  'fiscal_quarter': 1},
 'company_profile': {'name': 'NVIDIA Corporation',
  'ticker': 'NVDA',
  'reporting_currency': 'USD',
  'exchanges': ['The Nasdaq Global Select Market'],
  'auditor': None},
 'financial_highlights': {'period_end': '2025-01-26',
  'comparison_period_end': '2024-01-28',
  'currency': 'USD',
  'unit': 'thousands',
  'revenue': 130497.0,
  'revenue_prior_period': 60922.0,
  'revenue_growth': 114.23,
  'gross_profit': 97858.0,
  'gross_margin': 75.0,
  'operating_income': 81453.0,
  'operating_margin': None,
  'net_income': 72880.0,
  'net_margin': 55.8,
  'eps': None,
  'diluted_eps': None,
  'ebitda': None,
  'free_cash_flow': None,
  'page_numbers': [40, 41, 55, 56, 68]},
 'business_segments': [{'name': 'Compute & Networking',
   'description': 'Includes Data Center accelerated computing platforms and AI solutions and software; netw

#### Verifying Financial Metrics

Now let use the page numbers to verify the accuracy of the financial metrics extracted.

Here's the relevant financial metrics extracted:

```python
{
 'financial_highlights': {'period_end': '2025-01-26',
  'comparison_period_end': '2024-01-28',
  'currency': 'USD',
  'unit': 'thousands',
  'revenue': 130497.0,
  'revenue_prior_period': 60922.0,
  'revenue_growth': 114.23,
  'gross_profit': 97858.0,
  'gross_margin': 75.0,
  'operating_income': 81453.0,
  'operating_margin': None,
  'net_income': 72880.0,
  'net_margin': 55.8,
  'eps': None,
  'diluted_eps': None,
  'ebitda': None,
  'free_cash_flow': None,
  'page_numbers': [40, 41, 55, 56, 68]},
}
```
We can see that the gross margin of 75% is extracted fro page 40. The revenue number of 130,497 is extracted from page 41 which also has the breakdown of the revenue by segment.

**Page 40 (showing gross margin of 75%):**
<img src="./data/sec_filings/nvda_10k_page_40.png" width="50%" alt="NVIDIA 10K Page 40">

**Page 41 (showing revenue of 130,497):**
<img src="./data/sec_filings/nvda_10k_page_41.png" width="50%" alt="NVIDIA 10K Page 41">

You can likewise verify that the geographic breakdown of revenue is extracted from page 79 correctly. 

### General Guidelines for Long-Form Document Extraction

- **Schema Iteration using the Web UI**: We have a Web UI with a schema builder that can help you define your schema and iterate on different documents. We have a 10-K/Q schema for you to get started with if you are interested in trying this out. 
  Start small and build from there! Refer to the tips above. Try your schema on different documents to see whether it generalizes to the target documents.
- **Citations**: You can ask the extraction agent to provide page numbers for key figures extracted. This will help you quickly navigate to the relevant section of the document and verify the veracity of the information extracted. 
  We will have a more robust and convenient citation feature in the future. 
- **Run scalable batch jobs**: Once you have confidence that the extraction agent is working well, you can use your agent via our [Python SDK](https://github.com/run-llama/llama_cloud_services) to run scalable batch jobs. 

![Web UI with the 10-K/Q Template](./data/sec_filings/web_ui.png)