## SEC 10K/Q Forms

Companies listed on the US stock exchanges are required to file annual and quarterly reports with the SEC. These reports are called 10K (annual) and 10Q (quarterly) filings.
10K/Q filings are information dense and contain a lot of information about the company's business, operations, and financials.
The documents have a loosely defined structure and the reported metrics and sections may differ based on the company's operations. 

That said, there are enough commonalities that we may want to extract the information in a standardized format for downstream analysis. e.g. this could be 
used to extract financial metrics for a company and analysis of key risk factors after every earnings release.

Let's take a look at Nvidia's 10K filing for the year 2024. Here's the SEC link for the [10-K filing](https://www.sec.gov/ix?doc=/Archives/edgar/data/0001045810/000104581025000023/nvda-20250126.htm).
As you can see, this is a pretty large document with a lot of information to parse through. 



In [None]:
from IPython.display import IFrame

IFrame(src="./data/resumes/nvda_10k.pdf", width=600, height=400)

In [None]:
from dotenv import load_dotenv
from llama_cloud_services import LlamaExtract


# Load environment variables (put LLAMA_CLOUD_API_KEY in your .env file)
load_dotenv(override=True)

# Optionally, add your project id/organization id
llama_extract = LlamaExtract()

### Defining the Extraction Schema

To begin with, we'll focus on extracting the following information from the 10K/Q filings:
- **Filing Information**: Date of filing, type of filing, reporting period end date, fiscal year, fiscal quarter
- **Company Profile**: Name, ticker, reporting currency, stock exchanges, auditor
- **Financial Highlights**: Key metrics to assess the company's financial health - revenue, gross profit, operating income, net income, EPS, EBITDA, free cash flow
- **Business/Geographic Segments**: Revenue, operating income, year-over-year growth, outlook for each segment.
- **Risk Factors**: Key risks as identified by the company management.
- **Management Discussion & Analysis (MD&A)**: Key highlights from management discussion and analysis.

We can use JSON to define the schema for the extraction or use Pydantic models to encapsulate the schema. In this example, we'll use Pydantic models 
for schema definition for a few reasons:
- Extensibility: They are more flexible, easier to extend and maintain. 
- Readability: Pydantic models are more readable and easier to understand. Nested models in particular are easier to read than deeply nested JSON schemas.
- Type Safety: By validating against the Pydantic model, your code is guaranteed to be type safe. e.g. some date field will not suddenly become a numeric type.





In [None]:
from typing import Literal, Optional, List
from pydantic import BaseModel, Field


class FilingInfo(BaseModel):
    """Basic information about the SEC filing"""

    filing_type: Literal["10-K", "10-Q", "10-K/A", "10-Q/A"] = Field(
        description="Type of SEC filing"
    )
    filing_date: str = Field(description="Date when filing was submitted to SEC")
    reporting_period_end: str = Field(description="End date of reporting period")
    fiscal_year: Optional[int] = Field(None, description="Fiscal year")
    fiscal_quarter: Optional[int] = Field(
        None, description="Fiscal quarter (if 10-Q)", ge=1, le=4
    )


class CompanyProfile(BaseModel):
    """Essential company information"""

    name: str = Field(description="Legal name of company")
    ticker: str = Field(description="Stock ticker symbol")
    reporting_currency: str = Field(description="Currency used in financial statements")
    exchanges: Optional[List[str]] = Field(
        None, description="Stock exchanges where listed"
    )
    auditor: Optional[str] = Field(None, description="Company's auditor")


class FinancialHighlights(BaseModel):
    """Key financial metrics from this reporting period"""

    period_end: str = Field(description="End date of reporting period")
    comparison_period_end: Optional[str] = Field(
        None, description="End date of comparison period (typically prior year/quarter)"
    )
    currency: Optional[str] = Field(None, description="Currency of financial figures")
    unit: Optional[str] = Field(
        None, description="Unit of financial figures (thousands, millions, etc.)"
    )
    revenue: float = Field(description="Total revenue for period")
    revenue_prior_period: Optional[float] = Field(
        None, description="Revenue from comparison period"
    )
    revenue_growth: Optional[float] = Field(
        None, description="Revenue growth percentage"
    )
    gross_profit: Optional[float] = Field(None, description="Gross profit")
    gross_margin: Optional[float] = Field(None, description="Gross margin percentage")
    operating_income: Optional[float] = Field(None, description="Operating income")
    operating_margin: Optional[float] = Field(
        None, description="Operating margin percentage"
    )
    net_income: float = Field(description="Net income")
    net_margin: Optional[float] = Field(None, description="Net margin percentage")
    eps: Optional[float] = Field(None, description="Basic earnings per share")
    diluted_eps: Optional[float] = Field(None, description="Diluted earnings per share")
    ebitda: Optional[float] = Field(
        None,
        description="EBITDA (Earnings Before Interest, Taxes, Depreciation, Amortization)",
    )
    free_cash_flow: Optional[float] = Field(None, description="Free cash flow")


class BusinessSegment(BaseModel):
    """Information about a business segment"""

    name: str = Field(description="Segment name")
    description: Optional[str] = Field(None, description="Segment description")
    revenue: Optional[float] = Field(None, description="Segment revenue")
    revenue_percentage: Optional[float] = Field(
        None, description="Percentage of total company revenue"
    )
    operating_income: Optional[float] = Field(
        None, description="Segment operating income"
    )
    operating_margin: Optional[float] = Field(
        None, description="Segment operating margin percentage"
    )
    year_over_year_growth: Optional[float] = Field(
        None, description="Year-over-year growth percentage"
    )
    outlook: Optional[str] = Field(None, description="Future outlook for segment")


class GeographicSegment(BaseModel):
    """Information about a geographic segment"""

    region: str = Field(description="Geographic region")
    revenue: Optional[float] = Field(None, description="Revenue from region")
    revenue_percentage: Optional[float] = Field(
        None, description="Percentage of total company revenue"
    )
    year_over_year_growth: Optional[float] = Field(
        None, description="Year-over-year growth percentage"
    )


class RiskFactor(BaseModel):
    """Information about a risk factor"""

    category: str = Field(
        description="Risk category (e.g., Market, Operational, Legal)"
    )
    title: Optional[str] = Field(None, description="Brief title of risk")
    description: str = Field(description="Description of risk factor")
    potential_impact: Optional[str] = Field(
        None, description="Potential business impact"
    )


class ManagementHighlights(BaseModel):
    """Key highlights from Management Discussion & Analysis"""

    business_overview: Optional[str] = Field(
        None, description="Overview of business and strategy"
    )
    key_trends: Optional[str] = Field(
        None, description="Key trends affecting performance"
    )
    liquidity_assessment: Optional[str] = Field(
        None, description="Management assessment of liquidity"
    )
    outlook_summary: Optional[str] = Field(None, description="Future outlook/guidance")


class SECFiling(BaseModel):
    """Schema for parsing 10-K and 10-Q filings from the SEC"""

    filing_info: FilingInfo = Field(description="Basic information about the filing")
    company_profile: CompanyProfile = Field(description="Essential company information")
    financial_highlights: FinancialHighlights = Field(
        description="Key financial metrics from this reporting period"
    )
    business_segments: Optional[List[BusinessSegment]] = Field(
        None, description="Key business segments information"
    )
    geographic_segments: Optional[List[GeographicSegment]] = Field(
        None, description="Geographic segment information"
    )
    key_risks: Optional[List[RiskFactor]] = Field(
        None, description="Most significant risk factors"
    )
    mda_highlights: Optional[ManagementHighlights] = Field(
        None, description="Key highlights from Management Discussion & Analysis"
    )

Take a look at the schema definition above. We've defined a few models to represent the different sections of the 10K/Q filing. 
We've also defined a `SECFiling` model that combines all the sections into a single model. 

- There are a lot of optional fields in the schema. There are many fields that we would like to extract if present, but we know that they are not present in all filings. 
  e.g. companies which only has a US footprint will not have a geographic breakdown of their financials. It is important to designate these fields as optional so that the LLM is not 
  forced to make up values for these fields.
- While not mandatory, it is always a good idea to provide a description for each field. This helps the LLM understand the context in which the field is being extracted and 
  can improve the accuracy of the extraction.  

Now, let us create an agent to extract this information from the 10K/Q filing.

In [None]:
from llama_cloud.core.api_error import ApiError

try:
    existing_agent = llama_extract.get_agent(name="resume-screening")
    if existing_agent:
        llama_extract.delete_agent(existing_agent.id)
except ApiError as e:
    if e.status_code == 404:
        pass
    else:
        raise

agent = llama_extract.create_agent(name="resume-screening", data_schema=SECFiling)

In [None]:
nvda_10k_extract = agent.extract("./data/sec_filings/nvda_10k.pdf")

In [None]:
nvda_10k_extract.data

{'key_risks': [{'title': 'Failure to meet the evolving needs of our industry and markets',
   'category': 'Market',
   'description': 'Our accelerated computing platforms experience rapid changes in technology, customer requirements, competitive products, and industry standards.',
   'potential_impact': 'Adversely impact our financial results.'},
  {'title': 'Competition',
   'category': 'Market',
   'description': 'Our target markets remain competitive, and competition may intensify with expanding and changing product and service offerings, industry standards, customer and market needs, new entrants and consolidations.',
   'potential_impact': 'Adversely impact our market share and financial results.'},
  {'title': 'Long manufacturing lead times and uncertain supply and component availability',
   'category': 'Operational',
   'description': 'Long manufacturing lead times and uncertain supply and component availability, combined with a failure to estimate customer demand accurately, h

In [None]:
from pydantic.fields import FieldInfo

FinancialHighlights.__annotations__["page_numbers"] = List[int]
FinancialHighlights.model_fields["page_numbers"] = FieldInfo(
    annotation=List[int],
    description="Page numbers where this information was found in the document. Page numbers are at the bottom of the page.",
)
FinancialHighlights.model_rebuild(force=True)

BusinessSegment.model_fields["page_numbers"] = FieldInfo(
    annotation=List[int],
    description="Page numbers where this information was found in the document. Page numbers are at the bottom of the page.",
)
BusinessSegment.model_rebuild(force=True)

GeographicSegment.model_fields["page_numbers"] = FieldInfo(
    annotation=List[int],
    description="Page numbers where this information was found in the document. Page numbers are at the bottom of the page.",
)
GeographicSegment.model_rebuild(force=True)

SECFiling.model_rebuild(force=True)
SECFiling.model_json_schema()

{'$defs': {'BusinessSegment': {'description': 'Information about a business segment',
   'properties': {'name': {'description': 'Segment name',
     'title': 'Name',
     'type': 'string'},
    'description': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
     'default': None,
     'description': 'Segment description',
     'title': 'Description'},
    'revenue': {'anyOf': [{'type': 'number'}, {'type': 'null'}],
     'default': None,
     'description': 'Segment revenue',
     'title': 'Revenue'},
    'revenue_percentage': {'anyOf': [{'type': 'number'}, {'type': 'null'}],
     'default': None,
     'description': 'Percentage of total company revenue',
     'title': 'Revenue Percentage'},
    'operating_income': {'anyOf': [{'type': 'number'}, {'type': 'null'}],
     'default': None,
     'description': 'Segment operating income',
     'title': 'Operating Income'},
    'operating_margin': {'anyOf': [{'type': 'number'}, {'type': 'null'}],
     'default': None,
     'description': 'Segm

In [None]:
agent.data_schema = SECFiling

In [None]:
nvda_10k_extract = agent.extract("./data/sec_filings/nvda_10k.pdf")

In [None]:
nvda_10k_extract.data

{'filing_info': {'filing_type': '10-K',
  'filing_date': 'February 26, 2025',
  'reporting_period_end': 'January 26, 2025',
  'fiscal_year': 2025,
  'fiscal_quarter': None},
 'company_profile': {'name': 'NVIDIA Corporation',
  'ticker': 'NVDA',
  'reporting_currency': '',
  'exchanges': ['The Nasdaq Global Select Market'],
  'auditor': 'PricewaterhouseCoopers LLP'},
 'financial_highlights': {'period_end': '2025-01-26',
  'comparison_period_end': None,
  'currency': None,
  'unit': None,
  'revenue': 130497.0,
  'revenue_prior_period': None,
  'revenue_growth': None,
  'gross_profit': 97858.0,
  'gross_margin': 75.0,
  'operating_income': 81453.0,
  'operating_margin': None,
  'net_income': 72880.0,
  'net_margin': None,
  'eps': 2.97,
  'diluted_eps': 2.94,
  'ebitda': None,
  'free_cash_flow': None,
  'page_numbers': [29, 41, 57, 70]},
 'business_segments': [{'name': 'Compute & Networking',
   'description': 'Includes Data Center accelerated computing platforms and AI solutions and so