In [1]:
import os
import re
import requests
from datetime import datetime

from bs4 import BeautifulSoup, NavigableString

import vertexai
from vertexai.generative_models import (
    GenerativeModel,
    GenerationConfig,
    Tool,
    FunctionDeclaration,
    Part)

In [2]:
project_id = !gcloud config get project
project_id = project_id[0]

vertexai.init(project=project_id, location="us-central1")

## Download SEC filings and parse text to analyze
In this section, we will tackle the limitations of processing massive SEC filings with Gemini. We will explore utilizing the document structure to efficiently extract specific sections and confirm their suitability for analysis within Gemini's token limit.



In [3]:
# IMPORTANT! You must set these variables with REAL VALUES for the correct headers to be in place to access the API
PARTNER_COMPANY = "YOUR COMPANY'S NAME"
PARTNER_WEBSITE = "https://your-companys-website"

headers = {"User-Agent": f"{PARTNER_COMPANY} +{PARTNER_WEBSITE}"}


download_single_filing - It downloads a single SEC filing identified by a company's Central Index Key (CIK), the form type(an annual 10-K or quarterly 10-Q) and the date of filing.
download_range_of_filings - It queries all the filings of a company to find filings within a date range of a particular form type and download them.

In [4]:
BASE_DIR = "filings" # Directory for storing raw filings

def download_single_filing(url, cik, form, date, file_extension):
    """
    This function downloads and saves a SEC filing from a specified URL.

    Args:
        cik (str): Central Index Key (CIK) for the company.
        form (str): The type of SEC filing (e.g., 10-K, 10-Q).
        date (str): The filing date in YYYY-MM-DD format.
        file_extension (str): File extension for saved file (usually 'txt' or 'zip').
    """

    # Define request headers to simulate a browser visit
    response = requests.get(url, headers=headers)

    if response.status_code == 200:

        # Make the directory accord
        dir_name = f"{BASE_DIR}/{cik}"
        os.makedirs(dir_name, exist_ok=True)
        file_path = f"{dir_name}/{form}_{date}.{file_extension}"

        with open(file_path, 'wb') as file:
            file.write(response.content)

        print(f"Downloaded {form} for CIK {cik} on {date} to {file_path}")

        return file_path

    else:
        print(f"Failed to download {form} for CIK {cik} on {date}. Status code: {response.status_code}")

        return None

def download_range_of_filings(cik, starting_year_and_quarter,
                            ending_year_and_quarter, include_10q = False):
    """
    Download filings from EDGAR for a given CIK and clean them up.

    Args:
        cik (str): Central Index Key (CIK)
        starting_year_and_quarter (str): Specified in the format "2023 Q1"
        ending_year_and_quarter (str): Specified in the format "2024 Q4"
        include_10q (bool): Whether to include 10-Qs in addition to 10-Ks
    """

    url = f"https://data.sec.gov/submissions/CIK{cik}.json"
    headers = {'User-Agent': 'Google Partner Learning Services Demo +https://partners.cloud.google.com/learn'}
    response = requests.get(url, headers=headers)

    forms_to_download = {"10-K", "10-Q"} if include_10q else {"10-K"}

    start_year, start_quarter = starting_year_and_quarter.split()
    end_year, end_quarter = ending_year_and_quarter.split()

    if response.status_code == 200:
        data = response.json()

        filing_paths = []
        for filing_date, form, accession_number in zip(data['filings']['recent']['filingDate'], data['filings']['recent']['form'], data['filings']['recent']['accessionNumber']):
            if (form in forms_to_download) and (int(start_year) <= datetime.strptime(filing_date, '%Y-%m-%d').year <= int(end_year)):
                file_path = download_single_filing(
                    f"https://www.sec.gov/Archives/edgar/data/{cik}/{accession_number.replace('-', '')}/{accession_number}.txt",
                    cik, form, filing_date, 'htm'
                )
                filing_paths.append (file_path)
        return filing_paths
    else:
        print("Error from call.")
        print(response)

In [5]:
alphabet_cik = "0001652044"

download_range_of_filings(cik=alphabet_cik,
                starting_year_and_quarter="2024 Q1",
                ending_year_and_quarter="2024 Q4")

Downloaded 10-K for CIK 0001652044 on 2024-01-31 to filings/0001652044/10-K_2024-01-31.htm


['filings/0001652044/10-K_2024-01-31.htm']

In [7]:
model = GenerativeModel("gemini-2.5-flash",
                        generation_config=GenerationConfig(temperature=0),
                       )

Determine if we can send the entire document to Gemini to analyze at once, using GenerativeModel's count_tokens method to see the number of tokens in the raw file.

In [8]:
downloaded_path = "filings/0001652044/10-K_2024-01-31.htm"

with open(downloaded_path, 'r') as f:
    filing_text = f.read()

response = model.count_tokens(filing_text)
print(response)    

total_tokens: 5798629
total_billable_characters: 13029439
prompt_tokens_details {
  modality: TEXT
  token_count: 5798629
}



In [None]:
With a token count of 5,798,629, we can see that even with Gemini 2.0 Flash's large token window of 1,048,576 tokens, these documents are too long to read in a single pass.We could implement a Retrieval-Augmented Generation (RAG) framework to query small chunks of these documents, but here we are looking for a broader understanding of sections as a whole rather than smaller chunks consisting of a few facts in the document. Here, We don't mind passing a large number of tokens to Gemini as this will be an internal tool used by a relatively small number of analysts, not a public tool handling thousands of queries.SEC filings are required to adhere to a strict structure with named sections. We can use this standard structure of the documents to read the text between one section header and the next.


In [9]:
items = ['Business',
        'Risk Factors',
        'Unresolved Staff Comments',
        'Properties', 'Legal Proceedings',
        'Mine Safety Disclosures',
        'Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities',
        'Management’s Discussion and Analysis of Financial Condition and Results of Operations',
        'Quantitative and Qualitative Disclosures About Market Risk',
        'Financial Statements and Supplementary Data',
        'Changes in and Disagreements with Accountants on Accounting and Financial Disclosure',
        'Controls and Procedures',
        'Other Information',
        'Disclosure Regarding Foreign Jurisdictions that Prevent Inspections',
        'Directors, Executive Officers, and Corporate Governance',
        'Executive Compensation',
        'Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters',
        'Certain Relationships and Related Transactions, and Director Independence',
        'Principal Accountant Fees and Services',
        'Exhibit and Financial Statement Schedules',
        'Form 10-K Summary']

def find_div_id(soup, item_name: str):
    found_tag = soup.find('a', string=item_name)
    if found_tag:
        return found_tag["href"].strip('#')
    else:
        print(f"Couldn't find a matching tag for: {item_name}")
        return None

def get_text_between(soup, cur_name, end_name):
    cur = soup.find('div', id=find_div_id(soup, cur_name)).next_sibling
    end = soup.find('div', id=find_div_id(soup, end_name))
    while cur and cur != end:
        if isinstance(cur, NavigableString):
            text = cur.strip()
            if len(text):
                yield text
        cur = cur.next_element

def get_items_from_filings(item_names, filing_paths):

    print("Items of interest: " + ", ".join(item_names) + "\n")
    item_strings = {item: f"<{item}>\n" for item in item_names}

    for path in filing_paths:
        with open(path, 'r', encoding='utf-8') as file:
            content = file.read()

        soup = BeautifulSoup(content, 'html.parser')

        for item in item_names:
            item_index = items.index(item)
            item_index = item_index if item_index < len(items) - 1 else 0
            item_output = ' '.join(text for text in get_text_between(soup, item, items[item_index + 1]))
            item_strings[item] += f"From {os.path.basename(path)}" + "\n" + item_output + "\n"

    return "\n\n".join(item_strings.values())    

In [10]:
item_names = ["Management’s Discussion and Analysis of Financial Condition and Results of Operations"]

report = get_items_from_filings(item_names, [downloaded_path])

# Print the first 2,000 characters as an example.
print(report[0:2000] + "...")

Items of interest: Management’s Discussion and Analysis of Financial Condition and Results of Operations

<Management’s Discussion and Analysis of Financial Condition and Results of Operations>
From 10-K_2024-01-31.htm
Table of Contents Alphabet Inc. ITEM 7. MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS Please read the following discussion and analysis of our financial condition and results of operations together with “Note about Forward-Looking Statements,” Part I, Item 1 "Business," Part I, Item 1A "Risk Factors," and our consolidated financial statements and related notes included under Item 8 of this Annual Report on Form 10-K. The following section generally discusses 2023 results compared to 2022 results. Discussion of 2022 results compared to 2021 results to the extent not included in this report can be found in Item 7 of our 2022 Annual Report on Form 10-K. Understanding Alphabet’s Financial Results Alphabet is a collection of businesses 

In [11]:
response = model.count_tokens(report)
print(response)

total_tokens: 12688
total_billable_characters: 50620
prompt_tokens_details {
  modality: TEXT
  token_count: 12688
}



Looking up companies CIKs via Google Search
We were able to download Alphabet's annual report because we were provided its Central Index Key (CIK) number. Gemini already knows some CIKs for large public companies like Alphabet but for some smaller companies, it may hallucinate and invent inaccurate CIKs. We can instead look up correct numbers by having Gemini use everyone's favorite well-known, public, up-to-date source of information: Google Search.

The CIK of Summit Therapeutics is actually 0001599298 but without the ability to look it up Gemini provides some alternative numbers.

In [12]:
model.generate_content("What is Summit Therapeutics' CIK with the SEC?").text

"Summit Therapeutics' CIK with the SEC is **0001367739**."

Use Grounding with Google Search to provide Gemini with a tool to conduct Google searches.

In [13]:
from google import genai
from google.genai.types import Tool, GenerateContentConfig, GoogleSearch, HttpOptions
from IPython.display import display, HTML
import os

project_id = os.environ['GOOGLE_CLOUD_PROJECT'] = 'qwiklabs-gcp-01-741ee10e43d2'
location = os.environ['GOOGLE_CLOUD_LOCATION'] = 'us-central1'
os.environ['GOOGLE_GENAI_USE_VERTEXAI'] = 'True'

# Initialize the client, explicitly passing project and location for Vertex AI
client = genai.Client(project=project_id, location=location, http_options=HttpOptions(api_version="v1"))
model_id = "gemini-2.0-flash"

# Configure Google Search as a tool for grounding
search_tool = Tool(
    google_search=GoogleSearch()
)

# Helper function to lookup a company's SEC CIK
def lookup_cik(company_name: str) -> str:
    prompt = f"""
Find the Securities and Exchange Commission's Central Index Key (CIK)
for {company_name}.
Return only the 10-digit integer CIK including leading zeroes.
"""
    # Generate content with grounding via Google Search
    response = client.models.generate_content(
        model=model_id,
        contents=prompt,
        config=GenerateContentConfig(
            tools=[search_tool],
            response_modalities=["TEXT"],
        )
    )

    # Concatenate all text parts of the response
    cik = "".join(part.text for part in response.candidates[0].content.parts).strip()
    print(f"Lookup for CIK of {company_name} resulted in: {cik}\n")

    # Display the grounding metadata (search suggestions and sources)
    if hasattr(response.candidates[0], 'grounding_metadata'):
        display(
            HTML(
                response.candidates[0]
                .grounding_metadata.search_entry_point.rendered_content
            )
        )

    return cik

 When we use Grounding with Google Search, we are required to display the corresponding Google Search Suggestions which help us understand what Google search was conducted and allow us to investigate the results. 

In [14]:
lookup_cik("Summit Therapeutics")

Lookup for CIK of Summit Therapeutics resulted in: The Securities and Exchange Commission (SEC) Central Index Key (CIK) for Summit Therapeutics is 0001599298.



'The Securities and Exchange Commission (SEC) Central Index Key (CIK) for Summit Therapeutics is 0001599298.'

Creating a FunctionDeclaration that will help Gemini know about function to lookup a CIK.



In [15]:
lookup_cik_fd = FunctionDeclaration(
    name="lookup_cik",
    description="Look up a company's CIK used for its SEC filings.",
    parameters={
        "type": "object",
        "properties": {
            "company_name": {
                "type": "string",
                "description": "The name of the company to look up."
            },
        },
        "required": [
            "company_name"
        ]
    },
)

The search_tool loaded above cannot be combined with other Tools when passed to Gemini, so in order to create a Tool that combines it and other functions we can create a dedicated model instance and invoke that via another function as you are doing here.

## Empower Gemini to retrieve document sections from relevant years
We will equip Gemini to analyze specific sections of public company documents across different timeframes. We will define functions to retrieve relevant filings and extract information from desired sections within those documents for a more comprehensive analysis.

In [16]:
retrieve_filings_fd = FunctionDeclaration(
    name="retrieve_filings",
    description="Retrieve filings from the SEC EDGAR API for a company within a date range.",
    parameters={
        "type": "object",
        "properties": {
            "cik": {
                "type": "string",
                "description": "The CIK of a company whose documents will be retrieved"
            },
            "starting_year_and_quarter": {
                "type": "string",
                "description": "The first report quarter year and quarter in the format: 2024 Q1"
            },
            "ending_year_and_quarter": {
                "type": "string",
                "description": "The year and quarter in the format: 2024 Q1"
            },
            "include_quarterly_reports": {
                "type": "boolean",
                "description": "Whether to include 10-Q quarterly filings in addition to annual 10-K filings"
            },
            "items_of_interest": {
                "description": "An array of one or more section (called items) of interest from 10-K or 10-Q filings",
                "type": "array",
                "items": {
                "type": "string",
                "enum": [
                    "Business",
                    "Risk Factors",
                    "Unresolved Staff Comments",
                    "Properties",
                    "Legal Proceedings",
                    "Mine Safety Disclosures",
                    "Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities",
                    "Management’s Discussion and Analysis of Financial Condition and Results of Operations",
                    "Quantitative and Qualitative Disclosures About Market Risk",
                    "Financial Statements and Supplementary Data",
                    "Changes in and Disagreements with Accountants on Accounting and Financial Disclosure",
                    "Controls and Procedures",
                    "Other Information",
                    "Disclosure Regarding Foreign Jurisdictions that Prevent Inspections",
                    "Directors, Executive Officers and Corporate Governance",
                    "Executive Compensation",
                    "Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters",
                    "Certain Relationships and Related Transactions, and Director Independence",
                    "Principal Accountant Fees and Services",
                    "Exhibit and Financial Statement Schedules",
                    "Form 10-K Summary"
                ]
                }
        }
        },
        "required": [
            "cik",
            "starting_year_and_quarter",
            "ending_year_and_quarter",
            "items_of_interest"
        ]
    },
)    

In [17]:
sec_tool = Tool(function_declarations=[retrieve_filings_fd,
                                    lookup_cik_fd])    

Create a system_instruction and instantiate a new model that will follow the instructions to create analyses using the SEC filings. In the instructions, we will provide Gemini the current date so that it can calculate relevant dates for queries using relative terminology.



In [18]:
response1 = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Find the SEC CIK for Acme Corp. Return only the 10-digit CIK.",
    config=GenerateContentConfig(tools=[search_tool], response_modalities=["TEXT"])
)
cik = "".join(p.text for p in response1.candidates[0].content.parts).strip()

from vertexai.generative_models import GenerativeModel, GenerationConfig, Tool, FunctionDeclaration

# Define your function declarations (lookup_cik_fd, retrieve_filings_fd)
sec_tool = Tool(function_declarations=[lookup_cik_fd, retrieve_filings_fd])

system_instruction = """
    - You are a research assistant to a financial analyst.
    - Answer the user's question with an analysis, basing your response on information
    used in filings from the SEC EDGAR database.
    - Quote the SEC filing documents to support your analysis.
    - If you are not certain about a CIK, use your tool to look it up.
    - The current date is {current_date}.
    """.format(
        current_date=datetime.today().strftime("%Y-%m-%d")
    )

model = GenerativeModel(
    model_name="gemini-2.5-flash",
    generation_config=GenerationConfig(temperature=0),
    system_instruction=system_instruction,
    tools=[sec_tool]
)

response2 = model.generate_content(
    contents=f"Retrieve the 10-K for CIK {cik} covering 2023 Q4, and extract 'Risk Factors'."
)

When we ask Gemini a question related to a public company, it may return a response or it may return a request for a function call. Starting with Gemini 2.0, it may ask for function calls in separate rounds of chat or multiple parallel function calls in a single round of response. If multiple function calls are requested, our response must include a function response for each function call Gemini has requested (the Part.from_function_response section). To handle these cases, it is usually very helpful to define a handle_response function

In [19]:
def handle_response(response):

    parts_for_inner_response = []
    for part in response.candidates[0].content.parts:
        # If the content part has an attribute called 'text'
        if hasattr(part, "text"):
            print("\n" + part.text)
        # If the content has an attribute called 'function_call'
        if part.function_call:
            function_call = part.function_call
            try:
                if function_call.name == "lookup_cik":
                    cik = lookup_cik(function_call.args["company_name"])
                    parts_for_inner_response.append(
                        Part.from_function_response(
                            name="lookup_cik",
                            response={
                                "content": cik,
                            },
                        )
                    )
                if function_call.name == "retrieve_filings":
                    context = ""
                    filing_paths = download_range_of_filings(function_call.args["cik"],
                                    function_call.args["starting_year_and_quarter"],
                                    function_call.args["ending_year_and_quarter"],
                                    function_call.args.get("include_10q", False)
                    )
                    if filing_paths:
                        # TODO: Load cleaned filing docs
                        report = get_items_from_filings(function_call.args["items_of_interest"], filing_paths)
                        parts_for_inner_response.append(
                            Part.from_function_response(
                                name="retrieve_filings",
                                response={
                                    "content": report,
                                },
                            )
                        )
                    else:
                        print("No valid filings found or an error was encountered in retrieving them.")
            except AttributeError as e:
                print("Exception:")
                print(response)
                print(part)
                print(e)
    if parts_for_inner_response:
        inner_response = chat.send_message(parts_for_inner_response)
        handle_response(inner_response)   

In [20]:
chat = model.start_chat()

Asking a question that compares a corresponding section across two companies reports.

In [21]:
response = chat.send_message("How do Alphabet's risks in 2024 compare to Amazon's?")
handle_response(response)

Lookup for CIK of Alphabet Inc. resulted in: The Securities and Exchange Commission (SEC) Central Index Key (CIK) for Alphabet Inc. is 0001652044.



Lookup for CIK of Amazon.com Inc. resulted in: The Securities and Exchange Commission (SEC) assigns a unique Central Index Key (CIK) to each entity or individual that submits filings to the SEC. The CIK is a permanent identifier up to ten digits in length.

While the search results do not directly provide Amazon.com Inc.'s CIK, they offer methods to find it:
*   **CIK Lookup:** The SEC provides a CIK lookup tool on its website. You can search by company name to find the corresponding CIK.

Please use the SEC's CIK lookup tool to find the specific CIK for Amazon.com Inc.



Lookup for CIK of Amazon.com Inc. resulted in: The Securities and Exchange Commission's (SEC) Central Index Key (CIK) for Amazon.com Inc. is 0001018724.



Downloaded 10-K for CIK 0001652044 on 2024-01-31 to filings/0001652044/10-K_2024-01-31.htm
Items of interest: Risk Factors

Downloaded 10-K for CIK 0001018724 on 2024-02-02 to filings/0001018724/10-K_2024-02-02.htm
Items of interest: Risk Factors


Alphabet and Amazon, as leading technology and e-commerce companies, share several common risk factors in 2024, primarily stemming from their extensive global operations, reliance on technology, and exposure to evolving regulatory landscapes. However, their specific business models lead to different emphases and unique risks.

**Common Risk Factors:**

Both companies face intense competition, the complexities of international operations, and significant risks related to data privacy, cybersecurity, and evolving regulatory environments.

*   **Intense Competition:** Both Alphabet and Amazon acknowledge operating in highly competitive and rapidly evolving environments.
    *   **Alphabet:** "Our business environment is rapidly evolving and int

Comparing one section across multiple 'versions' (or filing years) of a document.

In [22]:
chat = model.start_chat()
response = chat.send_message("How has Home Depot changed the way it describes its business over the past 3 years?")
handle_response(response)

Lookup for CIK of Home Depot resulted in: The Securities and Exchange Commission (SEC) Central Index Key (CIK) for Home Depot is 0000354950.



Downloaded 10-K for CIK 0000354950 on 2022-03-23 to filings/0000354950/10-K_2022-03-23.htm
Items of interest: Business

Downloaded 10-K for CIK 0000354950 on 2023-03-15 to filings/0000354950/10-K_2023-03-15.htm
Items of interest: Business

Downloaded 10-K for CIK 0000354950 on 2021-03-24 to filings/0000354950/10-K_2021-03-24.htm
Items of interest: Business


Home Depot's business description over the past three years reflects a strategic evolution from completing a major investment cycle to leveraging those investments, adapting to broader economic challenges, and deepening its focus on key customer segments and operational efficiencies.

Here's a breakdown of the changes observed in their 10-K filings:

**1. Strategic Investment and Adaptation to Market Dynamics:**
*   **Fiscal Year 2020 (10-K filed March 24, 2021):** The company stated, "Our multi-year accelerated investment program to create the One Home Depot experience... is now largely complete." The primary challenge highlighted