# EDGAR Ground Truth Extraction (Gemini 3 Flash)

**Goal**: Extract fields from SEC 10-K filings using Google Gemini (Flash).

**Key Features**:
1. **Toggle Mode**: Single (per-field) or Batch (all-at-once) queries.
2. **Integrated Cleanup**: Appends full section text and specific booleans for found/verified evidence.
3. **Evidence Verification**: Checks if the specific evidence sentence actually exists in the text.


In [None]:
!pip list

Package                                  Version
---------------------------------------- -----------------
absl-py                                  1.4.0
accelerate                               1.12.0
access                                   1.1.10.post3
affine                                   2.4.0
aiofiles                                 24.1.0
aiohappyeyeballs                         2.6.1
aiohttp                                  3.13.3
aiosignal                                1.4.0
aiosqlite                                0.22.1
alabaster                                1.0.0
albucore                                 0.0.24
albumentations                           2.0.8
ale-py                                   0.11.2
alembic                                  1.18.3
altair                                   5.5.0
annotated-doc                            0.0.4
annotated-types                          0.7.0
antlr4-python3-runtime                   4.9.3
anyio                           

In [1]:
!pip install -q google-genai pandas tqdm python-Levenshtein datasets

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/153.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.3/153.3 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m1.7/3.2 MB[0m [31m52.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import os
import re
import json
import time
import pandas as pd
from tqdm import tqdm
from google import genai
from google.genai import types
from datasets import load_dataset
import Levenshtein
from google.colab import userdata
import time

# Configure API
# os.environ["GEMINI_API_KEY"] = "YOUR_API_KEY" # Uncomment if not set in env
client = genai.Client(api_key=userdata.get('RESEARCH_GEMINI_API_KEY'))
print("Setup Complete.")

Setup Complete.


In [10]:
# 2. Configuration

# Model Name
MODEL_ID = "gemini-2.5-flash"

# Extraction Mode
# True = One prompt for all fields (Cheaper/Faster)
# False = One prompt per field (Better precision/isolation)
USE_BATCH_MODE = True

# Paths
OUTPUT_FILE = "edgar_gt_gemini_extraction.csv"
MAX_DOCUMENTS = 250
CHECKPOINT_INTERVAL = 10

In [4]:
# 3. Question Bank

QUESTION_BANK = [
    {
        "id": "registrant_name",
        "prompt": (
            "What is the exact legal name of the registrant as explicitly stated in the document? "
            "Return ONLY the legal name string or NOT_FOUND."
        ),
    },
    {
        "id": "headquarters_city",
        "prompt": (
            "What city is explicitly stated as the location of the registrant's principal executive offices? "
            "Return ONLY the city name or NOT_FOUND."
        ),
    },
    {
        "id": "headquarters_state",
        "prompt": (
            "What U.S. state is explicitly stated as the location of the registrant's principal executive offices? "
            "Return ONLY the state name or NOT_FOUND."
        ),
    },
    {
        "id": "incorporation_state",
        "prompt": (
            "Identify the state or other jurisdiction under the laws of which the registrant is **currently** organized or incorporated. "
            "Exclude former jurisdictions. "
            "Return ONLY the state name or NOT_FOUND."
        ),
    },
    {
        "id": "incorporation_year",
        "prompt": (
            "What is the year of the registrant's incorporation in its **current** jurisdiction (state of incorporation)? "
            "Return ONLY the year (YYYY) or NOT_FOUND."
        ),
    },
    {
        "id": "employees_count_total",
        "prompt": (
            "What is the **total** number of persons employed by the registrant? "
            "Include full-time and part-time employees. "
            "Return ONLY the integer (remove commas) or NOT_FOUND."
        ),
    },
    {
        "id": "employees_count_full_time",
        "prompt": (
            "What is the number of **full-time** employees explicitly stated? "
            "Return ONLY the integer (remove commas) or NOT_FOUND."
        ),
    },
    {
        "id": "irs_tax_id",
        "prompt": (
            "What is the IRS Employer Identification Number (EIN) or I.R.S. Employer Identification No. of the registrant? "
            "Return ONLY the number (format XX-XXXXXXX) or NOT_FOUND."
        ),
    },
    {
        "id": "ceo_lastname",
        "prompt": (
            "What is the last name of the individual explicitly identified as the Chief Executive Officer (CEO) of the registrant as explicitly stated in the document? "
            "Return ONLY the last name string or NOT_FOUND."
        ),
    },
    {
        "id": "holder_record_amount",
        "prompt": (
            "What is the number of holders of record of the registrant's common stock as explicitly stated in the document? "
            "Return ONLY the integer (remove commas) or NOT_FOUND."
        ),
    },
]

In [5]:
# 4. Helper Functions: Context & Verification

SECTION_KEYS = [
    "section_1", "section_1A", "section_1B", "section_2", "section_3",
    "section_4", "section_5", "section_6", "section_7", "section_7A",
    "section_8", "section_9", "section_9A", "section_9B", "section_10",
    "section_11", "section_12", "section_13", "section_14", "section_15"
]

def build_full_context(doc):
    """Concatenates all sections."""
    parts = []
    for key in SECTION_KEYS:
        section_text = doc.get(key, "")
        if section_text and section_text.strip():
            parts.append(f"\n\n--- [{key.upper()}] ---\n\n{section_text}")
    return "".join(parts) if parts else ""

def verify_evidence(full_text, evidence_quote, value):
    """
    Verifies if the evidence quote exists in the text.
    Returns: True (Found), False (Not Found/Hallucinated), or None (N/A)
    """
    if str(value) == "NOT_FOUND" or not value:
        return None

    if not evidence_quote or evidence_quote == "NOT_FOUND":
        return False

    # 1. Exact Match
    if evidence_quote in full_text:
        return True

    # 2. Normalized Match
    clean_text = " ".join(full_text.split()).lower()
    clean_evd = " ".join(evidence_quote.split()).lower()
    if clean_evd in clean_text:
        return True

    # 3. Fuzzy/Levenshtein (Threshold 90%)
    # Only if evidence is substantial (>10 chars)
    if len(clean_evd) > 10:
        # Check fingerprint
        fp_text = re.sub(r'[^a-z0-9]', '', clean_text)
        fp_evd = re.sub(r'[^a-z0-9]', '', clean_evd)
        if fp_evd in fp_text:
            return True

    return False

In [6]:
# 5. Gemini Extraction Logic

class ExtractionResult(pd.Series):
    # Just a helper for type hinting
    pass

# Schema for single field extraction
response_schema_single = {
    "type": "OBJECT",
    "properties": {
        "value": {"type": "STRING"},
        "evidence": {"type": "STRING"},
        "source_sentence": {"type": "STRING"}
    },
    "required": ["value", "evidence", "source_sentence"]
}

def call_gemini_single(full_text, prompt_question):
    """Calls Gemini for a single field using structured output."""
    prompt = f"""
    Read the following SEC 10-K filing text and answer the question.

    Question: {prompt_question}

    Instructions:
    - Extract the answer strictly from the text.
    - If not found, set value to "NOT_FOUND".
    - Provide the EXACT evidence quote from the text.
    - Provide the full source sentence containing the evidence.
    """

    try:
        response = client.models.generate_content(
            model=MODEL_ID,
            contents=[prompt, full_text],
            config=types.GenerateContentConfig(
                response_mime_type="application/json",
                response_schema=response_schema_single,
                temperature=0.1
            )
        )
        return json.loads(response.text)
    except Exception as e:
        print(f"  [Error] Gemini Call Failed: {e}")
        return {"value": "ERROR", "evidence": "ERROR", "source_sentence": str(e)}

def call_gemini_batch(full_text, question_bank):
    """Calls Gemini for ALL fields in one go."""

    # Dynamically build schema for all fields
    properties = {}
    required = []

    for q in question_bank:
        fid = q["id"]
        properties[f"{fid}_value"] = {"type": "STRING"}
        properties[f"{fid}_evidence"] = {"type": "STRING"}
        properties[f"{fid}_source_sentence"] = {"type": "STRING"}
        required.extend([f"{fid}_value", f"{fid}_evidence", f"{fid}_source_sentence"])

    batch_schema = {
        "type": "OBJECT",
        "properties": properties,
        "required": required
    }

    questions_str = "\n".join([f"- {q['id']}: {q['prompt']}" for q in question_bank])

    prompt = f"""
    Read the following SEC 10-K filing and extract the following fields.
    For each field, provide the value, exact evidence quote, and source sentence.
    If not found, use "NOT_FOUND".

    Fields to Extract:
    {questions_str}
    """

    try:
        response = client.models.generate_content(
            model=MODEL_ID,
            contents=[prompt, full_text],
            config=types.GenerateContentConfig(
                response_mime_type="application/json",
                response_schema=batch_schema,
                temperature=0.1
            )
        )
        return json.loads(response.text)
    except Exception as e:
        print(f"  [Error] Gemini Batch Call Failed: {e}")
        return None

In [9]:
# 6. Main Processing Loop

def process_document(doc):
    result = {
        "filename": doc["filename"],
        "cik": doc.get("cik"),
        "year": doc.get("year")
    }

    # 1. Build Context
    full_text = build_full_context(doc)

    # 2. Append Section Texts (Cleanup Logic)
    for key in SECTION_KEYS:
        result[key] = doc.get(key, None)

    if not full_text:
        # Fill Empty
        for q in QUESTION_BANK:
            result[f"{q['id']}_value"] = "NO_TEXT"
            result[f"{q['id']}_found"] = False
        return result

    # 3. Extraction
    if USE_BATCH_MODE:
        # -- BATCH MODE --
        data = call_gemini_batch(full_text, QUESTION_BANK)
        if data:
            for q in QUESTION_BANK:
                fid = q["id"]
                val = data.get(f"{fid}_value", "NOT_FOUND")
                ev = data.get(f"{fid}_evidence", "NOT_FOUND")
                sent = data.get(f"{fid}_source_sentence", "NOT_FOUND")

                result[f"{fid}_value"] = val
                result[f"{fid}_evidence"] = ev
                result[f"{fid}_source_sentence"] = sent

                # Verify
                verified = verify_evidence(full_text, ev, val)
                result[f"{fid}_evidence_verified"] = verified
                result[f"{fid}_found"] = (val != "NOT_FOUND")
        else:
             # Batch failed
            for q in QUESTION_BANK:
                result[f"{q['id']}_value"] = "ERROR"

    else:
        # -- SINGLE MODE --
        for q in QUESTION_BANK:
            fid = q["id"]
            # print(f"  Extracting {fid}...")
            data = call_gemini_single(full_text, q["prompt"])
            time.sleep(0.1)

            val = data.get("value", "NOT_FOUND")
            ev = data.get("evidence", "NOT_FOUND")
            sent = data.get("source_sentence", "NOT_FOUND")

            result[f"{fid}_value"] = val
            result[f"{fid}_evidence"] = ev
            result[f"{fid}_source_sentence"] = sent

            # Verify
            verified = verify_evidence(full_text, ev, val)
            result[f"{fid}_evidence_verified"] = verified
            result[f"{fid}_found"] = (str(val) != "NOT_FOUND" and str(val) != "None")

    return result

In [11]:
# 7. Run Extraction

# Load Dataset
dataset = load_dataset(
    "c3po-ai/edgar-corpus",
    "default",
    split="train",
    streaming=True,
    revision="refs/convert/parquet",
)

# Resume Check
if os.path.exists(OUTPUT_FILE):
    df_results = pd.read_csv(OUTPUT_FILE)
    processed_files = set(df_results["filename"].astype(str))
    print(f"Resuming: {len(processed_files)} documents processed.")
else:
    df_results = pd.DataFrame()
    processed_files = set()

buffer = []
count = 0

print(f"Starting Extraction (Batch={USE_BATCH_MODE}, Model={MODEL_ID})...")

for doc in tqdm(dataset):
    if count >= MAX_DOCUMENTS:
        break

    fname = str(doc["filename"])
    if fname in processed_files:
        continue

    # Process
    try:
        row = process_document(doc)
        if USE_BATCH_MODE:
            time.sleep(0.1)
        buffer.append(row)
        processed_files.add(fname)
        count += 1
    except Exception as e:
        print(f"Error processing {fname}: {e}")
        continue

    # Checkpoint
    if len(buffer) >= CHECKPOINT_INTERVAL:
        new_df = pd.DataFrame(buffer)
        df_results = pd.concat([df_results, new_df], ignore_index=True)
        df_results.to_csv(OUTPUT_FILE, index=False)
        buffer = []
        print(f"Saved checkpoint: {len(df_results)} total")

# Final Save
if buffer:
    new_df = pd.DataFrame(buffer)
    df_results = pd.concat([df_results, new_df], ignore_index=True)
    df_results.to_csv(OUTPUT_FILE, index=False)

print(f"Done. Saved to {OUTPUT_FILE}")

Resolving data files:   0%|          | 0/78 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/35 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/36 [00:00<?, ?it/s]

Resuming: 5 documents processed.
Starting Extraction (Batch=True, Model=gemini-2.5-flash)...


15it [01:46,  9.37s/it]

Saved checkpoint: 15 total


25it [03:38, 10.48s/it]

Saved checkpoint: 25 total


35it [05:40, 12.72s/it]

Saved checkpoint: 35 total


45it [07:30,  9.17s/it]

Saved checkpoint: 45 total


55it [10:01, 13.22s/it]

Saved checkpoint: 55 total


65it [11:52, 10.33s/it]

Saved checkpoint: 65 total


75it [13:49, 13.34s/it]

Saved checkpoint: 75 total


85it [16:00, 15.10s/it]

Saved checkpoint: 85 total


95it [18:02, 11.27s/it]

Saved checkpoint: 95 total


105it [20:20, 13.55s/it]

Saved checkpoint: 105 total


115it [22:10,  9.71s/it]

Saved checkpoint: 115 total


125it [24:02, 11.04s/it]

Saved checkpoint: 125 total


135it [26:05, 13.41s/it]

Saved checkpoint: 135 total


145it [27:54, 11.62s/it]

Saved checkpoint: 145 total


155it [30:08, 14.29s/it]

Saved checkpoint: 155 total


165it [31:57, 11.50s/it]

Saved checkpoint: 165 total


175it [33:37,  9.84s/it]

Saved checkpoint: 175 total


185it [35:40, 13.52s/it]

Saved checkpoint: 185 total


195it [37:40, 14.24s/it]

Saved checkpoint: 195 total


205it [40:06, 13.15s/it]

Saved checkpoint: 205 total


215it [42:21, 14.09s/it]

Saved checkpoint: 215 total


225it [44:14, 12.45s/it]

Saved checkpoint: 225 total


235it [46:23, 11.58s/it]

Saved checkpoint: 235 total


245it [48:29, 15.16s/it]

Saved checkpoint: 245 total


255it [50:36, 11.91s/it]

Saved checkpoint: 255 total
Done. Saved to edgar_gt_gemini_extraction.csv



