# Statement Extractor Demo

This notebook demonstrates how to use the **corp-extractor** library to extract structured subject-predicate-object triples from unstructured text.

**Features:**
- Transform text into structured triples using T5-Gemma2
- Entity type recognition (ORG, PERSON, GPE, etc.)
- 5-stage extraction pipeline with pluggable components
- Entity database for organization and person lookup
- Document processing (URLs, PDFs)

**Resources:**
- [PyPI Package](https://pypi.org/project/corp-extractor/)
- [GitHub Repository](https://github.com/corp-o-rate/statement-extractor)
- [Hugging Face Model](https://huggingface.co/corp-o-rate/t5gemma2-statement-extractor)

## 1. Setup

### Prerequisites

Before running this notebook:

1. **Use a GPU runtime**: Runtime → Change runtime type → T4 GPU
2. **Accept the Gemma license**: Visit [google/gemma-3-12b-it-qat-q4_0-gguf](https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf) and accept the license agreement
3. **Have a HuggingFace account**: You'll need to login below

In [None]:
# Install the corp-extractor package
!pip install -q corp-extractor

# Verify installation
import statement_extractor
print(f"Installed version: {statement_extractor.__version__}")

In [None]:
# Login to HuggingFace (required for gated models)
# This will prompt you to enter your HuggingFace token
# Get your token at: https://huggingface.co/settings/tokens

from huggingface_hub import login
login()

In [None]:
# Check GPU availability
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 2. Simple Extraction

The simplest way to use the library is with the `extract_statements` function.

In [None]:
from statement_extractor import extract_statements

# Extract statements from text
text = "Apple Inc. announced a new iPhone today. Tim Cook presented the device at their Cupertino headquarters."

result = extract_statements(text)

# Display the extracted statements
print(f"Found {len(result)} statements:\n")
for stmt in result:
    print(f"  Subject: {stmt.subject.text} ({stmt.subject.entity_type})")
    print(f"  Predicate: {stmt.predicate}")
    print(f"  Object: {stmt.object.text} ({stmt.object.entity_type})")
    print(f"  Confidence: {stmt.confidence_score:.2f}")
    print()

### Output Formats

You can also get results in different formats:

In [None]:
from statement_extractor import (
    extract_statements_as_dict,
    extract_statements_as_json,
    extract_statements_as_xml
)

text = "Microsoft acquired Activision Blizzard for $68.7 billion."

# Get as dictionary
data = extract_statements_as_dict(text)
print("As Dictionary:")
print(data)
print()

In [None]:
# Get as JSON
import json
json_str = extract_statements_as_json(text)
print("As JSON:")
print(json.dumps(json.loads(json_str), indent=2))

In [None]:
# Get as XML
xml_str = extract_statements_as_xml(text)
print("As XML:")
print(xml_str)

## 3. Full Extraction Pipeline

For more comprehensive extraction, use the 5-stage pipeline:

| Stage | Name | Description |
|-------|------|-------------|
| 1 | Splitting | Text → raw triples (T5-Gemma2) |
| 2 | Extraction | Raw triples → typed statements (GLiNER2) |
| 3 | Qualification | Add identifiers + canonical names |
| 4 | Labeling | Add sentiment, relation type |
| 5 | Taxonomy | Classify against taxonomies |

In [None]:
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

# Create pipeline with default config
pipeline = ExtractionPipeline()

# Process text through all stages
text = """
Amazon CEO Andy Jassy announced plans to invest $4 billion in AI infrastructure.
The company will build new data centers in Virginia and Oregon.
AWS, Amazon's cloud division, will lead the initiative.
"""

ctx = pipeline.process(text)

print(f"Pipeline completed. Found {len(ctx.labeled_statements)} labeled statements.\n")

In [None]:
# Explore the labeled statements
for i, labeled in enumerate(ctx.labeled_statements, 1):
    stmt = labeled.statement
    print(f"Statement {i}:")
    print(f"  {labeled.subject_fqn} → {stmt.predicate} → {labeled.object_fqn}")

    # Show labels
    if labeled.labels:
        print(f"  Labels: {labeled.labels}")

    # Show taxonomy classifications
    if labeled.taxonomy_results:
        top_topics = sorted(labeled.taxonomy_results, key=lambda x: x.score, reverse=True)[:2]
        print(f"  Topics: {[t.topic for t in top_topics]}")
    print()

### Pipeline Configuration

You can customize which stages run and which plugins are enabled:

In [None]:
# Run only stages 1-3 (skip labeling and taxonomy)
config = PipelineConfig(
    enabled_stages={1, 2, 3},
)

pipeline = ExtractionPipeline(config)
ctx = pipeline.process("Google announced Gemini 2.0 at their Mountain View campus.")

print(f"Stages 1-3 only: {len(ctx.statements)} statements extracted")
for stmt in ctx.statements:
    print(f"  {stmt.subject.text} → {stmt.predicate} → {stmt.object.text}")

In [None]:
# List available plugins
from statement_extractor.pipeline import PluginRegistry

print("Available plugins by stage:")
print(f"  Splitters: {list(PluginRegistry._splitters.keys())}")
print(f"  Extractors: {list(PluginRegistry._extractors.keys())}")
print(f"  Qualifiers: {list(PluginRegistry._qualifiers.keys())}")
print(f"  Labelers: {list(PluginRegistry._labelers.keys())}")
print(f"  Taxonomy: {list(PluginRegistry._taxonomy_classifiers.keys())}")

## 4. Entity Database

The library includes an entity database for organization and person lookup. This enables entity qualification with canonical IDs.

In [None]:
# Download the entity database (lite version, ~500MB)
!corp-extractor db download

In [None]:
# Check database status
!corp-extractor db status

In [None]:
# Search for organizations
from statement_extractor.database import OrganizationDatabase

db = OrganizationDatabase()

# Search for Microsoft
results = db.search("Microsoft", limit=5)
print("Search results for 'Microsoft':")
for match in results:
    print(f"  {match.record.name} (score: {match.score:.3f})")
    print(f"    Type: {match.record.entity_type}")
    print(f"    Source: {match.record.source}")
    if match.record.lei:
        print(f"    LEI: {match.record.lei}")
    print()

In [None]:
# Search for people
from statement_extractor.database import PersonDatabase

people_db = PersonDatabase()

results = people_db.search("Elon Musk", limit=5)
print("Search results for 'Elon Musk':")
for match in results:
    print(f"  {match.record.name} (score: {match.score:.3f})")
    print(f"    Type: {match.record.person_type}")
    print(f"    Role: {match.record.role}")
    if match.record.org_name:
        print(f"    Organization: {match.record.org_name}")
    print()

## 5. Document Processing

Process entire documents including URLs and PDFs:

In [None]:
# Process a URL (example with a news article)
# Note: This requires the document to be accessible

from statement_extractor.document import DocumentPipeline

doc_pipeline = DocumentPipeline()

# Process from a text file or string
sample_doc = """
Tesla announced record quarterly deliveries of 500,000 vehicles.
CEO Elon Musk attributed the growth to strong demand in China.
The company's Shanghai Gigafactory produced 250,000 units.
Tesla stock rose 5% following the announcement.
"""

result = doc_pipeline.process_text(sample_doc)

print(f"Document processing found {len(result.statements)} statements:")
for stmt in result.statements[:5]:  # Show first 5
    print(f"  {stmt.subject.text} → {stmt.predicate} → {stmt.object.text}")

## 6. CLI Usage

The library also provides a command-line interface:

In [None]:
# Simple extraction
!corp-extractor split "Apple released the iPhone 16 with new AI features."

In [None]:
# Full pipeline with verbose output
!corp-extractor pipeline "Amazon acquired Whole Foods for \$13.7 billion." -v

In [None]:
# List available plugins
!corp-extractor plugins list

## 7. Advanced: Custom Extraction Options

Fine-tune extraction with custom options:

In [None]:
from statement_extractor import StatementExtractor, ExtractionOptions

# Create extractor with custom options
options = ExtractionOptions(
    num_beams=5,           # Number of beams for diverse beam search
    num_beam_groups=5,     # Number of beam groups
    diversity_penalty=0.5, # Penalty for similar beams
    max_new_tokens=512,    # Maximum output length
)

extractor = StatementExtractor(options=options)

text = "NVIDIA reported $26 billion in revenue, driven by AI chip demand from Microsoft and Google."
result = extractor.extract(text)

print(f"Extracted {len(result)} statements with custom options:")
for stmt in result:
    print(f"  {stmt.subject.text} → {stmt.predicate} → {stmt.object.text}")

## Summary

This notebook demonstrated:

1. **Simple extraction** with `extract_statements()`
2. **Multiple output formats** (dict, JSON, XML)
3. **Full pipeline** with 5 stages of processing
4. **Pipeline configuration** to enable/disable stages
5. **Entity database** for organization and person lookup
6. **Document processing** for longer texts
7. **CLI commands** for terminal usage
8. **Custom extraction options** for fine-tuning

For more information, see the [documentation](https://github.com/corp-o-rate/statement-extractor).