<a href="https://colab.research.google.com/github/anurag-code/LangExtract/blob/main/Lang_Extract.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Example docs:

1. https://github.com/google/langextract
2. https://github.com/google/langextract/blob/main/docs/examples/longer_text_example.md


In [None]:
!pip install langextract

Collecting langextract
  Downloading langextract-1.0.8-py3-none-any.whl.metadata (18 kB)
Collecting async_timeout>=4.0.0 (from langextract)
  Downloading async_timeout-5.0.1-py3-none-any.whl.metadata (5.1 kB)
Collecting exceptiongroup>=1.1.0 (from langextract)
  Downloading exceptiongroup-1.3.0-py3-none-any.whl.metadata (6.7 kB)
Collecting ml-collections>=0.1.0 (from langextract)
  Downloading ml_collections-1.1.0-py3-none-any.whl.metadata (22 kB)
Downloading langextract-1.0.8-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.0/85.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading async_timeout-5.0.1-py3-none-any.whl (6.2 kB)
Downloading exceptiongroup-1.3.0-py3-none-any.whl (16 kB)
Downloading ml_collections-1.1.0-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.7/76.7 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ml-collections, exceptiongroup, async_ti

In [None]:
!pip install pymupdf

Collecting pymupdf
  Downloading pymupdf-1.26.4-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.4-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m96.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.4


In [None]:
# Used to securely store your API key
from google.colab import userdata
import os

API_KEY=userdata.get('GOOGLE_API_KEY_1')
os.environ['GOOGLE_API_KEY_1'] = API_KEY

In [None]:
# ESG Data Extraction from Apple Annual Report using LangExtract
# This notebook demonstrates how to extract ESG-related information from a PDF annual report

# Cell 1: Installation and imports
# !pip install langextract pymupdf python-dotenv

import textwrap
import langextract as lx
import fitz  # PyMuPDF for PDF processing
import os
from dotenv import load_dotenv
import json

# Load environment variables (for API keys)
load_dotenv()

# Cell 2: PDF Text Extraction Function
def extract_text_from_pdf(pdf_path, max_pages=None):
    """
    Extract text from PDF file using PyMuPDF

    Args:
        pdf_path (str): Path to the PDF file
        max_pages (int): Maximum number of pages to process (None for all)

    Returns:
        str: Extracted text from the PDF
    """
    try:
        doc = fitz.open(pdf_path)
        text = ""

        pages_to_process = min(len(doc), max_pages) if max_pages else len(doc)

        for page_num in range(pages_to_process):
            page = doc.load_page(page_num)
            text += page.get_text()

        doc.close()
        return text
    except Exception as e:
        print(f"Error extracting text from PDF: {e}")
        return ""

# Cell 3: ESG Extraction Configuration
def setup_esg_extraction():
    """
    Setup the LangExtract configuration for ESG data extraction
    """

    # Define a comprehensive prompt for ESG extraction
    prompt = textwrap.dedent("""\
    Extract Environmental, Social, and Governance (ESG) metrics, commitments, and performance data.
    Focus on quantitative metrics, specific targets, timelines, and concrete initiatives.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Prioritize data with numbers, percentages, dates, and specific commitments.""")

    # Provide high-quality examples to guide the model
    examples = [
        lx.data.ExampleData(
            text=(
                "Apple is carbon neutral for our global corporate operations, and by 2030, "
                "we plan to be carbon neutral across our entire business, including manufacturing supply chain and all product life cycles. "
                "In 2023, we reduced our carbon emissions by 75% compared to 2015 baseline. "
                "We invested $4.7 billion in our supplier clean energy programs."
            ),
            extractions=[
                lx.data.Extraction(
                    extraction_class="environmental_target",
                    extraction_text="by 2030, we plan to be carbon neutral across our entire business",
                    attributes={
                        "category": "carbon_neutrality",
                        "target_year": "2030",
                        "scope": "entire_business_including_supply_chain"
                    }
                ),
                lx.data.Extraction(
                    extraction_class="environmental_achievement",
                    extraction_text="reduced our carbon emissions by 75% compared to 2015 baseline",
                    attributes={
                        "category": "carbon_emissions",
                        "metric_value": "75%",
                        "metric_type": "reduction",
                        "baseline_year": "2015",
                        "reporting_year": "2023"
                    }
                ),
                lx.data.Extraction(
                    extraction_class="environmental_investment",
                    extraction_text="invested $4.7 billion in our supplier clean energy programs",
                    attributes={
                        "category": "clean_energy",
                        "amount": "$4.7 billion",
                        "program_type": "supplier_clean_energy"
                    }
                )
            ]
        ),
        lx.data.ExampleData(
            text=(
                "Our workforce diversity continues to improve with women representing 35% of our global workforce "
                "and 30% of our leadership positions. We provided skills training to over 500,000 people through "
                "our education initiatives. Apple maintains the highest standards of data privacy and security."
            ),
            extractions=[
                lx.data.Extraction(
                    extraction_class="social_diversity",
                    extraction_text="women representing 35% of our global workforce",
                    attributes={
                        "category": "gender_diversity",
                        "metric_value": "35%",
                        "scope": "global_workforce"
                    }
                ),
                lx.data.Extraction(
                    extraction_class="social_diversity",
                    extraction_text="30% of our leadership positions",
                    attributes={
                        "category": "leadership_diversity",
                        "metric_value": "30%",
                        "scope": "leadership"
                    }
                ),
                lx.data.Extraction(
                    extraction_class="social_education",
                    extraction_text="provided skills training to over 500,000 people",
                    attributes={
                        "category": "education_training",
                        "beneficiaries": "500,000",
                        "program_type": "skills_training"
                    }
                ),
                lx.data.Extraction(
                    extraction_class="governance_privacy",
                    extraction_text="maintains the highest standards of data privacy and security",
                    attributes={
                        "category": "data_governance",
                        "commitment_level": "highest_standards"
                    }
                )
            ]
        )
    ]

    return prompt, examples

# Cell 4: Main ESG Extraction Function
def extract_esg_data(pdf_path, output_prefix="apple_esg", model_id="gemini-2.5-pro"):
    """
    Extract ESG data from Apple annual report PDF

    Args:
        pdf_path (str): Path to the PDF file
        output_prefix (str): Prefix for output files
        model_id (str): LLM model to use for extraction

    Returns:
        dict: Extraction results and file paths
    """

    print(f"Processing PDF: {pdf_path}")

    # Extract text from PDF
    print("Extracting text from PDF...")
    pdf_text = extract_text_from_pdf(pdf_path, max_pages=50)  # Limit to first 50 pages for demo

    if not pdf_text:
        print("Failed to extract text from PDF")
        return None

    print(f"Extracted {len(pdf_text)} characters from PDF")

    # Setup extraction configuration
    prompt, examples = setup_esg_extraction()

    # Run the extraction
    print("Running ESG extraction...")
    try:
        # Load API key from environment variables
        LANGEXTRACT_API_KEY = os.getenv('GOOGLE_API_KEY_1')
        if not LANGEXTRACT_API_KEY:
          print("Error: API key not found in environment variables.")
          return None

        result = lx.extract(
            text_or_documents=pdf_text,
            prompt_description=prompt,
            examples=examples,
            model_id=model_id,
            api_key=LANGEXTRACT_API_KEY, # Pass the API key here
        )

        # Save results to JSONL file
        output_jsonl = f"{output_prefix}_results.jsonl"
        lx.io.save_annotated_documents([result], output_name=output_jsonl)
        print(f"Results saved to: {output_jsonl}")

        # Generate interactive visualization
        print("Generating interactive visualization...")
        html_content = lx.visualize(output_jsonl)
        output_html = f"{output_prefix}_visualization.html"
        with open(output_html, "w", encoding="utf-8") as f:
            f.write(html_content)
        print(f"Visualization saved to: {output_html}")

        return {
            "result": result,
            "jsonl_file": output_jsonl,
            "html_file": output_html,
            "total_extractions": len(result.extractions)
        }

    except Exception as e:
        print(f"Error during extraction: {e}")
        return None

# Cell 5: Analysis and Summary Functions
def analyze_esg_extractions(result):
    """
    Analyze and categorize the extracted ESG data

    Args:
        result: LangExtract result object

    Returns:
        dict: Categorized ESG metrics and insights
    """

    if not result or not result.extractions:
        print("No extractions to analyze")
        return {}

    # Categorize extractions
    categories = {
        "Environmental": [],
        "Social": [],
        "Governance": []
    }

    environmental_keywords = ["environmental", "carbon", "energy", "emission", "renewable", "climate", "waste"]
    social_keywords = ["social", "diversity", "workforce", "education", "community", "training", "inclusion"]
    governance_keywords = ["governance", "privacy", "security", "compliance", "ethics", "board", "risk"]

    for extraction in result.extractions:
        extraction_text_lower = extraction.extraction_text.lower()
        extraction_class_lower = extraction.extraction_class.lower()

        # Categorize based on class and content
        if any(keyword in extraction_class_lower or keyword in extraction_text_lower for keyword in environmental_keywords):
            categories["Environmental"].append(extraction)
        elif any(keyword in extraction_class_lower or keyword in extraction_text_lower for keyword in social_keywords):
            categories["Social"].append(extraction)
        elif any(keyword in extraction_class_lower or keyword in extraction_text_lower for keyword in social_keywords):
            categories["Social"].append(extraction)
        elif any(keyword in extraction_class_lower or keyword in extraction_text_lower for keyword in governance_keywords):
            categories["Governance"].append(extraction)

    # Generate summary statistics
    summary = {
        "total_extractions": len(result.extractions),
        "environmental_count": len(categories["Environmental"]),
        "social_count": len(categories["Social"]),
        "governance_count": len(categories["Governance"]),
        "categories": categories
    }

    return summary

def print_esg_summary(analysis):
    """
    Print a formatted summary of ESG extractions
    """

    if not analysis:
        print("No analysis data available")
        return

    print("\n" + "="*80)
    print("ESG EXTRACTION SUMMARY")
    print("="*80)

    print(f"Total Extractions: {analysis['total_extractions']}")
    print(f"Environmental: {analysis['environmental_count']}")
    print(f"Social: {analysis['social_count']}")
    print(f"Governance: {analysis['governance_count']}")

    # Print top extractions by category
    for category, extractions in analysis["categories"].items():
        if extractions:
            print(f"\n{category.upper()} HIGHLIGHTS:")
            print("-" * 40)
            for i, extraction in enumerate(extractions[:5]):  # Show top 5
                print(f"{i+1}. [{extraction.extraction_class}] {extraction.extraction_text[:100]}...")
                if extraction.attributes:
                    print(f"   Attributes: {extraction.attributes}")

# Cell 6: Main execution function
def main():
    """
    Main function to run the ESG extraction pipeline
    """

    # Configuration
    PDF_PATH = '/content/220308_Informationforexternalpartners_SocialInvestment.pdf'  # Update this path
    OUTPUT_PREFIX = "bhp_esg_2024"
    MODEL_ID = "gemini-2.5-flash"  # or "gemini-2.5-flash" for faster processing

    print("ESG Data Extraction")
    print("="*50)

    # Check if PDF file exists
    if not os.path.exists(PDF_PATH):
        print(f"Error: PDF file not found at {PDF_PATH}")
        print("Please upload your Apple annual report PDF and update the PDF_PATH variable")
        return

    # Run extraction
    extraction_result = extract_esg_data(
        pdf_path=PDF_PATH,
        output_prefix=OUTPUT_PREFIX,
        model_id=MODEL_ID
    )

    if not extraction_result:
        print("Extraction failed")
        return

    print(f"\nExtraction completed successfully!")
    print(f"Found {extraction_result['total_extractions']} ESG-related entities")

    # Analyze results
    analysis = analyze_esg_extractions(extraction_result["result"])

    # Print summary
    print_esg_summary(analysis)

    print(f"\nOutput files:")
    print(f"- JSONL data: {extraction_result['jsonl_file']}")
    print(f"- HTML visualization: {extraction_result['html_file']}")

    return extraction_result, analysis

# Cell 7: Example usage for specific ESG queries
def query_specific_esg_metrics(result, query_type="environmental"):
    """
    Filter and display specific types of ESG metrics

    Args:
        result: LangExtract result object
        query_type: "environmental", "social", or "governance"
    """

    if not result or not result.extractions:
        return []

    query_keywords = {
        "environmental": ["carbon", "emission", "renewable", "energy", "climate", "waste", "water"],
        "social": ["diversity", "workforce", "employee", "community", "education", "training"],
        "governance": ["board", "compliance", "ethics", "privacy", "security", "risk"]
    }

    keywords = query_keywords.get(query_type.lower(), [])
    filtered_extractions = []

    for extraction in result.extractions:
        text_lower = extraction.extraction_text.lower()
        class_lower = extraction.extraction_class.lower()

        if any(keyword in text_lower or keyword in class_lower for keyword in keywords):
            filtered_extractions.append(extraction)

    print(f"\n{query_type.upper()} METRICS:")
    print("-" * 40)

    for i, extraction in enumerate(filtered_extractions):
        print(f"{i+1}. [{extraction.extraction_class}]")
        print(f"   Text: {extraction.extraction_text}")
        if extraction.attributes:
            print(f"   Attributes: {extraction.attributes}")
        print()

    return filtered_extractions

In [None]:
# pdf_path='./240827_bhpannualreport2024.pdf'
# pdf_path='./10-Q4-2024-As-Filed.pdf'

# Run the main extraction pipeline
extraction_result, analysis = main()

# Check if extraction was successful before proceeding
if extraction_result and analysis:
    # Query specific metrics (uncomment after running main())
    environmental_metrics = query_specific_esg_metrics(extraction_result["result"], "environmental")
    social_metrics = query_specific_esg_metrics(extraction_result["result"], "social")
    governance_metrics = query_specific_esg_metrics(extraction_result["result"], "governance")

    print("ESG Extraction script ready!")
    print("Instructions:")
    print("1. Install required packages: pip install langextract pymupdf python-dotenv")
    print("2. Set your Google AI API key: os.environ['GOOGLE_AI_API_KEY'] = 'your-key'")
    print("3. Update PDF_PATH to point to your Apple annual report PDF")
    print("4. Run main() to start extraction")
    print("5. Open the generated HTML file to view interactive results")
else:
    print("ESG Extraction failed. Please check the error messages above.")

DEBUG:absl:Registered GeminiLanguageModel with patterns ['^gemini'] at priority 10
DEBUG:absl:Registered OllamaLanguageModel with patterns ['^gemma', '^llama', '^mistral', '^mixtral', '^phi', '^qwen', '^deepseek', '^command-r', '^starcoder', '^codellama', '^codegemma', '^tinyllama', '^wizardcoder', '^gpt-oss', '^meta-llama/[Ll]lama', '^google/gemma', '^mistralai/[Mm]istral', '^mistralai/[Mm]ixtral', '^microsoft/phi', '^Qwen/', '^deepseek-ai/', '^bigcode/starcoder', '^codellama/', '^TinyLlama/', '^WizardLM/'] at priority 10
DEBUG:absl:Registered OpenAILanguageModel with patterns ['^gpt-4', '^gpt4\\.', '^gpt-5', '^gpt5\\.'] at priority 10


ESG Data Extraction
Processing PDF: /content/220308_Informationforexternalpartners_SocialInvestment.pdf
Extracting text from PDF...
Extracted 6973 characters from PDF
Running ESG extraction...


2025-08-27 15:25:11,860 - langextract.debug - DEBUG - [langextract.inference] CALL: BaseLanguageModel.__init__(self=<GeminiLanguageModel>, constraint=Constraint(co...NONE: 'none'>), kwargs={})
2025-08-27 15:25:11,862 - langextract.debug - DEBUG - [langextract.inference] RETURN: BaseLanguageModel.__init__ -> None (0.0 ms)
2025-08-27 15:25:11,863 - langextract.debug - DEBUG - [langextract.inference] CALL: BaseLanguageModel.apply_schema(self=<GeminiLanguageModel>, schema_instance=GeminiSchema(...xtractions']}))
2025-08-27 15:25:11,863 - langextract.debug - DEBUG - [langextract.inference] RETURN: BaseLanguageModel.apply_schema -> None (0.0 ms)
DEBUG:absl:Initialized Annotator with prompt:
Extract Environmental, Social, and Governance (ESG) metrics, commitments, and performance data.
Focus on quantitative metrics, specific targets, timelines, and concrete initiatives.
Use exact text for extractions. Do not paraphrase or overlap entities.
Prioritize data with numbers, percentages, dates, and

[92m✓[0m Extraction processing complete



INFO:absl:Finalizing annotation for document ID doc_2a784e45.
INFO:absl:Document annotation completed.


[92m✓[0m Extracted [1m3[0m entities ([1m2[0m unique types)
  [96m•[0m Time: [1m15.54s[0m
  [96m•[0m Speed: [1m449[0m chars/sec
  [96m•[0m Chunks: [1m7[0m


[94m[1mLangExtract[0m: Saving to [92mbhp_esg_2024_results.jsonl[0m: 1 docs [00:00, 1394.38 docs/s]

[92m✓[0m Saved [1m1[0m documents to [92mbhp_esg_2024_results.jsonl[0m
Results saved to: bhp_esg_2024_results.jsonl
Generating interactive visualization...
Error during extraction: JSONL file not found: bhp_esg_2024_results.jsonl
Extraction failed





TypeError: cannot unpack non-iterable NoneType object

In [None]:
# Generate the interactive visualization
html_content = lx.visualize("/content/bhp_esg_2024_results.jsonl")
with open("/content/bhpesg.html", "w") as f:
    if hasattr(html_content, 'data'):
        f.write(html_content.data)  # For Jupyter/Colab
    else:
        f.write(html_content)

print("Interactive visualization saved to bhpesg.html")

[94m[1mLangExtract[0m: Loading [92mbhp_esg_2024_results.jsonl[0m: 100%|██████████| 8.66k/8.66k [00:00<00:00, 12.8MB/s]

[92m✓[0m Loaded [1m1[0m documents from [92mbhp_esg_2024_results.jsonl[0m
Interactive visualization saved to bhpesg.html



