# Method 4: LangExtract for News Article NER Extraction

This notebook demonstrates NER extraction using Google's LangExtract library, specifically designed for extracting structured information from text with high accuracy.

## Overview
- **Approach**: Schema-guided extraction with multiple passes
- **Model**: Gemini 2.0 Flash (or other Gemini models)
- **Key Features**:
  - Multiple extraction passes for better recall
  - Parallel processing for speed
  - Smart chunking for long documents
  - Interactive visualizations
  - JSONL output for portability
- **Advantages**: 
  - High accuracy with world knowledge
  - Handles long documents efficiently
  - Rich entity attributes
  - No training required
- **Disadvantages**:
  - Requires Google API key
  - API costs per request
  - Internet connection required

## 1. Setup and Installation

In [None]:
# Install LangExtract if not already installed
!pip install -q langextract google-generativeai python-dotenv

In [None]:
import sys
sys.path.append('..')

from src.config import NERConfig, PROCESSED_DATA_DIR, RESULTS_DIR
from src.data_loader import NERDataLoader
from src.langextract_pipeline import LangExtractNERExtractor
from src.evaluation import NEREvaluator
from src.benchmark import NERBenchmark

import json
import os
from pathlib import Path
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Check for API key
if not os.getenv('GOOGLE_API_KEY'):
    print("‚ö†Ô∏è Warning: GOOGLE_API_KEY not found in environment")
    print("Please set it in .env file or set it now:")
    print("   export GOOGLE_API_KEY='your-api-key-here'")
    print("\nGet your API key from: https://ai.google.dev/")
else:
    print("‚úì Google API key found")

## 2. Load Configuration

In [None]:
# Initialize configuration
config = NERConfig()

print("Configuration:")
print(f"  Entity types: {config.entity_types}")
print(f"  Model: Gemini 2.0 Flash (via LangExtract)")

## 3. Load Dataset

In [None]:
# Load processed dataset
val_dataset = NERDataLoader.load_json_dataset(PROCESSED_DATA_DIR / "validation.json")
test_dataset = NERDataLoader.load_json_dataset(PROCESSED_DATA_DIR / "test.json")

print(f"Validation set size: {len(val_dataset)}")
print(f"Test set size: {len(test_dataset)}")

# Show example
print("\nExample sample:")
print(f"Text: {val_dataset[0]['text'][:200]}...")
print(f"Entities: {val_dataset[0]['entities']}")

## 4. Initialize LangExtract Extractor

In [None]:
# Initialize extractor
extractor = LangExtractNERExtractor(
    config=config,
    model_id="gemini-2.0-flash-exp"  # Use Gemini 2.0 Flash (fast and cost-effective)
)

print("‚úì LangExtract extractor initialized!")
print(f"\nPrompt:")
print(extractor.prompt)
print(f"\nNumber of examples: {len(extractor.examples)}")

## 5. Test on Sample Examples

In [None]:
# Test on a few examples
num_examples = 3

for i, sample in enumerate(val_dataset[:num_examples]):
    print(f"\n{'='*80}")
    print(f"Example {i+1}")
    print(f"{'='*80}")
    
    text = sample['text']
    ground_truth = sample['entities']
    
    print(f"\nText: {text[:300]}...\n")
    
    # Extract entities with LangExtract
    print("Extracting entities...")
    predicted = extractor.extract_entities(
        text,
        extraction_passes=2,  # Multiple passes for better recall
        max_workers=5,
        max_char_buffer=2000
    )
    
    print("Ground Truth:")
    print(json.dumps(ground_truth, indent=2, ensure_ascii=False))
    
    print("\nPredicted:")
    print(json.dumps(predicted, indent=2, ensure_ascii=False))

## 6. Extract with Detailed Attributes

In [None]:
# Test detailed extraction with attributes
sample_text = val_dataset[0]['text']

print("Extracting with detailed attributes...\n")
detailed_result = extractor.extract_with_details(
    sample_text,
    extraction_passes=2
)

print("Detailed Extraction Results:")
print("=" * 80)

for entity_type in ["person", "organizations", "address"]:
    entities = detailed_result[entity_type]
    print(f"\n{entity_type.upper()} ({len(entities)} entities):")
    for entity in entities[:5]:  # Show first 5
        attrs = ", ".join(f"{k}={v}" for k, v in entity['attributes'].items())
        attrs_str = f" [{attrs}]" if attrs else ""
        print(f"  - {entity['text']}{attrs_str}")

print(f"\nStatistics:")
print(json.dumps(detailed_result['statistics'], indent=2))

## 7. Evaluate on Validation Set

**Note**: This will make API calls for each sample. Start with a small subset to estimate costs.

In [None]:
# Use a small subset for testing
# Adjust this number based on your API quota and budget
EVAL_SUBSET_SIZE = 50  # Start small, increase if needed

val_subset = val_dataset[:EVAL_SUBSET_SIZE]

print(f"Evaluating on {len(val_subset)} samples...")
print(f"Estimated API calls: ~{len(val_subset) * 2} (with 2 extraction passes)")
print("\nThis may take a few minutes...\n")

# Run evaluation
predictions, ground_truth = extractor.evaluate_on_dataset(val_subset)

# Evaluate
evaluator = NEREvaluator(entity_types=config.entity_types)
results = evaluator.evaluate_all(predictions, ground_truth)

# Print results
evaluator.print_results(results)

# Save results
results_path = RESULTS_DIR / "langextract_validation.json"
evaluator.save_results(results, results_path)
print(f"Results saved to {results_path}")

## 8. Run Benchmark on Test Set (Optional)

**Warning**: This will process the full test set and may incur significant API costs.

In [None]:
# Uncomment to run full benchmark
# TEST_SUBSET_SIZE = 100
# test_subset = test_dataset[:TEST_SUBSET_SIZE]

# benchmark = NERBenchmark(config=config)
# test_results = benchmark.run_benchmark(
#     method_name="LangExtract",
#     extractor=extractor,
#     test_dataset=test_subset,
#     verbose=True
# )

# # Save benchmark results
# benchmark.save_results(RESULTS_DIR / "langextract")

print("‚ö†Ô∏è Full benchmark commented out to avoid unexpected API costs.")
print("Uncomment the code above to run the full benchmark.")

## 9. Create Interactive Visualization

LangExtract can create beautiful interactive HTML visualizations of extracted entities.

In [None]:
# Process a few articles and create visualization
vis_samples = val_dataset[:10]
vis_texts = [s['text'] for s in vis_samples]

# Save annotated documents
jsonl_path = RESULTS_DIR / "langextract_samples.jsonl"
extractor.save_annotated_documents(
    vis_texts,
    output_path=str(jsonl_path),
    extraction_passes=2
)

# Create visualization
html_path = RESULTS_DIR / "langextract_visualization.html"
extractor.create_visualization(
    jsonl_path=str(jsonl_path),
    output_html_path=str(html_path)
)

print(f"\n‚úì Visualization created!")
print(f"Open {html_path} in your browser to explore the results.")

## 10. Analyze Extraction Statistics

In [None]:
# Analyze the extraction results
stats = LangExtractNERExtractor.analyze_extraction_statistics(str(jsonl_path))

print("\n" + "="*80)
print("EXTRACTION STATISTICS")
print("="*80)

print(f"\nDocuments processed: {stats['total_documents']}")
print(f"Total characters: {stats['total_characters']:,}")
print(f"Total extractions: {stats['total_extractions']}")
print(f"Extractions per document: {stats['extractions_per_document']:.1f}")

print("\nExtractions by class:")
for entity_class, count in stats['class_counts'].items():
    percentage = (count / stats['total_extractions']) * 100
    print(f"  {entity_class}: {count} ({percentage:.1f}%)")

print("\nUnique entities:")
for entity_class, count in stats['unique_entities'].items():
    print(f"  {entity_class}: {count}")

## 11. Compare with Other Methods

In [None]:
import pandas as pd

# Load results from other methods (if available)
comparison_data = []

# LangExtract results
comparison_data.append({
    "Method": "LangExtract",
    "Exact Match": results['exact_match_accuracy'],
    "Macro F1": results['partial_match_metrics']['macro_avg']['f1'],
    "Samples": len(val_subset),
    "Notes": "Gemini API, 2 passes"
})

# Try to load other methods
other_methods = {
    "Prompt Engineering": RESULTS_DIR / "prompt_engineering" / "Prompt Engineering_results.json",
    "RAG": RESULTS_DIR / "rag" / "RAG_results.json",
    "Fine-tuning": RESULTS_DIR / "finetuning" / "Fine-tuning_results.json",
}

for method_name, result_path in other_methods.items():
    if result_path.exists():
        with open(result_path, 'r') as f:
            method_results = json.load(f)
        comparison_data.append({
            "Method": method_name,
            "Exact Match": method_results['exact_match_accuracy'],
            "Macro F1": method_results['partial_match_metrics']['macro_avg']['f1'],
            "Samples": "Full test set",
            "Notes": "-"
        })

# Display comparison
if len(comparison_data) > 1:
    df = pd.DataFrame(comparison_data)
    print("\n" + "="*80)
    print("COMPARISON WITH OTHER METHODS")
    print("="*80 + "\n")
    print(df.to_string(index=False))
else:
    print("\n‚ö†Ô∏è No other method results found for comparison.")
    print("Run other method notebooks first to enable comparison.")

## 12. Key Insights and Recommendations

In [None]:
print("\n" + "="*80)
print("LANGEXTRACT METHOD: KEY INSIGHTS")
print("="*80)

print(f"\nüìä Performance on {len(val_subset)} samples:")
print(f"  - Exact Match Accuracy: {results['exact_match_accuracy']:.2%}")
print(f"  - Macro F1 Score: {results['partial_match_metrics']['macro_avg']['f1']:.2%}")

print("\n‚úÖ Strengths:")
print("  - High accuracy with world knowledge enrichment")
print("  - Multiple extraction passes improve recall")
print("  - Rich entity attributes (role, context, type)")
print("  - Handles long documents efficiently with smart chunking")
print("  - Beautiful interactive visualizations")
print("  - JSONL format for portability")
print("  - No training required")

print("\n‚ö†Ô∏è Considerations:")
print("  - Requires Google API key and internet connection")
print("  - API costs per request (though Flash model is cost-effective)")
print("  - Slower than local models due to API calls")
print("  - Subject to API rate limits")

print("\nüí° Best Use Cases:")
print("  - One-time or periodic extraction tasks")
print("  - When high accuracy is critical")
print("  - Long documents or complex news articles")
print("  - When you need rich entity attributes")
print("  - Exploratory analysis with visualizations")

print("\nüí∞ Cost Optimization Tips:")
print("  - Use gemini-2.0-flash-exp for cost-effectiveness")
print("  - Reduce extraction_passes for simpler texts")
print("  - Batch process documents to minimize overhead")
print("  - Use max_char_buffer wisely for your text length")

print("\n" + "="*80)

## 13. Export for Comparison

Save results in the same format as other methods for fair comparison.

In [None]:
# Save in benchmark format
langextract_dir = RESULTS_DIR / "langextract"
langextract_dir.mkdir(parents=True, exist_ok=True)

# Add method name to results
results['method_name'] = 'LangExtract'
results['model_info'] = 'Gemini 2.0 Flash Experimental'
results['extraction_passes'] = 2

# Save results
with open(langextract_dir / "LangExtract_results.json", 'w') as f:
    json.dump(results, f, indent=2)

# Save predictions
with open(langextract_dir / "predictions.json", 'w') as f:
    json.dump({
        'predictions': predictions,
        'ground_truth': ground_truth,
        'sample_count': len(predictions)
    }, f, indent=2)

print("\n‚úì Results saved for comparison!")
print(f"Location: {langextract_dir}")
print("\nYou can now run the comparison notebook to compare with other methods.")