# Document Metadata Extraction with Fenic

This notebook demonstrates how to extract structured metadata from unstructured document text using fenic's semantic operations. Fenic's structured extraction functionality leverages large language models to intelligently parse and extract structured information from diverse document types including research papers, product announcements, meeting notes, news articles, and technical documentation.

## What You'll Learn

- Setting up fenic sessions with semantic capabilities
- Creating DataFrames from document data
- Extracting structured metadata using AI-powered operations

Let's dive in!

## Setup and Configuration

First, we need to import the necessary libraries and configure our fenic session. We'll set up:

- **Type hints** from `typing` for better code documentation
- **Pydantic** for our second extraction approach 
- **Fenic** as our main DataFrame library

For the session configuration, we're setting up semantic capabilities using OpenAI's GPT-4o-mini model with specific rate limits:
- **RPM (Requests Per Minute)**: 500 requests
- **TPM (Tokens Per Minute)**: 200,000 tokens

This configuration ensures we can efficiently process our document extraction tasks while staying within API limits.

In [None]:
from typing import Literal, List
from pydantic import BaseModel, Field
import fenic as fc

# Configure session with semantic capabilities
config = fc.SessionConfig(
        app_name="document_extraction",
        semantic=fc.SemanticConfig(
            language_models={
                "mini": fc.OpenAIModelConfig(
                    model_name="gpt-4o-mini",
                    rpm=500,
                    tpm=200_000,
                )
            }
        ),
    )

# Create session
session = fc.Session.get_or_create(config)

## Sample Document Data

Now let's create our test dataset. We've carefully selected 5 diverse document types to showcase the versatility of metadata extraction:

1. **Research Paper** (`doc_001`) - Academic study on neural networks and climate prediction
2. **Product Announcement** (`doc_002`) - CloudSync Pro file synchronization software launch
3. **Meeting Notes** (`doc_003`) - Engineering team standup with decisions and action items
4. **News Article** (`doc_004`) - Breaking news about a data breach incident
5. **Technical Documentation** (`doc_005`) - API reference for an authentication service

Each document contains different types of metadata (titles, dates, keywords, etc.) that we'll extract automatically. After creating the DataFrame, we'll inspect the basic properties including document IDs and text lengths to understand our data better.

In [None]:
documents_data = [
        {
            "id": "doc_001",
            "text": "Neural Networks for Climate Prediction: A Comprehensive Study. Published March 15, 2024. This research presents a novel deep learning approach for predicting climate patterns using multi-layered neural networks. Our methodology combines satellite imagery data with ground-based sensor readings to achieve 94% accuracy in temperature forecasting. The study was conducted over 18 months across 12 research stations. Keywords: machine learning, climate modeling, neural networks, environmental science."
        },
        {
            "id": "doc_002",
            "text": "Introducing CloudSync Pro - Next-Generation File Synchronization. Release Date: January 8, 2024. CloudSync Pro revolutionizes how teams collaborate with real-time file synchronization across unlimited devices. Features include end-to-end encryption, automatic conflict resolution, and integration with over 50 productivity tools. Pricing starts at $12/month per user with enterprise discounts available. Contact our sales team for a personalized demo."
        },
        {
            "id": "doc_003",
            "text": "Weekly Engineering Standup - December 4, 2023. Attendees: Sarah Chen (Lead), Marcus Rodriguez (Backend), Lisa Park (Frontend), James Wilson (DevOps). Key decisions: Migration to Kubernetes approved for Q1 2024, new CI/CD pipeline reduces deployment time by 60%, API rate limiting implementation scheduled for next sprint. Action items: Sarah to finalize container specifications, Marcus to document database migration plan."
        },
        {
            "id": "doc_004",
            "text": "Breaking: Major Data Breach Affects 2.3 Million Users. December 12, 2023 - TechCorp announced today that unauthorized access to customer databases occurred between November 28-30, 2023. Compromised data includes email addresses, encrypted passwords, and partial payment information. The company has implemented additional security measures and is offering free credit monitoring to affected users. Stock prices dropped 8% in after-hours trading."
        },
        {
            "id": "doc_005",
            "text": "API Reference: Authentication Service v2.1. Last updated: February 20, 2024. The Authentication Service provides secure user login and session management for distributed applications. Supports OAuth 2.0, SAML, and multi-factor authentication. Rate limits: 1000 requests per hour for standard accounts, 10000 for premium. Available endpoints include /auth/login, /auth/refresh, /auth/logout. Response format: JSON with standardized error codes."
        }
    ]

# Create DataFrame
docs_df = session.create_dataframe(documents_data)

docs_df.select("id", fc.text.length("text").alias("text_length")).show()


# Structured Extraction

Fenic enables structured data extraction using LLMs by leveraging Pydantic models to define rich, schema-driven metadata extraction workflows. Here's how it works:

1. **Schema Definition with Pydantic**

   Define a Pydantic model to represent the structure of the data you want to extract. Each field must include a natural language description. This schema drives prompt generation and model output parsing.

2. **LLM Orchestration**

   Fenic uses the model provider of your choice to call the LLM with a structured output or tool-calling interface. The LLM returns data that conforms to the schema you defined.

3. **Data Structuring**

   The extracted data is represented as a struct column in a DataFrame with native Fenic struct fields. From there, it can be:

   - Unnested into individual columns
   - Exploded if it contains arrays
   - Processed in place as nested data

Because Fenic maps Pydantic models to a strongly typed, columnar data model, certain Python types are not currently supported:

- **Non-Optional Union types**: Not expressible in Fenic's type system
- **Dictionaries**: Fenic does not yet support map types (future support via a JsonType is planned)
- **Custom classes / dataclasses**: These are stateful or logic-heavy constructs that don't fit the declarative nature of Fenic's data model

Despite these constraints, you can define complex extraction schemas using nested Pydantic models, optional fields, and lists—enabling robust and expressive structured extraction pipelines.

Let's see it in action on the documents!

In [None]:
# Define Pydantic model for document metadata
class DocumentMetadata(BaseModel):
    """Pydantic model for document metadata extraction."""
    title: str = Field(description="The main title or subject of the document")
    document_type: Literal["research paper", "product announcement", "meeting notes", "news article", "technical documentation", "other"] = Field(description="Type of document")
    date: str = Field(description="Any date mentioned in the document (publication date, meeting date, etc.)")
    keywords: List[str] = Field(description="List of key topics, technologies, or important terms mentioned in the document")
    summary: str = Field(description="Brief one-sentence summary of the document's main purpose or content")

# Apply extraction using Pydantic model
pydantic_extracted_df = docs_df.select(
    "id",
    fc.semantic.extract("text", DocumentMetadata).alias("metadata")
)

# Flatten the extracted metadata into separate columns
pydantic_results = pydantic_extracted_df.select(
    "id",
    pydantic_extracted_df.metadata.title.alias("title"),
    pydantic_extracted_df.metadata.document_type.alias("document_type"),
    pydantic_extracted_df.metadata.date.alias("date"),
    pydantic_extracted_df.metadata.keywords.alias("keywords"),
    pydantic_extracted_df.metadata.summary.alias("summary")
)

pydantic_results.show()


## Cleanup and Conclusion

Finally, we properly close our fenic session to free up resources.

In [None]:
# Clean up
session.stop()