# Working with Document Metadata

Metadata provides critical context about documents that enhances search, retrieval, and organization. In this notebook, we'll explore how to extract, attach, and use metadata in document processing pipelines.

## Learning Objectives

By the end of this notebook, you will be able to:
- Extract metadata from document content and filenames
- Attach metadata to LlamaIndex Document objects
- Propagate metadata through node parsing
- Understand common metadata fields and their uses

## Why Metadata Matters

Metadata enriches documents with contextual information:
- **Context** - Author, date, source, topic information
- **Filtering** - Find documents from specific date ranges or sources
- **Relevance** - Identify which documents are most important
- **Organization** - Group related documents together
- **Search enhancement** - Improve retrieval accuracy with metadata filters

In [None]:
## Extracting and Attaching Metadata

from llama_index.core import Document
from llama_index.core.node_parser import SimpleNodeParser
import os
from datetime import datetime

# Sample document with metadata potential
sample_doc = """
# Annual Report 2023
## Financial Performance
Our company achieved record profits in 2023, with revenue increasing 15% compared to 2022.

## Product Launches
The new X1000 product line was launched in March 2023 and has exceeded sales expectations.

## Future Outlook
We expect continued growth in 2024, driven by expansion into European markets.
"""

# Document file information
filename = "annual_report_2023.md"
file_path = f"/documents/{filename}"

def extract_basic_metadata(content, filename, file_path):
    """
    Extract basic metadata from document content and file information.
    
    Args:
        content (str): Document text
        filename (str): Name of the file
        file_path (str): Full path to the file
        
    Returns:
        dict: Extracted metadata
    """
    # Extract file information
    file_metadata = {
        "file_name": filename,
        "file_path": file_path,
        "file_type": os.path.splitext(filename)[1][1:],  # Extension without dot
        "file_size": len(content),  # Size in characters
        "extracted_date": datetime.now().strftime("%Y-%m-%d")
    }

    # Extract year from content or filename
    for year in ["2024", "2023", "2022"]:
        if year in content or year in filename:
            file_metadata["year"] = year
            break

    # Detect document type
    content_lower = content.lower()
    filename_lower = filename.lower()
    
    if "annual report" in content_lower or "annual report" in filename_lower:
        file_metadata["document_type"] = "annual_report"
    elif "quarterly report" in content_lower or "quarterly" in filename_lower:
        file_metadata["document_type"] = "quarterly_report"

    return file_metadata

# Extract metadata
metadata = extract_basic_metadata(sample_doc, filename, file_path)

# Create document with metadata
doc = Document(text=sample_doc, metadata=metadata)

# Display document with metadata
print("Document Created:\n")
print("=" * 80)
print(f"Text (first 100 chars): {doc.text[:100]}...\n")

print("Metadata:")
for key, value in doc.metadata.items():
    print(f"  {key:<20}: {value}")

print("\n✓ Metadata extracted and attached to document")

In [None]:
## Metadata Propagation to Nodes

# Create nodes with metadata propagation
parser = SimpleNodeParser.from_defaults()
nodes = parser.get_nodes_from_documents([doc])

print("Metadata Propagation:\n")
print("=" * 80)
print(f"\nNodes created: {len(nodes)}")
print(f"Node text length: {len(nodes[0].text)} characters\n")

print("Node metadata (inherited from document):")
for key, value in nodes[0].metadata.items():
    print(f"  {key:<20}: {value}")

print("\n✓ Metadata automatically propagated from document to nodes")
print("\nThis ensures metadata is preserved throughout the processing pipeline,")
print("enabling filtered search and retrieval at the node level.")

## Summary

We've explored how to work with document metadata in LlamaIndex:

### Key Concepts

1. **Metadata Extraction** - Extract from:
   - File information (name, path, type, size)
   - Content analysis (dates, document types)
   - Timestamps (extraction date)

2. **Metadata Attachment** - Attach to Document objects:
   - Pass as `metadata` parameter
   - Use dictionary format
   - Include relevant contextual fields

3. **Metadata Propagation** - Automatically flows to:
   - Nodes created from documents
   - Chunks generated by parsers
   - All downstream processing steps

### Common Metadata Fields

- **Source info**: `file_name`, `file_path`, `file_type`
- **Temporal**: `year`, `date`, `extracted_date`
- **Classification**: `document_type`, `category`, `topic`
- **Size**: `file_size`, `page_count`, `word_count`
- **Origin**: `author`, `source`, `url`

### Best Practices

- Extract metadata early in the pipeline
- Use consistent field names across documents
- Include both file-based and content-based metadata
- Document your metadata schema
- Test metadata propagation through your pipeline

Metadata is essential for building sophisticated search and retrieval systems that go beyond simple text matching.