# Document Processing and Schema Discovery

In this notebook, we'll explore:
1. Understanding knowledge graph schemas
2. Schema discovery with Neo4j LLM Knowledge Graph Builder
3. Document processing with SimpleKGPipeline
4. Iterative schema refinement

In [None]:
# Import required libraries
from neo4j import GraphDatabase
from dotenv import load_dotenv
import os
import openai
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embedder import OpenAIEmbedder
from neo4j_graphrag.retriever import VectorRetriever

# Setup connections
load_dotenv()
driver = GraphDatabase.driver(
    os.getenv('NEO4J_URI'),
    auth=(os.getenv('NEO4J_USERNAME'), os.getenv('NEO4J_PASSWORD'))
)

# Initialize components
openai.api_key = os.getenv('OPENAI_API_KEY')
llm = OpenAILLM()
embedder = OpenAIEmbedder()

## 1. Understanding Knowledge Graph Schemas

A knowledge graph schema defines:
1. **Node Labels**: Types of entities (e.g., Product, Feature)
2. **Relationships**: Connections between entities (e.g., HAS_FEATURE)
3. **Properties**: Attributes of nodes and relationships

There are two main approaches to building your schema:

1. **Domain Graph** (Top-down):
   - Start with structured data (databases, ontologies)
   - Define schema based on domain knowledge
   - Example: E-commerce product catalog

2. **Lexical Graph** (Bottom-up):
   - Start with unstructured data (documents, text)
   - Discover schema through analysis
   - Example: Processing technical documentation

Let's explore both approaches:

### Domain Graph Example

When you have a clear domain model:

In [None]:
# Define schema based on domain knowledge
domain_schema = {
    'nodes': [
        'Product',      # Products in catalog
        'Category',     # Product categories
        'Feature',      # Product features
        'Specification' # Technical specs
    ],
    'relationships': [
        'BELONGS_TO',   # Product -> Category
        'HAS_FEATURE',  # Product -> Feature
        'MEETS_SPEC'    # Product -> Specification
    ]
}

# Create pipeline with domain schema
domain_pipeline = SimpleKGPipeline(
    driver=driver,
    llm=llm,
    embedder=embedder,
    text_splitter=FixedSizeSplitter(chunk_size=500, chunk_overlap=100),
    entities=domain_schema['nodes'],
    relations=domain_schema['relationships']
)

# Example structured document
product_doc = """
Product: XPS 13 Plus
Category: Premium Laptops
Features:
- 13.4-inch 4K OLED display
- Zero-lattice keyboard
- Haptic touchpad

Specifications:
- 32GB LPDDR5 RAM
- 12th Gen Intel Core i7
- 1TB NVMe SSD
"""

# Process with domain schema
domain_pipeline.run(text=product_doc)

### Lexical Graph Example

When exploring unstructured data:

In [None]:
# Start with no predefined schema
discovery_pipeline = SimpleKGPipeline(
    driver=driver,
    llm=llm,
    embedder=embedder
)

# Example unstructured document
tech_doc = """
The new XPS series revolutionizes mobile computing. Users praise its
innovative design and exceptional performance. The system integrates
seamlessly with peripherals through Thunderbolt ports. Early reviews
highlight the responsive keyboard and immersive display quality.

Customer feedback emphasizes productivity gains, particularly in
creative workflows. The laptop's thermal design maintains consistent
performance under heavy workloads.
"""

# Let the pipeline discover entities and relationships
discovery_pipeline.run(text=tech_doc)

# Query discovered schema
with driver.session() as session:
    # Get unique node labels
    labels = session.run("""
    CALL db.labels()
    YIELD label
    RETURN collect(label) as labels
    """).single()["labels"]
    
    # Get unique relationship types
    relationships = session.run("""
    CALL db.relationshipTypes()
    YIELD relationshipType
    RETURN collect(relationshipType) as types
    """).single()["types"]
    
print("Discovered Node Labels:", labels)
print("Discovered Relationships:", relationships)

## 2. Schema Discovery with Knowledge Graph Builder

The Neo4j LLM Knowledge Graph Builder helps explore and refine schemas:

1. Upload documents
2. Generate initial graph
3. Preview and analyze results
4. Refine schema iteratively

This visual tool helps identify:
- Common entity types
- Natural relationship patterns
- Missing connections
- Data quality issues

## 3. Document Processing Pipeline

Now let's build a complete pipeline that combines both approaches:

In [None]:
class DocumentProcessor:
    def __init__(self, driver, llm, embedder):
        self.driver = driver
        self.llm = llm
        self.embedder = embedder
        
    def discover_schema(self, text):
        """First pass: Discover potential schema"""
        discovery_pipeline = SimpleKGPipeline(
            driver=self.driver,
            llm=self.llm,
            embedder=self.embedder
        )
        discovery_pipeline.run(text=text)
        
        # Analyze discovered schema
        with self.driver.session() as session:
            schema = {
                'nodes': session.run("CALL db.labels()").value('label'),
                'relationships': session.run("CALL db.relationshipTypes()").value('relationshipType')
            }
        return schema
    
    def process_with_schema(self, text, schema):
        """Second pass: Process with refined schema"""
        refined_pipeline = SimpleKGPipeline(
            driver=self.driver,
            llm=self.llm,
            embedder=self.embedder,
            entities=schema['nodes'],
            relations=schema['relationships']
        )
        refined_pipeline.run(text=text)

# Example usage
processor = DocumentProcessor(driver, llm, embedder)

# First: Discover schema
discovered_schema = processor.discover_schema(tech_doc)
print("Initial Schema:", discovered_schema)

# Second: Refine schema (you would typically review and modify)
refined_schema = {
    'nodes': ['Product', 'Feature', 'Review', 'Component'],
    'relationships': ['HAS_FEATURE', 'MENTIONED_IN', 'CONTAINS']
}

# Third: Process with refined schema
processor.process_with_schema(tech_doc, refined_schema)

## 4. Iterative Schema Refinement

Best practices for refining your schema:

1. **Start Simple**
   - Begin with core entities and relationships
   - Add complexity gradually

2. **Analyze Results**
   - Review extracted entities
   - Check relationship patterns
   - Look for missing connections

3. **Refine Iteratively**
   - Add/remove node labels
   - Adjust relationship types
   - Update property names

4. **Validate**
   - Test with sample queries
   - Check data quality
   - Verify connections

Remember:
- Schema development is iterative
- Start with discovery, then refine
- Balance flexibility and structure
- Consider your query patterns