# OCR Image Processing and Vector Search

## Overview
This notebook processes OCR (Optical Character Recognition) data extracted from images and indexes them in Neo4j for semantic search. The OCR text comes from a structured JSON file containing descriptions of various images from legal documents.

## Workflow
1. **Data Loading**: Loads OCR text data from a nested JSON structure
2. **Text Processing**: Extracts sections from structured OCR text
3. **Chunking**: Splits sections into manageable text chunks
4. **Embedding Generation**: Creates vector embeddings using all-MiniLM-L6-v2
5. **Neo4j Storage**: Stores chunks as ImageChunk nodes with proper metadata
6. **Relationship Creation**: Establishes relationships between chunks and document types
7. **Vector Search**: Enables semantic search across all image descriptions
8. **Reranking**: Improves search relevance using a cross-encoder reranking model

## Data Structure
The OCR data follows a hierarchical structure:
- Folder (e.g., Acts, Regulations)
  - Subfolder (e.g., Election_Act)
    - File (e.g., 96106_greatseal.gif)
      - OCR Text with sections like "Image Type and Category", "Detailed Description", etc.

## Neo4j Schema
- **Node Labels**: 
  - `ImageChunk`: Contains text chunks from OCR data
  - `UpdatedChunksAndImagesv4`: Combined label for unified vector search
  - Document types: `Act`, `Regulation`, etc.
- **Relationships**: 
  - `(ImageChunk)-[:PART_OF]->(DocumentType)`
- **Vector Index**: On the `textEmbedding` property

## Section 0: Installs

In [None]:
%pip install langchain-huggingface
%pip install langchain-neo4j
%pip install langchain
%pip install langchain-text-splitters
%pip install neo4j
%pip install sentence-transformers
%pip install python-dotenv
%pip install numpy
%pip install ipywidgets

## Section 1: Imports

In [24]:
# Import necessary libraries
import json
import os
from typing import Dict, List, Any
import time

# LangChain imports
from langchain_text_splitters import SentenceTransformersTokenTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_neo4j import Neo4jGraph
from dotenv import load_dotenv

# Reranking import
from sentence_transformers import CrossEncoder

## Section 2: Configure Environment and Connections

In [25]:
# Load environment variables
load_dotenv()

# Neo4j connection settings
NEO4J_URI = os.getenv("NEO4J_URI", "bolt://localhost:7687")
NEO4J_USERNAME = os.getenv("NEO4J_USERNAME", "")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD", "")
NEO4J_DATABASE = os.getenv("NEO4J_DATABASE", "neo4j")

# Initialize Neo4j connection
graph = Neo4jGraph(
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    database=NEO4J_DATABASE
)

# Initialize embedding model
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Initialize text splitter
text_splitter = SentenceTransformersTokenTextSplitter(
    chunk_overlap=20, 
    tokens_per_chunk=256
)

## Section 3: Create Index in Neo4j

In [39]:
# Create vector index in Neo4j database for similarity search
def setup_neo4j_indexes():
    # Create constraint for unique ImageChunk IDs
    graph.query("""
    CREATE CONSTRAINT IF NOT EXISTS FOR (c:ImageChunk) REQUIRE c.id IS UNIQUE
    """)
    
    # Create vector index for embeddings - use the standard chunk_embeddings name
    graph.query("""
    CREATE VECTOR INDEX UpdatedChunksAndImagesv4 IF NOT EXISTS
    FOR (m:UpdatedChunksAndImagesv4) 
    ON m.textEmbedding 
    OPTIONS { 
        indexConfig: { 
            `vector.dimensions`: 384, 
            `vector.similarity_function`: 'cosine'
        }
    }
    """)
    
    print("Neo4j indexes created successfully.")

## Section 4: Helper Functions

In [27]:
# Extract folder and file components from a path structure
def extract_path_components(path: str) -> Dict[str, str]:
    components = path.split('/')
    if len(components) >= 2:
        folder = components[0]
        subfolder = components[1] if len(components) > 1 else None
        filename = components[-1]
    else:
        folder = None
        subfolder = None
        filename = components[0]
    
    return {
        "folder": folder, 
        "subfolder": subfolder,
        "filename": filename
    }

# Extract sections from OCR text based on numbered headers
def extract_sections(text: str) -> Dict[str, str]:
    sections = {}
    lines = text.split('\n')
    
    current_section = None
    current_content = []
    
    for line in lines:
        line = line.strip()
        # Check for section headers like "1. Image Type and Category:"
        if any(line.startswith(f"{i}. ") for i in range(1, 7)):
            # Save previous section if exists
            if current_section:
                sections[current_section] = '\n'.join(current_content).strip()
            
            # Start new section
            current_section = line
            current_content = []
        else:
            # Add line to current section content
            if current_section:
                current_content.append(line)
    
    # Add the last section
    if current_section and current_content:
        sections[current_section] = '\n'.join(current_content).strip()
    
    # If there's a "Detailed Description" section
    detailed_idx = next((i for i, line in enumerate(lines) if "Detailed Description:" in line), -1)
    if detailed_idx >= 0:
        detailed_text = '\n'.join(lines[detailed_idx+1:]).strip()
        sections["Detailed Description:"] = detailed_text
    
    return sections

## Section 5: Document Processing Functions

In [28]:
# Load and parse JSON file containing OCR data
def process_json_file(file_path: str) -> Dict:
    with open(file_path, 'r') as f:
        data = json.load(f)
    return data

# Create metadata for a chunk with updated structure
def create_metadata(
    folder: str, 
    subfolder: str, 
    filename: str, 
    section_name: str
) -> Dict[str, Any]:
    # Base metadata
    metadata = {
        "folder": folder,
        "subfolder": subfolder,
        "file_name": filename,
        "type": "image",
        "section": section_name,
        "url": f"https://www.bclaws.gov.bc.ca/civix/document/id/complete/statreg/{filename}"
    }
    
    # Add specific ID based on folder type
    if folder == "Acts":
        metadata["ActId"] = subfolder
    elif folder == "Regulations":
        metadata["RegId"] = subfolder
    elif folder == "Appendix":
        metadata["AppendixId"] = subfolder
    elif folder == "Others":
        metadata["OthersId"] = subfolder
    elif folder == "Schedules":
        metadata["SchedulesId"] = subfolder
    elif folder == "Point_in_Times":
        metadata["PointInTimeId"] = subfolder
    elif folder == "Parts":
        metadata["PartsId"] = subfolder
    elif folder == "Rules":
        metadata["RulesId"] = subfolder
        
    return metadata

# Create embedding for a chunk with given text and metadata
def create_chunk_embedding(text: str, metadata: Dict[str, Any]) -> Dict[str, Any]:
    embedding = embedding_model.embed_query(text)
    
    # Create a unique ID based on metadata
    chunk_id = f"{metadata['folder']}_{metadata['subfolder']}_{metadata['file_name']}_{metadata['section']}"
    chunk_id = chunk_id.replace(" ", "_").replace(":", "")
    
    return {
        "id": chunk_id,
        "text": text,
        "metadata": metadata,
        "embedding": embedding
    }

# Save chunk data to Neo4j with updated metadata and relationships
def save_chunk_to_neo4j(chunk_data: Dict[str, Any]):
    # Base query for creating the ImageChunk node
    query = """
    MERGE (c:ImageChunk {id: $id})
    SET c.text = $text,
        c.textEmbedding = $textEmbedding,
        c.url = $metadata.url,
        c.folder = $metadata.folder,
        c.subfolder = $metadata.subfolder,
        c.file_name = $metadata.file_name,
        c.type = $metadata.type,
        c.section = $metadata.section
    """
    
    # Add conditional relationship creation based on folder type
    if "ActId" in chunk_data["metadata"]:
        query += """
        WITH c
        MERGE (a:Act {name: $metadata.ActId})
        MERGE (c)-[:PART_OF]->(a)
        """
    elif "RegId" in chunk_data["metadata"]:
        query += """
        WITH c
        MERGE (r:Regulation {name: $metadata.RegId})
        MERGE (c)-[:PART_OF]->(r)
        """
    elif "AppendixId" in chunk_data["metadata"]:
        query += """
        WITH c
        MERGE (ap:Appendix {name: $metadata.AppendixId})
        MERGE (c)-[:PART_OF]->(ap)
        """
    elif "OthersId" in chunk_data["metadata"]:
        query += """
        WITH c
        MERGE (o:Other {name: $metadata.OthersId})
        MERGE (c)-[:PART_OF]->(o)
        """
    elif "SchedulesId" in chunk_data["metadata"]:
        query += """
        WITH c
        MERGE (s:Schedule {name: $metadata.SchedulesId})
        MERGE (c)-[:PART_OF]->(s)
        """
    elif "PointInTimeId" in chunk_data["metadata"]:
        query += """
        WITH c
        MERGE (p:PointInTime {name: $metadata.PointInTimeId})
        MERGE (c)-[:PART_OF]->(p)
        """
    elif "PartsId" in chunk_data["metadata"]:
        query += """
        WITH c
        MERGE (p:Part {name: $metadata.PartsId})
        MERGE (c)-[:PART_OF]->(p)
        """
    elif "RulesId" in chunk_data["metadata"]:
        query += """
        WITH c
        MERGE (r:Rule {name: $metadata.RulesId})
        MERGE (c)-[:PART_OF]->(r)
        """
    
    query += "\nRETURN c"
    
    result = graph.query(
        query=query,
        params={
            "id": chunk_data["id"],
            "text": chunk_data["text"],
            "textEmbedding": chunk_data["embedding"],
            "metadata": chunk_data["metadata"]
        }
    )
    
    return result

## Section 6: Main Processing Logic

In [29]:
# Process all OCR data from the loaded JSON
def process_ocr_data(data: Dict):
    total_chunks = 0
    
    # Process the nested structure
    for folder, subfolders in data.items():
        print(f"Processing folder: {folder}")
        
        for subfolder, files in subfolders.items():
            print(f"  Processing subfolder: {subfolder}")
            
            for filename, ocr_text in files.items():
                print(f"    Processing file: {filename}")
                
                # Extract sections from the OCR text
                sections = extract_sections(ocr_text)
                
                # Process each section
                for section_name, section_content in sections.items():
                    # Create chunks from the section text
                    chunks = text_splitter.split_text(section_content)
                    
                    for chunk in chunks:
                        # Create metadata
                        metadata = create_metadata(
                            folder=folder,
                            subfolder=subfolder,
                            filename=filename,
                            section_name=section_name
                        )
                        
                        # Create embedding
                        chunk_data = create_chunk_embedding(
                            text=chunk,
                            metadata=metadata
                        )
                        
                        # Save to Neo4j
                        save_chunk_to_neo4j(chunk_data)
                        total_chunks += 1
    
    print(f"Completed processing. Total chunks created: {total_chunks}")

## Section 7: Main Execution

In [None]:
# Main execution function
def main():
    # Setup Neo4j indexes
    setup_neo4j_indexes()
    
    # Specify path to your JSON file
    json_file_path = "./final_image_sonnet.json"
    
    # Process the OCR data
    print(f"Loading data from {json_file_path}...")
    data = process_json_file(json_file_path)
    
    # Process the data
    start_time = time.time()
    process_ocr_data(data)
    end_time = time.time()
    
    print(f"Processing completed in {end_time - start_time:.2f} seconds.")
# Run the main function when executing the notebook
if __name__ == "__main__":
    main()

## Section 8: Neo4j Node Label Consolidation

In [None]:
# Execute Cypher query to consolidate node labels for unified vector search
def execute_label_consolidation():
    """
    Consolidate existing nodes under a unified label (UpdatedChunksAndImagesv4) 
    to ensure all nodes are accessible through the same vector index.
    """
    # Consolidate UpdatedChunk nodes
    updated_chunk_query = """
    MATCH (m:UpdatedChunk) 
    WHERE NOT m:UpdatedChunksAndImagesv4
    SET m:UpdatedChunksAndImagesv4
    RETURN count(m) as updated_chunks
    """
    
    # Consolidate ImageChunk nodes
    image_chunk_query = """
    MATCH (m:ImageChunk) 
    WHERE NOT m:UpdatedChunksAndImagesv4
    SET m:UpdatedChunksAndImagesv4
    RETURN count(m) as updated_image_chunks
    """
    
    # Execute both queries
    updated_chunks_result = graph.query(updated_chunk_query)
    image_chunks_result = graph.query(image_chunk_query)
    
    # Extract counts
    updated_chunks_count = updated_chunks_result[0]['updated_chunks'] if updated_chunks_result else 0
    image_chunks_count = image_chunks_result[0]['updated_image_chunks'] if image_chunks_result else 0
    
    print(f"Consolidated {updated_chunks_count} UpdatedChunk nodes")
    print(f"Consolidated {image_chunks_count} ImageChunk nodes")
    print(f"Total nodes consolidated: {updated_chunks_count + image_chunks_count}")
    
    # Verify the consolidation
    verification_query = """
    MATCH (n:UpdatedChunksAndImagesv4)
    RETURN count(n) as consolidated_nodes
    """
    
    verification_result = graph.query(verification_query)
    total_consolidated = verification_result[0]['consolidated_nodes'] if verification_result else 0
    
    print(f"Total nodes with UpdatedChunksAndImagesv4 label: {total_consolidated}")
    
    return {
        "updated_chunks": updated_chunks_count,
        "image_chunks": image_chunks_count,
        "total_consolidated": total_consolidated
    }

# Execute the consolidation
consolidation_results = execute_label_consolidation()

## Section 9: Neo4j Test Query Functions

In [31]:
# Search for similar chunks based on the query text with updated metadata fields
def search_similar_chunks(query_text: str, top_k: int = 5):
    # Generate embedding for the query
    query_embedding = embedding_model.embed_query(query_text)
    
    # Search in Neo4j using the common index name
    search_query = """
    CALL db.index.vector.queryNodes('UpdatedChunksAndImagesv4', $top_k, $textEmbedding)
    YIELD node, score
    RETURN 
        node.id as id,
        node.url as url,
        node.text as text,
        node.section as section,
        node.file_name as file_name,
        node.type as type,
        node.folder as folder,
        node.subfolder as subfolder,
        score,
        CASE 
            WHEN node.ActId IS NOT NULL THEN {type: 'Act', id: node.ActId}
            WHEN node.RegId IS NOT NULL THEN {type: 'Regulation', id: node.RegId}
            WHEN node.AppendixId IS NOT NULL THEN {type: 'Appendix', id: node.AppendixId}
            WHEN node.OthersId IS NOT NULL THEN {type: 'Other', id: node.OthersId}
            WHEN node.SchedulesId IS NOT NULL THEN {type: 'Schedule', id: node.SchedulesId}
            WHEN node.PointInTimeId IS NOT NULL THEN {type: 'PointInTime', id: node.PointInTimeId}
            WHEN node.PartsId IS NOT NULL THEN {type: 'Part', id: node.PartsId}
            WHEN node.RulesId IS NOT NULL THEN {type: 'Rule', id: node.RulesId}
            ELSE null
        END as related_document
    ORDER BY score DESC
    """
    
    results = graph.query(
        query=search_query,
        params={"textEmbedding": query_embedding, "top_k": top_k}
    )
    
    return results

# Two-stage search with vector similarity followed by reranking.
def reranked_search(query_text: str, top_k: int = 5, candidates: int = 20):
    # Load the reranking model
    reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    
    # Generate embedding for the query
    query_embedding = embedding_model.embed_query(query_text)
    
    # First stage: Retrieve candidates with vector search - use the standard index name
    initial_query = """
    CALL db.index.vector.queryNodes('UpdatedChunksAndImagesv4', $candidates, $textEmbedding)
    YIELD node, score
    RETURN 
        node.id as id,
        node.text as text,
        node.section as section,
        node.file_name as file_name,
        node.type as type,
        node.folder as folder,
        node.subfolder as subfolder,
        node.url as url,
        score as vector_score,
        CASE 
            WHEN node.ActId IS NOT NULL THEN {type: 'Act', id: node.ActId}
            WHEN node.RegId IS NOT NULL THEN {type: 'Regulation', id: node.RegId}
            WHEN node.AppendixId IS NOT NULL THEN {type: 'Appendix', id: node.AppendixId}
            WHEN node.OthersId IS NOT NULL THEN {type: 'Other', id: node.OthersId}
            WHEN node.SchedulesId IS NOT NULL THEN {type: 'Schedule', id: node.SchedulesId}
            WHEN node.PointInTimeId IS NOT NULL THEN {type: 'PointInTime', id: node.PointInTimeId}
            WHEN node.PartsId IS NOT NULL THEN {type: 'Part', id: node.PartsId}
            WHEN node.RulesId IS NOT NULL THEN {type: 'Rule', id: node.RulesId}
            ELSE null
        END as related_document
    """
    
    candidate_results = graph.query(
        query=initial_query,
        params={"textEmbedding": query_embedding, "candidates": candidates}
    )
    
    # Second stage: Rerank candidates
    if candidate_results:
        # Create pairs of (query, text) for reranking
        pairs = [(query_text, result["text"]) for result in candidate_results]
        
        # Score candidate pairs with reranker model
        reranker_scores = reranker.predict(pairs)
        
        # Add reranker scores to results
        for i, result in enumerate(candidate_results):
            result["reranker_score"] = float(reranker_scores[i])
        
        # Sort by reranker score (descending)
        reranked_results = sorted(candidate_results, key=lambda x: x["reranker_score"], reverse=True)
        
        # Return top_k results
        return reranked_results[:top_k]
    
    return []

## Section 10: Neo4j Test Queries

In [None]:
def test_rerankers():
    """Test different search methods without processing data."""
    query = "What does the official coat of arms of BC look like?"
    
    print("\nStandard Vector Search:")
    vector_results = search_similar_chunks(query, top_k=10)
    for result in vector_results:
        print(f"Score: {result['score']:.4f}")
        print(f"Document: {result['folder']}/{result['subfolder']}/{result['file_name']}")
        print(f"Section: {result['section']}")
        print(f"Text: {result['text'][:100]}...\n")
    
    print("\nReranked Search:")
    rerank_results = reranked_search(query, top_k=3)
    for result in rerank_results:
        print(f"Score: {result['reranker_score']:.4f}")
        print(f"Document: {result['folder']}/{result['subfolder']}/{result['file_name']}")
        print(f"Section: {result['section']}")
        print(f"Text: {result['text'][:100]}...\n")
        
# Run just the search tests without processing data
test_rerankers()