# OCR Image Processing and Vector Search

## Overview
This notebook processes OCR (Optical Character Recognition) data extracted from images and indexes them in Neo4j for semantic search. The OCR text comes from a structured JSON file containing descriptions of various images from legal documents. The implementation uses sequential chunking with NEXT relationships to maintain context between chunks.

## Workflow
1. **Data Loading**: Loads OCR text data from a nested JSON structure
2. **Text Preprocessing**: Replaces line breaks with spaces for better readability
3. **Token-Based Chunking**: Splits full text into manageable chunks (256 tokens with 20 token overlap)
4. **Metadata Creation**: Adds structured metadata including ImageUrl and document references
5. **Embedding Generation**: Creates vector embeddings using all-MiniLM-L6-v2
6. **Sequential Storage**: Stores chunks as ImageChunk nodes with NEXT relationships between them
7. **Document Relationships**: Establishes PART_OF relationships between chunks and document types
8. **Vector Search**: Enables semantic search across all image descriptions
9. **Context Retrieval**: Returns neighboring chunks for better context awareness
10. **Reranking**: Improves search relevance using a cross-encoder reranking model

## Data Structure
The OCR data follows a hierarchical structure:
- Folder (e.g., Acts, Regulations)
  - Subfolder (e.g., Election Act) - stored with spaces instead of underscores
    - File (e.g., 96106_greatseal.gif)
      - OCR Text (processed as a single continuous text)

## Neo4j Schema
- **Node Labels**: 
  - `ImageChunk`: Contains text chunks from OCR data
  - `UpdatedChunksAndImagesv4`: Combined label for unified vector search
  - Document types: `Act`, `Regulation`, etc.
- **Node Properties**:
  - `text`: The chunk text content
  - `textEmbedding`: Vector representation of the text
  - `ImageUrl`: URL to the source image
  - `chunkSeqId`: Sequential ID to maintain chunk order
  - `folder`, `subfolder`, `file_name`: Source location metadata
  - `ActId`/`RegId`: Document identifiers with spaces (not underscores)
- **Relationships**: 
  - `(ImageChunk)-[:PART_OF]->(DocumentType)`: Links chunks to their document
  - `(ImageChunk)-[:NEXT]->(ImageChunk)`: Sequential link between adjacent chunks
- **Vector Index**: Applied on the `textEmbedding` property

## Key Features
- **Continuous Text Processing**: Processes OCR text as a whole without section splitting
- **Sequential Chunking**: Maintains relationships between adjacent chunks
- **Context-Aware Search**: Retrieves neighboring chunks for better context understanding
- **Standardized IDs**: Document IDs use spaces instead of underscores (e.g., "Health Act" not "Health_Act")
- **Unified Vector Search**: All chunks use the same vector index for cross-document search
- **Two-Stage Search**: Combines vector similarity with reranking for improved relevance

## Search Capabilities
- Standard vector search using cosine similarity
- Reranked search for improved relevance scoring
- Optional retrieval of neighboring chunks for context
- Combined search across different document types

## Section 0: Installs

In [None]:
%pip install langchain-huggingface
%pip install langchain-neo4j
%pip install langchain
%pip install langchain-text-splitters
%pip install neo4j
%pip install sentence-transformers
%pip install python-dotenv
%pip install numpy
%pip install ipywidgets

## Section 1: Imports

In [2]:
# Import necessary libraries
import json
import os
from typing import Dict, List, Any
import time

# LangChain imports
from langchain_text_splitters import SentenceTransformersTokenTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_neo4j import Neo4jGraph
from dotenv import load_dotenv

# Reranking import
from sentence_transformers import CrossEncoder

## Section 2: Configure Environment and Connections

In [3]:
# Load environment variables
load_dotenv()

# Neo4j connection settings
NEO4J_URI = os.getenv("NEO4J_URI", "bolt://localhost:7687")
NEO4J_USERNAME = os.getenv("NEO4J_USERNAME", "")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD", "")
NEO4J_DATABASE = os.getenv("NEO4J_DATABASE", "neo4j")

# Initialize Neo4j connection
graph = Neo4jGraph(
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    database=NEO4J_DATABASE
)

# Initialize embedding model
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Initialize text splitter
text_splitter = SentenceTransformersTokenTextSplitter(
    chunk_overlap=20, 
    tokens_per_chunk=256
)

## Section 3: Remove all existing image nodes from the Neo4j database

In [None]:
#Remove all existing image nodes from the Neo4j database completely. 
def clean_neo4j_image_data():
    # First, count how many nodes exist before deletion
    count_query = """
    MATCH (c) 
    WHERE c:ImageChunk OR 
        (c:UpdatedChunksAndImagesv4 AND c.type = 'image')
    RETURN count(c) as total_nodes
    """
    
    count_result = graph.query(count_query)
    initial_count = count_result[0]["total_nodes"] if count_result else 0
    
    print(f"Found {initial_count} image nodes to remove...")
    
    # Delete all image-related nodes with any label
    cleanup_query = """
    MATCH (c) 
    WHERE c:ImageChunk OR 
        (c:UpdatedChunksAndImagesv4 AND c.type = 'image')
    DETACH DELETE c
    RETURN count(c) as removed_nodes
    """
    
    result = graph.query(cleanup_query)
    removed_count = result[0]["removed_nodes"] if result else 0
    
    # Verify all nodes were deleted
    verify_query = """
    MATCH (c) 
    WHERE c:ImageChunk OR 
        (c:UpdatedChunksAndImagesv4 AND c.type = 'image')
    RETURN count(c) as remaining_nodes
    """
    
    verify_result = graph.query(verify_query)
    remaining = verify_result[0]["remaining_nodes"] if verify_result else 0
    
    print(f"Cleaned up Neo4j database: {removed_count} image nodes removed.")
    print(f"Remaining image nodes: {remaining}")
    
    if remaining > 0:
        print("WARNING: Not all image nodes were removed. You may need to run cleanup again.")
    
    return removed_count

# comment out the following line to run the cleanup

print("Beginning cleanup of existing image nodes...")
clean_neo4j_image_data()
print("Cleanup completed.")

## Section 4: Create Index in Neo4j

In [5]:
# Create vector index in Neo4j database for similarity search
def setup_neo4j_indexes():
    # Create constraint for unique ImageChunk IDs
    graph.query("""
    CREATE CONSTRAINT IF NOT EXISTS FOR (c:ImageChunk) REQUIRE c.id IS UNIQUE
    """)
    
    # Create vector index for embeddings - use the standard chunk_embeddings name
    graph.query("""
    CREATE VECTOR INDEX UpdatedChunksAndImagesv4 IF NOT EXISTS
    FOR (m:UpdatedChunksAndImagesv4) 
    ON m.textEmbedding 
    OPTIONS { 
        indexConfig: { 
            `vector.dimensions`: 384, 
            `vector.similarity_function`: 'cosine'
        }
    }
    """)
    
    print("Neo4j indexes created successfully.")

## Section 5: Helper Functions

In [6]:
# Extract folder and file components from a path structure
def extract_path_components(path: str) -> Dict[str, str]:
    components = path.split('/')
    if len(components) >= 2:
        folder = components[0]
        subfolder = components[1] if len(components) > 1 else None
        filename = components[-1]
    else:
        folder = None
        subfolder = None
        filename = components[0]
    
    return {
        "folder": folder, 
        "subfolder": subfolder,
        "filename": filename
    }

## Section 6: Document Processing Functions

In [7]:
# Load and parse JSON file containing OCR data
def process_json_file(file_path: str) -> Dict:
    with open(file_path, 'r') as f:
        data = json.load(f)
    return data

# Create metadata for a chunk with updated structure
def create_metadata(token_split_texts, folder, subfolder, filename):
    chunks_with_metadata = []
    chunk_seq_id = 0
    
    # Format subfolder name - replace underscores with spaces
    display_subfolder = subfolder.replace("_", " ") if subfolder else ""
    
    # Create metadata for each chunk
    for chunk in token_split_texts:
        chunks_with_metadata.append({
            'text': chunk,
            'chunkSeqId': chunk_seq_id,
            'chunkId': f'{folder}_{subfolder}_{filename}-chunk-{chunk_seq_id:04d}',
            'folder': folder,
            'subfolder': subfolder,
            'file_name': filename,
            'type': 'image',
            'ImageUrl': f"https://www.bclaws.gov.bc.ca/civix/document/id/complete/statreg/{filename}"
        })
        
        # Add specific ID based on folder type - use display_subfolder with spaces
        if folder == "Acts":
            chunks_with_metadata[-1]["ActId"] = display_subfolder
        elif folder == "Regulations":
            chunks_with_metadata[-1]["RegId"] = display_subfolder
            
        chunk_seq_id += 1
        
    return chunks_with_metadata

# Create chunks from OCR text
def create_chunks(ocr_text, folder, subfolder, filename):
    # Use token_splitter directly on the full text
    token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=20, tokens_per_chunk=256)
    token_split_texts = token_splitter.split_text(ocr_text)
    meta_data = create_metadata(token_split_texts, folder, subfolder, filename)    
    return meta_data


# Save chunk data to Neo4j with updated metadata and NEXT relationships
def save_chunk_to_neo4j(chunk_id, text, metadata, embedding, previous_chunk_id=None):
    # Base query for creating the ImageChunk node
    query = """
    MERGE (c:ImageChunk:UpdatedChunksAndImagesv4 {id: $id})
    SET c.text = $text,
        c.textEmbedding = $embedding,
        c.ImageUrl = $metadata.ImageUrl,
        c.folder = $metadata.folder,
        c.subfolder = $metadata.subfolder,
        c.file_name = $metadata.file_name,
        c.type = $metadata.type,
        c.chunkSeqId = $metadata.chunkSeqId,
        c.chunkId = $metadata.chunkId
    """
    
    # Add ActId or RegId if present
    if "ActId" in metadata:
        query += """
        SET c.ActId = $metadata.ActId
        """
    
    if "RegId" in metadata:
        query += """
        SET c.RegId = $metadata.RegId
        """
    
    # Add relationship to previous chunk if available
    if previous_chunk_id:
        query += """
        WITH c
        MATCH (prev:ImageChunk {id: $prev_id})
        MERGE (prev)-[:NEXT]->(c)
        """
    
    # Add conditional relationship creation based on folder type
    if "ActId" in metadata:
        query += """
        WITH c
        MERGE (a:Act {name: $metadata.ActId})
        MERGE (c)-[:PART_OF]->(a)
        """
    elif "RegId" in metadata:
        query += """
        WITH c
        MERGE (r:Regulation {name: $metadata.RegId})
        MERGE (c)-[:PART_OF]->(r)
        """
    
    query += "\nRETURN c"
    
    result = graph.query(
        query=query,
        params={
            "text": text,
            "embedding": embedding,
            "metadata": metadata,
            "prev_id": previous_chunk_id
        }
    )
    
    return result

## Section 7: Main Processing Logic

In [8]:
# Process all OCR data from the loaded JSON
def process_ocr_data(data: Dict):
    total_chunks = 0
    
    # Process the nested structure
    for folder, subfolders in data.items():
        print(f"Processing folder: {folder}")
        
        for subfolder, files in subfolders.items():
            print(f"  Processing subfolder: {subfolder}")
            
            for filename, ocr_text in files.items():
                print(f"    Processing file: {filename}")
                
                # Replace line breaks with spaces for better readability
                ocr_text = ocr_text.replace('\n', ' ')
                
                # Create chunks using the new function
                chunks_with_metadata = create_chunks(ocr_text, folder, subfolder, filename)
                
                # Process each chunk and save to Neo4j
                previous_chunk_id = None
                
                for chunk_data in chunks_with_metadata:
                    # Create embedding
                    embedding = embedding_model.embed_query(chunk_data['text'])
                    
                    # Save to Neo4j with NEXT relationship to previous chunk
                    save_chunk_to_neo4j(
                        chunk_id=chunk_data['chunkId'],
                        text=chunk_data['text'],
                        metadata=chunk_data,
                        embedding=embedding,
                        previous_chunk_id=previous_chunk_id
                    )
                    
                    previous_chunk_id = chunk_data['chunkId']
                    total_chunks += 1
    
    print(f"Completed processing. Total chunks created: {total_chunks}")

## Section 8: Main Execution

In [None]:
# Main execution function
def main():
    # Setup Neo4j indexes
    setup_neo4j_indexes()
    
    # Specify path to your JSON file
    json_file_path = "./final_image_sonnet.json"
    
    # Check if file exists
    if not os.path.exists(json_file_path):
        print(f"ERROR: File '{json_file_path}' not found. Please check the path.")
        return
    
    # Process the OCR data
    print(f"Loading data from {json_file_path}...")
    data = process_json_file(json_file_path)
    
    # Verify data was loaded
    if not data:
        print("ERROR: No data loaded from the JSON file.")
        return
    
    # Process the data
    start_time = time.time()
    process_ocr_data(data)
    end_time = time.time()
    
    print(f"Processing completed in {end_time - start_time:.2f} seconds.")

# Run the main function to process and index all data
print("Starting main processing...")
main()

## Section 9: Neo4j Node Label Consolidation

In [None]:
# Execute Cypher query to consolidate node labels for unified vector search
def execute_label_consolidation():
    # Consolidate UpdatedChunk nodes
    updated_chunk_query = """
    MATCH (m:UpdatedChunk) 
    WHERE NOT m:UpdatedChunksAndImagesv4
    SET m:UpdatedChunksAndImagesv4
    RETURN count(m) as updated_chunks
    """
    
    # Consolidate ImageChunk nodes
    image_chunk_query = """
    MATCH (m:ImageChunk) 
    WHERE NOT m:UpdatedChunksAndImagesv4
    SET m:UpdatedChunksAndImagesv4
    RETURN count(m) as updated_image_chunks
    """
    
    # Execute both queries
    updated_chunks_result = graph.query(updated_chunk_query)
    image_chunks_result = graph.query(image_chunk_query)
    
    # Extract counts
    updated_chunks_count = updated_chunks_result[0]['updated_chunks'] if updated_chunks_result else 0
    image_chunks_count = image_chunks_result[0]['updated_image_chunks'] if image_chunks_result else 0
    
    print(f"Consolidated {updated_chunks_count} UpdatedChunk nodes")
    print(f"Consolidated {image_chunks_count} ImageChunk nodes")
    print(f"Total nodes consolidated: {updated_chunks_count + image_chunks_count}")
    
    # Verify the consolidation
    verification_query = """
    MATCH (n:UpdatedChunksAndImagesv4)
    RETURN count(n) as consolidated_nodes
    """
    
    verification_result = graph.query(verification_query)
    total_consolidated = verification_result[0]['consolidated_nodes'] if verification_result else 0
    
    print(f"Total nodes with UpdatedChunksAndImagesv4 label: {total_consolidated}")
    
    return {
        "updated_chunks": updated_chunks_count,
        "image_chunks": image_chunks_count,
        "total_consolidated": total_consolidated
    }

# Execute the consolidation
consolidation_results = execute_label_consolidation()

## Section 10: Neo4j Node Creation Verification Query

In [None]:
def verify_processing():
    # Check for ImageChunk nodes
    image_query = """
    MATCH (c:ImageChunk)
    RETURN count(c) as image_count
    """
    
    # Check for specific properties
    properties_query = """
    MATCH (c:ImageChunk)
    WHERE c.ActId IS NOT NULL OR c.RegId IS NOT NULL
    RETURN count(c) as doc_count
    """
    
    # Check for NEXT relationships
    next_query = """
    MATCH (c1:ImageChunk)-[:NEXT]->(c2:ImageChunk)
    RETURN count(c1) as relationship_count
    """
    
    image_result = graph.query(image_query)
    properties_result = graph.query(properties_query)
    next_result = graph.query(next_query)
    
    image_count = image_result[0]["image_count"] if image_result else 0
    doc_count = properties_result[0]["doc_count"] if properties_result else 0
    rel_count = next_result[0]["relationship_count"] if next_result else 0
    
    print(f"ImageChunk nodes: {image_count}")
    print(f"Nodes with ActId/RegId: {doc_count}")
    print(f"NEXT relationships: {rel_count}")
    
    return {
        "image_count": image_count,
        "doc_count": doc_count,
        "rel_count": rel_count
    }

# Run verification
verification = verify_processing()

## Section 11: Neo4j Test Query Functions

In [14]:
# Search for similar chunks with related chunks via NEXT relationship
def search_similar_chunks(query_text: str, top_k: int = 5, include_related: bool = True):
    # Generate embedding for the query
    query_embedding = embedding_model.embed_query(query_text)
    
    # Search in Neo4j using vector index
    search_query = """
    CALL db.index.vector.queryNodes('UpdatedChunksAndImagesv4', $top_k, $textEmbedding)
    YIELD node, score
    """
    
    if include_related:
        search_query += """
        WITH node, score
        OPTIONAL MATCH (node)<-[:NEXT]-(prev:ImageChunk)
        OPTIONAL MATCH (node)-[:NEXT]->(next:ImageChunk)
        """
    
    search_query += """
    RETURN 
        node.id as id,
        node.ImageUrl as ImageUrl,
        node.text as text,
        node.type as type,
        node.folder as folder,
        node.subfolder as subfolder,
        node.file_name as file_name,
        node.chunkSeqId as chunkSeqId,
        score,
    """
    
    if include_related:
        search_query += """
        prev.text as prev_text,
        prev.id as prev_id,
        next.text as next_text,
        next.id as next_id,
        """
    
    search_query += """
        CASE 
            WHEN node.ActId IS NOT NULL THEN {type: 'Act', id: node.ActId}
            WHEN node.RegId IS NOT NULL THEN {type: 'Regulation', id: node.RegId}
            ELSE null
        END as related_document
    ORDER BY score DESC
    """
    
    results = graph.query(
        query=search_query,
        params={"textEmbedding": query_embedding, "top_k": top_k}
    )
    
    return results

# Two-stage search with vector similarity followed by reranking, including related chunks
def reranked_search(query_text: str, top_k: int = 5, candidates: int = 20, include_related: bool = True):
    # Load the reranking model
    reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    
    # Generate embedding for the query
    query_embedding = embedding_model.embed_query(query_text)
    
    # First stage: Retrieve candidates with vector search
    initial_query = """
    CALL db.index.vector.queryNodes('UpdatedChunksAndImagesv4', $candidates, $textEmbedding)
    YIELD node, score
    """
    
    if include_related:
        initial_query += """
        WITH node, score
        OPTIONAL MATCH (node)<-[:NEXT]-(prev:ImageChunk)
        OPTIONAL MATCH (node)-[:NEXT]->(next:ImageChunk)
        """
    
    initial_query += """
    RETURN 
        node.id as id,
        node.text as text,
        node.folder as folder,
        node.subfolder as subfolder,
        node.file_name as file_name,
        node.type as type,
        node.ImageUrl as ImageUrl,
        node.chunkSeqId as chunkSeqId,
        score as vector_score,
    """
    
    if include_related:
        initial_query += """
        prev.text as prev_text,
        prev.id as prev_id,
        next.text as next_text,
        next.id as next_id,
        """
    
    initial_query += """
        CASE 
            WHEN node.ActId IS NOT NULL THEN {type: 'Act', id: node.ActId}
            WHEN node.RegId IS NOT NULL THEN {type: 'Regulation', id: node.RegId}
            ELSE null
        END as related_document
    """
    
    candidate_results = graph.query(
        query=initial_query,
        params={"textEmbedding": query_embedding, "candidates": candidates}
    )
    
    # Second stage: Rerank candidates
    if candidate_results:
        # Create pairs of (query, text) for reranking
        pairs = [(query_text, result["text"]) for result in candidate_results]
        
        # Score candidate pairs with reranker model
        reranker_scores = reranker.predict(pairs)
        
        # Add reranker scores to results
        for i, result in enumerate(candidate_results):
            result["reranker_score"] = float(reranker_scores[i])
        
        # Sort by reranker score (descending)
        reranked_results = sorted(candidate_results, key=lambda x: x["reranker_score"], reverse=True)
        
        # Return top_k results
        return reranked_results[:top_k]
    
    return []

## Section 12: Neo4j Test Queries

In [None]:
# Test different search methods without processing data.
def test_rerankers():
    # Test the standard vector search.
    query = "What does the official coat of arms of BC look like?"
    
    print("\nStandard Vector Search:")
    vector_results = search_similar_chunks(query, top_k=10)
    for result in vector_results:
        print(f"Score: {result['score']:.4f}")
        print(f"Document: {result['folder']}/{result['subfolder']}/{result['file_name']}")
        print(f"ID: {result['id']}")
        print(f"ChunkSeqId: {result['chunkSeqId']}")
        print(f"Text: {result['text'][:100]}...\n")
        
        if 'prev_text' in result and result['prev_text']:
            print(f"Previous chunk: {result['prev_text'][:50]}...\n")
        
        if 'next_text' in result and result['next_text']:
            print(f"Next chunk: {result['next_text'][:50]}...\n")
    
    # Test the reranked search.
    print("\nReranked Search:")
    rerank_results = reranked_search(query, top_k=3)
    for result in rerank_results:
        print(f"Score: {result['reranker_score']:.4f}")
        print(f"Document: {result['folder']}/{result['subfolder']}/{result['file_name']}")
        print(f"ID: {result['id']}")
        print(f"ChunkSeqId: {result['chunkSeqId']}")
        print(f"Text: {result['text'][:100]}...\n")
        
        if 'prev_text' in result and result['prev_text']:
            print(f"Previous chunk: {result['prev_text'][:50]}...\n")
        
        if 'next_text' in result and result['next_text']:
            print(f"Next chunk: {result['next_text'][:50]}...\n")
            
test_rerankers()