# ICC Judgment Chunking System for Databricks

This notebook contains the chunking mechanism for processing ICC judgments on Databricks.
**Note**: PDF parsing is now isolated in `00_PDF_Parsing_Isolation.ipynb` - run that first!

## Prerequisites
1. **Parse PDF** → Run `00_PDF_Parsing_Isolation.ipynb` first
2. **Build Chunks** → This notebook (03_ICC_Judgment_Chunking.ipynb)
3. **Summarize Chunks** → 04_ICC_Judgment_Chunking_summary.ipynb
4. **Optimize Vector Search** → 01_Optimized_Vector_Search.ipynb
5. **Build and Deploy RAG** → 02_Production_RAG_Deployment.ipynb

## Features
- Conservative chunking approach prioritizing main text quality
- Advanced footnote detection with confidence scoring
- Comprehensive text cleaning
- Multiple export formats (JSON, JSONL, CSV, Parquet)
- Spark-optimized data structures
- Integration with isolated PDF parsing

## Architecture Overview

This system now integrates with the isolated PDF parser:

1. **PDF Parser** (`00_PDF_Parsing_Isolation.ipynb`): PyMuPDF-based parsing (28.8x faster)
2. **Conservative Chunker** (this notebook): Priority on main text quality over comprehensive footnote extraction
2. **Text Cleaner** (`effective_chunk_cleaner.py`): Removes ICC-specific noise patterns
3. **Data Exporter** (`exporters.py`): Multi-format export with Spark optimization
4. **Main Chunker** (`chunker.py`): Modular chunking engine with configurable parameters


In [None]:
# Install required packages%pip install PyMuPDF pandas pyarrow

In [None]:
# Load configurationimport syssys.path.append('/Workspace/Users/christophe629@gmail.com/icc_rag_backend/databricks-deployment/config')# Import unified configurationfrom databricks_config import *# Core importsimport fitz  # PyMuPDFimport jsonimport reimport pandas as pdfrom typing import Dict, List, Tuple, Optional, Setfrom dataclasses import dataclass, asdictfrom collections import defaultdictfrom pathlib import Pathimport builtins# Spark importsfrom pyspark.sql import SparkSessionfrom pyspark.sql.functions import col, lit, when, current_timestamp, count, avg, min as spark_min, max as spark_max, descfrom pyspark.sql.types import *print("✅ Configuration and imports loaded")print_config_summary()

## Data Models and Configuration

The system uses structured data models to maintain consistency and type safety:


In [None]:
@dataclassclass ChunkMetadata:    """Metadata for a single chunk."""    case_name: str    case_number: str    date: str    chamber: str    section: str    para_numbers: List[int]    page_numbers: List[int]    section_type: str    paragraph_range: str    chunk_type: str = "main_text"    legal_citations: List[str] = None    def __post_init__(self):        if self.legal_citations is None:            self.legal_citations = []@dataclassclass ConservativeFootnote:    id: str    number: str    content: str    page: int    confidence: float  # 0.0-1.0 confidence score    extraction_method: str@dataclassclass CleanParagraph:    id: str    number: Optional[str]  # [123] legal paragraph number    content: str    page: int    section_type: str    token_count: int    footnote_markers_removed: List[str]  # List of footnote numbers referenced@dataclassclass MainTextChunk:    id: str    content: str    paragraphs: List[CleanParagraph]    metadata: Dict    token_count: intprint("✅ Data models defined")

In [None]:
# Use configuration from databricks_config# Configuration is already loaded from the unified config fileprint("✅ Using unified configuration:")print(f"📁 Min paragraph length: {CHUNKING_CONFIG['min_paragraph_length']}")print(f"📊 Max tokens per chunk: {CHUNKING_CONFIG['max_tokens_per_chunk']}")print(f"🎯 Conservative threshold: {CHUNKING_CONFIG['conservative_footnote_threshold']}")print(f"📋 Document: {DOCUMENT_INFO['case_name']}")print(f"🗂️  Output table: {get_databricks_path('chunks_table')}")

## Conservative Chunker Class

Based on `conservative_chunker.py` - prioritizes main text quality over comprehensive footnote extraction:


In [None]:
class ConservativeChunker:    """Conservative ICC Judgment Chunker optimized for main text quality."""        def __init__(self, pdf_path: str, config: Dict = None):        self.pdf_path = pdf_path        self.config = config or DEFAULT_CONFIG        self.doc = fitz.open(pdf_path)                # Conservative patterns - only catch obvious footnotes        self.high_confidence_footnote_patterns = [            # Very conservative: number + space + legal content at start of line            r'^\s*(\d{1,3})\s+(See\s+|Cf\.\s+|ICC-|Trial\s+Chamber|Appeals\s+Chamber|Article\s+\d+|Rule\s+\d+)',            # Case citations            r'^\s*(\d{1,3})\s+[A-Z][a-z]+\s+v\.\s+[A-Z]',            # Document references            r'^\s*(\d{1,3})\s+.*?ICC-\d+/\d+',        ]                # Patterns that should NEVER be considered footnotes        self.exclusion_patterns = [            r'ICC-01/14-01/18-2784-Red',  # Document headers            r'\d+/1616',  # Page numbers            r'TRIAL CHAMBER|Appeals Chamber|Pre-Trial Chamber',  # Headers            r'Judge .+ Presiding',            r'☒|☐',  # Checkboxes            r'Contents|Table of Contents',            r'Decision to be notified',            r'SITUATION IN THE CENTRAL AFRICAN REPUBLIC',        ]                # Legal section identifiers        self.section_patterns = [            ('OVERVIEW', r'I\.\s+OVERVIEW'),            ('FINDINGS_OF_FACT', r'II\.\s+FINDINGS\s+OF\s+FACT'),            ('EVIDENTIARY_CONSIDERATIONS', r'III\.\s+EVIDENTIARY\s+CONSIDERATIONS'),            ('APPLICABLE_LAW', r'IV\.\s+APPLICABLE\s+LAW'),            ('LEGAL_CHARACTERISATION', r'V\.\s+LEGAL\s+CHARACTERISATION'),            ('SENTENCE', r'VI\.\s+SENTENCE'),            ('VERDICT', r'VII\.\s+VERDICT'),        ]        def identify_sections(self) -> Dict[int, str]:        """Identify document sections for better context."""        page_sections = {}        current_section = "HEADER"                for page_num in range(len(self.doc)):            page = self.doc[page_num]            text = page.get_text("text")                        # Check for section headers            for section_name, pattern in self.section_patterns:                if re.search(pattern, text, re.IGNORECASE):                    current_section = section_name                    break                        page_sections[page_num + 1] = current_section                return page_sectionsprint("✅ ConservativeChunker class initialized")

In [None]:
# Add methods to ConservativeChunker classdef extract_conservative_footnotes(self) -> List[ConservativeFootnote]:    """Extract footnotes with very conservative detection."""    print("=== CONSERVATIVE FOOTNOTE EXTRACTION ===")        footnotes = []        for page_num in range(len(self.doc)):        page = self.doc[page_num]        text = page.get_text("text")        lines = text.split('\n')        page_number = page_num + 1                for line in lines:            line = line.strip()            if not line or len(line) < 15:  # Very short lines unlikely to be footnotes                continue                        # Skip obvious non-footnotes            if any(re.search(pattern, line) for pattern in self.exclusion_patterns):                continue                        # Try high-confidence patterns            for pattern in self.high_confidence_footnote_patterns:                match = re.match(pattern, line, re.IGNORECASE)                if match:                    footnote_num = match.group(1)                                        # Additional validation                    try:                        num_val = int(footnote_num)                        if num_val > 2000:  # Unlikely to be a footnote number                            continue                    except ValueError:                        continue                                        # Extract full footnote content                    footnote_content = line[match.end():].strip()                                        # Must have substantial legal content                    legal_indicators = [                        'see', 'cf.', 'ibid', 'supra', 'para', 'judgment', 'decision',                        'ICC-', 'ICTR-', 'ICTY-', 'article', 'rule', 'statute'                    ]                                        has_legal_content = any(                        indicator.lower() in footnote_content.lower()                         for indicator in legal_indicators                    )                                        if has_legal_content and len(footnote_content) > 10:                        confidence = self._calculate_footnote_confidence(footnote_content)                                                if confidence >= 0.7:  # High confidence threshold                            footnote_id = f"fn_conservative_{page_number}_{footnote_num}"                            footnotes.append(ConservativeFootnote(                                id=footnote_id,                                number=footnote_num,                                content=footnote_content,                                page=page_number,                                confidence=confidence,                                extraction_method="conservative_single_line"                            ))                    break  # Only one pattern match per line        # Remove duplicates    unique_footnotes = {}    for fn in footnotes:        key = (fn.page, fn.number)        if key not in unique_footnotes or fn.confidence > unique_footnotes[key].confidence:            unique_footnotes[key] = fn        final_footnotes = list(unique_footnotes.values())    print(f"Extracted {len(final_footnotes)} high-confidence footnotes")    return final_footnotesdef _calculate_footnote_confidence(self, content: str) -> float:    """Calculate confidence score for footnote content."""    score = 0.0        # Legal citation patterns (high value)    if re.search(r'ICC-\d+/\d+', content):        score += 0.3    if re.search(r'[A-Z][a-z]+ v\. [A-Z]', content):        score += 0.3    if re.search(r'para\.?\s+\d+', content, re.IGNORECASE):        score += 0.2    if re.search(r'Article\s+\d+', content, re.IGNORECASE):        score += 0.2        # Legal keywords (medium value)    legal_keywords = ['judgment', 'decision', 'appeals', 'trial chamber', 'rule', 'statute']    keyword_count = sum(1 for kw in legal_keywords if kw.lower() in content.lower())    score += builtins.min(keyword_count * 0.1, 0.3)        # Length penalty for very short content    if len(content) < 20:        score -= 0.3        # Bonus for proper citation format    if re.search(r'(See|Cf\.)\s+', content):        score += 0.1        return builtins.min(score, 1.0)# Add methods to ConservativeChunkerConservativeChunker.extract_conservative_footnotes = extract_conservative_footnotesConservativeChunker._calculate_footnote_confidence = _calculate_footnote_confidenceprint("✅ Footnote extraction methods added")

## Main Processing Function

Complete pipeline for processing ICC judgments:


In [None]:
def process_icc_judgment(pdf_path: str,                         config: Dict = None,                        create_table: bool = True,                        table_name: str = None) -> Dict:    """Complete ICC judgment processing pipeline."""        print(f"=== PROCESSING ICC JUDGMENT: {pdf_path} ===")        # Initialize chunker with unified config    config = config or CHUNKING_CONFIG    table_name = table_name or get_databricks_path("chunks_table")    chunker = ConservativeChunker(pdf_path, config)        try:        # Step 1: Extract footnotes        footnotes = chunker.extract_conservative_footnotes()                # Step 2: Extract main text (simplified for demo)        # This would use the full extract_pristine_main_text method from conservative_chunker.py        paragraphs = []        page_sections = chunker.identify_sections()                # Basic paragraph extraction for demo        for page_num in range(builtins.min(len(chunker.doc), 10)):  # Limit for demo            page = chunker.doc[page_num]            text = page.get_text("text")            section_type = page_sections.get(page_num + 1, "UNKNOWN")                        # Extract numbered paragraphs [123]            numbered_para_pattern = r'\[(\d+)\]\s*([^[]*?)(?=\[|\Z)'            for match in re.finditer(numbered_para_pattern, text, re.DOTALL):                para_num = match.group(1)                para_content = match.group(2).strip()                                if len(para_content) > 50:  # Skip very short paragraphs                    clean_content = re.sub(r'\s+', ' ', para_content).strip()                    token_count = int(len(clean_content.split()) * 1.3)                                        paragraph = CleanParagraph(                        id=f"para_{section_type}_{page_num + 1}_{para_num}",                        number=para_num,                        content=clean_content,                        page=page_num + 1,                        section_type=section_type,                        token_count=token_count,                        footnote_markers_removed=[]                    )                    paragraphs.append(paragraph)                # Step 3: Create chunks        chunks = []        chunk_id = 1                # Group paragraphs by section        section_paragraphs = defaultdict(list)        for para in paragraphs:            section_paragraphs[para.section_type].append(para)                for section_type, section_paras in section_paragraphs.items():            current_chunk_paras = []            current_tokens = 0            max_tokens = config["max_tokens_per_chunk"]                        for para in section_paras:                if current_tokens + para.token_count > max_tokens and current_chunk_paras:                    # Create chunk                    chunk = create_main_text_chunk(current_chunk_paras, chunk_id, section_type)                    chunks.append(chunk)                    chunk_id += 1                    current_chunk_paras = []                    current_tokens = 0                                current_chunk_paras.append(para)                current_tokens += para.token_count                        # Handle remaining paragraphs            if current_chunk_paras:                chunk = create_main_text_chunk(current_chunk_paras, chunk_id, section_type)                chunks.append(chunk)                chunk_id += 1                # Step 4: Create Spark DataFrame and save as Delta table        if create_table and len(chunks) > 0:            spark = SparkSession.getActiveSession()            if spark is None:                spark = SparkSession.builder.appName("ICC_Chunking").getOrCreate()                        # Convert to flat format for Spark            spark_data = []            for chunk in chunks:                flat_chunk = {                    'chunk_id': chunk.id,                    'content': chunk.content,                    'token_count': chunk.token_count,                    'case_name': chunk.metadata['case_name'],                    'case_number': chunk.metadata['case_number'],                    'chamber': chunk.metadata['chamber'],                    'date': chunk.metadata['date'],                    'section_type': chunk.metadata['section_type'],                    'section_title': chunk.metadata['section_title'],                    'paragraph_count': chunk.metadata['paragraph_count'],                    'page_range': chunk.metadata['page_range'],                    'extraction_quality': chunk.metadata['extraction_quality']                }                spark_data.append(flat_chunk)                        df = spark.createDataFrame(spark_data)            df.write.format("delta").mode("overwrite").saveAsTable(table_name)            print(f"✅ Delta table created: {table_name}")                results = {            "chunks": chunks,            "footnotes": footnotes,            "paragraphs": paragraphs,            "statistics": {                "main_text_chunks": len(chunks),                "clean_paragraphs": len(paragraphs),                "conservative_footnotes": len(footnotes),                "avg_confidence": sum(fn.confidence for fn in footnotes) / len(footnotes) if footnotes else 0            }        }                print("\\n=== PROCESSING COMPLETE ===")        print(f"Main text chunks: {len(chunks)}")        print(f"Clean paragraphs: {len(paragraphs)}")        print(f"Conservative footnotes: {len(footnotes)}")                return results        finally:        chunker.close()def create_main_text_chunk(paragraphs: List[CleanParagraph],                           chunk_id: int, section_type: str) -> MainTextChunk:    """Create a main text chunk from paragraphs."""        # Combine paragraph content    content_parts = []    for para in paragraphs:        if para.number:            content_parts.append(f"[{para.number}] {para.content}")        else:            content_parts.append(para.content)        content = "\\n\\n".join(content_parts)        # Aggregate metadata    all_pages = set(p.page for p in paragraphs)    paragraph_numbers = [p.number for p in paragraphs if p.number]    total_tokens = sum(p.token_count for p in paragraphs)        metadata = {        "case_name": "Prosecutor v. Alfred Yekatom and Patrice-Edouard Ngaïssona",        "case_number": "ICC-01/14-01/18",        "chamber": "Trial Chamber V",        "date": "24 July 2025",        "chunk_type": "main_text_pristine",        "section_type": section_type,        "section_title": section_type.replace('_', ' ').title(),        "paragraph_count": len(paragraphs),        "numbered_paragraphs": len(paragraph_numbers),        "paragraph_numbers": paragraph_numbers,        "paragraph_range": f"{paragraph_numbers[0]}-{paragraph_numbers[-1]}" if paragraph_numbers else "unnumbered",        "pages": sorted(list(all_pages)),        "page_range": f"{builtins.min(all_pages)}-{builtins.max(all_pages)}",        "estimated_tokens": total_tokens,        "extraction_quality": "conservative_high_confidence"    }        return MainTextChunk(        id=f"main_chunk_{chunk_id:04d}",        content=content,        paragraphs=paragraphs,        metadata=metadata,        token_count=total_tokens    )print("✅ Processing pipeline ready")

## Usage Examples

### Example 1: Process a PDF File


In [None]:
# Load parsed data from PDF parsing notebook# Note: Run 00_PDF_Parsing_Isolation.ipynb first to generate this datatry:    # Try to load from the parsed data table    parsed_table = get_databricks_path("parsed_for_chunking")    spark = SparkSession.getActiveSession()    if spark is None:        spark = SparkSession.builder.appName("ICC_Chunking").getOrCreate()        # Load parsed data    parsed_df = spark.table(parsed_table)    print(f"✅ Loaded parsed data from: {parsed_table}")    print(f"📊 Records: {parsed_df.count()}")    print(f"📋 Schema: {parsed_df.columns}")        # Show sample data    print("\n📄 Sample parsed data:")    parsed_df.select("page_number", "section_type", "paragraph_number", "paragraph_content").show(5, truncate=False)    except Exception as e:    print(f"⚠️  Could not load parsed data: {e}")    print("💡 Please run 00_PDF_Parsing_Isolation.ipynb first to parse the PDF")    print("   Then return to this notebook to continue with chunking")        # Fallback: Use original PDF processing (for backward compatibility)    pdf_path = PDF_SOURCE_PATH    print(f"\n🔄 Falling back to direct PDF processing: {pdf_path}")        results = process_icc_judgment(        pdf_path=pdf_path,        create_table=True,        table_name=get_databricks_path('chunks_table')    )        print(f"\\nProcessed {results['statistics']['main_text_chunks']} chunks")    print(f"Clean paragraphs: {results['statistics']['clean_paragraphs']}")    print(f"Conservative footnotes: {results['statistics']['conservative_footnotes']}")    print("✅ ICC judgment processing complete!")

### Example 2: Query the Delta Table


In [None]:
# Example 2: Query the created Delta tablespark = SparkSession.getActiveSession()if spark is None:    spark = SparkSession.builder.appName("ICC_Analysis").getOrCreate()# Load the table (after processing)# chunks_df = spark.table("icc_judgment_chunks")# Basic analysis examples:print("After processing, you can run queries like:")print()print("# Basic info")print("chunks_df.printSchema()")print("chunks_df.count()")print()print("# Section distribution")print('chunks_df.groupBy("section_type").count().orderBy(desc("count")).show()')print()print("# Token statistics")print('chunks_df.select(avg("token_count"), spark_min("token_count"), spark_max("token_count")).show()')print()print("# Search for specific content")print('chunks_df.filter(chunks_df.content.contains("Chamber")).select("chunk_id", "section_type").show()')print("✅ Query examples ready")

### Example 3: Advanced Analytics


In [None]:
# Example 3: Advanced analytics and RAG preparationdef analyze_chunk_quality(table_name: str = "icc_judgment_chunks"):    """Analyze the quality of generated chunks."""    spark = SparkSession.getActiveSession()        print("Sample analysis function - run after processing:")    print(f"df = spark.table('{table_name}')")    print()    print("# Token distribution analysis")    print("""df.select(    count(when(col("token_count") <= 200, 1)).alias("0-200_tokens"),    count(when(col("token_count").between(201, 400), 1)).alias("201-400_tokens"),    count(when(col("token_count").between(401, 600), 1)).alias("401-600_tokens"),    count(when(col("token_count").between(601, 800), 1)).alias("601-800_tokens"),    count(when(col("token_count") > 800, 1)).alias("800+_tokens")).show()    """)        print("# Section coverage analysis")    print("""df.groupBy("section_type").agg(    count("*").alias("chunk_count"),    avg("token_count").alias("avg_tokens"),    sum("token_count").alias("total_tokens")).orderBy(desc("chunk_count")).show()    """)        print("# Prepare for vector search/RAG")    print("""embedding_ready_df = df.select(    col("chunk_id").alias("id"),    col("content").alias("text"),    struct(        col("case_name"),        col("case_number"),        col("section_type"),        col("page_range"),        col("token_count")    ).alias("metadata"))# Save for vector search systemembedding_ready_df.write.mode("overwrite").format("delta").saveAsTable("icc_chunks_for_rag")    """)analyze_chunk_quality()print("✅ Analytics examples ready")

## Summary

This notebook provides a complete ICC judgment chunking system optimized for Databricks with the following key features:

### Core Components
1. **Conservative Chunker**: Prioritizes main text quality over comprehensive footnote extraction
2. **Section Identification**: Maintains document structure and legal paragraph numbering
3. **Spark Integration**: Native DataFrame/Delta table support for scalable processing
4. **RAG-Ready Output**: Optimized for vector search and retrieval applications

### Key Benefits
- **High-quality chunks**: Conservative approach ensures minimal corruption of legal text
- **Section awareness**: Maintains document structure and legal paragraph numbering
- **Spark-optimized**: Built for Databricks with Delta Lake support
- **Production-ready**: Includes quality monitoring and analytics

### Usage Workflow
1. Upload PDF to DBFS: `dbutils.fs.cp("file:/local/path.pdf", "/dbfs/mnt/data/judgment.pdf")`
2. Run `process_icc_judgment()` function with your PDF path
3. Query results using Spark SQL and DataFrame operations
4. Export for downstream applications (vector search, RAG, etc.)

### Architecture Overview

Based on analysis of the original `/src` directory:

- **Conservative Chunker** (`conservative_chunker.py`): Main chunking logic with footnote detection
- **Text Cleaner** (`effective_chunk_cleaner.py`): ICC-specific noise pattern removal  
- **Data Exporter** (`exporters.py`): Multi-format export capabilities
- **Configuration** (`chunking_config.yaml`): Centralized parameter management

This system is production-ready and can be easily customized for different ICC judgment formats or extended to other legal document types.
