# Document Processing for Croatian RAG System

## Learning Objectives

This notebook explains the document processing pipeline - the first critical step in building a Croatian RAG system:

1. **Why document processing matters for RAG quality**
2. **Text extraction from different file formats**
3. **Croatian language-specific cleaning challenges**
4. **Document chunking strategies and their impact**
5. **Building a complete preprocessing pipeline**

## 1. Why Document Processing Matters

### The Foundation of RAG Quality

Document processing is like preparing ingredients for cooking - poor preparation ruins the final dish. In RAG systems:

- **Garbage In = Garbage Out**: Bad preprocessing leads to poor retrieval
- **Chunk Quality**: How you split documents affects search relevance
- **Language-Specific Issues**: Croatian diacritics and encoding must be preserved
- **Metadata Preservation**: Source information helps with citations

### Croatian Language Challenges

Croatian text processing has unique challenges:
- **Diacritics**: č, ć, š, ž, đ must be preserved correctly
- **Encoding Issues**: Many documents use Windows-1250 or ISO-8859-2
- **Regional Variations**: Different dialects and spellings
- **Mixed Content**: Documents often contain English/German/Italian text

In [None]:
# Import our document processing components
import sys
sys.path.append('../src')

from preprocessing.extractors import DocumentExtractor, TextExtractor, PDFExtractor, DocxExtractor
from preprocessing.cleaners import CroatianTextCleaner, TextCleaningConfig
from preprocessing.chunkers import DocumentChunker, ChunkingConfig, ChunkingStrategy

import tempfile
import os
from pathlib import Path
import logging

# Set up logging to see what's happening
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
print("✅ Document processing components imported successfully!")

## 2. Text Extraction from Different Formats

Our system supports three main document types:

### 📄 Plain Text (.txt)
- Simplest format but encoding can be tricky
- Croatian documents often use Windows-1250 encoding
- UTF-8 detection and conversion is crucial

### 📕 PDF Documents (.pdf)
- Complex format with fonts, images, layouts
- May contain scanned text (OCR needed)
- Croatian PDFs may have embedded font issues

### 📘 Word Documents (.docx)
- Structured format with styles, headers
- Generally good encoding support
- May contain tables, images, footnotes

In [None]:
# Let's create some sample Croatian documents for testing
import tempfile
from pathlib import Path

# Create temporary directory for our test documents
temp_dir = Path(tempfile.mkdtemp(prefix="croatian_docs_"))
print(f"📁 Created temporary directory: {temp_dir}")

# Sample Croatian texts with various challenges
sample_texts = {
    "zagreb_info.txt": """
Zagreb - Glavni Grad Hrvatske

Zagreb je glavni i najveći grad Republike Hrvatske. Smješten je u sjeverozapadnom dijelu zemlje, na rijeci Savi. 
Grad ima bogatu povijest koja seže u rimsko doba.

TURISTIČKA MJESTA:
• Gornji grad - povijesni dio s crkvom sv. Marka
• Donji grad - trgovački i poslovni centar
• Maksimir - najveći park u Zagrebu

Stanovništvo: ~800,000 stanovnika (2021.)
Površina: 641.4 km²

Zagreb je također kulturno središte Hrvatske s brojnim muzejima, kazalištima i galerijama.
""",
    
    "dubrovnik_guide.txt": """
Dubrovnik - Biser Jadrana

Dubrovnik je grad na jugu Hrvatske, poznat kao "biser Jadrana". 
Stari grad je upisan na UNESCO-ovu listu svjetske baštine 1979. godine.

ZNAMENITOSTI:
- Gradske zidine (duljine 1940m)
- Stradun - glavna ulica
- Knežev dvor
- Franjevački samostan

Dubrovnik je služio kao lokacija za snimanje serije "Igra prijestolja" (Game of Thrones).

Klima: mediteranska
Broj dana sa suncem: >250 godišnje
""",
    
    "plitvice_nature.txt": """
Plitvička jezera - Nacionalni park

Nacionalni park Plitvička jezera nalazi se u Lici i Kordunu. 
Park je osnovan 1949. godine i jedan je od najstarijih nacionalnih parkova u jugoistočnoj Europi.

KARAKTERISTIKE:
✓ 16 jezera povezanih slapovima
✓ Ukupna površina: 296.85 km²
✓ UNESCO svjetska baština od 1979.

Flora i fauna:
- Šume bukve, jele i smreke
- Smeđi medvjed, vuk, ris
- Više od 150 vrsta ptica

VAŽNO: Park je otvoren tijekom cijele godine, ali zimski period ima ograničen pristup.
"""
}

# Write sample texts to files
sample_files = {}
for filename, content in sample_texts.items():
    file_path = temp_dir / filename
    with open(file_path, 'w', encoding='utf-8') as f:
        f.write(content.strip())
    sample_files[filename] = file_path
    print(f"📝 Created: {filename} ({len(content)} chars)")

print(f"\n✅ Created {len(sample_files)} sample Croatian documents")

In [None]:
# Let's test text extraction from our sample documents
print("📂 DOCUMENT EXTRACTION TESTING:")
print("=" * 60)

# Create document extractor
extractor = DocumentExtractor()

for filename, file_path in sample_files.items():
    print(f"\n📄 Processing: {filename}")
    print("-" * 40)
    
    try:
        # Extract text and metadata
        result = extractor.extract_text(str(file_path))
        
        print(f"   ✅ Extraction successful")
        print(f"   📊 Text length: {len(result['content'])} characters")
        print(f"   🏷️  File type: {result['metadata']['file_type']}")
        print(f"   📅 File size: {result['metadata']['file_size']} bytes")
        print(f"   🔤 Encoding: {result['metadata']['encoding']}")
        
        # Show first 150 characters
        preview = result['content'][:150].replace('\n', ' ')
        print(f"   📖 Preview: {preview}...")
        
        # Check Croatian diacritics preservation
        croatian_chars = ['č', 'ć', 'š', 'ž', 'đ', 'Č', 'Ć', 'Š', 'Ž', 'Đ']
        found_diacritics = [char for char in croatian_chars if char in result['content']]
        if found_diacritics:
            print(f"   ✓ Croatian diacritics preserved: {', '.join(found_diacritics)}")
        
    except Exception as e:
        print(f"   ❌ Extraction failed: {e}")

print("\n💡 Observations:")
print("   • UTF-8 encoding preserves all Croatian diacritics")
print("   • Metadata extraction provides useful document information")
print("   • Text structure (headers, lists) is preserved")

In [None]:
# Let's create a more complex document with encoding challenges
print("🔤 ENCODING CHALLENGE TESTING:")
print("=" * 60)

# Create a document with mixed encoding (simulate real-world scenario)
problematic_content = """
Problematični tekst s različitim kodiranjima

Ovaj dokument sadrži:
• Hrvatska slova: čćšžđ ČĆŠŽĐ
• Specijalne znakove: razni znakovi
• Brojeve: 123,456.78 €
• Datum: 31.12.2023.

Problemi s kodiranjem:
- Windows-1250: Ã¡ Ä
- ISO-8859-2: ® ¼ ½
- Stari sistem: [NEŸITLJIVO]

Napomena: Ovaj tekst testira robusnot sistema.
"""

# Save with different encodings to test detection
encoding_test_files = {}

encodings_to_test = [
    ('utf-8', 'UTF-8 (standard)'),
    ('windows-1250', 'Windows-1250 (Croatian)'),
    ('iso-8859-2', 'ISO-8859-2 (Latin-2)')
]

for encoding, description in encodings_to_test:
    filename = f"encoding_test_{encoding.replace('-', '_')}.txt"
    file_path = temp_dir / filename
    
    try:
        with open(file_path, 'w', encoding=encoding) as f:
            f.write(problematic_content)
        encoding_test_files[filename] = (file_path, encoding, description)
        print(f"📝 Created {filename} with {description}")
    except UnicodeEncodeError as e:
        print(f"❌ Cannot create {filename} with {encoding}: {e}")

print("\nTesting automatic encoding detection:")
print("-" * 50)

for filename, (file_path, original_encoding, description) in encoding_test_files.items():
    print(f"\n📄 {filename} (originally {description})")
    
    try:
        result = extractor.extract_text(str(file_path))
        detected_encoding = result['metadata']['encoding']
        
        print(f"   🔍 Detected encoding: {detected_encoding}")
        print(f"   ✅ Original encoding: {original_encoding}")
        
        # Check if Croatian characters are correctly preserved
        if 'čćšžđ' in result['content']:
            print(f"   ✓ Croatian diacritics correctly preserved")
        else:
            print(f"   ❌ Croatian diacritics lost or corrupted")
            
    except Exception as e:
        print(f"   ❌ Error: {e}")

print("\n💡 Encoding is crucial for Croatian text processing!")
print("   • Always use UTF-8 when possible")
print("   • Auto-detection helps with legacy documents")
print("   • Verify diacritic preservation after extraction")

## 3. Croatian Text Cleaning Challenges

### Why Clean Text?

Raw extracted text often contains:
- **Formatting artifacts**: Extra spaces, line breaks
- **Non-printable characters**: Control characters, BOM
- **Inconsistent whitespace**: Tabs, multiple spaces
- **OCR errors**: From scanned documents
- **Mixed languages**: Headers/footers in other languages

### Croatian-Specific Cleaning:

1. **Diacritic Normalization**: Ensure consistent diacritic representation
2. **Case Handling**: Proper Croatian title case rules
3. **Punctuation**: Handle Croatian-specific quotation marks
4. **Number Formats**: Croatian uses comma for decimals (123,45)
5. **Date Formats**: DD.MM.YYYY. format common in Croatia

In [None]:
# Let's test Croatian text cleaning capabilities
print("🧹 CROATIAN TEXT CLEANING:")
print("=" * 60)

# Create text cleaner with Croatian-specific configuration
cleaning_config = TextCleaningConfig(
    normalize_whitespace=True,
    remove_extra_newlines=True,
    normalize_diacritics=False,  # Keep Croatian diacritics!
    lowercase=False,  # Preserve proper names
    remove_punctuation=False,
    min_word_length=2,
    language='hr'  # Croatian-specific rules
)

cleaner = CroatianTextCleaner(cleaning_config)

print(f"⚙️ Text cleaner configured for Croatian:")
print(f"   • Language: {cleaning_config.language}")
print(f"   • Preserve diacritics: {not cleaning_config.normalize_diacritics}")
print(f"   • Preserve case: {not cleaning_config.lowercase}")
print(f"   • Min word length: {cleaning_config.min_word_length}")

# Test with messy Croatian text
messy_text = """

   ZAGREB    -   GLAVNI GRAD    


Zagreb  je  glavni   grad  Republike   Hrvatske.   
    Nalazi    se   u   sjeverozapadnom    dijelu    zemlje.  


Stanovništvo:     ~800,000     stanovnika     (2021.)    

Površina:   641.4    km²   


VAŽNE   ČINJENICE:  
•    Osnovan    je    u    11.    stoljeću  
•  Glavni    grad   od    1991.  godine  
•    Sveučilište    osnovano    1669.    


"""

print("\n📝 Original messy text:")
print(f"   Length: {len(messy_text)} characters")
print(f"   Lines: {messy_text.count(chr(10))} newlines")
print(f"   Preview: {repr(messy_text[:100])}...")

# Clean the text
try:
    cleaned_result = cleaner.clean_text(messy_text)
    cleaned_text = cleaned_result['text']
    
    print("\n✨ Cleaned text:")
    print(f"   Length: {len(cleaned_text)} characters (reduced by {len(messy_text) - len(cleaned_text)})")
    print(f"   Lines: {cleaned_text.count(chr(10))} newlines")
    print(f"   Preview: {cleaned_text[:200]}...")
    
    # Show cleaning statistics
    stats = cleaned_result['metadata']['cleaning_stats']
    print("\n📊 Cleaning Statistics:")
    for operation, count in stats.items():
        if count > 0:
            print(f"   • {operation}: {count}")
            
    # Verify Croatian diacritics are preserved
    croatian_chars = ['č', 'ć', 'š', 'ž', 'đ']
    found_chars = [char for char in croatian_chars if char in cleaned_text]
    if found_chars:
        print(f"\n✓ Croatian diacritics preserved: {', '.join(found_chars)}")
    
except Exception as e:
    print(f"❌ Cleaning failed: {e}")

In [None]:
# Test different cleaning configurations
print("🔧 CLEANING CONFIGURATION COMPARISON:")
print("=" * 60)

test_text = "Dubrovnik je POZNAT kao "biser Jadrana". Temperatura: 25,5°C (77.9°F). Datum: 15.08.2023."

print(f"📝 Test text: {test_text}")
print()

# Different cleaning strategies
configs = [
    ("Conservative", {
        'normalize_diacritics': False,
        'lowercase': False,
        'remove_punctuation': False,
        'normalize_quotes': True
    }),
    ("Aggressive", {
        'normalize_diacritics': True,
        'lowercase': True,
        'remove_punctuation': True,
        'normalize_quotes': True
    }),
    ("Balanced", {
        'normalize_diacritics': False,
        'lowercase': True,
        'remove_punctuation': False,
        'normalize_quotes': True
    })
]

for strategy_name, config_updates in configs:
    print(f"🛠️ {strategy_name} Cleaning:")
    print("-" * 30)
    
    # Create config with updates
    config = TextCleaningConfig(
        language='hr',
        **config_updates
    )
    
    cleaner_test = CroatianTextCleaner(config)
    
    try:
        result = cleaner_test.clean_text(test_text)
        cleaned = result['text']
        
        print(f"   Input:  {test_text}")
        print(f"   Output: {cleaned}")
        
        # Analysis
        changes = []
        if config.normalize_diacritics and any(c in test_text for c in 'čćšžđ'):
            changes.append("diacritics normalized")
        if config.lowercase and any(c.isupper() for c in test_text):
            changes.append("lowercased")
        if config.remove_punctuation:
            changes.append("punctuation removed")
        if config.normalize_quotes and "smart quotes" in test_text:
            changes.append("quotes normalized")
            
        if changes:
            print(f"   Changes: {', '.join(changes)}")
        else:
            print(f"   Changes: none")
            
    except Exception as e:
        print(f"   ❌ Error: {e}")
    
    print()

print("💡 Cleaning Strategy Recommendations:")
print("   • Conservative: Best for formal documents, citations")
print("   • Aggressive: Good for search indexing, analysis")
print("   • Balanced: Good compromise for most RAG applications")

## 4. Document Chunking Strategies

### Why Chunk Documents?

Large documents must be split into smaller pieces because:
- **Embedding limits**: Models have max token limits (512-1024 tokens)
- **Search precision**: Smaller chunks = more focused results
- **Context relevance**: Large chunks may contain irrelevant information
- **Processing efficiency**: Smaller pieces are faster to process

### Chunking Strategies:

1. **Sentence-based**: Split on sentence boundaries (good for Croatian)
2. **Fixed-size**: Split by character/token count
3. **Paragraph-based**: Split on paragraph breaks
4. **Semantic**: Split based on topic/meaning
5. **Hybrid**: Combine multiple approaches

### Croatian Considerations:

- **Sentence detection**: Croatian punctuation patterns
- **Long sentences**: Croatian can have very long complex sentences
- **Paragraph structure**: Formal vs informal text differences

In [None]:
# Let's test different chunking strategies on Croatian text
print("📝 DOCUMENT CHUNKING STRATEGIES:")
print("=" * 60)

# Use one of our cleaned documents as test content
test_document = """
Zagreb - Glavni Grad Hrvatske

Zagreb je glavni i najveći grad Republike Hrvatske. Smješten je u sjeverozapadnom dijelu zemlje, na rijeci Savi. Grad ima bogatu povijest koja seže u rimsko doba.

Povijesni razvoj grada možemo pratiti kroz nekoliko važnih razdoblja. U rimsko doba na ovom je prostoru postojao grad Andautonia. Srednjovjekovni Zagreb nastao je spajaanjem dvaju gradova: Kaptola i Gradeca.

Turistička mjesta u Zagrebu uključuju Gornji grad - povijesni dio s crkvom sv. Marka, Donji grad - trgovački i poslovni centar, te Maksimir - najveći park u Zagrebu. Grad je također poznato kulturno središte s brojnim muzejima, kazalištima i galerijama.

Zagreb je dom mnogih važnih institucija. Tu se nalaze sveučilište, akademija znanosti, nacionalna knjižnica i mnoge druge ustanove. Grad je također važno gospodarsko središte regije.

Stanovništvo Zagreba broji oko 800.000 stanovnika, što ga čini najvećim gradom u Hrvatskoj. Površina grada iznosi 641.4 km². Klima je umjereno kontinentalna s toplim ljetima i hladnim zimama.
""".strip()

print(f"📄 Test document:")
print(f"   Length: {len(test_document)} characters")
print(f"   Paragraphs: {test_document.count(chr(10) + chr(10)) + 1}")
print(f"   Sentences: ~{test_document.count('.')}")
print()

# Test different chunking strategies
chunking_strategies = [
    ("Sentence-based", ChunkingStrategy.SENTENCE, {"max_chunk_size": 200, "overlap_size": 20}),
    ("Fixed-size", ChunkingStrategy.FIXED_SIZE, {"max_chunk_size": 150, "overlap_size": 30}),
    ("Paragraph-based", ChunkingStrategy.PARAGRAPH, {"max_chunk_size": 400, "overlap_size": 50}),
    ("Hybrid", ChunkingStrategy.HYBRID, {"max_chunk_size": 250, "overlap_size": 40})
]

for strategy_name, strategy, params in chunking_strategies:
    print(f"📑 {strategy_name} Chunking:")
    print("-" * 40)
    
    try:
        # Configure chunker
        config = ChunkingConfig(
            strategy=strategy,
            max_chunk_size=params["max_chunk_size"],
            overlap_size=params["overlap_size"],
            language='hr'
        )
        
        chunker = DocumentChunker(config)
        
        # Create chunks
        chunks = chunker.chunk_text(
            text=test_document,
            metadata={
                "source": "zagreb_info.txt",
                "title": "Zagreb - Glavni Grad Hrvatske",
                "language": "hr"
            }
        )
        
        print(f"   ✅ Created {len(chunks)} chunks")
        print(f"   📊 Config: max_size={params['max_chunk_size']}, overlap={params['overlap_size']}")
        
        # Analyze chunk sizes
        chunk_sizes = [len(chunk['content']) for chunk in chunks]
        avg_size = sum(chunk_sizes) / len(chunk_sizes) if chunk_sizes else 0
        print(f"   📏 Sizes: avg={avg_size:.1f}, min={min(chunk_sizes)}, max={max(chunk_sizes)}")
        
        # Show first few chunks
        for i, chunk in enumerate(chunks[:2]):
            preview = chunk['content'][:80].replace('\n', ' ')
            print(f"   📝 Chunk {i+1}: {preview}...")
            
        if len(chunks) > 2:
            print(f"   ⋯ ({len(chunks) - 2} more chunks)")
            
    except Exception as e:
        print(f"   ❌ Chunking failed: {e}")
    
    print()

In [None]:
# Analyze chunking quality for Croatian text
print("🔍 CHUNKING QUALITY ANALYSIS:")
print("=" * 60)

# Test with sentence-based chunking (good for Croatian)
config = ChunkingConfig(
    strategy=ChunkingStrategy.SENTENCE,
    max_chunk_size=300,
    overlap_size=50,
    language='hr'
)

chunker = DocumentChunker(config)

try:
    chunks = chunker.chunk_text(
        text=test_document,
        metadata={
            "source": "zagreb_analysis.txt",
            "title": "Zagreb Analysis"
        }
    )
    
    print(f"📊 Quality Analysis for {len(chunks)} chunks:")
    print()
    
    for i, chunk in enumerate(chunks):
        content = chunk['content']
        metadata = chunk['metadata']
        
        # Analyze chunk quality
        sentences = content.count('. ') + content.count('!') + content.count('?')
        words = len(content.split())
        has_header = any(line.isupper() or line.startswith('#') for line in content.split('\n'))
        
        print(f"📄 Chunk {i+1} (ID: {metadata['chunk_id'][:8]}...):")
        print(f"   📏 Length: {len(content)} chars, {words} words, ~{sentences} sentences")
        print(f"   🔢 Index: {metadata['chunk_index']}/{metadata['total_chunks']}")
        print(f"   🏷️  Header: {'Yes' if has_header else 'No'}")
        
        # Show content with line breaks preserved
        lines = content.split('\n')[:3]  # First 3 lines
        for line in lines:
            if line.strip():
                preview = line[:60] + "..." if len(line) > 60 else line
                print(f"   📝 {preview}")
        
        if len(content.split('\n')) > 3:
            print(f"   ⋯ ({len(content.split('\n')) - 3} more lines)")
        
        print()
        
    # Calculate overlap analysis
    if len(chunks) > 1:
        print("🔗 Overlap Analysis:")
        for i in range(len(chunks) - 1):
            current_end = chunks[i]['content'][-50:]
            next_start = chunks[i+1]['content'][:50]
            
            # Simple overlap detection
            overlap_words = set(current_end.split()) & set(next_start.split())
            print(f"   Chunk {i+1}→{i+2}: {len(overlap_words)} overlapping words")
            
except Exception as e:
    print(f"❌ Analysis failed: {e}")

print("\n💡 Good chunks should:")
print("   • Be complete thoughts (end at sentence boundaries)")
print("   • Have reasonable size (100-500 characters for Croatian)")
print("   • Preserve context through overlap")
print("   • Maintain structural information (headers, lists)")

In [None]:
# Test chunking with different types of Croatian content
print("📚 CHUNKING DIFFERENT CONTENT TYPES:")
print("=" * 60)

# Different Croatian content types
content_types = {
    "News Article": """
ZAGREB, 15. srpnja 2023. - Hrvatski premijer najavio je nova ulaganja u infrastrukturu. 

Prema riječima premijera, projekti će uključivati modernizaciju cesta, željeznica i digitalne infrastrukture. "Ovo su strateška ulaganja za budućnost Hrvatske", rekao je premijer na tiskovnoj konferenciji.

Ukupna vrijednost projekata procjenjuje se na 2,5 milijardi eura. Financiranje će biti osigurano iz EU fondova i državnog proračuna.
""",
    
    "Academic Text": """
Sintaksa hrvatskog jezika odlikuje se složenošću koja proizlazi iz bogate morfologije. Fleksijska priroda jezika omogućuje relativno slobodan red riječi, što se posebno očituje u poetskom diskursu.

Slavenski jezici, uključujući hrvatski, karakterizira razvijen aspektualnost glagolski sistem. Perfektivnost i imperfektivnost glagola fundamentalne su kategorije koje utječu na temporalnu strukturu iskaza.
""",
    
    "Recipe": """
SARMA - tradicionalno zimsko jelo

Potrebno:
• 1 kg miješanog mesa
• 1 glavica kiselog kupusa
• 200g riže
• 2 luka
• Sol, papar, vegeta

Priprema:
1. Prokuhajte rižu na pola
2. Pomiješajte meso s rižom i začinima
3. Zamotajte u kupusove listove
4. Kuhajte 2 sata na laganoj vatri
"""
}

for content_name, content in content_types.items():
    print(f"📖 {content_name}:")
    print("-" * 30)
    
    # Choose appropriate chunking strategy based on content type
    if "News" in content_name:
        strategy = ChunkingStrategy.PARAGRAPH
        max_size = 200
    elif "Academic" in content_name:
        strategy = ChunkingStrategy.SENTENCE
        max_size = 150  # Shorter for complex sentences
    elif "Recipe" in content_name:
        strategy = ChunkingStrategy.HYBRID  # Preserve lists
        max_size = 100
    else:
        strategy = ChunkingStrategy.SENTENCE
        max_size = 200
    
    config = ChunkingConfig(
        strategy=strategy,
        max_chunk_size=max_size,
        overlap_size=20,
        language='hr'
    )
    
    chunker = DocumentChunker(config)
    
    try:
        chunks = chunker.chunk_text(
            text=content.strip(),
            metadata={"content_type": content_name.lower(), "language": "hr"}
        )
        
        print(f"   📊 Strategy: {strategy.value}, Max size: {max_size}")
        print(f"   📑 Chunks created: {len(chunks)}")
        
        for i, chunk in enumerate(chunks):
            preview = chunk['content'][:60].replace('\n', ' ')
            print(f"   {i+1}. {preview}... ({len(chunk['content'])} chars)")
        
    except Exception as e:
        print(f"   ❌ Error: {e}")
    
    print()

print("🎯 Content-specific chunking improves RAG quality:")
print("   • News: Paragraph-based preserves story flow")
print("   • Academic: Sentence-based handles complexity")
print("   • Recipes: Hybrid preserves structured lists")

## 5. Complete Document Processing Pipeline

### Putting It All Together

A complete document processing pipeline combines:
1. **Document detection and loading**
2. **Format-specific text extraction**
3. **Croatian-aware text cleaning**
4. **Intelligent document chunking**
5. **Metadata preservation and enrichment**

### Pipeline Benefits:
- **Consistency**: Same processing for all documents
- **Quality**: Each step improves text quality
- **Scalability**: Can process hundreds of documents
- **Traceability**: Track processing steps and errors

In [None]:
# Let's create a complete document processing pipeline
print("🔄 COMPLETE DOCUMENT PROCESSING PIPELINE:")
print("=" * 60)

def process_document_pipeline(file_path: str, chunk_strategy: str = "sentence") -> list:
    """
    Complete pipeline to process a Croatian document from file to chunks.
    """
    results = []
    
    print(f"📂 Processing: {Path(file_path).name}")
    
    try:
        # Step 1: Extract text
        print("   1️⃣ Extracting text...")
        extractor = DocumentExtractor()
        extraction_result = extractor.extract_text(file_path)
        
        raw_text = extraction_result['content']
        file_metadata = extraction_result['metadata']
        
        print(f"      ✅ Extracted {len(raw_text)} characters")
        print(f"      📄 File type: {file_metadata['file_type']}")
        print(f"      🔤 Encoding: {file_metadata['encoding']}")
        
        # Step 2: Clean text
        print("   2️⃣ Cleaning Croatian text...")
        cleaning_config = TextCleaningConfig(
            language='hr',
            normalize_whitespace=True,
            remove_extra_newlines=True,
            normalize_diacritics=False,  # Preserve Croatian characters
            normalize_quotes=True
        )
        
        cleaner = CroatianTextCleaner(cleaning_config)
        cleaning_result = cleaner.clean_text(raw_text)
        
        cleaned_text = cleaning_result['text']
        cleaning_stats = cleaning_result['metadata']['cleaning_stats']
        
        print(f"      ✅ Cleaned to {len(cleaned_text)} characters")
        print(f"      🧹 Cleaning operations: {sum(cleaning_stats.values())}")
        
        # Step 3: Create chunks
        print(f"   3️⃣ Creating chunks ({chunk_strategy} strategy)...")
        
        strategy_map = {
            "sentence": ChunkingStrategy.SENTENCE,
            "paragraph": ChunkingStrategy.PARAGRAPH,
            "hybrid": ChunkingStrategy.HYBRID
        }
        
        chunking_config = ChunkingConfig(
            strategy=strategy_map.get(chunk_strategy, ChunkingStrategy.SENTENCE),
            max_chunk_size=300,
            overlap_size=50,
            language='hr'
        )
        
        chunker = DocumentChunker(chunking_config)
        
        # Combine metadata
        combined_metadata = {
            **file_metadata,
            'cleaning_stats': cleaning_stats,
            'chunking_strategy': chunk_strategy
        }
        
        chunks = chunker.chunk_text(cleaned_text, combined_metadata)
        
        print(f"      ✅ Created {len(chunks)} chunks")
        
        # Step 4: Quality check
        print("   4️⃣ Quality check...")
        
        chunk_sizes = [len(chunk['content']) for chunk in chunks]
        avg_size = sum(chunk_sizes) / len(chunk_sizes)
        
        quality_issues = []
        if avg_size < 50:
            quality_issues.append("chunks too small")
        if max(chunk_sizes) > 500:
            quality_issues.append("some chunks too large")
        if not any('č' in chunk['content'] or 'ć' in chunk['content'] or 
                  'š' in chunk['content'] or 'ž' in chunk['content'] or 
                  'đ' in chunk['content'] for chunk in chunks):
            if any(c in raw_text for c in 'čćšžđ'):
                quality_issues.append("Croatian diacritics lost")
        
        if quality_issues:
            print(f"      ⚠️ Quality issues: {', '.join(quality_issues)}")
        else:
            print(f"      ✅ Quality check passed")
        
        print(f"      📊 Avg chunk size: {avg_size:.1f} chars")
        
        return chunks
        
    except Exception as e:
        print(f"      ❌ Pipeline failed: {e}")
        return []

# Test the complete pipeline
print("🚀 Testing complete pipeline on our sample documents:")
print()

all_processed_chunks = []

for filename, file_path in list(sample_files.items())[:2]:  # Process first 2 files
    chunks = process_document_pipeline(str(file_path), "sentence")
    all_processed_chunks.extend(chunks)
    print()

print(f"📈 Pipeline Summary:")
print(f"   📁 Documents processed: {len(sample_files)}")
print(f"   📄 Total chunks created: {len(all_processed_chunks)}")
print(f"   💾 Ready for vector database storage")

In [None]:
# Analyze the final processed chunks
if all_processed_chunks:
    print("🔍 FINAL CHUNK ANALYSIS:")
    print("=" * 60)
    
    print("📊 Chunk Statistics:")
    chunk_lengths = [len(chunk['content']) for chunk in all_processed_chunks]
    print(f"   • Total chunks: {len(all_processed_chunks)}")
    print(f"   • Average length: {sum(chunk_lengths) / len(chunk_lengths):.1f} characters")
    print(f"   • Min length: {min(chunk_lengths)} characters")
    print(f"   • Max length: {max(chunk_lengths)} characters")
    
    # Sample some chunks
    print("\n📝 Sample Processed Chunks:")
    for i, chunk in enumerate(all_processed_chunks[:3]):
        print(f"\nChunk {i+1}:")
        print(f"   📏 Length: {len(chunk['content'])} chars")
        print(f"   🏷️  Source: {chunk['metadata']['source']}")
        print(f"   🔢 Index: {chunk['metadata']['chunk_index']}/{chunk['metadata']['total_chunks']}")
        
        # Show content
        content_lines = chunk['content'].split('\n')[:2]
        for line in content_lines:
            if line.strip():
                print(f"   📄 {line[:70]}..." if len(line) > 70 else f"   📄 {line}")
        
        # Check Croatian preservation
        croatian_chars = [c for c in 'čćšžđČĆŠŽĐ' if c in chunk['content']]
        if croatian_chars:
            print(f"   ✓ Croatian chars: {', '.join(set(croatian_chars))}")
    
    print("\n🎯 These chunks are now ready for:")
    print("   1. Embedding generation (vector database)")
    print("   2. Metadata indexing")
    print("   3. Similarity search")
    print("   4. RAG retrieval")

else:
    print("⚠️  No processed chunks available for analysis")

## 6. Summary and Best Practices

### Key Takeaways:

🎯 **Document Processing is Critical**:
- Quality processing = better RAG results
- Each step (extract→clean→chunk) adds value
- Bad preprocessing ruins the entire pipeline

🇭🇷 **Croatian Language Considerations**:
- Always preserve diacritics (č, ć, š, ž, đ)
- Handle various encodings (UTF-8, Windows-1250)
- Use Croatian-aware sentence splitting
- Consider regional variations and dialects

📝 **Chunking Strategy Matters**:
- Sentence-based: Best for most Croatian content
- Paragraph-based: Good for structured documents
- Hybrid: Best for complex formats
- Always include overlap for context preservation

### Best Practices:

✅ **Text Extraction**:
- Auto-detect encoding for legacy documents
- Preserve file metadata for traceability
- Handle extraction errors gracefully

✅ **Text Cleaning**:
- Use conservative cleaning for Croatian
- Never normalize Croatian diacritics
- Preserve proper names and technical terms
- Normalize whitespace and quotes

✅ **Document Chunking**:
- Aim for 100-400 characters per chunk
- Use 20-50 character overlap
- Respect sentence boundaries
- Include chunk metadata for debugging

### Next Steps:

1. ✅ **Document Processing** - Just completed!
2. ✅ **Vector Database** - Already done
3. ⏳ **Retrieval System** - Next step
4. ⏳ **Generation** - Local LLM integration
5. ⏳ **Complete Pipeline** - End-to-end integration

The document processing pipeline creates clean, well-structured chunks that are ready for embedding and storage in our vector database!

In [None]:
# Clean up our temporary files
import shutil

try:
    if temp_dir.exists():
        shutil.rmtree(temp_dir)
        print(f"🧹 Cleaned up temporary directory: {temp_dir}")
except Exception as e:
    print(f"⚠️  Warning: Could not clean up temp directory: {e}")

print("\n🎉 Document Processing Learning Complete!")
print("\n📚 What we learned:")
print("   • Text extraction from multiple formats")
print("   • Croatian-specific cleaning challenges")
print("   • Chunking strategies and their impact")
print("   • Complete processing pipeline design")
print("\n➡️  Ready for the next step: Vector Database integration!")