# Multilingual Document Processing for RAG System

## Learning Objectives

This notebook explains the multilingual document processing pipeline - the first critical step in building a scalable multilingual RAG system:

1. **Why language-aware document processing matters for RAG quality**
2. **Text extraction from different file formats across languages**
3. **Language-specific cleaning challenges (Croatian, English, extensible)**
4. **Language-aware document chunking strategies and their impact**
5. **Building a complete multilingual preprocessing pipeline**
6. **Automatic language detection and folder organization**

## 1. Why Multilingual Document Processing Matters

### The Foundation of Cross-Language RAG Quality

Document processing is like preparing ingredients for international cuisine - language-specific preparation ensures authentic results. In multilingual RAG systems:

- **Language Preservation**: Maintain language-specific features (diacritics, scripts, etc.)
- **Cross-Language Consistency**: Unified processing while respecting language differences
- **Chunk Quality**: Language-aware splitting for better semantic coherence
- **Metadata Enhancement**: Language tags and source information for better routing
- **Encoding Robustness**: Handle various character encodings across languages

### Language-Specific Processing Challenges

#### üá≠üá∑ Croatian Language Challenges
- **Diacritics**: ƒç, ƒá, ≈°, ≈æ, ƒë must be preserved correctly
- **Encoding Issues**: Many documents use Windows-1250 or ISO-8859-2
- **Morphological Complexity**: Rich inflectional system
- **Regional Variations**: Different dialects and spellings

#### üá¨üáß English Language Challenges  
- **Business Documents**: Financial reports, legal contracts, technical specs
- **Encoding Variations**: UTF-8, cp1252, iso-8859-1
- **Technical Terminology**: Industry-specific vocabulary preservation
- **Document Structure**: Complex layouts in business documents

#### üåê Cross-Language Challenges
- **Language Detection**: Automatic identification for proper routing
- **Mixed Documents**: Handling multilingual content within single documents
- **Unified Metadata**: Consistent tagging across language boundaries

In [None]:
# Import our multilingual document processing components
import sys
sys.path.append('../src')

from src.preprocessing.extractors import DocumentExtractor
from src.preprocessing.cleaners import MultilingualTextCleaner
from src.preprocessing.chunkers import DocumentChunker, chunk_document

import tempfile
import os
from pathlib import Path
import logging

# Set up logging to see what's happening
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
print("‚úÖ Multilingual document processing components imported successfully!")
print("üåç Supporting Croatian, English, and extensible language framework")
print("üìÅ Language-aware processing with automatic detection capabilities")

## 2. Multilingual Document Organization

### üìÅ Language-Based Folder Structure

Our system uses a language-aware folder organization for scalable document management:

```
data/
‚îú‚îÄ‚îÄ raw/
‚îÇ   ‚îú‚îÄ‚îÄ hr/                    # Croatian documents
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ NN-2025-115-1666.pdf
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ legal_document.docx
‚îÇ   ‚îú‚îÄ‚îÄ en/                    # English documents
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ financial_report.pdf
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ business_plan.docx
‚îÇ   ‚îî‚îÄ‚îÄ multilingual/          # Mixed-language documents
‚îú‚îÄ‚îÄ processed/
‚îÇ   ‚îú‚îÄ‚îÄ hr/                    # Croatian processed chunks
‚îÇ   ‚îú‚îÄ‚îÄ en/                    # English processed chunks
‚îÇ   ‚îî‚îÄ‚îÄ shared/                # Cross-language resources
‚îî‚îÄ‚îÄ test/
    ‚îú‚îÄ‚îÄ hr/sample_croatian.txt
    ‚îú‚îÄ‚îÄ en/sample_english.txt
    ‚îî‚îÄ‚îÄ multilingual/mixed_content.txt
```

### üîÑ Language Detection and Routing

1. **Automatic Detection**: Identify document language during ingestion
2. **Folder Routing**: Move documents to appropriate language folders
3. **Language-Specific Processing**: Apply language-aware cleaning and chunking
4. **Unified Storage**: Store in multilingual vector database with language metadata

## 3. Text Extraction from Different Formats

Our system supports three main document types across all languages:

### üìÑ Plain Text (.txt)
- Simplest format but encoding can be tricky
- **Croatian**: Often use Windows-1250 encoding, diacritics preservation
- **English**: Usually UTF-8, but may encounter cp1252, iso-8859-1
- **Detection**: UTF-8 detection and conversion is crucial

### üìï PDF Documents (.pdf)
- Complex format with fonts, images, layouts
- May contain scanned text (OCR needed)
- **Croatian**: Embedded font issues with diacritics
- **English**: Business documents with complex layouts
- **Multilingual**: Mixed-language content within single documents

### üìò Word Documents (.docx)
- Structured format with styles, headers
- Generally good encoding support across languages
- May contain tables, images, footnotes
- **Language Detection**: Can identify primary language from content

In [None]:
# Let's create sample multilingual documents for testing
import tempfile
from pathlib import Path

# Create temporary directory for our test documents
temp_dir = Path(tempfile.mkdtemp(prefix="multilingual_docs_"))
print(f"üìÅ Created temporary directory: {temp_dir}")

# Create language-specific subdirectories
hr_dir = temp_dir / "hr"
en_dir = temp_dir / "en"
multilingual_dir = temp_dir / "multilingual"

for lang_dir in [hr_dir, en_dir, multilingual_dir]:
    lang_dir.mkdir(exist_ok=True)

print(f"üìÅ Created language directories: hr/, en/, multilingual/")

# Sample Croatian texts with various challenges
croatian_texts = {
    "zagreb_info.txt": """
Zagreb - Glavni Grad Hrvatske

Zagreb je glavni i najveƒái grad Republike Hrvatske. Smje≈°ten je u sjeverozapadnom dijelu zemlje, na rijeci Savi.
Grad ima bogatu povijest koja se≈æe u rimsko doba.

TURISTIƒåKA MJESTA:
‚Ä¢ Gornji grad - povijesni dio s crkvom sv. Marka
‚Ä¢ Donji grad - trgovaƒçki i poslovni centar
‚Ä¢ Maksimir - najveƒái park u Zagrebu

Stanovni≈°tvo: ~800,000 stanovnika (2021.)
Povr≈°ina: 641.4 km¬≤

Zagreb je takoƒëer kulturno sredi≈°te Hrvatske s brojnim muzejima, kazali≈°tima i galerijama.
""",

    "financije_hr.txt": """
IZVJE≈†TAJ O FINANCIJSKIM REZULTATIMA

Ukupni prihodi: 2.450.000,00 EUR
Ukupni tro≈°kovi: 1.890.000,00 EUR
DOBIT: 560.000,00 EUR

Kljuƒçni pokazatelji:
- Profitabilnost: 22.86%
- ROI: 15.3%
- Rast prihoda: +8.5% (godina)
"""
}

# Sample English texts
english_texts = {
    "business_report.txt": """
QUARTERLY FINANCIAL REPORT - Q3 2025

Executive Summary
Our company achieved significant growth in Q3 2025, with total revenue reaching ‚Ç¨3,250,000.

Key Performance Indicators:
‚Ä¢ Revenue: ‚Ç¨3,250,000 (+12.5% YoY)
‚Ä¢ Operating Profit: ‚Ç¨785,000 (+18.2% YoY)
‚Ä¢ Net Margin: 24.15%
‚Ä¢ Customer Acquisition: 1,847 new clients

Market Analysis:
The European market showed strong demand for our services, particularly in the technology and financial sectors.

Outlook:
We expect continued growth in Q4 2025, projecting revenues of ‚Ç¨3.8M.
""",

    "technical_specs.txt": """
SYSTEM SPECIFICATIONS

Hardware Requirements:
- CPU: Intel i7-12700K or AMD Ryzen 7 5800X
- RAM: 32GB DDR4-3200
- Storage: 1TB NVMe SSD
- GPU: NVIDIA RTX 4070 (optional, for acceleration)

Software Dependencies:
- Python 3.11+
- PyTorch 2.0+
- Transformers 4.30+
- ChromaDB 0.4+

Performance Benchmarks:
- Query Processing: <150ms
- Document Ingestion: 2.5 docs/sec
- Embedding Generation: 850 tokens/sec
"""
}

# Mixed language document
multilingual_texts = {
    "mixed_content.txt": """
MULTILINGUAL DOCUMENT EXAMPLE

English Section:
This document demonstrates processing of mixed-language content.
Total investment value: ‚Ç¨1,500,000

Croatian Section:
Ovaj dokument prikazuje obradu sadr≈æaja na vi≈°e jezika.
Ukupna vrijednost investicije: 1.500.000,00 EUR

Technical Details:
- Processing language: Auto-detect
- Encoding: UTF-8
- Fallback: Language-specific processing
"""
}

# Write Croatian files
for filename, content in croatian_texts.items():
    file_path = hr_dir / filename
    with open(file_path, 'w', encoding='utf-8') as f:
        f.write(content)
    print(f"üá≠üá∑ Created Croatian file: {filename}")

# Write English files
for filename, content in english_texts.items():
    file_path = en_dir / filename
    with open(file_path, 'w', encoding='utf-8') as f:
        f.write(content)
    print(f"üá¨üáß Created English file: {filename}")

# Write multilingual files
for filename, content in multilingual_texts.items():
    file_path = multilingual_dir / filename
    with open(file_path, 'w', encoding='utf-8') as f:
        f.write(content)
    print(f"üåê Created multilingual file: {filename}")

print(f"\n‚úÖ Created {len(croatian_texts)} Croatian, {len(english_texts)} English, and {len(multilingual_texts)} multilingual test documents")

In [None]:
# Let's test text extraction from our sample documents
print("üìÇ DOCUMENT EXTRACTION TESTING:")
print("=" * 60)

# Create document extractor
extractor = DocumentExtractor()

for filename, file_path in sample_files.items():
    print(f"\nüìÑ Processing: {filename}")
    print("-" * 40)

    try:
        # Extract text and metadata
        result = extractor.extract_text(str(file_path))

        print(f"   ‚úÖ Extraction successful")
        print(f"   üìä Text length: {len(result['content'])} characters")
        print(f"   üè∑Ô∏è  File type: {result['metadata']['file_type']}")
        print(f"   üìÖ File size: {result['metadata']['file_size']} bytes")
        print(f"   üî§ Encoding: {result['metadata']['encoding']}")

        # Show first 150 characters
        preview = result['content'][:150].replace('\n', ' ')
        print(f"   üìñ Preview: {preview}...")

        # Check Croatian diacritics preservation
        croatian_chars = ['ƒç', 'ƒá', '≈°', '≈æ', 'ƒë', 'ƒå', 'ƒÜ', '≈†', '≈Ω', 'ƒê']
        found_diacritics = [char for char in croatian_chars if char in result['content']]
        if found_diacritics:
            print(f"   ‚úì Croatian diacritics preserved: {', '.join(found_diacritics)}")

    except Exception as e:
        print(f"   ‚ùå Extraction failed: {e}")

print("\nüí° Observations:")
print("   ‚Ä¢ UTF-8 encoding preserves all Croatian diacritics")
print("   ‚Ä¢ Metadata extraction provides useful document information")
print("   ‚Ä¢ Text structure (headers, lists) is preserved")

In [None]:
chunker = DocumentChunker(config)

try:
    chunks = chunker.chunk_document(
        text=test_document,
        source_file="zagreb_analysis.txt",
        strategy="sliding_window"
    )

    print(f"üìä Quality Analysis for {len(chunks)} chunks:")
    print()

    for i, chunk in enumerate(chunks):
        content = chunk.content
        metadata = chunk.metadata

        # Analyze chunk quality
        sentences = content.count('. ') + content.count('!') + content.count('?')
        words = len(content.split())
        has_header = any(line.isupper() or line.startswith('#') for line in content.split('\n'))

        print(f"? Chunk {i+1} (ID: {metadata['chunk_id'][:8]}...):")
        print(f"   üìè Length: {len(content)} chars, {words} words, ~{sentences} sentences")
        print(f"   ? Index: {metadata['chunk_index']}/{metadata['total_chunks']}")
        print(f"   üè∑Ô∏è  Language: {metadata.get('language', 'auto-detected')}")
        print(f"   üìÑ Source: {metadata['source_file']}")
        if has_header:
            print(f"   üìã Contains headers/structure")
        print()

except Exception as e:
    print(f"‚ùå Error in chunking: {e}")
    import traceback
    traceback.print_exc()

## 3. Croatian Text Cleaning Challenges

### Why Clean Text?

Raw extracted text often contains:
- **Formatting artifacts**: Extra spaces, line breaks
- **Non-printable characters**: Control characters, BOM
- **Inconsistent whitespace**: Tabs, multiple spaces
- **OCR errors**: From scanned documents
- **Mixed languages**: Headers/footers in other languages

### Croatian-Specific Cleaning:

1. **Diacritic Normalization**: Ensure consistent diacritic representation
2. **Case Handling**: Proper Croatian title case rules
3. **Punctuation**: Handle Croatian-specific quotation marks
4. **Number Formats**: Croatian uses comma for decimals (123,45)
5. **Date Formats**: DD.MM.YYYY. format common in Croatia

In [None]:
# Let's test Croatian text cleaning capabilities
print("üßπ CROATIAN TEXT CLEANING:")
print("=" * 60)

# Create text cleaner with Croatian-specific configuration
cleaning_config = TextCleaningConfig(
    normalize_whitespace=True,
    remove_extra_newlines=True,
    normalize_diacritics=False,  # Keep Croatian diacritics!
    lowercase=False,  # Preserve proper names
    remove_punctuation=False,
    min_word_length=2,
    language='hr'  # Croatian-specific rules
)

cleaner = CroatianTextCleaner(cleaning_config)

print(f"‚öôÔ∏è Text cleaner configured for Croatian:")
print(f"   ‚Ä¢ Language: {cleaning_config.language}")
print(f"   ‚Ä¢ Preserve diacritics: {not cleaning_config.normalize_diacritics}")
print(f"   ‚Ä¢ Preserve case: {not cleaning_config.lowercase}")
print(f"   ‚Ä¢ Min word length: {cleaning_config.min_word_length}")

# Test with messy Croatian text
messy_text = """

   ZAGREB    -   GLAVNI GRAD


Zagreb  je  glavni   grad  Republike   Hrvatske.
    Nalazi    se   u   sjeverozapadnom    dijelu    zemlje.


Stanovni≈°tvo:     ~800,000     stanovnika     (2021.)

Povr≈°ina:   641.4    km¬≤


VA≈ΩNE   ƒåINJENICE:
‚Ä¢    Osnovan    je    u    11.    stoljeƒáu
‚Ä¢  Glavni    grad   od    1991.  godine
‚Ä¢    Sveuƒçili≈°te    osnovano    1669.


"""

print("\nüìù Original messy text:")
print(f"   Length: {len(messy_text)} characters")
print(f"   Lines: {messy_text.count(chr(10))} newlines")
print(f"   Preview: {repr(messy_text[:100])}...")

# Clean the text
try:
    cleaned_result = cleaner.clean_text(messy_text)
    cleaned_text = cleaned_result['text']

    print("\n‚ú® Cleaned text:")
    print(f"   Length: {len(cleaned_text)} characters (reduced by {len(messy_text) - len(cleaned_text)})")
    print(f"   Lines: {cleaned_text.count(chr(10))} newlines")
    print(f"   Preview: {cleaned_text[:200]}...")

    # Show cleaning statistics
    stats = cleaned_result['metadata']['cleaning_stats']
    print("\nüìä Cleaning Statistics:")
    for operation, count in stats.items():
        if count > 0:
            print(f"   ‚Ä¢ {operation}: {count}")

    # Verify Croatian diacritics are preserved
    croatian_chars = ['ƒç', 'ƒá', '≈°', '≈æ', 'ƒë']
    found_chars = [char for char in croatian_chars if char in cleaned_text]
    if found_chars:
        print(f"\n‚úì Croatian diacritics preserved: {', '.join(found_chars)}")

except Exception as e:
    print(f"‚ùå Cleaning failed: {e}")

In [None]:
for content_name, content in content_types.items():
    print(f"? {content_name}:")
    print("-" * 30)

    # Choose appropriate chunking strategy based on content type
    if "News" in content_name:
        strategy = "paragraph"
        max_size = 200
    elif "Academic" in content_name:
        strategy = "sentence"
        max_size = 150  # Shorter for complex sentences
    elif "Recipe" in content_name:
        strategy = "sliding_window"  # Preserve structure
        max_size = 100
    else:
        strategy = "sentence"
        max_size = 200

    chunker = DocumentChunker(language='hr')

    try:
        chunks = chunker.chunk_document(
            text=content.strip(),
            source_file=f"{content_name.lower()}_sample.txt",
            strategy=strategy
        )

        print(f"   üìä Strategy: {strategy}, Max size: {max_size}")
        print(f"   üìë Chunks created: {len(chunks)}")

        for i, chunk in enumerate(chunks):
            preview = chunk.content[:60].replace('\n', ' ')
            print(f"   {i+1}. {preview}... ({len(chunk.content)} chars)")

    except Exception as e:
        print(f"   ‚ùå Error: {e}")

    print()

## 4. Document Chunking Strategies

### Why Chunk Documents?

Large documents must be split into smaller pieces because:
- **Embedding limits**: Models have max token limits (512-1024 tokens)
- **Search precision**: Smaller chunks = more focused results
- **Context relevance**: Large chunks may contain irrelevant information
- **Processing efficiency**: Smaller pieces are faster to process

### Chunking Strategies:

1. **Sentence-based**: Split on sentence boundaries (good for Croatian)
2. **Fixed-size**: Split by character/token count
3. **Paragraph-based**: Split on paragraph breaks
4. **Semantic**: Split based on topic/meaning
5. **Hybrid**: Combine multiple approaches

### Croatian Considerations:

- **Sentence detection**: Croatian punctuation patterns
- **Long sentences**: Croatian can have very long complex sentences
- **Paragraph structure**: Formal vs informal text differences

In [None]:
for content_name, content in content_types.items():
    print(f"? {content_name}:")
    print("-" * 30)

    # Choose appropriate chunking strategy based on content type
    if "News" in content_name:
        strategy = ChunkingStrategy.PARAGRAPH
        max_size = 200
    elif "Academic" in content_name:
        strategy = ChunkingStrategy.SENTENCE
        max_size = 150  # Shorter for complex sentences
    elif "Recipe" in content_name:
        strategy = ChunkingStrategy.HYBRID  # Preserve lists
        max_size = 100
    else:
        strategy = ChunkingStrategy.SENTENCE
        max_size = 200

    config = ChunkingConfig(
        strategy=strategy,
        max_chunk_size=max_size,
        overlap_size=20,
        language='hr'
    )

    chunker = DocumentChunker(config)

    try:
        # Use chunk_document with source_file parameter
        chunks = chunker.chunk_document(
            content=content.strip(),
            source_file=f"sample_{content_name.lower()}.txt",
            metadata={"content_type": content_name.lower(), "language": "hr"}
        )

        print(f"   üìä Strategy: {strategy.value}, Max size: {max_size}")
        print(f"   üìë Chunks created: {len(chunks)}")

        for i, chunk in enumerate(chunks):
            preview = chunk['content'][:60].replace('\n', ' ')
            print(f"   {i+1}. {preview}... ({len(chunk['content'])} chars)")

    except Exception as e:
        print(f"   ‚ùå Error: {e}")

    print()

In [None]:
for content_name, content in content_types.items():
    print(f"? {content_name}:")
    print("-" * 30)

    # Choose appropriate chunking strategy based on content type
    if "News" in content_name:
        strategy = ChunkingStrategy.PARAGRAPH
        max_size = 200
    elif "Academic" in content_name:
        strategy = ChunkingStrategy.SENTENCE
        max_size = 150  # Shorter for complex sentences
    elif "Recipe" in content_name:
        strategy = ChunkingStrategy.HYBRID  # Preserve lists
        max_size = 100
    else:
        strategy = ChunkingStrategy.SENTENCE
        max_size = 200

    config = ChunkingConfig(
        strategy=strategy,
        max_chunk_size=max_size,
        overlap_size=20,
        language='hr'
    )

    chunker = DocumentChunker(config)

    try:
        # Use chunk_document with source_file parameter
        chunks = chunker.chunk_document(
            content=content.strip(),
            source_file=f"sample_{content_name.lower()}.txt",
            metadata={"content_type": content_name.lower(), "language": "hr"}
        )

        print(f"   ? Strategy: {strategy.value}, Max size: {max_size}")
        print(f"   üìë Chunks created: {len(chunks)}")

        for i, chunk in enumerate(chunks):
            preview = chunk['content'][:60].replace('\n', ' ')
            print(f"   {i+1}. {preview}... ({len(chunk['content'])} chars)")

    except Exception as e:
        print(f"   ‚ùå Error: {e}")

    print()

In [None]:
# Test chunking with different types of Croatian content
print("üìö CHUNKING DIFFERENT CONTENT TYPES:")
print("=" * 60)

# Different Croatian content types
content_types = {
    "News Article": """
ZAGREB, 15. srpnja 2023. - Hrvatski premijer najavio je nova ulaganja u infrastrukturu.

Prema rijeƒçima premijera, projekti ƒáe ukljuƒçivati modernizaciju cesta, ≈æeljeznica i digitalne infrastrukture. "Ovo su strate≈°ka ulaganja za buduƒánost Hrvatske", rekao je premijer na tiskovnoj konferenciji.

Ukupna vrijednost projekata procjenjuje se na 2,5 milijardi eura. Financiranje ƒáe biti osigurano iz EU fondova i dr≈æavnog proraƒçuna.
""",

    "Academic Text": """
Sintaksa hrvatskog jezika odlikuje se slo≈æeno≈°ƒáu koja proizlazi iz bogate morfologije. Fleksijska priroda jezika omoguƒáuje relativno slobodan red rijeƒçi, ≈°to se posebno oƒçituje u poetskom diskursu.

Slavenski jezici, ukljuƒçujuƒái hrvatski, karakterizira razvijen aspektualnost glagolski sistem. Perfektivnost i imperfektivnost glagola fundamentalne su kategorije koje utjeƒçu na temporalnu strukturu iskaza.
""",

    "Recipe": """
SARMA - tradicionalno zimsko jelo

Potrebno:
‚Ä¢ 1 kg mije≈°anog mesa
‚Ä¢ 1 glavica kiselog kupusa
‚Ä¢ 200g ri≈æe
‚Ä¢ 2 luka
‚Ä¢ Sol, papar, vegeta

Priprema:
1. Prokuhajte ri≈æu na pola
2. Pomije≈°ajte meso s ri≈æom i zaƒçinima
3. Zamotajte u kupusove listove
4. Kuhajte 2 sata na laganoj vatri
"""
}

for content_name, content in content_types.items():
    print(f"üìñ {content_name}:")
    print("-" * 30)

    # Choose appropriate chunking strategy based on content type
    if "News" in content_name:
        strategy = ChunkingStrategy.PARAGRAPH
        max_size = 200
    elif "Academic" in content_name:
        strategy = ChunkingStrategy.SENTENCE
        max_size = 150  # Shorter for complex sentences
    elif "Recipe" in content_name:
        strategy = ChunkingStrategy.HYBRID  # Preserve lists
        max_size = 100
    else:
        strategy = ChunkingStrategy.SENTENCE
        max_size = 200

    config = ChunkingConfig(
        strategy=strategy,
        max_chunk_size=max_size,
        overlap_size=20,
        language='hr'
    )

    chunker = DocumentChunker(config)

    try:
        chunks = chunker.chunk_text(
            text=content.strip(),
            metadata={"content_type": content_name.lower(), "language": "hr"}
        )

        print(f"   üìä Strategy: {strategy.value}, Max size: {max_size}")
        print(f"   üìë Chunks created: {len(chunks)}")

        for i, chunk in enumerate(chunks):
            preview = chunk['content'][:60].replace('\n', ' ')
            print(f"   {i+1}. {preview}... ({len(chunk['content'])} chars)")

    except Exception as e:
        print(f"   ‚ùå Error: {e}")

    print()

print("üéØ Content-specific chunking improves RAG quality:")
print("   ‚Ä¢ News: Paragraph-based preserves story flow")
print("   ‚Ä¢ Academic: Sentence-based handles complexity")
print("   ‚Ä¢ Recipes: Hybrid preserves structured lists")

## 5. Complete Document Processing Pipeline

### Putting It All Together

A complete document processing pipeline combines:
1. **Document detection and loading**
2. **Format-specific text extraction**
3. **Croatian-aware text cleaning**
4. **Intelligent document chunking**
5. **Metadata preservation and enrichment**

### Pipeline Benefits:
- **Consistency**: Same processing for all documents
- **Quality**: Each step improves text quality
- **Scalability**: Can process hundreds of documents
- **Traceability**: Track processing steps and errors

In [None]:
def process_document_complete(file_path, chunk_strategy="sentence"):
    """Complete document processing pipeline."""
    print(f"   ? Processing: {Path(file_path).name}")

    try:
        # Step 1: Extract text
        print("   1Ô∏è‚É£ Text extraction...")
        extractor = DocumentExtractor()
        raw_text, file_metadata = extractor.extract_text(file_path)
        print(f"      ‚úÖ Extracted {len(raw_text)} characters")

        # Step 2: Clean text
        print("   2Ô∏è‚É£ Text cleaning...")
        cleaner = MultilingualTextCleaner(language='hr')
        cleaned_text, cleaning_stats = cleaner.clean_text(raw_text)
        print(f"      ‚úÖ Cleaned: {cleaning_stats['characters_removed']} chars removed")

        # Step 3: Chunk text
        print("   3Ô∏è‚É£ Text chunking...")

        # Map strategy names to valid options
        strategy_mapping = {
            "sentence": "sentence",
            "paragraph": "paragraph",
            "sliding_window": "sliding_window"
        }

        strategy = strategy_mapping.get(chunk_strategy, "sentence")
        chunker = DocumentChunker(language='hr')

        chunks = chunker.chunk_document(
            text=cleaned_text,
            source_file=str(file_path),
            strategy=strategy
        )

        print(f"      ‚úÖ Created {len(chunks)} chunks")

        # Step 4: Quality check
        print("   4Ô∏è‚É£ Quality check...")
        chunk_sizes = [len(chunk.content) for chunk in chunks]
        avg_size = sum(chunk_sizes) / len(chunk_sizes) if chunk_sizes else 0

        # Check for quality issues
        quality_issues = []
        if avg_size < 50:
            quality_issues.append("chunks too small")
        if max(chunk_sizes) if chunk_sizes else 0 > 500:
            quality_issues.append("some chunks too large")
        if not any('ƒç' in chunk.content or 'ƒá' in chunk.content or
                  '≈°' in chunk.content or '≈æ' in chunk.content or
                  'ƒë' in chunk.content for chunk in chunks):
            if any(c in raw_text for c in 'ƒçƒá≈°≈æƒë'):
                quality_issues.append("Croatian diacritics lost")

        if quality_issues:
            print(f"      ‚ö†Ô∏è Quality issues: {', '.join(quality_issues)}")
        else:
            print(f"      ‚úÖ Quality check passed")

        print(f"      üìä Avg chunk size: {avg_size:.1f} chars")

        return chunks

    except Exception as e:
        print(f"      ‚ùå Pipeline failed: {e}")
        return []

## 6. Advanced Multilingual Document Examples

### üåç Real-World Document Processing Scenarios

Here we demonstrate processing diverse multilingual documents that reflect actual use cases in Croatian business and government environments.

In [None]:
# Advanced multilingual document examples for processing
multilingual_documents = {
    "üèõÔ∏è Croatian Legal Document": {
        "content": """ODLUKA VLADE REPUBLIKE HRVATSKE
O IZMJENI ZAKONA O RADU

Klasa: 022-03/25-01/89
Urbroj: 50301-01-25-4
Zagreb, 15. lipnja 2025.

Na temelju ƒçlanka 89. Ustava Republike Hrvatske, Vlada Republike Hrvatske na sjednici odr≈æanoj 15. lipnja 2025. godine donosi

ODLUKU
O IZMJENI ZAKONA O RADU

ƒålanak 1.
U Zakonu o radu (¬ªNarodne novine¬´, broj 93/14, 127/17, 98/19 i 151/22) mijenja se ƒçlanak 142. koji glasi:

¬ªMinimalna mjeseƒçna plaƒáa za puno radno vrijeme iznosi 721,79 EUR.
Minimalna mjeseƒçna plaƒáa uvjetovana je indeksom potro≈°aƒçkih cijena.
Vlada Republike Hrvatske donosi uredbu o visini minimalne plaƒáe.¬´

ƒålanak 2.
Ova Odluka stupa na snagu 1. srpnja 2025. godine.

Predsjednik Vlade Republike Hrvatske
Andrej Plenkoviƒá""",
        "language": "hr",
        "domain": "legal",
        "complexity": "high",
        "features": ["formal_language", "legal_terminology", "official_formatting", "currency_amounts", "dates"]
    },

    "üíº Croatian Business Report": {
        "content": """IZVJE≈†ƒÜE O POSLOVANJU ZA Q3 2025
TECH SOLUTIONS d.o.o.

SA≈ΩETAK FINANCIJSKIH REZULTATA

Ukupni prihodi: 2.450.000,00 EUR (+15,3% u odnosu na Q3 2024)
Neto dobit: 367.500,00 EUR (+22,7% YoY)
EBITDA: 490.000,00 EUR (20,0% mar≈æa)

KLJUƒåNI INDIKATORI PERFORMANSI (KPI)
‚Ä¢ Broj zaposlenih: 47 (+6 nova zaposlenja)
‚Ä¢ Customer Satisfaction Score: 4.7/5.0
‚Ä¢ Net Promoter Score (NPS): +65
‚Ä¢ Retention rate: 94,2%

PROJEKTI I INOVACIJE
Tijekom Q3 implementirali smo RAG (Retrieval-Augmented Generation) sustav za automatizaciju customer support-a. Sustav koristi BGE-M3 embeddings i Ollama lokalnu infrastrukturu.

Rezultati:
- 40% smanjenje response time-a
- 15% poveƒáanje customer satisfaction
- U≈°teda od 25.000 EUR mjeseƒçno na operativnim tro≈°kovima

OUTLOOK ZA Q4 2025
Oƒçekujemo daljnji rast prihoda od 12-18% te lansiranje novog AI-powered produkta.""",
        "language": "hr",
        "domain": "business",
        "complexity": "medium",
        "features": ["mixed_terminology", "financial_data", "percentages", "technical_terms", "english_acronyms"]
    },

    "üî¨ English Technical Documentation": {
        "content": """RAG SYSTEM ARCHITECTURE SPECIFICATION
Version 2.1.3 | Build Date: September 5, 2025

OVERVIEW
This document outlines the technical architecture for a multilingual Retrieval-Augmented Generation (RAG) system supporting Croatian and English languages.

SYSTEM COMPONENTS

1. DOCUMENT PROCESSING PIPELINE
   - Input formats: PDF, DOCX, TXT
   - Encoding detection: UTF-8, Windows-1250, ISO-8859-2
   - Language detection: Pattern-based Croatian/English classification
   - Text extraction: pypdf, python-docx libraries
   - Cleaning: Unicode normalization, whitespace handling

2. CHUNKING STRATEGY
   - Sentence-based: 100-400 characters with 20-50 overlap
   - Paragraph-based: Semantic boundary preservation
   - Hybrid: Content-aware splitting for complex documents
   - Language-aware: Croatian morphology considerations

3. VECTOR DATABASE
   - Embeddings: BGE-M3 (1024 dimensions)
   - Storage: ChromaDB with persistence
   - Indexing: HNSW algorithm for similarity search
   - Metadata: Language tags, source references, timestamps

4. RETRIEVAL SYSTEM
   - Dense retrieval: Semantic similarity via embeddings
   - Sparse retrieval: BM25 with multilingual tokenization
   - Hybrid fusion: Weighted combination (0.7 dense + 0.3 sparse)
   - Reranking: Cross-encoder for result refinement

PERFORMANCE METRICS
- Query latency: <200ms (p95)
- Retrieval accuracy: >85% (HR), >92% (EN)
- Storage efficiency: 4.2MB per 1000 documents
- Throughput: 15 queries/second sustained""",
        "language": "en",
        "domain": "technical",
        "complexity": "high",
        "features": ["technical_specifications", "performance_metrics", "system_architecture", "acronyms", "version_numbers"]
    },

    "üåç Mixed Language Business Communication": {
        "content": """MEETINGR ZAPISNIK / MEETING MINUTES
Tech Solutions d.o.o. - Q3 Strategy Review
Datum/Date: 15. rujan 2025. / September 15, 2025

SUDIONICI/ATTENDEES:
- Marko Horvat (CEO)
- Sarah Johnson (CTO)
- Ana Novak (Head of Operations)
- James Wilson (Product Manager)

AGENDA ITEMS:

1. Q3 PERFORMANCE REVIEW
Marko: "Rezultati za Q3 su odliƒçni - 15.3% rast u odnosu na pro≈°lu godinu."
Sarah: "The RAG implementation exceeded our expectations. Response time improved by 40%."

2. TECHNICAL ROADMAP
Sarah: "We need to expand our LLM capabilities. Predla≈æem da implementiramo nova multilingual models."
James: "I agree. Our customer base is 60% Croatian, 35% English, 5% other languages."

3. RESOURCE ALLOCATION
Ana: "Za Q4 trebamo zaposliti jo≈° 3 developera i 2 data scientista."
Sarah: "Budget approval needed for additional GPU infrastructure - approximately ‚Ç¨45,000."

4. NEXT STEPS
- Implement BGE-M3 embeddings for better Croatian support
- Hire additional technical staff (deadline: 30. listopad 2025.)
- Present findings to board (meeting scheduled November 12th, 2025)

CLOSING REMARKS:
Marko: "Excellent progress team. Let's maintain this momentum kroz Q4."

SLJEDEƒÜI SASTANAK/NEXT MEETING: 30. rujan 2025. / September 30, 2025""",
        "language": "mixed",
        "domain": "business_communication",
        "complexity": "medium",
        "features": ["code_switching", "bilingual_formatting", "meeting_structure", "mixed_dates", "currency_amounts"]
    },

    "üìä Croatian Government Statistics": {
        "content": """DR≈ΩAVNI ZAVOD ZA STATISTIKU REPUBLIKE HRVATSKE
STATISTIƒåKO IZVJE≈†ƒÜE BR. 1.2.3/2025

DEMOGRAFSKI TRENDOVI U HRVATSKOJ - RUJAN 2025

STANOVNI≈†TVO
Ukupno stanovni≈°tvo: 3.871.833 (-0,8% u odnosu na 2024.)
Prirodni prirast: -1,2 ‚Ä∞
Migracijski saldo: +0,4 ‚Ä∞

REGIONALNA DISTRIBUCIJA
‚Ä¢ Zagreb (grad): 769.944 stanovnika (+0,2%)
‚Ä¢ Split-Dalmatinska ≈æupanija: 454.798 (-1,1%)
‚Ä¢ Primorsko-goranska ≈æupanija: 296.195 (-0,9%)
‚Ä¢ Osjeƒçko-baranjska ≈æupanija: 305.032 (-1,4%)

EKONOMSKI INDIKATORI
Prosjeƒçna neto plaƒáa: 1.235,67 EUR (+7,2% YoY)
Stopa nezaposlenosti: 6,1% (-0,8 postotnih bodova)
Inflacija (mjeseƒçno): 2,4%
BDP per capita: 16.247 EUR

OBRAZOVANJE
Stopa pismenosti: 99,2%
Visoko≈°kolska naobrazba: 28,7% (+1,2 p.p.)
STEM diplomanti: 23,4% ukupnih diplomanata

NAPOMENE METODOLOGIJE
Podaci se temelje na registru stanovni≈°tva i anketi radne snage.
Confidence interval: 95%
Margin of error: ¬±0,3%

Objavljeno: 5. rujan 2025.
Sljedeƒáe izvje≈°ƒáe: 5. prosinac 2025.""",
        "language": "hr",
        "domain": "statistics",
        "complexity": "high",
        "features": ["statistical_data", "percentages", "regional_data", "economic_indicators", "methodological_notes"]
    }
}

print("üåç Advanced Multilingual Document Examples Prepared:")
print("=" * 65)
for doc_name, doc_data in multilingual_documents.items():
    print(f"{doc_name}")
    print(f"   Language: {doc_data['language']} | Domain: {doc_data['domain']}")
    print(f"   Complexity: {doc_data['complexity']} | Length: {len(doc_data['content'])} chars")
    print(f"   Features: {', '.join(doc_data['features'])}")
    print(f"   Preview: {doc_data['content'][:80].replace(chr(10), ' ')}...")
    print()

In [None]:
# Comprehensive multilingual document processing demonstration
import re
from collections import Counter

def detect_document_language(text):
    """Simple language detection for demonstration."""
    croatian_indicators = len(re.findall(r'[ƒçƒá≈æ≈°ƒë]', text.lower()))
    croatian_words = len(re.findall(r'\b(je|su|za|na|u|od|do|se|i|a|koji|koja|koje|gdje|kada|kako|≈°to)\b', text.lower()))
    english_indicators = len(re.findall(r'\b(the|and|or|of|in|to|for|with|on|at|by|from)\b', text.lower()))

    total_words = len(text.split())
    if total_words == 0:
        return "unknown"

    croatian_score = (croatian_indicators * 3 + croatian_words) / total_words
    english_score = english_indicators / total_words

    if croatian_score > 0.1 and english_score > 0.05:
        return "mixed"
    elif croatian_score > english_score:
        return "hr"
    elif english_score > 0.05:
        return "en"
    else:
        return "unknown"

def extract_document_features(text, expected_features):
    """Extract and validate document features."""
    found_features = []

    # Check for each expected feature
    if "formal_language" in expected_features:
        formal_patterns = r'\b(Vlada|Republika|Odluka|ƒçlanak|temelju|Ustava)\b'
        if re.search(formal_patterns, text, re.IGNORECASE):
            found_features.append("formal_language")

    if "financial_data" in expected_features:
        financial_patterns = r'\d+[.,]\d+\s*(EUR|‚Ç¨|\%)'
        if re.search(financial_patterns, text):
            found_features.append("financial_data")

    if "technical_terms" in expected_features:
        tech_patterns = r'\b(RAG|API|BGE-M3|embeddings|vector|database|algorithm|pipeline)\b'
        if re.search(tech_patterns, text, re.IGNORECASE):
            found_features.append("technical_terms")

    if "dates" in expected_features:
        date_patterns = r'\d{1,2}\.\s*(sijeƒçnja|veljaƒçe|o≈æujka|travnja|svibnja|lipnja|srpnja|kolovoza|rujna|listopada|studenoga|prosinca|\d+)\s*\d{4}|\b(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},?\s*\d{4}'
        if re.search(date_patterns, text):
            found_features.append("dates")

    if "mixed_terminology" in expected_features:
        mixed_patterns = r'\b(customer|support|response|time|satisfaction|score)\b.*\b(sustav|implementacija|rezultati|tro≈°kovi)\b|\b(sustav|implementacija|rezultati|tro≈°kovi)\b.*\b(customer|support|response|time|satisfaction|score)\b'
        if re.search(mixed_patterns, text, re.IGNORECASE | re.DOTALL):
            found_features.append("mixed_terminology")

    if "code_switching" in expected_features:
        # Look for language switches within sentences
        sentences = text.split('.')
        for sentence in sentences:
            croatian_words = len(re.findall(r'\b(je|su|za|na|u|od|trebamo|mo≈æemo|godina)\b', sentence.lower()))
            english_words = len(re.findall(r'\b(the|and|we|need|can|year|meeting|next)\b', sentence.lower()))
            if croatian_words > 0 and english_words > 0:
                found_features.append("code_switching")
                break

    return found_features

# Process all multilingual documents
print("üìä Comprehensive Multilingual Document Processing")
print("=" * 70)

processing_results = []

for doc_name, doc_data in multilingual_documents.items():
    print(f"\nüîç Processing: {doc_name}")
    print("-" * 50)

    content = doc_data['content']
    expected_lang = doc_data['language']
    expected_features = doc_data['features']

    # Language detection
    detected_lang = detect_document_language(content)
    lang_match = detected_lang == expected_lang

    # Feature extraction
    found_features = extract_document_features(content, expected_features)
    feature_accuracy = len(found_features) / len(expected_features) if expected_features else 0

    # Document statistics
    word_count = len(content.split())
    char_count = len(content)
    sentence_count = len([s for s in content.split('.') if s.strip()])

    # Chunking simulation (sentence-based)
    sentences = [s.strip() + '.' for s in content.split('.') if s.strip()]
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk + sentence) <= 300:  # Target chunk size
            current_chunk += sentence + " "
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = sentence + " "

    if current_chunk:
        chunks.append(current_chunk.strip())

    # Store results
    result = {
        'name': doc_name,
        'expected_lang': expected_lang,
        'detected_lang': detected_lang,
        'lang_match': lang_match,
        'word_count': word_count,
        'char_count': char_count,
        'sentence_count': sentence_count,
        'chunk_count': len(chunks),
        'expected_features': expected_features,
        'found_features': found_features,
        'feature_accuracy': feature_accuracy,
        'domain': doc_data['domain'],
        'complexity': doc_data['complexity']
    }
    processing_results.append(result)

    # Display results
    status = "‚úÖ" if lang_match else "‚ùå"
    print(f"{status} Language Detection: {detected_lang} (expected: {expected_lang})")
    print(f"üìè Document Stats: {word_count} words, {sentence_count} sentences, {char_count} chars")
    print(f"üî™ Chunking Result: {len(chunks)} chunks created")
    print(f"üè∑Ô∏è  Feature Accuracy: {feature_accuracy:.1%} ({len(found_features)}/{len(expected_features)})")
    print(f"   Expected: {', '.join(expected_features[:3])}{'...' if len(expected_features) > 3 else ''}")
    print(f"   Found: {', '.join(found_features[:3])}{'...' if len(found_features) > 3 else ''}")

    # Show sample chunk
    if chunks:
        sample_chunk = chunks[0][:100] + "..." if len(chunks[0]) > 100 else chunks[0]
        print(f"üìÑ Sample Chunk: {sample_chunk}")

# Overall processing statistics
print(f"\nüéØ Overall Processing Results:")
print("=" * 50)

total_docs = len(processing_results)
correct_lang_detection = sum(1 for r in processing_results if r['lang_match'])
avg_feature_accuracy = sum(r['feature_accuracy'] for r in processing_results) / total_docs
total_chunks = sum(r['chunk_count'] for r in processing_results)
avg_chunk_per_doc = total_chunks / total_docs

print(f"üìä Language Detection Accuracy: {correct_lang_detection}/{total_docs} ({correct_lang_detection/total_docs:.1%})")
print(f"üéØ Average Feature Detection: {avg_feature_accuracy:.1%}")
print(f"üî™ Total Chunks Created: {total_chunks}")
print(f"üìà Average Chunks per Document: {avg_chunk_per_doc:.1f}")

# Domain and complexity analysis
domain_stats = Counter(r['domain'] for r in processing_results)
complexity_stats = Counter(r['complexity'] for r in processing_results)

print(f"\nüìÇ Domain Distribution: {dict(domain_stats)}")
print(f"‚öñÔ∏è  Complexity Distribution: {dict(complexity_stats)}")

# Language-specific performance
lang_performance = {}
for result in processing_results:
    lang = result['expected_lang']
    if lang not in lang_performance:
        lang_performance[lang] = {'correct': 0, 'total': 0, 'chunks': 0}
    lang_performance[lang]['total'] += 1
    lang_performance[lang]['chunks'] += result['chunk_count']
    if result['lang_match']:
        lang_performance[lang]['correct'] += 1

print(f"\nüåç Language-Specific Performance:")
for lang, stats in lang_performance.items():
    accuracy = stats['correct'] / stats['total']
    avg_chunks = stats['chunks'] / stats['total']
    print(f"   {lang}: {accuracy:.1%} accuracy, {avg_chunks:.1f} avg chunks/doc")

In [None]:
# Test the complete pipeline
print("? Testing complete pipeline on our sample documents:")
print()

all_processed_chunks = []

for filename, file_path in list(sample_files.items())[:2]:  # Process first 2 files
    chunks = process_document_complete(file_path, chunk_strategy="sentence")
    all_processed_chunks.extend(chunks)

print(f"\n? Pipeline Summary:")
print(f"   üìÑ Documents processed: 2")
print(f"   üìë Total chunks created: {len(all_processed_chunks)}")
print(f"   üìè Average chunk size: {sum(len(chunk.content) for chunk in all_processed_chunks) / len(all_processed_chunks):.1f} chars")

# Language detection check
croatian_chunks = [chunk for chunk in all_processed_chunks
                  if any(c in chunk.content for c in 'ƒçƒá≈°≈æƒë')]
print(f"   üá≠üá∑ Chunks with Croatian diacritics: {len(croatian_chunks)}")

print("\n‚úÖ Complete multilingual pipeline working perfectly!")
print("üåç Ready for production Croatian document processing")

## 6. Summary and Best Practices

### Key Takeaways:

üéØ **Document Processing is Critical**:
- Quality processing = better RAG results
- Each step (extract‚Üíclean‚Üíchunk) adds value
- Bad preprocessing ruins the entire pipeline

üá≠üá∑ **Croatian Language Considerations**:
- Always preserve diacritics (ƒç, ƒá, ≈°, ≈æ, ƒë)
- Handle various encodings (UTF-8, Windows-1250)
- Use Croatian-aware sentence splitting
- Consider regional variations and dialects

üìù **Chunking Strategy Matters**:
- Sentence-based: Best for most Croatian content
- Paragraph-based: Good for structured documents
- Hybrid: Best for complex formats
- Always include overlap for context preservation

### Best Practices:

‚úÖ **Text Extraction**:
- Auto-detect encoding for legacy documents
- Preserve file metadata for traceability
- Handle extraction errors gracefully

‚úÖ **Text Cleaning**:
- Use conservative cleaning for Croatian
- Never normalize Croatian diacritics
- Preserve proper names and technical terms
- Normalize whitespace and quotes

‚úÖ **Document Chunking**:
- Aim for 100-400 characters per chunk
- Use 20-50 character overlap
- Respect sentence boundaries
- Include chunk metadata for debugging

### Next Steps:

1. ‚úÖ **Document Processing** - Just completed!
2. ‚úÖ **Vector Database** - Already done
3. ‚è≥ **Retrieval System** - Next step
4. ‚è≥ **Generation** - Local LLM integration
5. ‚è≥ **Complete Pipeline** - End-to-end integration

The document processing pipeline creates clean, well-structured chunks that are ready for embedding and storage in our vector database!

In [None]:
# Clean up our temporary files
import shutil

try:
    if temp_dir.exists():
        shutil.rmtree(temp_dir)
        print(f"üßπ Cleaned up temporary directory: {temp_dir}")
except Exception as e:
    print(f"‚ö†Ô∏è  Warning: Could not clean up temp directory: {e}")

print("\nüéâ Document Processing Learning Complete!")
print("\nüìö What we learned:")
print("   ‚Ä¢ Text extraction from multiple formats")
print("   ‚Ä¢ Croatian-specific cleaning challenges")
print("   ‚Ä¢ Chunking strategies and their impact")
print("   ‚Ä¢ Complete processing pipeline design")
print("\n‚û°Ô∏è  Ready for the next step: Vector Database integration!")