# CORPUS-INFORMED AUTO-TRANSLATOR: VECTOR CREATION NOTEBOOK
Day 3: Converting JSON Corpora into Semantic Vectors
(Create vectors for multiple corpus items)

## Notebook Outline

**Part 1: Data Preparation and Planning**

- STEP 1: LOAD THE REQUIRED LIBRARIES
- STEP 2: LOAD AND EXAMINE DATABASE STRUCTURE
- STEP 3: LOAD UNPROCESSED DOCUMENTS FOR BATCH PROCESSING (WITHOUT RE-VECTORIZATION)

**Part 2: Function and Model Setup**

- STEP 4: DEFINE TEXT EXTRACTION FUNCTIONS (FLEXIBLE)
- STEP 5: DEFINE VECTORIZATION FUNCTIONS
- STEP 6: INITIALIZE THE MULTILINGUAL EMBEDDING MODEL

**Part 3: Execution and Output**
- STEP 7: BATCH PROCESS DOCUMENTS

## =============================================================================


## Part 1: Data Preparation and Planning

### STEP 1: LOAD THE REQUIRED LIBRARIES

In [1]:
### STEP 1: LOAD THE REQUIRED LIBRARIES

# Core Python libraries
import json
import os
from pathlib import Path
import logging
from typing import Dict, List, Tuple, Optional
from datetime import datetime

# Data processing libraries
import pandas as pd
import numpy as np

# Text processing libraries
import re
from collections import defaultdict

# Machine learning and embedding libraries
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

# Progress tracking (helpful for long processing times)
from tqdm.notebook import tqdm

# Visualization libraries (for testing our vectors)
import matplotlib.pyplot as plt
import seaborn as sns

# Configuration management
import sys
sys.path.append('../scripts')  # Add scripts folder to path
from config import *  # Import all our configuration settings

# Set up logging for debugging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

print("✅ All libraries loaded successfully!")
print(f"📁 Working with domain: {DOMAIN}")
print(f"🌐 Processing languages: {LANGUAGES}")
print(f"🤖 Using model: {MODEL_NAME}")

✅ All libraries loaded successfully!
📁 Working with domain: gai
🌐 Processing languages: ['eng', 'esp']
🤖 Using model: paraphrase-multilingual-MiniLM-L12-v2


### STEP 2: LOAD AND EXAMINE DATABASE STRUCTURE

In [2]:
### STEP 2: LOAD AND EXAMINE DATABASE STRUCTURE

def load_database_file(language):
    """Load the database file for a specific language"""
    database_path = Path(CORPORA_DIR) / language / f"{DOMAIN}-{language}_database.json"
    
    if not database_path.exists():
        raise FileNotFoundError(f"Database file not found: {database_path}")
    
    with open(database_path, 'r', encoding='utf-8') as f:
        return json.load(f)

# Load databases for all languages
databases = {}
all_documents = {}  # Store all documents for easy access

for language in LANGUAGES:
    try:
        print(f"🔍 Loading {language.upper()} database...")
        databases[language] = load_database_file(language)
        
        # Show database structure
        db = databases[language]
        print(f"📊 {language.upper()} database contains {len(db)} top-level keys:")
        for key in list(db.keys())[:5]:  # Show first 5 keys
            print(f"  • {key}")
        
        # Count actual documents (excluding template)
        # Handle mixed structure: check both nested 'documents' and top-level keys
        documents = {}
        
        # First, get documents from nested 'documents' key
        if 'documents' in db and isinstance(db['documents'], dict):
            nested_docs = {k: v for k, v in db['documents'].items() if k.startswith(f'gai-{language}_item')}
            documents.update(nested_docs)
        
        # Also check for direct top-level document keys
        top_level_docs = {k: v for k, v in db.items() if k != 'gai_template' and k != 'documents' and k.startswith(f'gai-{language}_item')}
        documents.update(top_level_docs)
        
        # Store documents for this language
        all_documents[language] = documents
        
        print(f"📚 Found {len(documents)} documents in {language.upper()} database")
        
        # Show ALL document IDs (not just sample)
        if documents:
            print(f"  All document IDs:")
            for i, (doc_id, doc_data) in enumerate(documents.items(), 1):
                title = doc_data.get('document_metadata', {}).get('title', 'No title')
                print(f"    {i:2d}. {doc_id}: {title[:60]}{'...' if len(title) > 60 else ''}")
        
        print()  # Add blank line between languages
        
    except FileNotFoundError as e:
        print(f"❌ Error loading {language.upper()} database: {e}")
        print(f"Expected location: {Path(CORPORA_DIR) / language / f'{DOMAIN}-{language}_database.json'}")
        print()

print(f"✅ Successfully loaded databases for: {', '.join(databases.keys())}")

# Calculate correct total documents using the stored documents
total_documents = sum(len(docs) for docs in all_documents.values())
print(f"📊 Total documents across all languages: {total_documents}")

# Show breakdown by language
print(f"\n📊 Document count breakdown:")
for language, docs in all_documents.items():
    print(f"  • {language.upper()}: {len(docs)} documents")

# Show expected file locations for all languages
print(f"\n📂 Expected database file locations:")
for language in LANGUAGES:
    expected_path = Path(CORPORA_DIR) / language / f"{DOMAIN}-{language}_database.json"
    print(f"  • {language.upper()}: {expected_path}")

# Optional: Create a summary dictionary for easy access
document_summary = {
    'total_count': total_documents,
    'by_language': {lang: len(docs) for lang, docs in all_documents.items()},
    'all_documents': all_documents
}

print(f"\n📋 Summary:")
print(f"Total documents loaded: {document_summary['total_count']}")
for lang, count in document_summary['by_language'].items():
    print(f"  {lang.upper()}: {count} documents")

🔍 Loading ENG database...
📊 ENG database contains 1 top-level keys:
  • documents
📚 Found 4 documents in ENG database
  All document IDs:
     1. gai-eng_item001: Attention is All You Need
     2. gai-eng_item002: On the Dangers of Stochastic Parrots: Can Language Models Be...
     3. gai-eng_item003: Recommendation on the Ethics of Artificial Intelligence
     4. gai-eng_item004: The Age of AI has begun

🔍 Loading ESP database...
📊 ESP database contains 1 top-level keys:
  • documents
📚 Found 2 documents in ESP database
  All document IDs:
     1. gai-esp_item001: Propuesta de Agenda Nacional de la Inteligencia Artificial p...
     2. gai-esp_item002: Conversando con una computadora: ¿Cómo entienden las intelig...

✅ Successfully loaded databases for: eng, esp
📊 Total documents across all languages: 6

📊 Document count breakdown:
  • ENG: 4 documents
  • ESP: 2 documents

📂 Expected database file locations:
  • ENG: ..\corpora\gai\eng\gai-eng_database.json
  • ESP: ..\corpora\gai\esp\

### STEP 3: LOAD UNPROCESSED DOCUMENTS FOR BATCH PROCESSING (NO RE-VECTORIZATION)

In [3]:
### STEP 3: LOAD UNPROCESSED DOCUMENTS FOR BATCH PROCESSING (NO RE-VECTORIZATION) - HELPER FUNCTIONS

# Helper Functions (for use in step 3 and later)

def load_corpus_item(language, doc_id):
    """
    Load both the metadata and content data for a specific document

    Args:
        language: Language code ('eng', 'fra', etc.)
        doc_id: Document ID (e.g., 'gai-eng_item004')

    Returns:
        Dictionary containing both metadata and content
    """
    # Load metadata from database
    if language not in all_documents:
        raise ValueError(f"Language '{language}' not found in loaded databases")

    if doc_id not in all_documents[language]:
        raise ValueError(f"Document '{doc_id}' not found in {language} database")

    metadata = all_documents[language][doc_id]

    # Load content from separate content file
    content_file_path = Path(CORPORA_DIR) / language / 'processed' / f'{doc_id}.json'

    if not content_file_path.exists():
        raise FileNotFoundError(f"Content file not found: {content_file_path}")

    try:
        with open(content_file_path, 'r', encoding='utf-8') as f:
            content_data = json.load(f)
    except Exception as e:
        raise ValueError(f"Error loading content file {content_file_path}: {e}")

    # Merge metadata and content
    merged_data = {
        'document_metadata': metadata.get('document_metadata', {}),
        'processing_metadata': metadata.get('processing_metadata', {}),
        'document_id': content_data.get('document_id', doc_id),
        'content': content_data.get('content', {})
    }

    return merged_data

def get_document_metadata(databases_dict, language, doc_id):
    """
    Extract metadata for a specific document

    Args:
        databases_dict: The databases dictionary (not used but kept for compatibility)
        language: Language code
        doc_id: Document ID

    Returns:
        Document metadata dictionary
    """
    if language not in all_documents:
        raise ValueError(f"Language '{language}' not found in loaded databases")

    if doc_id not in all_documents[language]:
        raise ValueError(f"Document '{doc_id}' not found in {language} database")

    document_data = all_documents[language][doc_id]
    return document_data.get('document_metadata', {})

def merge_content_and_metadata(content_data, metadata):
    """
    Merge document content and metadata into a single structure

    Args:
        content_data: Full document data (already merged from load_corpus_item)
        metadata: Document metadata (usually already part of content_data)

    Returns:
        Merged document structure ready for processing
    """
    # The content_data from load_corpus_item already contains everything we need
    # but we'll ensure metadata is properly merged
    merged = content_data.copy()

    # Ensure document_metadata is present and updated
    if 'document_metadata' not in merged:
        merged['document_metadata'] = {}

    # Update with any additional metadata
    merged['document_metadata'].update(metadata)

    return merged

print("✅ Helper functions loaded and ready to use!")

✅ Helper functions loaded and ready to use!


In [4]:
### STEP 3: LOAD UNPROCESSED DOCUMENTS FOR BATCH PROCESSING (NO RE-VECTORIZATION)

import json
from pathlib import Path

def load_existing_vector_files_comprehensive(vector_dir='../vectors/gai'):
    """
    Comprehensively load all existing vector files and extract processed document IDs
    Returns detailed information about what's already processed
    """
    vector_dir = Path(vector_dir)
    
    processed_info = {
        'document_ids': set(),
        'section_ids': set(),
        'paragraph_ids': set(),
        'files_found': {},
        'total_vectors': 0
    }
    
    # Define the three vector files we expect
    vector_files = {
        'documents': 'gai-document-vectors.json',
        'sections': 'gai-section-vectors.json', 
        'paragraphs': 'gai-paragraph-vectors.json'
    }
    
    print(f"🔍 CHECKING EXISTING VECTORS IN: {vector_dir}")
    
    for vector_type, filename in vector_files.items():
        filepath = vector_dir / filename
        
        if filepath.exists():
            try:
                with open(filepath, 'r', encoding='utf-8') as f:
                    data = json.load(f)
                
                # Handle both new schema and legacy formats
                vectors = data.get('vectors', data.get('vector_data', []))
                
                processed_info['files_found'][vector_type] = {
                    'path': str(filepath),
                    'count': len(vectors),
                    'size_mb': filepath.stat().st_size / (1024*1024)
                }
                
                # Extract IDs based on vector type
                if vector_type == 'documents':
                    for vector in vectors:
                        processed_info['document_ids'].add(vector.get('id', ''))
                elif vector_type == 'sections':
                    for vector in vectors:
                        processed_info['section_ids'].add(vector.get('id', ''))
                        # Also track the parent document
                        doc_id = vector.get('document_id', '')
                        if doc_id:
                            processed_info['document_ids'].add(doc_id)
                elif vector_type == 'paragraphs':
                    for vector in vectors:
                        processed_info['paragraph_ids'].add(vector.get('id', ''))
                        # Also track the parent document
                        doc_id = vector.get('document_id', '')
                        if doc_id:
                            processed_info['document_ids'].add(doc_id)
                
                processed_info['total_vectors'] += len(vectors)
                
                print(f"  ✅ {vector_type.title()}: {len(vectors)} vectors found")
                
            except Exception as e:
                print(f"  ❌ Error reading {filepath}: {e}")
                processed_info['files_found'][vector_type] = {'error': str(e)}
        else:
            print(f"  📝 {vector_type.title()}: No existing file found")
            processed_info['files_found'][vector_type] = {'status': 'not_found'}
    
    # Remove empty strings from sets
    processed_info['document_ids'].discard('')
    processed_info['section_ids'].discard('')
    processed_info['paragraph_ids'].discard('')
    
    print(f"\n📊 EXISTING VECTORS SUMMARY:")
    print(f"  • Unique documents with vectors: {len(processed_info['document_ids'])}")
    print(f"  • Total sections: {len(processed_info['section_ids'])}")
    print(f"  • Total paragraphs: {len(processed_info['paragraph_ids'])}")
    print(f"  • Grand total vectors: {processed_info['total_vectors']}")
    
    return processed_info

def find_truly_unprocessed_documents(all_documents, processed_info):
    """
    Find documents that have NO vectors at all (completely unprocessed)
    More robust than just checking document-level vectors
    """
    unprocessed = []
    processed_doc_ids = processed_info['document_ids']
    
    print(f"\n🔍 IDENTIFYING UNPROCESSED DOCUMENTS:")
    
    for language, docs in all_documents.items():
        for doc_id, doc_data in docs.items():
            if doc_id not in processed_doc_ids:
                # Get document info
                doc_metadata = doc_data.get('document_metadata', {})
                processing_metadata = doc_data.get('processing_metadata', {})
                
                unprocessed.append({
                    'document_id': doc_id,
                    'language': language,
                    'title': doc_metadata.get('title', 'No title'),
                    'text_type': doc_metadata.get('text_type', 'Unknown'),
                    'word_count': processing_metadata.get('word_count', 0),
                    'language_family': doc_metadata.get('language_family', 'unknown'),
                    'language_variant': doc_metadata.get('language_variant', 'unknown')
                })
                
                print(f"  📄 {doc_id} - Not processed")
            else:
                print(f"  ✅ {doc_id} - Already processed")
    
    return unprocessed

def load_documents_for_batch(documents_to_process):
    """
    Load the actual document content for processing
    Uses your existing load_corpus_item() helper function
    """
    loaded_documents = []
    
    print(f"\n📖 LOADING DOCUMENT CONTENT:")
    
    for doc_info in documents_to_process:
        doc_id = doc_info['document_id']
        language = doc_info['language']
        
        print(f"  Loading {doc_id}...")
        
        try:
            # Use your existing helper function
            document = load_corpus_item(language, doc_id)
            
            # Add the processing info
            document['processing_info'] = doc_info
            
            loaded_documents.append(document)
            print(f"    ✅ Success")
            
        except Exception as e:
            print(f"    ❌ Error: {e}")
    
    return loaded_documents

# Run the comprehensive check
existing_vectors_info = load_existing_vector_files_comprehensive()

# Find unprocessed documents using the comprehensive info
documents_to_process = find_truly_unprocessed_documents(all_documents, existing_vectors_info)

print(f"\n📊 FINAL PROCESSING STATUS:")
print(f"  • Total documents in database: {sum(len(docs) for docs in all_documents.values())}")
print(f"  • Documents with existing vectors: {len(existing_vectors_info['document_ids'])}")
print(f"  • Documents needing processing: {len(documents_to_process)}")

if documents_to_process:
    print(f"\n📝 DOCUMENTS TO PROCESS ({len(documents_to_process)} total):")
    for i, doc in enumerate(documents_to_process, 1):
        print(f"  {i:2d}. {doc['document_id']} ({doc['language'].upper()})")
        print(f"      {doc['title'][:50]}{'...' if len(doc['title']) > 50 else ''}")
        print(f"      Type: {doc['text_type']} | Words: {doc['word_count']:,}")
    
    # Load the documents for processing
    loaded_docs = load_documents_for_batch(documents_to_process)
    
    print(f"\n✅ LOADING COMPLETE:")
    print(f"  📦 Successfully loaded: {len(loaded_docs)} documents")
    
    if len(loaded_docs) != len(documents_to_process):
        print(f"  ⚠️  Failed to load: {len(documents_to_process) - len(loaded_docs)} documents")
else:
    print(f"\n🎉 ALL DOCUMENTS ALREADY PROCESSED!")
    print(f"    No new vectorization needed.")
    loaded_docs = []

print(f"\n✅ STEP 3 COMPLETE - Ready for vectorization functions in STEP 5!")

🔍 CHECKING EXISTING VECTORS IN: ..\vectors\gai
  ✅ Documents: 1 vectors found
  ✅ Sections: 22 vectors found
  ✅ Paragraphs: 70 vectors found

📊 EXISTING VECTORS SUMMARY:
  • Unique documents with vectors: 1
  • Total sections: 22
  • Total paragraphs: 70
  • Grand total vectors: 93

🔍 IDENTIFYING UNPROCESSED DOCUMENTS:
  ✅ gai-eng_item001 - Already processed
  📄 gai-eng_item002 - Not processed
  📄 gai-eng_item003 - Not processed
  📄 gai-eng_item004 - Not processed
  📄 gai-esp_item001 - Not processed
  📄 gai-esp_item002 - Not processed

📊 FINAL PROCESSING STATUS:
  • Total documents in database: 6
  • Documents with existing vectors: 1
  • Documents needing processing: 5

📝 DOCUMENTS TO PROCESS (5 total):
   1. gai-eng_item002 (ENG)
      On the Dangers of Stochastic Parrots: Can Language...
      Type: Academic paper | Words: 15,544
   2. gai-eng_item003 (ENG)
      Recommendation on the Ethics of Artificial Intelli...
      Type: Policy document | Words: 14,571
   3. gai-eng_item004 

## =============================================================================

## Part 2: Function and Model Setup

### STEP 4: FLEXIBLE TEXT EXTRACTION FUNCTIONS

In [5]:
### STEP 4a: FLEXIBLE TEXT EXTRACTION FUNCTIONS
### FROM: create_vectors_batch.ipynb

# COMPREHENSIVE EXTRACTION FUNCTIONS

def extract_text_from_any_element(element, element_type="text"):
    """
    Flexible text extraction from any element type
    
    Handles:
    - Regular text fields
    - Terms with definitions
    - Figures with captions
    - Lists with items
    - References
    - Citations with various formats
    """
    if isinstance(element, dict):
        # Handle different element types
        if element_type == "term":
            term = element.get('term', '')
            definition = element.get('definition', '')
            return f"{term}: {definition}"
        
        elif element_type == "figure":
            caption = element.get('caption', '')
            description = element.get('description', '')
            # Handle figures that have paragraphs
            paragraph_texts = []
            if 'paragraphs' in element:
                for para in element.get('paragraphs', []):
                    para_text = extract_text_from_any_element(para)
                    if para_text:
                        paragraph_texts.append(para_text)
            
            all_texts = [caption, description] + paragraph_texts
            return ' '.join([t for t in all_texts if t])
        
        elif element_type == "reference":
            return element.get('text', '')
        
        elif element_type == "citation":
            # Handle both reference_id (string) and reference_ids (array)
            marker = element.get('marker', '')
            if 'reference_id' in element:
                ref_id = element.get('reference_id')
                if isinstance(ref_id, list):
                    return f"[{marker}: {', '.join(ref_id)}]"
                else:
                    return f"[{marker}: {ref_id}]"
            elif 'reference_ids' in element:
                ref_ids = element.get('reference_ids', [])
                if ref_ids:  # Check if not empty
                    return f"[{marker}: {', '.join(ref_ids)}]"
                else:
                    return f"[{marker}]"
        
        else:
            # Default: try to get 'text' field
            return element.get('text', '')
    
    elif isinstance(element, str):
        return element
    
    return ""

def extract_paragraph_text_flexible(paragraph):
    """
    Extract text from a paragraph, handling various embedded elements
    """
    if not isinstance(paragraph, dict):
        return "", ""
    
    paragraph_id = paragraph.get('id', 'unknown')
    texts = []
    
    # Main paragraph text
    main_text = paragraph.get('text', '')
    if main_text:
        texts.append(main_text)
    
    # Handle lists (like in journal articles)
    if 'list' in paragraph:
        for item in paragraph.get('list', []):
            item_text = extract_text_from_any_element(item)
            if item_text:
                texts.append(f"• {item_text}")
    
    # Handle footnotes
    if 'footnotes' in paragraph:
        for footnote in paragraph.get('footnotes', []):
            marker = footnote.get('marker', '')
            text = footnote.get('text', '')
            texts.append(f"[Footnote {marker}]: {text}")
    
    # Handle citations
    if 'citations' in paragraph:
        citation_texts = []
        for citation in paragraph.get('citations', []):
            cite_text = extract_text_from_any_element(citation, 'citation')
            if cite_text:
                citation_texts.append(cite_text)
        if citation_texts:
            texts.append(f"Citations: {' '.join(citation_texts)}")
    
    # Handle external links (NEW - for gai-eng_item004 structure)
    if 'external_links' in paragraph:
        link_texts = []
        for link in paragraph.get('external_links', []):
            marker = link.get('marker', '')
            url = link.get('url', '')
            if marker and url:
                link_texts.append(f"[Link: {marker} -> {url}]")
            elif marker:
                link_texts.append(f"[Link: {marker}]")
        if link_texts:
            texts.append(f"External links: {' '.join(link_texts)}")
    
    # Handle inline equations
    if 'inline_equations' in paragraph:
        for eq in paragraph.get('inline_equations', []):
            eq_text = eq.get('text', '')
            if eq_text:
                texts.append(f"[Equation: {eq_text}]")
    
    # Handle internal references
    if 'internal_references' in paragraph:
        for ref in paragraph.get('internal_references', []):
            ref_text = ref.get('text', '')
            # Handle both field name variations
            target = ref.get('target_section_id', ref.get('target_id', ''))
            if ref_text:
                texts.append(f"[Ref: {ref_text} -> {target}]")
    
    # Handle any embedded paragraphs (for complex structures)
    if 'paragraphs' in paragraph:
        for sub_para in paragraph.get('paragraphs', []):
            _, sub_text = extract_paragraph_text_flexible(sub_para)
            if sub_text:
                texts.append(sub_text)
    
    # Combine all text
    full_text = ' '.join(texts)
    
    # Clean text
    full_text = extract_clean_text(full_text)
    
    return paragraph_id, full_text

def extract_section_content_flexible(section, parent_id="", level=0, corpus_item=None):
    """
    Flexibly extract content from sections with various structures
    """
    section_id = section.get('id', 'unknown')
    # Handle both 'title' and 'group' fields
    section_title = section.get('title', section.get('group', ''))
    section_type = section.get('type', 'standard')  # For special section types

    # Get document_id from corpus_item
    document_id = 'unknown'
    if corpus_item:
        document_id = corpus_item.get('document_id', 'unknown')
    
    # Build section ID
    if level == 0:
        full_section_id = section_id
    else:
        full_section_id = f"{parent_id}_{section_id.split('_')[-1]}"
    
    section_texts = []
    all_paragraphs = []
    subsection_results = []
    special_content = {}
    
    # Add section title if present
    if section_title:
        section_texts.append(section_title)
    
    # Handle direct text on section (some structures have this)
    if 'text' in section and isinstance(section['text'], str):
        section_texts.append(section['text'])
    
    # Extract paragraphs
    for paragraph in section.get('paragraphs', []):
        para_id, para_text = extract_paragraph_text_flexible(paragraph)
        if para_text:
            section_texts.append(para_text)
            all_paragraphs.append({
                'id': para_id,
                'text': para_text,
                'section_id': full_section_id,
                'section_title': section_title,
                'document_id': document_id, 
                'level': level
            })
    
    # Handle terms (for glossaries)
    if 'terms' in section:
        special_content['terms'] = []
        for term in section.get('terms', []):
            term_text = extract_text_from_any_element(term, 'term')
            if term_text:
                section_texts.append(term_text)
                special_content['terms'].append({
                    'id': term.get('id', ''),
                    'text': term_text
                })
    
    # Handle references
    if 'references' in section:
        special_content['references'] = []
        for ref in section.get('references', []):
            ref_text = extract_text_from_any_element(ref, 'reference')
            if ref_text:
                section_texts.append(ref_text)
                special_content['references'].append({
                    'id': ref.get('id', ''),
                    'text': ref_text
                })
    
    # Handle figures
    if 'figures' in section:
        special_content['figures'] = []
        for figure in section.get('figures', []):
            fig_text = extract_text_from_any_element(figure, 'figure')
            if fig_text:
                section_texts.append(fig_text)
                special_content['figures'].append({
                    'id': figure.get('id', ''),
                    'text': fig_text
                })
    
    # Handle tables
    if 'tables' in section:
        special_content['tables'] = []
        for table in section.get('tables', []):
            caption = table.get('caption', '')
            if caption:
                section_texts.append(f"Table: {caption}")
                special_content['tables'].append({
                    'id': table.get('id', ''),
                    'caption': caption
                })
    
    # Process subsections recursively
    for subsection in section.get('subsections', []):
        subsection_content = extract_section_content_flexible(
            subsection, 
            full_section_id, 
            level + 1,
            corpus_item
        )
        subsection_results.append(subsection_content)
        section_texts.append(subsection_content['section_text'])
        all_paragraphs.extend(subsection_content['all_paragraphs'])
    
    # Process subsubsections (for deeper nesting)
    for subsubsection in section.get('subsubsections', []):
        subsubsection_content = extract_section_content_flexible(
            subsubsection, 
            full_section_id, 
            level + 1,
            corpus_item
        )
        subsection_results.append(subsubsection_content)
        section_texts.append(subsubsection_content['section_text'])
        all_paragraphs.extend(subsubsection_content['all_paragraphs'])
    
    # Combine all text
    combined_section_text = ' '.join(section_texts)
    
    return {
        'section_id': full_section_id,
        'section_title': section_title,
        'section_type': section_type,
        'section_text': combined_section_text,
        'document_id': document_id,
        'subsections': subsection_results,
        'all_paragraphs': all_paragraphs,
        'special_content': special_content,
        'level': level
    }

def extract_document_content_flexible(corpus_item):
    """
    Extract all content from a document with flexible structure
    """
    print(f"📄 Extracting content from: {corpus_item.get('document_id', 'Unknown')}")
    
    document_texts = []
    all_sections = []
    all_paragraphs = []
    special_elements = {
        'figures': [],
        'tables': [],
        'equations': [],
        'pull_quotes': [],
        'external_links': [],
        'terms': [],
        'references': []
    }
    
    # Extract metadata
    metadata = corpus_item.get('document_metadata', {})
    doc_id = corpus_item.get('document_id', 'unknown')
    title = metadata.get('title', '')
    language_family = metadata.get('language_family', 'unknown')
    language_variant = metadata.get('language_variant', 'unknown')
    language = f"{language_family}-{language_variant}"
    text_type = metadata.get('text_type', 'unknown')
    
    # Add title
    if title:
        document_texts.append(title)
    
    # Extract content
    content = corpus_item.get('content', {})
    
    # Check for abstract (not all documents have this)
    abstract = content.get('abstract', '')
    if abstract:
        clean_abstract = extract_clean_text(abstract)
        document_texts.append(clean_abstract)
        all_paragraphs.append({
            'id': f"{doc_id}_abstract",
            'text': clean_abstract,
            'section_id': 'abstract',
            'section_title': 'Abstract',
            'document_id': doc_id,
            'level': -1
        })
    
    # Process all sections
    for section in content.get('sections', []):
        section_content = extract_section_content_flexible(
            section, 
            "", 
            0,
            corpus_item
        )
        
        # Add to document text
        document_texts.append(section_content['section_text'])
        
        # Collect all sections
        def collect_sections(section_data):
            all_sections.append({
                'id': section_data['section_id'],
                'title': section_data['section_title'],
                'type': section_data['section_type'],
                'text': section_data['section_text'],
                'document_id': section_data['document_id'],
                'level': section_data['level']
            })
            
            # Collect special content
            for content_type, items in section_data['special_content'].items():
                if items:
                    special_elements[content_type].extend(items)
            
            for subsection in section_data['subsections']:
                collect_sections(subsection)
        
        collect_sections(section_content)
        all_paragraphs.extend(section_content['all_paragraphs'])
    
    # Extract top-level figures (if present)
    for figure in content.get('figures', []):
        fig_text = extract_text_from_any_element(figure, 'figure')
        if fig_text:
            special_elements['figures'].append({
                'id': figure.get('id', ''),
                'text': fig_text
            })
            document_texts.append(fig_text)
    
    # Extract top-level tables (if present)
    for table in content.get('tables', []):
        caption = table.get('caption', '')
        if caption:
            special_elements['tables'].append({
                'id': table.get('id', ''),
                'caption': caption
            })
            document_texts.append(f"Table: {caption}")
    
    # Extract top-level equations (if present)
    for equation in content.get('equations', []):
        eq_id = equation.get('id', '')
        latex = equation.get('latex', '')
        if latex:
            special_elements['equations'].append({
                'id': eq_id,
                'latex': latex
            })
            document_texts.append(f"Equation {eq_id}: {latex}")
    
    # Extract top-level references (if present)
    for reference in content.get('references', []):
        ref_text = reference.get('text', '')
        if ref_text:
            special_elements['references'].append({
                'id': reference.get('id', ''),
                'text': ref_text
            })
            document_texts.append(ref_text)
    
    # Extract pull quotes (if present)
    for quote in content.get('pull_quotes', []):
        quote_text = quote.get('text', '')
        if quote_text:
            special_elements['pull_quotes'].append({
                'id': quote.get('id', ''),
                'text': quote_text
            })
    
    # Combine all text
    full_document_text = ' '.join(document_texts)
    
    # Calculate statistics
    stats = {
        'total_sections': len(all_sections),
        'total_paragraphs': len(all_paragraphs),
        'document_length': len(full_document_text),
        'text_type': text_type,
        'has_abstract': bool(abstract),
        'has_figures': len(special_elements['figures']) > 0,
        'has_tables': len(special_elements['tables']) > 0,
        'has_equations': len(special_elements['equations']) > 0,
        'has_terms': len(special_elements['terms']) > 0,
        'has_references': len(special_elements['references']) > 0,
        'has_pull_quotes': len(special_elements['pull_quotes']) > 0
    }
    
    print(f"  ✅ Extracted: {stats['total_sections']} sections, {stats['total_paragraphs']} paragraphs")
    print(f"  📊 Document type: {text_type}")
    print(f"  🎯 Special elements: " + ', '.join([k for k, v in special_elements.items() if v]))
    
    return {
        'document_id': doc_id,
        'title': title,
        'language': language,
        'text_type': text_type,
        'document_text': full_document_text,
        'sections': all_sections,
        'paragraphs': all_paragraphs,
        'special_elements': special_elements,
        'statistics': stats
    }

# Helper function to clean text (assuming you have this)
def extract_clean_text(text):
    """
    Clean text by removing extra whitespaces, normalizing quotes, etc.
    """
    if not text:
        return ""
    
    # Replace multiple spaces with single space
    text = ' '.join(text.split())
    
    # Normalize quotes
    text = text.replace('"', '"').replace('"', '"')
    text = text.replace(''', "'").replace(''', "'")
    
    # Remove leading/trailing whitespace
    text = text.strip()
    
    return text

print("✅ Flexible extraction functions defined and ready to use!")

✅ Flexible extraction functions defined and ready to use!


Potential Minor Improvements for future iterations:

- equation_numbers arrays in paragraphs that the current code doesn’t explicitly handle
- external_links in some paragraphs that aren’t currently extracted
- The  current code also only extracts table captions, not the full table content (rows/columns)

In [6]:
### STEP 4b: FLEXIBLE TEXT EXTRACTION FUNCTIONS
### FROM: create_vectors_batch.ipynb

# TEST AND VERIFY EXTRACTION

def verify_extraction_details(extracted_data, show_full_text=False):
    """
    Print detailed information about what was extracted from the document
    """
    print(f"\n🔍 DETAILED EXTRACTION VERIFICATION")
    print(f"=" * 60)
    
    # Basic stats
    print(f"📋 Document ID: {extracted_data['document_id']}")
    print(f"📑 Title: {extracted_data['title']}")
    print(f"🌐 Language: {extracted_data['language']}")
    print(f"📄 Text type: {extracted_data['text_type']}")
    
    # Section details
    print(f"\n📚 SECTIONS EXTRACTED ({len(extracted_data['sections'])} total):")
    for i, section in enumerate(extracted_data['sections'], 1):
        print(f"  {i}. [{section['id']}] {section['title']}")
        if section['type'] != 'standard':
            print(f"     (Type: {section['type']})")
    
    # Paragraph details  
    print(f"\n📝 PARAGRAPHS EXTRACTED ({len(extracted_data['paragraphs'])} total):")
    for i, para in enumerate(extracted_data['paragraphs'][:5], 1):  # Show first 5
        print(f"  {i}. [{para['id']}] in section '{para['section_title']}'")
        # Show first 100 characters of paragraph text
        preview = para['text'][:100] + "..." if len(para['text']) > 100 else para['text']
        print(f"     Text: {preview}")
    
    if len(extracted_data['paragraphs']) > 5:
        print(f"     ... and {len(extracted_data['paragraphs']) - 5} more paragraphs")
    
    # Special elements details
    print(f"\n🎯 SPECIAL ELEMENTS FOUND:")
    for element_type, items in extracted_data['special_elements'].items():
        if items:
            print(f"  • {element_type.title()}: {len(items)} found")
            # Show details for some types
            if element_type == 'pull_quotes' and items:
                for quote in items:
                    print(f"    - \"{quote['text'][:80]}...\"")
            elif element_type == 'figures' and items:
                for fig in items[:3]:  # Show first 3
                    print(f"    - [{fig['id']}]: {fig['text'][:60]}...")
    
    # Document text preview
    print(f"\n📄 DOCUMENT TEXT PREVIEW:")
    print(f"  Total length: {len(extracted_data['document_text']):,} characters")
    if show_full_text:
        print(f"  Full text:\n{extracted_data['document_text']}")
    else:
        # Show first 300 characters
        preview = extracted_data['document_text'][:300] + "..." if len(extracted_data['document_text']) > 300 else extracted_data['document_text']
        print(f"  First 300 chars: {preview}")
    
    # Statistics summary
    stats = extracted_data['statistics']
    print(f"\n📊 STATISTICS:")
    print(f"  • Total sections: {stats['total_sections']}")
    print(f"  • Total paragraphs: {stats['total_paragraphs']}")
    print(f"  • Document length: {stats['document_length']:,} characters")
    print(f"  • Has abstract: {stats['has_abstract']}")
    print(f"  • Has figures: {stats['has_figures']}")
    print(f"  • Has tables: {stats['has_tables']}")
    print(f"  • Has equations: {stats['has_equations']}")
    print(f"  • Has pull quotes: {stats['has_pull_quotes']}")
    
    return True

# Test specifically with the gai-eng_item004 document
def test_gai_eng_item004():
    """
    Test the extraction with the specific document structure
    """
    print(f"\n🧪 TESTING gai-eng_item004 EXTRACTION")
    print(f"=" * 60)
    
    try:
        # Load the document (adjust path as needed)
        doc_id = "gai-eng_item004"
        lang = "eng"
        
        # Load content and metadata
        content_data = load_corpus_item(lang, doc_id)
        metadata = get_document_metadata(databases, lang, doc_id)
        corpus_item = merge_content_and_metadata(content_data, metadata)
        
        # Extract with flexible function
        extracted = extract_document_content_flexible(corpus_item)
        
        # Show detailed verification
        verify_extraction_details(extracted, show_full_text=False)
        
        # Specific checks for this document
        print(f"\n✅ SPECIFIC CHECKS FOR gai-eng_item004:")
        print(f"  • Expected 7 sections: {'✓' if len(extracted['sections']) == 7 else '✗'} (found {len(extracted['sections'])})")
        print(f"  • Expected pull quotes: {'✓' if len(extracted['special_elements']['pull_quotes']) == 2 else '✗'} (found {len(extracted['special_elements']['pull_quotes'])})")
        print(f"  • Has external links in text: {'✓' if 'Link:' in extracted['document_text'] else '✗'}")
        print(f"  • Document has substantial content: {'✓' if len(extracted['document_text']) > 10000 else '✗'} ({len(extracted['document_text']):,} chars)")
        
        # Store result globally for further use
        global test_result
        test_result = extracted
        
        return extracted
        
    except Exception as e:
        print(f"  ❌ Error during extraction: {e}")
        import traceback
        traceback.print_exc()
        return None

# Run the test
print("🚀 Running extraction test...")
result = test_gai_eng_item004()

# Show section titles to verify all 7 are there
if result:
    print(f"\n📚 ALL SECTION TITLES FOUND:")
    for i, section in enumerate(result['sections'], 1):
        print(f"  {i}. {section['title']}")
        
    print(f"\n💾 Result stored in 'test_result' variable for further use")
else:
    print(f"\n❌ Test failed - check the error messages above")

🚀 Running extraction test...

🧪 TESTING gai-eng_item004 EXTRACTION
📄 Extracting content from: gai-eng_item004
  ✅ Extracted: 7 sections, 56 paragraphs
  📊 Document type: Blog post
  🎯 Special elements: pull_quotes

🔍 DETAILED EXTRACTION VERIFICATION
📋 Document ID: gai-eng_item004
📑 Title: The Age of AI has begun
🌐 Language: eng-usa
📄 Text type: Blog post

📚 SECTIONS EXTRACTED (7 total):
  1. [section_1] Introduction
  2. [section_2] Defining artificial intelligence
  3. [section_3] Productivity enhancement
  4. [section_4] Health
  5. [section_5] Education
  6. [section_6] Risks and problems with AI
  7. [section_7] The next frontiers

📝 PARAGRAPHS EXTRACTED (56 total):
  1. [p1_1] in section 'Introduction'
     Text: In my lifetime, I’ve seen two demonstrations of technology that struck me as revolutionary.
  2. [p1_2] in section 'Introduction'
     Text: The first time was in 1980, when I was introduced to a graphical user interface—the forerunner of ev...
  3. [p1_3] in section 'Int

### STEP 5: DEFINE VECTORIZATION FUNCTIONS

In [None]:
### STEP 5: DEFINE VECTORIZATION FUNCTIONS (SCHEMA-COMPLIANT)

def create_schema_compliant_metadata(model):
    """Create metadata section matching the standardized schema"""
    return {
        "model": "paraphrase-multilingual-MiniLM-L12-v2",
        "dimension": model.get_sentence_embedding_dimension(),
        "version": "1.0"
    }

def create_document_vector_schema_compliant(extracted_content, model):
    """
    Create a single document-level vector matching existing schema
    """
    doc_vector = model.encode(extracted_content['document_text'])
    
    return {
        'id': extracted_content['document_id'],
        'title': extracted_content['title'],
        'vector': doc_vector.tolist(),  # Convert numpy array to list for JSON
        'text': extracted_content['document_text'][:500]  # Limit text for storage
    }

def create_section_vectors_schema_compliant(extracted_content, model):
    """
    Create vectors for all sections matching existing schema
    """
    section_vectors = []
    
    for section in extracted_content['sections']:
        if section['text'].strip():  # Only create vectors for non-empty sections
            vector = model.encode(section['text'])
            
            section_vectors.append({
                'id': section['id'],
                'document_id': section['document_id'],
                'title': section['title'],
                'level': section['level'],
                'vector': vector.tolist(),  # Convert to list for JSON
                'text': section['text'][:500]  # Limit text for storage
            })
    
    return section_vectors

def create_paragraph_vectors_schema_compliant(extracted_content, model):
    """
    Create vectors for all paragraphs matching existing schema
    """
    paragraph_vectors = []
    
    for paragraph in extracted_content['paragraphs']:
        if paragraph['text'].strip():  # Only create vectors for non-empty paragraphs
            vector = model.encode(paragraph['text'])
            
            paragraph_vectors.append({
                'id': paragraph['id'],
                'document_id': paragraph['document_id'],
                'vector': vector.tolist(),  # Convert to list for JSON
                'text': paragraph['text'][:300]  # Limit text for storage
            })
    
    return paragraph_vectors

def append_vectors_to_schema_file(new_vectors, vector_type, model, output_dir="../vectors/gai"):
    """
    Append new vectors to existing schema-compliant files
    Creates files if they don't exist
    
    Args:
        new_vectors: List of new vector objects to append
        vector_type: 'document', 'section', or 'paragraph'
        model: SentenceTransformer model for metadata
        output_dir: Output directory path
    
    Returns:
        Dictionary with file info and statistics
    """
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Define filename based on type
    filename_map = {
        'document': 'gai-document-vectors.json',
        'section': 'gai-section-vectors.json', 
        'paragraph': 'gai-paragraph-vectors.json'
    }
    
    if vector_type not in filename_map:
        raise ValueError(f"Invalid vector_type: {vector_type}")
    
    filepath = output_dir / filename_map[vector_type]
    
    # Load existing file or create new structure
    if filepath.exists():
        try:
            with open(filepath, 'r', encoding='utf-8') as f:
                existing_data = json.load(f)
        except Exception as e:
            print(f"  ⚠️ Error reading existing file, creating new: {e}")
            existing_data = {"metadata": {}, "vectors": []}
    else:
        existing_data = {"metadata": {}, "vectors": []}
    
    # Ensure proper structure
    if 'vectors' not in existing_data:
        existing_data['vectors'] = []
    if 'metadata' not in existing_data:
        existing_data['metadata'] = {}
    
    # Add new vectors with proper count numbering
    current_count = len(existing_data['vectors'])
    
    for i, vector in enumerate(new_vectors):
        # Ensure each vector has a count field
        vector_with_count = vector.copy()
        vector_with_count['count'] = current_count + i + 1
        vector_with_count['created'] = datetime.now().isoformat()
        
        existing_data['vectors'].append(vector_with_count)
    
    # Update metadata
    existing_data['metadata'] = create_schema_compliant_metadata(model)
    
    # Save file
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(existing_data, f, ensure_ascii=False, indent=2)
    
    return {
        'file': str(filepath),
        'new_count': len(new_vectors),
        'total_count': len(existing_data['vectors']),
        'size_mb': filepath.stat().st_size / (1024*1024)
    }

def batch_process_and_append_vectors(loaded_docs, model, output_dir="../vectors/gai"):
    """
    Process documents and append to existing vector files with schema compliance
    Only processes documents that aren't already vectorized
    
    Args:
        loaded_docs: List of loaded document objects from STEP 3
        model: SentenceTransformer model
        output_dir: Output directory for vector files
    
    Returns:
        Dictionary with processing statistics
    """
    if not loaded_docs:
        print("📝 No documents to process - all vectors are up to date!")
        return {'processed': 0, 'vectors_created': 0}
    
    print(f"🚀 Processing {len(loaded_docs)} new documents...")
    
    # Collections for new vectors only
    new_document_vectors = []
    new_section_vectors = []
    new_paragraph_vectors = []
    
    processing_stats = {
        'documents_processed': 0,
        'documents_failed': 0,
        'total_doc_vectors': 0,
        'total_section_vectors': 0,
        'total_paragraph_vectors': 0
    }
    
    # Process each document
    for idx, corpus_item in enumerate(loaded_docs):
        try:
            doc_id = corpus_item.get('document_id', 'Unknown')
            
            # Show progress every 10 documents or for small batches
            if len(loaded_docs) <= 10 or (idx + 1) % 10 == 0 or idx == 0:
                print(f"   Processing {idx + 1}/{len(loaded_docs)}: {doc_id}")
            
            # Extract content using your existing flexible extraction
            extracted_content = extract_document_content_flexible(corpus_item)
            
            # Create vectors using schema-compliant functions
            doc_vector_data = create_document_vector_schema_compliant(extracted_content, model)
            doc_section_vectors = create_section_vectors_schema_compliant(extracted_content, model)
            doc_paragraph_vectors = create_paragraph_vectors_schema_compliant(extracted_content, model)
            
            # Add to collections
            new_document_vectors.append(doc_vector_data)
            new_section_vectors.extend(doc_section_vectors)
            new_paragraph_vectors.extend(doc_paragraph_vectors)
            
            # Update statistics
            processing_stats['documents_processed'] += 1
            processing_stats['total_doc_vectors'] += 1
            processing_stats['total_section_vectors'] += len(doc_section_vectors)
            processing_stats['total_paragraph_vectors'] += len(doc_paragraph_vectors)
            
        except Exception as e:
            print(f"   ❌ Error processing {doc_id}: {e}")
            processing_stats['documents_failed'] += 1
            continue
    
    # Append all new vectors to files
    print(f"💾 Saving vectors to files...")
    
    file_results = {}
    
    # Append document vectors
    if new_document_vectors:
        file_results['documents'] = append_vectors_to_schema_file(
            new_document_vectors, 'document', model, output_dir
        )
    
    # Append section vectors
    if new_section_vectors:
        file_results['sections'] = append_vectors_to_schema_file(
            new_section_vectors, 'section', model, output_dir
        )
    
    # Append paragraph vectors
    if new_paragraph_vectors:
        file_results['paragraphs'] = append_vectors_to_schema_file(
            new_paragraph_vectors, 'paragraph', model, output_dir
        )
    
    # Compact final statistics
    total_new_vectors = (len(new_document_vectors) + len(new_section_vectors) + len(new_paragraph_vectors))
    
    print(f"✅ Processed {processing_stats['documents_processed']} docs, created {total_new_vectors} vectors")
    if processing_stats['documents_failed'] > 0:
        print(f"⚠️  {processing_stats['documents_failed']} documents failed")
    
    return {
        'processed': processing_stats['documents_processed'],
        'failed': processing_stats['documents_failed'],
        'vectors_created': total_new_vectors,
        'file_results': file_results,
        'processing_stats': processing_stats
    }

def verify_no_duplicates(vector_dir="../vectors/gai"):
    """
    Verify that there are no duplicate document IDs in the vector files
    This ensures our "no re-processing" logic worked correctly
    """
    from pathlib import Path
    import json
    
    vector_dir = Path(vector_dir)
    
    # Check each vector file
    vector_files = [
        'gai-document-vectors.json',
        'gai-section-vectors.json', 
        'gai-paragraph-vectors.json'
    ]
    
    all_good = True
    
    for filename in vector_files:
        filepath = vector_dir / filename
        
        if filepath.exists():
            try:
                with open(filepath, 'r', encoding='utf-8') as f:
                    data = json.load(f)
                
                vectors = data.get('vectors', [])
                ids = [v.get('id', '') for v in vectors]
                unique_ids = set(ids)
                
                if len(ids) != len(unique_ids):
                    print(f"❌ Duplicates found in {filename}: {len(ids)} total, {len(unique_ids)} unique")
                    all_good = False
                        
            except Exception as e:
                print(f"❌ Error checking {filename}: {e}")
                all_good = False
    
    return all_good

print("✅ Cleaned batch processing functions loaded!")
print("📋 Functions available:")
print("  • batch_process_and_append_vectors() - Clean batch processing")
print("  • verify_no_duplicates() - Silent verification function")

✅ Cleaned batch processing functions loaded!
📋 Functions available:
  • batch_process_and_append_vectors() - Clean batch processing
  • verify_no_duplicates() - Silent verification function
🔄 Reduced output for better notebook experience!


### STEP 6: INITIALIZE THE MULTILINGUAL EMBEDDING MODEL

In [8]:
# Initialize the multilingual sentence transformer model
print("🤖 Loading multilingual embedding model...")
print("This may take a moment on first run as it downloads the model (~420MB)")

try:
    # Load the model that creates our unified vector space
    model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
    print("✅ Model loaded successfully!")
    
    # Display model information
    print(f"\n📊 Model Information:")
    print(f"  • Model name: {model._modules['0'].auto_model.config._name_or_path}")
    print(f"  • Embedding dimension: {model.get_sentence_embedding_dimension()}")
    print(f"  • Max sequence length: {model.max_seq_length}")
    print(f"  • Supports 50+ languages including English and Spanish")
    
    # Verify this matches our existing vector schema
    expected_dimension = 384
    actual_dimension = model.get_sentence_embedding_dimension()
    
    if actual_dimension == expected_dimension:
        print(f"\n✅ Model dimension ({actual_dimension}) matches existing vector files!")
    else:
        print(f"\n⚠️  WARNING: Model dimension ({actual_dimension}) differs from expected ({expected_dimension})")
        print("This may cause compatibility issues with existing vectors.")
    
except Exception as e:
    print(f"❌ Error loading model: {e}")
    print("💡 Try running: pip install sentence-transformers")

print("\n🚀 Ready for batch vectorization!")

2025-05-29 01:49:24,926 - INFO - Use pytorch device_name: cpu
2025-05-29 01:49:24,927 - INFO - Load pretrained SentenceTransformer: paraphrase-multilingual-MiniLM-L12-v2


🤖 Loading multilingual embedding model...
This may take a moment on first run as it downloads the model (~420MB)
✅ Model loaded successfully!

📊 Model Information:
  • Model name: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
  • Embedding dimension: 384
  • Max sequence length: 128
  • Supports 50+ languages including English and Spanish

✅ Model dimension (384) matches existing vector files!

🚀 Ready for batch vectorization!


## =============================================================================

## Part 3: Execution and Output

### STEP 7: BATCH PROCESS DOCUMENTS

In [9]:
### STEP 7: SMART BATCH PROCESSING (NO RE-VECTORIZATION)

# Only proceed if we have documents to process
if documents_to_process:
    print("🚀 Starting Smart Batch Vectorization")
    print(f"📋 Processing {len(documents_to_process)} new documents")
    print(f"🤖 Model: {model._modules['0'].auto_model.config._name_or_path}")
    print(f"📐 Dimension: {model.get_sentence_embedding_dimension()}")
    
    # Load the unprocessed documents
    if 'loaded_docs' not in locals():
        print("📖 Loading document content...")
        loaded_docs = load_documents_for_batch(documents_to_process)
    
    if loaded_docs:
        print(f"🔄 Processing {len(loaded_docs)} documents...")
        
        # Use the schema-compliant batch processing
        batch_results = batch_process_and_append_vectors(
            loaded_docs=loaded_docs,
            model=model,
            output_dir="../vectors/gai"
        )
        
        # Show concise results
        if batch_results['processed'] > 0:
            print(f"✅ Processed: {batch_results['processed']} docs, Created: {batch_results['vectors_created']} vectors")
            
            # Show file updates in one line per type
            if 'file_results' in batch_results:
                for vector_type, file_info in batch_results['file_results'].items():
                    print(f"   {vector_type.title()}: +{file_info['new_count']} vectors (Total: {file_info['total_count']}, {file_info['size_mb']:.1f}MB)")
            
            # Compact language and type breakdown
            lang_counts = {}
            for doc in loaded_docs:
                doc_id = doc.get('document_id', '')
                lang = doc_id.split('_')[0].split('-')[-1] if '_' in doc_id else 'unknown'
                lang_counts[lang] = lang_counts.get(lang, 0) + 1
            
            type_counts = {}
            for doc_info in documents_to_process:
                text_type = doc_info.get('text_type', 'Unknown')
                type_counts[text_type] = type_counts.get(text_type, 0) + 1
            
            # Single line summaries
            lang_summary = ", ".join([f"{lang.upper()}:{count}" for lang, count in lang_counts.items()])
            type_summary = ", ".join([f"{text_type}:{count}" for text_type, count in type_counts.items()])
            print(f"🌐 Languages: {lang_summary}")
            print(f"📄 Types: {type_summary}")
        
        if batch_results['failed'] > 0:
            print(f"⚠️ Failed: {batch_results['failed']} documents")
        
        # Quick verification
        print("🔍 Verifying integrity...", end="")
        verification_passed = verify_no_duplicates("../vectors/gai")
        print(" ✅ Passed" if verification_passed else " ❌ Failed")
        
        # Final status - compact version
        updated_vectors_info = load_existing_vector_files_comprehensive("../vectors/gai")
        total_corpus_docs = sum(len(docs) for docs in all_documents.values())
        processed_docs = len(updated_vectors_info['document_ids'])
        coverage = (processed_docs/total_corpus_docs)*100
        
        print(f"📊 Corpus Status: {processed_docs}/{total_corpus_docs} docs ({coverage:.1f}% coverage)")
        
        if processed_docs == total_corpus_docs:
            print("🎉 CORPUS VECTORIZATION COMPLETE!")
        else:
            print(f"📋 Remaining: {total_corpus_docs - processed_docs} documents")
    
    else:
        print("❌ No documents could be loaded for processing")

else:
    print("🎉 NO PROCESSING NEEDED - All documents already have vectors")
    
    # Compact current status
    total_corpus_docs = sum(len(docs) for docs in all_documents.values())
    processed_docs = len(existing_vectors_info['document_ids'])
    coverage = (processed_docs/total_corpus_docs)*100
    
    print(f"📊 Status: {processed_docs}/{total_corpus_docs} docs ({coverage:.1f}% coverage)")
    print(f"📊 Total vectors: {existing_vectors_info['total_vectors']:,}")
    
    # Single line per vector type
    for vector_type, file_info in existing_vectors_info['files_found'].items():
        if 'count' in file_info:
            print(f"   {vector_type.title()}: {file_info['count']} vectors ({file_info['size_mb']:.1f}MB)")

# Store final results for potential use in next steps
if 'batch_results' in locals():
    final_processing_summary = {
        'new_documents_processed': batch_results.get('processed', 0),
        'total_vectors_added': batch_results.get('vectors_created', 0),
        'files_updated': len(batch_results.get('file_results', {})),
        'corpus_coverage': f"{coverage:.1f}%"
    }
else:
    final_processing_summary = {
        'new_documents_processed': 0,
        'total_vectors_added': 0,
        'files_updated': 0,
        'corpus_coverage': f"{coverage:.1f}%"
    }

print(f"\n🚀 Ready for Step 9: Visualize the Vectors!")
print(f"💡 Benefits: No re-processing, schema-compliant files, smart detection, verified integrity")

🚀 Starting Smart Batch Vectorization
📋 Processing 5 new documents
🤖 Model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
📐 Dimension: 384
🔄 Processing 5 documents...
🚀 Processing 5 new documents...
   Processing 1/5: gai-eng_item002
📄 Extracting content from: gai-eng_item002
  ✅ Extracted: 19 sections, 73 paragraphs
  📊 Document type: Academic paper
  🎯 Special elements: figures, tables, references


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing 2/5: gai-eng_item003
📄 Extracting content from: gai-eng_item003
  ✅ Extracted: 36 sections, 194 paragraphs
  📊 Document type: Policy document
  🎯 Special elements: 


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing 3/5: gai-eng_item004
📄 Extracting content from: gai-eng_item004
  ✅ Extracted: 7 sections, 56 paragraphs
  📊 Document type: Blog post
  🎯 Special elements: pull_quotes


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing 4/5: gai-esp_item001
📄 Extracting content from: gai-esp_item001
  ✅ Extracted: 23 sections, 182 paragraphs
  📊 Document type: Policy document
  🎯 Special elements: terms, references


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing 5/5: gai-esp_item002
📄 Extracting content from: gai-esp_item002
  ✅ Extracted: 7 sections, 40 paragraphs
  📊 Document type: Journal article
  🎯 Special elements: figures, references


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

💾 Saving vectors to files...
✅ Processed 5 docs, created 642 vectors
✅ Processed: 5 docs, Created: 642 vectors
   Documents: +5 vectors (Total: 6, 0.1MB)
   Sections: +92 vectors (Total: 114, 1.4MB)
   Paragraphs: +545 vectors (Total: 615, 7.1MB)
🌐 Languages: ENG:3, ESP:2
📄 Types: Academic paper:1, Policy document:2, Blog post:1, Journal article:1
🔍 Verifying integrity...❌ Duplicates found in gai-section-vectors.json: 114 total, 68 unique
❌ Duplicates found in gai-paragraph-vectors.json: 615 total, 452 unique
 ❌ Failed
🔍 CHECKING EXISTING VECTORS IN: ..\vectors\gai
  ✅ Documents: 6 vectors found
  ✅ Sections: 114 vectors found
  ✅ Paragraphs: 615 vectors found

📊 EXISTING VECTORS SUMMARY:
  • Unique documents with vectors: 6
  • Total sections: 68
  • Total paragraphs: 452
  • Grand total vectors: 735
📊 Corpus Status: 6/6 docs (100.0% coverage)
🎉 CORPUS VECTORIZATION COMPLETE!

🚀 Ready for Step 9: Visualize the Vectors!
💡 Benefits: No re-processing, schema-compliant files, smart detect