# CORPUS-INFORMED AUTO-TRANSLATOR: VECTOR CREATION NOTEBOOK
Converting JSON Corpus Contents into Semantic Vectors

## Vectorization Notebook Outline

**Part 1: Data Preparation and Planning**

- STEP 1: LOAD THE REQUIRED LIBRARIES
- STEP 2: LOAD AND EXAMINE DATABASE STRUCTURE
- STEP 3: LOAD UNPROCESSED DOCUMENTS FOR BATCH PROCESSING (WITHOUT RE-VECTORIZATION)

**Part 2: Function and Model Setup**

- STEP 4: DEFINE TEXT EXTRACTION FUNCTIONS (STANDARDIZED)
- STEP 5: DEFINE VECTORIZATION FUNCTIONS
- STEP 6: INITIALIZE THE MULTILINGUAL EMBEDDING MODEL

**Part 3: Execution and Output**
- STEP 7: BATCH PROCESS DOCUMENTS
- STEP 8: GENERATE JSON VISUALIZATION REPORT

## =============================================================================

## Local Environment Setup

In [1]:
# ==============================================================================
# LOCAL ENVIRONMENT SETUP - Run this FIRST!
# ==============================================================================

print("üåç Setting up local Pragmatic Auto-Translator environment...")

import os
import sys
from pathlib import Path

# Navigate to project root (assumes notebook is in scripts/vectorization/)
print("üìÅ Setting up project paths...")
current_notebook_dir = Path.cwd()

# Find project root by looking for config.py in scripts folder
project_root = None
search_dir = current_notebook_dir

# Search up the directory tree for scripts/config.py
for _ in range(5):  # Prevent infinite loop
    scripts_dir = search_dir / 'scripts'
    config_file = scripts_dir / 'config.py'
    
    if config_file.exists():
        project_root = search_dir
        break
    
    parent = search_dir.parent
    if parent == search_dir:  # Reached filesystem root
        break
    search_dir = parent

if project_root is None:
    print("‚ùå Could not find project root with scripts/config.py")
    print(f"üí° Current directory: {current_notebook_dir}")
    print("üí° Make sure you're running this notebook from within the project structure")
    sys.exit(1)

# Set working directory and Python path
os.chdir(project_root)
scripts_path = project_root / 'scripts'
if str(scripts_path) not in sys.path:
    sys.path.insert(0, str(scripts_path))

print(f"‚úÖ Project root: {project_root}")
print(f"‚úÖ Scripts path added: {scripts_path}")

# Import and validate configuration
print("üìã Loading project configuration...")
try:
    from config import *
    
    # Initialize directories
    ensure_directories(DOMAIN)
    
    # Verify corpus databases and show results
    corpus_ready = verify_corpus_databases(DOMAIN)
    
    if corpus_ready:
        print(f"\n‚úÖ Configuration loaded: {DOMAIN.upper()} domain, {len(LANGUAGES)} languages, {MODEL_NAME}")
        setup_success = True
    else:
        print(f"\n‚ö†Ô∏è Missing corpus files - check DOMAIN setting or run corpus collection")
        setup_success = False

except ImportError as e:
    print(f"‚ùå Configuration import failed: {e}")
    setup_success = False
except Exception as e:
    print(f"‚ùå Configuration error: {e}")
    setup_success = False

# Final status
if setup_success:
    print("\n‚úÖ Environment ready for STEP 1: Load Required Libraries")
else:
    print("\n‚ùå Setup failed - resolve issues before proceeding")
    print("üí° Check that:")
    print("   - You're in the correct project directory")
    print("   - config.py exists in scripts/ folder")
    print("   - Corpus database files exist for configured languages")

üåç Setting up local Pragmatic Auto-Translator environment...
üìÅ Setting up project paths...
‚úÖ Project root: c:\Users\alain\OneDrive\Documents\GitHub\pragmatic-auto-translator-v2
‚úÖ Scripts path added: c:\Users\alain\OneDrive\Documents\GitHub\pragmatic-auto-translator-v2\scripts
üìã Loading project configuration...
Initializing local Pragmatic Auto-Translator environment...

üîç VERIFYING CORPUS DATABASES FOR DOMAIN: GAI
--------------------------------------------------
‚úÖ ENG: 6 documents
‚úÖ ESP: 5 documents
‚úÖ ZHO: 2 documents

‚úÖ Configuration loaded: GAI domain, 3 languages, jinaai/jina-embeddings-v3

‚úÖ Environment ready for STEP 1: Load Required Libraries



## Part 1: Data Preparation and Planning

### STEP 1: LOAD THE REQUIRED LIBRARIES

In [2]:
# ==============================================================================
# STEP 1: LOAD THE REQUIRED LIBRARIES
# ==============================================================================

print("üìö Loading required libraries...")

# Core Python libraries (built-in)
import json
import logging
from typing import Dict, List, Optional
from datetime import datetime
import re
from collections import defaultdict

# Check and install required libraries
required_libs = ["numpy", "sentence-transformers", "scikit-learn", "matplotlib", "seaborn", "tqdm"]
missing_libs = []

# Test imports and collect missing libraries
import importlib

for lib in required_libs:
    try:
        if lib == "sentence-transformers":
            importlib.import_module("sentence_transformers")
        elif lib == "scikit-learn":
            importlib.import_module("sklearn")
        else:
            importlib.import_module(lib)
    except ImportError:
        missing_libs.append(lib)

# Install missing libraries
if missing_libs:
    print(f"üì¶ Installing missing libraries: {', '.join(missing_libs)}")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install"] + missing_libs)
    print("‚úÖ Installation complete")

# Now import everything
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

print("üöÄ All libraries loaded - ready for vectorization")
print(f"üìä Ready to process {len(LANGUAGES)} languages: {', '.join(LANGUAGES)}")
print(f"üéØ Domain: {DOMAIN}")

üìö Loading required libraries...


  from .autonotebook import tqdm as notebook_tqdm


üöÄ All libraries loaded - ready for vectorization
üìä Ready to process 3 languages: eng, esp, zho
üéØ Domain: gai


### STEP 2: LOAD AND EXAMINE DATABASE STRUCTURE

In [3]:
# ==============================================================================
# STEP 2: LOAD CORPUS DATABASES AND DEFINE CORPUS ITEM LOADING HELPER FUNCTION  
# ==============================================================================

# Load document metadata from corpus databases
document_metadata = load_all_databases()  # Note: This loads metadata, not actual content

if not document_metadata:
    raise Exception("‚ùå No document metadata loaded - check corpus database files")

# Create summary for processing
document_summary = {}
for language, docs in document_metadata.items():
    document_summary[language] = {
        'count': len(docs),
        'sample_ids': list(docs.keys())[:3]  # Show first 3 IDs as examples
    }

print(f"‚úÖ Document metadata loaded - ready to load unprocessed documents for vectorization")

# Add this to your Step 2 (after loading document_metadata)

def load_corpus_item(language, doc_id):
    """
    Load both the metadata and content data for a specific document
    
    Args:
        language: Language code ('eng', 'esp', etc.)
        doc_id: Document ID (e.g., 'gai-eng_corpus-item001')
    
    Returns:
        Dictionary containing both metadata and content
    """
    # Load metadata from database
    if language not in document_metadata:
        raise ValueError(f"Language '{language}' not found in loaded databases")
    
    if doc_id not in document_metadata[language]:
        raise ValueError(f"Document '{doc_id}' not found in {language} database")
    
    metadata = document_metadata[language][doc_id]
    
    # Load content from separate content file using config paths
    content_file_path = PATHS[language]['processed'] / f'{doc_id}.json'
    
    if not content_file_path.exists():
        raise FileNotFoundError(f"Content file not found: {content_file_path}")
    
    try:
        with open(content_file_path, 'r', encoding='utf-8') as f:
            content_data = json.load(f)
    except Exception as e:
        raise ValueError(f"Error loading content file {content_file_path}: {e}")
    
    # Merge metadata and content
    merged_data = {
        'document_metadata': metadata.get('document_metadata', {}),
        'processing_metadata': metadata.get('processing_metadata', {}),
        'document_id': content_data.get('document_id', doc_id),
        'content': content_data.get('content', {})
    }
    
    return merged_data

print(f"\n‚úÖ Helper function loaded - load_corpus_item() ready to use")


üìö LOADING ALL CORPUS DATABASES FOR DOMAIN: GAI
--------------------------------------------------
‚úÖ ENG: 6 documents loaded
‚úÖ ESP: 5 documents loaded
‚úÖ ZHO: 2 documents loaded

üìä TOTAL: 13 documents across 3 languages
--------------------------------------------------
‚úÖ Document metadata loaded - ready to load unprocessed documents for vectorization

‚úÖ Helper function loaded - load_corpus_item() ready to use


### STEP 3: LOAD UNPROCESSED DOCUMENTS FOR BATCH PROCESSING (NO RE-VECTORIZATION)

In [4]:
# ==============================================================================
# STEP 3: LOAD UNPROCESSED DOCUMENTS FOR BATCH PROCESSING (NO RE-VECTORIZATION)
# ==============================================================================

def load_existing_vectors():
    """
    Load all existing vector files and extract processed document IDs
    Returns set of document IDs that already have vectors
    """
    processed_doc_ids = set()
    vectors_dir = PATHS['vectors']
    
    print(f"üîç CHECKING EXISTING VECTORS IN: {vectors_dir}")
    
    # Check each vector file type defined in config
    for vector_type, filename in OUTPUT_FILES.items():
        if not vector_type.endswith('_vectors'):
            continue  # Skip non-vector files
            
        filepath = vectors_dir / filename
        
        if filepath.exists():
            try:
                with open(filepath, 'r', encoding='utf-8') as f:
                    data = json.load(f)
                
                vectors = data.get('vectors', [])
                
                # Extract document IDs from vectors
                for vector in vectors:
                    doc_id = vector.get('document_id', '')
                    if doc_id:
                        processed_doc_ids.add(doc_id)
                
                print(f"  ‚úÖ {vector_type}: {len(vectors)} vectors found")
                
            except Exception as e:
                print(f"  ‚ùå Error reading {filepath}: {e}")
        else:
            print(f"  üìù {vector_type}: No existing file")
    
    print(f"\nüìä SUMMARY: {len(processed_doc_ids)} documents already have vectors")
    return processed_doc_ids

def find_unprocessed_documents(all_documents, processed_doc_ids):
    """
    Find documents that need vectorization
    """
    unprocessed = []
    
    for language, docs in all_documents.items():
        for doc_id, doc_data in docs.items():
            if doc_id not in processed_doc_ids:
                doc_metadata = doc_data.get('document_metadata', {})
                processing_metadata = doc_data.get('processing_metadata', {})
                
                unprocessed.append({
                    'document_id': doc_id,
                    'language': language,
                    'title': doc_metadata.get('title', 'No title'),
                    'text_type': doc_metadata.get('text_type', 'Unknown'),
                    'word_count': processing_metadata.get('word_count', 0)
                })
    
    return unprocessed

def load_documents_for_processing(documents_to_process):
    """
    Load the actual document content for vectorization
    """
    loaded_documents = []
    failed_count = 0
    
    print(f"\nüìñ LOADING DOCUMENT CONTENT:")
    
    for doc_info in documents_to_process:
        doc_id = doc_info['document_id']
        language = doc_info['language']
        
        try:
            document = load_corpus_item(language, doc_id)
            document['processing_info'] = doc_info
            loaded_documents.append(document)
            
        except Exception as e:
            print(f"  ‚ùå {doc_id}: Error - {e}")
            failed_count += 1
    
    if failed_count == 0:
        print(f"  ‚úÖ All {len(loaded_documents)} documents loaded successfully")
    else:
        print(f"  ‚úÖ {len(loaded_documents)} loaded, {failed_count} failed")
    
    return loaded_documents

# Execute the loading process
print("üöÄ STARTING DOCUMENT LOADING PROCESS")
print("="*50)

# Check for existing vectors
existing_processed_ids = load_existing_vectors()

# Find unprocessed documents
documents_to_process = find_unprocessed_documents(document_metadata, existing_processed_ids)

# Show processing summary
total_docs = sum(len(docs) for docs in document_metadata.values())
print(f"\nüìä PROCESSING SUMMARY:")
print(f"  ‚Ä¢ Total documents: {total_docs} | Already processed: {len(existing_processed_ids)} | Need processing: {len(documents_to_process)}")

# Load documents for processing (if any)
if documents_to_process:
    print(f"\nüìù DOCUMENTS TO PROCESS ({len(documents_to_process)} total):")
    for i, doc in enumerate(documents_to_process, 1):
        print(f"  {i:2d}. {doc['document_id']} ({doc['language'].upper()}) - {doc['word_count']:,} words")
        print(f"      {doc['title'][:60]}{'...' if len(doc['title']) > 60 else ''}")
    
    # Load the actual content
    loaded_docs = load_documents_for_processing(documents_to_process)
    
    print(f"\n‚úÖ LOADING COMPLETE: {len(loaded_docs)} documents ready for text extraction")

else:
    print(f"\nüéâ ALL DOCUMENTS ALREADY PROCESSED!")
    loaded_docs = []

print(f"\n‚úÖ STEP 3 COMPLETE - Ready for text extraction functions!")

üöÄ STARTING DOCUMENT LOADING PROCESS
üîç CHECKING EXISTING VECTORS IN: c:\Users\alain\OneDrive\Documents\GitHub\pragmatic-auto-translator-v2\corpora\gai\vectors
  üìù document_vectors: No existing file
  üìù section_vectors: No existing file
  üìù paragraph_vectors: No existing file

üìä SUMMARY: 0 documents already have vectors

üìä PROCESSING SUMMARY:
  ‚Ä¢ Total documents: 13 | Already processed: 0 | Need processing: 13

üìù DOCUMENTS TO PROCESS (13 total):
   1. gai-eng_corpus-item001 (ENG) - 5,817 words
      Attention is All You Need
   2. gai-eng_corpus-item002 (ENG) - 15,544 words
      On the Dangers of Stochastic Parrots: Can Language Models Be...
   3. gai-eng_corpus-item003 (ENG) - 14,571 words
      Recommendation on the Ethics of Artificial Intelligence
   4. gai-eng_corpus-item004 (ENG) - 3,672 words
      The Age of AI has begun
   5. gai-eng_corpus-item005 (ENG) - 4,200 words
      "If I Had Another Job, I Would Not Accept Data Annotation Ta...
   6. gai-eng_c

## =============================================================================

## Part 2: Function and Model Setup

### STEP 4: STANDARDIZED TEXT EXTRACTION FUNCTIONS

In [5]:
# ==============================================================================
# STEP 4: STANDARDIZED TEXT EXTRACTION FUNCTIONS
# ==============================================================================

def extract_clean_text(text):
    """
    Clean and normalize text content
    
    Args:
        text: Raw text string
        
    Returns:
        Cleaned text string
    """
    if not text or text == "null":
        return ""
    
    # Convert to string if not already
    text = str(text)
    
    # Normalize whitespace
    text = ' '.join(text.split())
    
    # Normalize quotation marks (straight quotes only as per QA checklist)
    text = text.replace('"', '"').replace('"', '"')
    text = text.replace(''', "'").replace(''', "'")
    
    # Remove leading/trailing whitespace
    text = text.strip()
    
    return text

def extract_paragraph_content(paragraph):
    """
    Extract semantic content from a standardized paragraph object
    
    Args:
        paragraph: Paragraph object from standardized JSON structure
        
    Returns:
        tuple: (paragraph_id, combined_text)
    """
    if not isinstance(paragraph, dict):
        return "unknown", ""
    
    paragraph_id = paragraph.get('id', 'unknown')
    text_parts = []
    
    # Main paragraph text (includes anchor text for links)
    main_text = extract_clean_text(paragraph.get('text', ''))
    if main_text:
        text_parts.append(main_text)
    
    # Process inline equations (semantic content for technical domains)
    inline_equations = paragraph.get('inline_equations', [])
    for equation in inline_equations:
        marker = equation.get('marker', '')
        latex = equation.get('latex', '')
        if marker and latex:
            text_parts.append(f"[Equation: {marker} = {latex}]")
    
    # Combine all text parts
    combined_text = ' '.join(text_parts)
    
    return paragraph_id, combined_text

def extract_section_content(section, document_id, parent_section_id="", nesting_level=0):
    """
    Extract content from a section and all its nested subsections (arbitrary depth)
    
    Args:
        section: Section object from standardized JSON structure
        document_id: Document identifier
        parent_section_id: Parent section ID for building hierarchy
        nesting_level: Current nesting depth (for debugging/logging)
        
    Returns:
        dict: Section data with text and paragraph information
    """
    section_id = section.get('id', 'unknown')
    section_title = extract_clean_text(section.get('title', ''))
    
    # Build full section ID considering hierarchy
    if parent_section_id:
        full_section_id = f"{parent_section_id}_{section_id.split('_')[-1]}"
    else:
        full_section_id = section_id
    
    # Initialize containers
    section_text_parts = []
    all_paragraphs = []
    all_subsections = []
    
    # Add section title to text
    if section_title:
        section_text_parts.append(section_title)
    
    # Process paragraphs at this level
    paragraphs = section.get('paragraphs', [])
    for paragraph in paragraphs:
        para_id, para_text = extract_paragraph_content(paragraph)
        
        if para_text:
            section_text_parts.append(para_text)
            
            # Store paragraph with metadata
            all_paragraphs.append({
                'id': para_id,
                'text': para_text,
                'section_id': full_section_id,
                'section_title': section_title,
                'document_id': document_id,
                'nesting_level': nesting_level
            })
    
    # Process all possible nested subsection keys (handles arbitrary depth)
    nested_section_keys = [
        'subsections', 
        'subsubsections', 
        'subsubsubsections',
        'subsubsubsubsections'  # Just in case you go even deeper!
    ]
    
    # Also dynamically find any keys that contain 'section' for future-proofing
    for key in section.keys():
        if 'section' in key.lower() and key not in nested_section_keys and isinstance(section[key], list):
            nested_section_keys.append(key)
    
    # Process all found nested sections
    for subsection_key in nested_section_keys:
        nested_sections = section.get(subsection_key, [])
        
        for nested_section in nested_sections:
            nested_section_data = extract_section_content(
                nested_section, 
                document_id, 
                full_section_id,
                nesting_level + 1
            )
            
            # Add nested section text to current section
            section_text_parts.append(nested_section_data['text'])
            
            # Collect all paragraphs and subsections from nested levels
            all_paragraphs.extend(nested_section_data['paragraphs'])
            all_subsections.append(nested_section_data)
            all_subsections.extend(nested_section_data['subsections'])
    
    # Combine section text
    section_text = ' '.join(section_text_parts)
    
    return {
        'id': full_section_id,
        'title': section_title,
        'text': section_text,
        'document_id': document_id,
        'paragraphs': all_paragraphs,
        'subsections': all_subsections,
        'nesting_level': nesting_level
    }

def extract_document_content(corpus_item):
    """
    Extract all content from a standardized corpus document
    
    Args:
        corpus_item: Complete document object from load_corpus_item()
        
    Returns:
        dict: Complete document content with text at multiple granularities
    """
    # Get document metadata
    document_id = corpus_item.get('document_id', 'unknown')
    doc_metadata = corpus_item.get('document_metadata', {})
    
    title = extract_clean_text(doc_metadata.get('title', ''))
    text_type = doc_metadata.get('text_type', 'unknown')
    language_family = doc_metadata.get('language_family', 'unknown')
    language_variant = doc_metadata.get('language_variant', 'unknown')
    
    print(f"üìÑ Processing: {document_id} ({text_type})")
    
    # Initialize content containers
    document_text_parts = []
    all_sections = []
    all_paragraphs = []
    
    # Add title to document text
    if title:
        document_text_parts.append(title)
    
    # Get content structure
    content = corpus_item.get('content', {})
    
    # Process abstract (if present)
    abstract = content.get('abstract')
    if abstract and abstract != "null":
        clean_abstract = extract_clean_text(abstract)
        if clean_abstract:
            document_text_parts.append(clean_abstract)
            
            # Store abstract as special paragraph
            all_paragraphs.append({
                'id': f"{document_id}_abstract",
                'text': clean_abstract,
                'section_id': 'abstract',
                'section_title': 'Abstract',
                'document_id': document_id
            })
    
    # Process sections
    sections = content.get('sections', [])
    for section in sections:
        section_data = extract_section_content(section, document_id, "", 0)  # Start at nesting level 0
        
        # Add section text to document
        document_text_parts.append(section_data['text'])
        
        # Store section information
        all_sections.append({
            'id': section_data['id'],
            'title': section_data['title'],
            'text': section_data['text'],
            'document_id': document_id,
            'nesting_level': section_data['nesting_level']
        })
        
        # Collect all paragraphs from section and all nested subsections
        all_paragraphs.extend(section_data['paragraphs'])
        
        # Collect all subsections (flattened from all nesting levels)
        all_sections.extend([
            {
                'id': sub['id'],
                'title': sub['title'],
                'text': sub['text'],
                'document_id': document_id,
                'nesting_level': sub['nesting_level']
            }
            for sub in section_data['subsections']
        ])
    
    # Process top-level figures (if present)
    figures = content.get('figures', [])
    for figure in figures:
        caption = extract_clean_text(figure.get('caption', ''))
        if caption:
            figure_text = f"Figure {figure.get('id', '')}: {caption}"
            document_text_parts.append(figure_text)
    
    # Process top-level tables (if present)
    tables = content.get('tables', [])
    for table in tables:
        caption = extract_clean_text(table.get('caption', ''))
        if caption:
            table_text = f"Table {table.get('id', '')}: {caption}"
            document_text_parts.append(table_text)
    
    # Combine all document text
    full_document_text = ' '.join(document_text_parts)
    
    # Calculate content statistics
    stats = {
        'total_sections': len(all_sections),
        'total_paragraphs': len(all_paragraphs),
        'document_length_chars': len(full_document_text),
        'document_length_words': len(full_document_text.split()),
        'text_type': text_type,
        'has_abstract': bool(abstract and abstract != "null"),
        'has_figures': len(figures) > 0,
        'has_tables': len(tables) > 0,
        'language': f"{language_family}-{language_variant}"
    }
    
    print(f"  ‚úÖ Extracted: {stats['total_sections']} sections, {stats['total_paragraphs']} paragraphs")
    print(f"  üìä Length: {stats['document_length_words']:,} words")
    
    return {
        'document_id': document_id,
        'title': title,
        'text_type': text_type,
        'language': f"{language_family}-{language_variant}",
        'document_text': full_document_text,
        'sections': all_sections,
        'paragraphs': all_paragraphs,
        'statistics': stats,
        'processing_metadata': {
            'extracted_at': datetime.now().isoformat(),
            'extraction_method': 'standardized_schema_v1'
        }
    }

def validate_extracted_content(extracted_content):
    """
    Validate that content extraction produced expected results
    
    Args:
        extracted_content: Result from extract_document_content()
        
    Returns:
        bool: True if validation passes
    """
    required_fields = ['document_id', 'document_text', 'sections', 'paragraphs', 'statistics']
    
    for field in required_fields:
        if field not in extracted_content:
            print(f"‚ùå Missing required field: {field}")
            return False
    
    # Check that we have actual content
    if not extracted_content['document_text'].strip():
        print(f"‚ùå Empty document text for {extracted_content['document_id']}")
        return False
    
    if len(extracted_content['paragraphs']) == 0:
        print(f"‚ùå No paragraphs extracted for {extracted_content['document_id']}")
        return False
    
    print(f"‚úÖ Content validation passed for {extracted_content['document_id']}")
    return True

# Test the extraction functions with a sample document
print("üß™ Testing extraction functions...")

if document_metadata and len(loaded_docs) > 0:
    # Test with first loaded document
    test_doc = loaded_docs[0]
    test_result = extract_document_content(test_doc)
    
    if validate_extracted_content(test_result):
        print("‚úÖ Step 4 extraction functions ready for batch processing")
        print(f"üìã Test document: {test_result['document_id']}")
        print(f"üìä Test stats: {test_result['statistics']['total_sections']} sections, {test_result['statistics']['total_paragraphs']} paragraphs")
    else:
        print("‚ùå Extraction function validation failed")
else:
    print("‚ö†Ô∏è No documents available for testing - functions defined but not tested")
    print("‚úÖ Step 4 extraction functions ready (untested)")

print("\n" + "="*60)
print("üìã STEP 4 COMPLETE: Standardized text extraction functions ready")
print("üéØ Functions defined:")
print("   ‚Ä¢ extract_clean_text() - Text normalization")
print("   ‚Ä¢ extract_paragraph_content() - Semantic paragraph processing")
print("   ‚Ä¢ extract_section_content() - Recursive section processing (arbitrary depth)")
print("   ‚Ä¢ extract_document_content() - Complete document processing")
print("   ‚Ä¢ validate_extracted_content() - Content validation")
print("üîÑ Supports unlimited nesting: sections ‚Üí subsections ‚Üí subsubsections ‚Üí subsubsubsections ‚Üí ...")
print("üìù Extracts: Main text + section titles + equations (excludes footnotes & URLs)")
print("="*60)

üß™ Testing extraction functions...
üìÑ Processing: gai-eng_corpus-item001 (Academic paper)
  ‚úÖ Extracted: 22 sections, 69 paragraphs
  üìä Length: 4,524 words
‚úÖ Content validation passed for gai-eng_corpus-item001
‚úÖ Step 4 extraction functions ready for batch processing
üìã Test document: gai-eng_corpus-item001
üìä Test stats: 22 sections, 69 paragraphs

üìã STEP 4 COMPLETE: Standardized text extraction functions ready
üéØ Functions defined:
   ‚Ä¢ extract_clean_text() - Text normalization
   ‚Ä¢ extract_paragraph_content() - Semantic paragraph processing
   ‚Ä¢ extract_section_content() - Recursive section processing (arbitrary depth)
   ‚Ä¢ extract_document_content() - Complete document processing
   ‚Ä¢ validate_extracted_content() - Content validation
üîÑ Supports unlimited nesting: sections ‚Üí subsections ‚Üí subsubsections ‚Üí subsubsubsections ‚Üí ...
üìù Extracts: Main text + section titles + equations (excludes footnotes & URLs)


### STEP 5: DEFINE VECTORIZATION FUNCTIONS

In [6]:
# ==============================================================================
# STEP 5: VECTORIZATION FUNCTIONS 
# ==============================================================================

def create_vector_metadata(model):
    """
    Create metadata section matching JS-expected schema
    """
    return {
        "model": MODEL_NAME,
        "dimension": MODEL_DIMENSIONS,
        "task": MODEL_TASK,
        "normalization": True,
        "created": datetime.now().isoformat(),
        "model_parameters": {
            "trust_remote_code": MODEL_TRUST_REMOTE_CODE,
            "normalize_embeddings": True
        }
    }

def create_document_vectors(extracted_content, model):
    """
    Create document-level vectors with metadata
    """
    print(f"üéØ Creating document vector for: {extracted_content['document_id']}")
    
    doc_text = extracted_content['document_text']
    word_count = len(doc_text.split())
    
    # Create vector using Jina-v3
    vector = model.encode(
        doc_text,
        task=MODEL_TASK,
        normalize_embeddings=True
    )
    
    return {
        'id': extracted_content['document_id'],
        'title': extracted_content['title'],  # Document title already included
        'text': doc_text,
        'word_count': word_count,
        'character_count': len(doc_text),
        'vector': vector.tolist()
    }

def create_section_vectors(extracted_content, model):
    """
    Create section-level vectors with metadata including document title
    """
    print(f"üìö Creating section vectors for: {extracted_content['document_id']}")
    
    section_vectors = []
    document_title = extracted_content.get('title', 'No title')  # Get document title
    
    for section in extracted_content['sections']:
        if section['text'].strip():  # Only process non-empty sections
            section_text = section['text']
            word_count = len(section_text.split())
            
            # Create vector
            vector = model.encode(
                section_text,
                task=MODEL_TASK,
                normalize_embeddings=True
            )
            
            section_vectors.append({
                'id': section['id'],
                'document_id': section['document_id'],
                'document_title': document_title,  
                'title': section['title'],         # Section title
                'text': section_text,
                'word_count': word_count,
                'character_count': len(section_text),
                'nesting_level': section['nesting_level'],
                'vector': vector.tolist()
            })
    
    print(f"  ‚úÖ Created {len(section_vectors)} section vectors (with document titles)")
    return section_vectors

def create_paragraph_vectors(extracted_content, model):
    """
    Create paragraph-level vectors with metadata including document title
    """
    print(f"üìù Creating paragraph vectors for: {extracted_content['document_id']}")
    
    paragraph_vectors = []
    document_title = extracted_content.get('title', 'No title')  # Get document title
    
    for paragraph in extracted_content['paragraphs']:
        if paragraph['text'].strip():  # Only process non-empty paragraphs
            para_text = paragraph['text']
            word_count = len(para_text.split())
            
            # Create vector
            vector = model.encode(
                para_text,
                task=MODEL_TASK,
                normalize_embeddings=True
            )
            
            paragraph_vectors.append({
                'id': paragraph['id'],
                'document_id': paragraph['document_id'],
                'document_title': document_title,  
                'text': para_text,
                'word_count': word_count,
                'character_count': len(para_text),
                'section_id': paragraph['section_id'],
                'section_title': paragraph['section_title'],
                'vector': vector.tolist()
            })
    
    print(f"  ‚úÖ Created {len(paragraph_vectors)} paragraph vectors (with document titles)")
    return paragraph_vectors

def append_vectors_to_file(new_vectors, vector_type, model):
    """
    Append vectors to JS-compatible JSON files with proper schema
    
    Args:
        new_vectors: List of vectors to append
        vector_type: 'document', 'section', or 'paragraph'
        model: SentenceTransformer model for metadata
    
    Returns:
        dict: File statistics
    """
    # Use correct filename format expected by JS
    filename_map = {
        'document': OUTPUT_FILES['document_vectors'],
        'section': OUTPUT_FILES['section_vectors'],
        'paragraph': OUTPUT_FILES['paragraph_vectors']
    }
    
    if vector_type not in filename_map:
        raise ValueError(f"Invalid vector_type: {vector_type}")
    
    filepath = PATHS['vectors'] / filename_map[vector_type]
    
    # Load existing file or create new structure
    if filepath.exists():
        try:
            with open(filepath, 'r', encoding='utf-8') as f:
                existing_data = json.load(f)
        except Exception as e:
            print(f"  ‚ö†Ô∏è Error reading existing file, creating new: {e}")
            existing_data = {"metadata": {}, "vectors": []}
    else:
        existing_data = {"metadata": {}, "vectors": []}
    
    # Ensure proper structure
    if 'vectors' not in existing_data:
        existing_data['vectors'] = []
    if 'metadata' not in existing_data:
        existing_data['metadata'] = {}
    
    # Get current count for sequential numbering
    current_count = len(existing_data['vectors'])
    timestamp = datetime.now().isoformat()
    
    # Add vectors with JS-compatible schema
    for i, vector in enumerate(new_vectors):
        # Create vector object matching JS expectations
        vector_obj = {
            'id': vector['id'],
            'count': current_count + i + 1,  # Sequential count (required by JS)
            'created': timestamp,
            'text': vector['text'],
            'word_count': vector['word_count'],
            'character_count': vector['character_count'],
            'vector': vector['vector']
        }
        
        # Add type-specific fields
        if vector_type == 'document':
            vector_obj['title'] = vector.get('title', 'No title')
        elif vector_type == 'section':
            vector_obj['document_id'] = vector['document_id']
            vector_obj['document_title'] = vector.get('document_title', 'No title')  # ADD: Document title
            vector_obj['title'] = vector.get('title', 'No title')  # Section title
            vector_obj['level'] = vector.get('nesting_level', 0)  # JS expects 'level' not 'nesting_level'
        elif vector_type == 'paragraph':
            vector_obj['document_id'] = vector['document_id']
            vector_obj['document_title'] = vector.get('document_title', 'No title')  # ADD: Document title
            # Include section context for paragraphs
            if 'section_id' in vector:
                vector_obj['section_id'] = vector['section_id']
            if 'section_title' in vector:
                vector_obj['section_title'] = vector['section_title']
        
        existing_data['vectors'].append(vector_obj)
    
    # Update metadata
    existing_data['metadata'] = create_vector_metadata(model)
    
    # Save file
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(existing_data, f, ensure_ascii=False, indent=2)
    
    # Calculate stats
    file_size_mb = filepath.stat().st_size / (1024*1024)
    
    print(f"  üíæ Saved to: {filepath.name}")
    print(f"  üìä Added {len(new_vectors)} vectors (total: {len(existing_data['vectors'])})")
    print(f"  üìÅ File size: {file_size_mb:.2f} MB")
    
    return {
        'file': str(filepath),
        'vectors_added': len(new_vectors),
        'total_vectors': len(existing_data['vectors']),
        'file_size_mb': file_size_mb
    }

def process_document_vectors(extracted_content, model):
    """
    Process a single document into all vector types and append to files
    
    Args:
        extracted_content: Output from extract_document_content()
        model: Loaded SentenceTransformer model
    
    Returns:
        dict: Processing statistics
    """
    doc_id = extracted_content['document_id']
    doc_title = extracted_content.get('title', 'No title')
    print(f"\nüîÑ PROCESSING VECTORS FOR: {doc_id}")
    print(f"üìñ Document: {doc_title}")
    print("="*60)
    
    stats = {}
    
    # Create and save document vectors
    if CREATE_DOCUMENT_VECTORS:
        doc_vectors = [create_document_vectors(extracted_content, model)]
        stats['document'] = append_vectors_to_file(doc_vectors, 'document', model)
    
    # Create and save section vectors
    if CREATE_SECTION_VECTORS:
        section_vectors = create_section_vectors(extracted_content, model)
        if section_vectors:
            stats['section'] = append_vectors_to_file(section_vectors, 'section', model)
        else:
            print("  üìö No sections to vectorize")
    
    # Create and save paragraph vectors
    if CREATE_PARAGRAPH_VECTORS:
        paragraph_vectors = create_paragraph_vectors(extracted_content, model)
        if paragraph_vectors:
            stats['paragraph'] = append_vectors_to_file(paragraph_vectors, 'paragraph', model)
        else:
            print("  üìù No paragraphs to vectorize")
    
    print(f"‚úÖ COMPLETED: {doc_id}")
    return stats

def verify_vector_files():
    """
    Verify that vector files match JS expectations and include document titles
    """
    print(f"\nüîç VERIFYING VECTOR FILES (WITH DOCUMENT TITLES)")
    print("="*40)
    
    vector_types = ['document', 'section', 'paragraph']
    all_good = True
    
    for vector_type in vector_types:
        filename = OUTPUT_FILES[f'{vector_type}_vectors']
        filepath = PATHS['vectors'] / filename
        
        if filepath.exists():
            try:
                with open(filepath, 'r', encoding='utf-8') as f:
                    data = json.load(f)
                
                # Check structure
                metadata = data.get('metadata', {})
                vectors = data.get('vectors', [])
                
                print(f"üìÑ {filename}:")
                print(f"   Model: {metadata.get('model', 'MISSING')}")
                print(f"   Dimension: {metadata.get('dimension', 'MISSING')}")
                print(f"   Vector count: {len(vectors)}")
                
                # Check first vector structure
                if vectors:
                    first_vector = vectors[0]
                    required_fields = ['id', 'count', 'created', 'text', 'vector', 'word_count']
                    
                    # Add type-specific required fields
                    if vector_type == 'document':
                        required_fields.extend(['title'])
                    elif vector_type in ['section', 'paragraph']:
                        required_fields.extend(['document_id', 'document_title'])
                    
                    missing = [f for f in required_fields if f not in first_vector]
                    
                    if missing:
                        print(f"   ‚ùå Missing fields: {missing}")
                        all_good = False
                    else:
                        print(f"   ‚úÖ Schema complete (with document titles)")
                        
                        # Check vector dimension
                        actual_dim = len(first_vector['vector'])
                        if actual_dim != MODEL_DIMENSIONS:
                            print(f"   ‚ùå Vector dimension: {actual_dim} (expected {MODEL_DIMENSIONS})")
                            all_good = False
                        else:
                            print(f"   ‚úÖ Vector dimension: {actual_dim}")
                        
                        # Verify document title field for section/paragraph vectors
                        if vector_type in ['section', 'paragraph']:
                            doc_title = first_vector.get('document_title', 'MISSING')
                            if doc_title == 'MISSING':
                                print(f"   ‚ùå Document title missing")
                                all_good = False
                            else:
                                print(f"   ‚úÖ Document title: {doc_title[:30]}...")
                
            except Exception as e:
                print(f"   ‚ùå Error: {e}")
                all_good = False
        else:
            print(f"üìÑ {filename}: Not found")
        
        print()
    
    if all_good:
        print("‚úÖ All files compatible with JS code and include document titles!")
    else:
        print("‚ö†Ô∏è Some compatibility issues found")
    
    return all_good

print("‚úÖ STEP 5 COMPLETE: Enhanced vectorization functions ready")
print("üéØ Functions defined:")
print("   ‚Ä¢ create_document_vectors() - Document-level vectorization")
print("   ‚Ä¢ create_section_vectors() - Section-level vectorization (+ document title)") 
print("   ‚Ä¢ create_paragraph_vectors() - Paragraph-level vectorization (+ document title)")
print("   ‚Ä¢ process_document_vectors() - Complete document processing")
print("   ‚Ä¢ verify_vector_files() - JS compatibility verification")
print(f"üìÅ Output files: {DOMAIN}-corpus-[type]-vectors.json")
print("üí° Includes word_count, character_count, and document_title metadata")
print("üìñ Section and paragraph vectors now include document title for easier reference")
print("üîó Fully compatible with existing JS translation pipeline")

‚úÖ STEP 5 COMPLETE: Enhanced vectorization functions ready
üéØ Functions defined:
   ‚Ä¢ create_document_vectors() - Document-level vectorization
   ‚Ä¢ create_section_vectors() - Section-level vectorization (+ document title)
   ‚Ä¢ create_paragraph_vectors() - Paragraph-level vectorization (+ document title)
   ‚Ä¢ process_document_vectors() - Complete document processing
   ‚Ä¢ verify_vector_files() - JS compatibility verification
üìÅ Output files: gai-corpus-[type]-vectors.json
üí° Includes word_count, character_count, and document_title metadata
üìñ Section and paragraph vectors now include document title for easier reference
üîó Fully compatible with existing JS translation pipeline


### STEP 6: INITIALIZE THE MULTILINGUAL EMBEDDING MODEL

In [7]:
import subprocess
import sys

# Install einops in the same environment as the notebook
subprocess.check_call([sys.executable, "-m", "pip", "install", "einops"])

0

In [8]:
# ==============================================================================
# STEP 6: INITIALIZE THE MULTILINGUAL EMBEDDING MODEL
# ==============================================================================

print(f"ü§ñ Loading {MODEL_NAME}...")
print("üì• First run may take a moment to download model (~2GB)")

try:
    # Load jina-embeddings-v3 model
    model = SentenceTransformer(
        MODEL_NAME,
        trust_remote_code=MODEL_TRUST_REMOTE_CODE
    )
    
    # Verify model loaded correctly
    actual_dimension = model.get_sentence_embedding_dimension()
    max_length = model.max_seq_length
    
    print(f"‚úÖ Model loaded successfully!")
    print(f"üìä Dimension: {actual_dimension} | Max length: {max_length}")
    
    # Check dimension matches config
    if actual_dimension != MODEL_DIMENSIONS:
        print(f"‚ö†Ô∏è WARNING: Expected {MODEL_DIMENSIONS}D, got {actual_dimension}D")
        print("Check your config.py MODEL_DIMENSIONS setting")
    
    # Quick functionality test
    print("üß™ Testing model...")
    test_vector = model.encode(
        "Test sentence for model verification.",
        task=MODEL_TASK,
        normalize_embeddings=True
    )
    
    # Verify output
    if len(test_vector) == MODEL_DIMENSIONS:
        print(f"‚úÖ Model test passed - ready for vectorization!")
    else:
        print(f"‚ùå Test failed - vector dimension: {len(test_vector)}")
    
    print(f"üéØ Model optimized for: {MODEL_TASK}")
    print(f"üåê Supports: English, Spanish and Simplified Chinese")

except Exception as e:
    print(f"‚ùå Model loading failed: {e}")
    print("üí° Try: pip install sentence-transformers>=2.7.0")
    model = None

if model is not None:
    print(f"\n‚úÖ STEP 6 COMPLETE: Model ready for batch processing")
else:
    print(f"\n‚ùå STEP 6 FAILED: Fix model loading before proceeding")

2025-07-11 13:34:56,498 - INFO - Use pytorch device_name: cpu
2025-07-11 13:34:56,500 - INFO - Load pretrained SentenceTransformer: jinaai/jina-embeddings-v3


ü§ñ Loading jinaai/jina-embeddings-v3...
üì• First run may take a moment to download model (~2GB)


2025-07-11 13:35:08,334 - INFO - 2 prompts are loaded, with the keys: ['retrieval.query', 'retrieval.passage']


‚úÖ Model loaded successfully!
üìä Dimension: 1024 | Max length: 8194
üß™ Testing model...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.72s/it]

‚úÖ Model test passed - ready for vectorization!
üéØ Model optimized for: retrieval.passage
üåê Supports: English, Spanish and Simplified Chinese

‚úÖ STEP 6 COMPLETE: Model ready for batch processing





## =============================================================================

## Part 3: Execution and Output

### STEP 7: BATCH PROCESS DOCUMENTS

In [9]:
# ==============================================================================
# STEP 7: BATCH PROCESS DOCUMENTS INTO VECTORS
# ==============================================================================

print("üöÄ STARTING BATCH VECTORIZATION PROCESS")
print("=" * 60)

# Check if we have the model and documents ready
if 'model' not in locals() or model is None:
    print("‚ùå Model not loaded - run STEP 6 first")
    sys.exit(1)

if 'documents_to_process' not in locals():
    print("‚ùå Documents not identified - run STEP 3 first")
    sys.exit(1)

# Process documents if we have any to process
if documents_to_process and len(documents_to_process) > 0:
    print(f"üìã Processing {len(documents_to_process)} new documents")
    print(f"ü§ñ Model: {MODEL_NAME}")
    print(f"üìê Dimensions: {MODEL_DIMENSIONS}")
    print(f"üéØ Task: {MODEL_TASK}")
    print("-" * 40)
    
    # Load document content if not already loaded
    if 'loaded_docs' not in locals() or not loaded_docs:
        print("üìñ Loading document content...")
        loaded_docs = load_documents_for_processing(documents_to_process)
        print(f"‚úÖ Loaded {len(loaded_docs)} documents for processing")
    
    # Track processing statistics
    processing_stats = {
        'documents_processed': 0,
        'documents_failed': 0,
        'total_vectors_created': 0,
        'files_updated': []
    }
    
    # Process each document
    for i, corpus_item in enumerate(loaded_docs, 1):
        doc_id = corpus_item.get('document_id', f'unknown_{i}')
        
        # Show progress for every document
        print(f"\nüìÑ {i}/{len(loaded_docs)}: {doc_id}")
        
        try:
            # Extract text content using established function
            extracted_content = extract_document_content(corpus_item)
            
            # Validate extraction worked
            if not validate_extracted_content(extracted_content):
                print(f"  ‚ùå Content extraction failed")
                processing_stats['documents_failed'] += 1
                continue
            
            # Process into vectors using established function
            doc_stats = process_document_vectors(extracted_content, model)
            
            # Update statistics
            processing_stats['documents_processed'] += 1
            
            # Count vectors created
            vectors_created = 0
            if 'document' in doc_stats:
                vectors_created += doc_stats['document']['vectors_added']
            if 'section' in doc_stats:
                vectors_created += doc_stats['section']['vectors_added']
            if 'paragraph' in doc_stats:
                vectors_created += doc_stats['paragraph']['vectors_added']
            
            processing_stats['total_vectors_created'] += vectors_created
            print(f"  ‚úÖ Created {vectors_created} vectors")
            
        except Exception as e:
            print(f"  ‚ùå Error processing {doc_id}: {e}")
            processing_stats['documents_failed'] += 1
            continue
    
    # Final processing summary
    print("\n" + "=" * 60)
    print("üìä BATCH PROCESSING COMPLETE")
    print("=" * 60)
    print(f"‚úÖ Documents processed: {processing_stats['documents_processed']}")
    print(f"üéØ Total vectors created: {processing_stats['total_vectors_created']}")
    
    if processing_stats['documents_failed'] > 0:
        print(f"‚ö†Ô∏è  Documents failed: {processing_stats['documents_failed']}")
    
    # Show language breakdown
    lang_counts = {}
    for doc in loaded_docs:
        # Extract language from processing_info
        lang = doc.get('processing_info', {}).get('language', 'unknown')
        lang_counts[lang] = lang_counts.get(lang, 0) + 1
    
    if lang_counts:
        lang_summary = ", ".join([f"{lang.upper()}: {count}" for lang, count in lang_counts.items()])
        print(f"üåê Languages processed: {lang_summary}")
    
    # Verify vector files
    print(f"\nüîç VERIFYING OUTPUT FILES")
    print("-" * 30)
    if verify_vector_files():
        print("‚úÖ All vector files are properly formatted")
    else:
        print("‚ö†Ô∏è  Some vector files may have issues")

else:
    print("üéâ NO PROCESSING NEEDED")
    print("All documents already have vectors")

# Show final corpus status
print(f"\nüìä FINAL CORPUS STATUS")
print("-" * 30)

# Reload existing vectors to get current totals
current_vectors = load_existing_vectors()
total_corpus_docs = sum(len(docs) for docs in document_metadata.values())
processed_docs = len(current_vectors)
coverage = (processed_docs / total_corpus_docs * 100) if total_corpus_docs > 0 else 0

print(f"Documents: {processed_docs}/{total_corpus_docs} ({coverage:.1f}% coverage)")
print(f"Model: {MODEL_NAME}")
print(f"Dimensions: {MODEL_DIMENSIONS}")

# Check if corpus is complete
if processed_docs >= total_corpus_docs:
    print("üéâ CORPUS VECTORIZATION COMPLETE!")
    print("üåê Ready for cross-lingual analysis and clustering")
else:
    remaining = total_corpus_docs - processed_docs
    print(f"üìã Remaining: {remaining} documents to process")

# Store final results for next steps
final_processing_summary = {
    'documents_processed': processing_stats.get('documents_processed', 0),
    'total_vectors_created': processing_stats.get('total_vectors_created', 0),
    'corpus_coverage_percent': round(coverage, 1),
    'total_documents_in_corpus': total_corpus_docs,
    'processed_documents_count': processed_docs,
    'model_used': MODEL_NAME,
    'vector_dimensions': MODEL_DIMENSIONS,
    'task_optimization': MODEL_TASK,
    'output_directory': str(PATHS['vectors'])
}

print(f"\n‚úÖ STEP 7 COMPLETE")
print(f"üìù Results stored in 'final_processing_summary' variable")
print(f"üìÅ Vector files saved to: {PATHS['vectors']}")
print(f"üöÄ Ready for STEP 8: Generate visualization report")

üöÄ STARTING BATCH VECTORIZATION PROCESS
üìã Processing 13 new documents
ü§ñ Model: jinaai/jina-embeddings-v3
üìê Dimensions: 1024
üéØ Task: retrieval.passage
----------------------------------------

üìÑ 1/13: gai-eng_corpus-item001
üìÑ Processing: gai-eng_corpus-item001 (Academic paper)
  ‚úÖ Extracted: 22 sections, 69 paragraphs
  üìä Length: 4,524 words
‚úÖ Content validation passed for gai-eng_corpus-item001

üîÑ PROCESSING VECTORS FOR: gai-eng_corpus-item001
üìñ Document: Attention is All You Need
üéØ Creating document vector for: gai-eng_corpus-item001


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [03:14<00:00, 194.17s/it]


  üíæ Saved to: gai-corpus-document-vectors.json
  üìä Added 1 vectors (total: 1)
  üìÅ File size: 0.06 MB
üìö Creating section vectors for: gai-eng_corpus-item001


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.50s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.93s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:38<00:00, 38.51s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.37s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:13<00:00, 13.26s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.53s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.34s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.46s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.30s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.91s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.80s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:06<00:00,  6.92s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.78s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 22 section vectors (with document titles)
  üíæ Saved to: gai-corpus-section-vectors.json
  üìä Added 22 vectors (total: 22)
  üìÅ File size: 0.73 MB
üìù Creating paragraph vectors for: gai-eng_corpus-item001


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.47s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.49s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.05s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.31s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.46s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.68s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.36s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.38s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.36s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.85s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.27s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.11s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.76s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 69 paragraph vectors (with document titles)
  üíæ Saved to: gai-corpus-paragraph-vectors.json
  üìä Added 69 vectors (total: 69)
  üìÅ File size: 2.16 MB
‚úÖ COMPLETED: gai-eng_corpus-item001
  ‚úÖ Created 92 vectors

üìÑ 2/13: gai-eng_corpus-item002
üìÑ Processing: gai-eng_corpus-item002 (Academic paper)
  ‚úÖ Extracted: 15 sections, 71 paragraphs
  üìä Length: 9,297 words
‚úÖ Content validation passed for gai-eng_corpus-item002

üîÑ PROCESSING VECTORS FOR: gai-eng_corpus-item002
üìñ Document: On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
üéØ Creating document vector for: gai-eng_corpus-item002


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [02:51<00:00, 171.31s/it]


  üíæ Saved to: gai-corpus-document-vectors.json
  üìä Added 1 vectors (total: 2)
  üìÅ File size: 0.15 MB
üìö Creating section vectors for: gai-eng_corpus-item002


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:06<00:00,  6.58s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:11<00:00, 11.93s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:08<00:00,  8.59s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:31<00:00, 31.50s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:09<00:00,  9.58s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.45s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:06<00:00,  6.46s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.24s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:06<00:00,  6.22s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:30<00:00, 30.31s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.94s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:15<00:00, 15.82s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.09s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 15 section vectors (with document titles)
  üíæ Saved to: gai-corpus-section-vectors.json
  üìä Added 15 vectors (total: 37)
  üìÅ File size: 1.27 MB
üìù Creating paragraph vectors for: gai-eng_corpus-item002


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.91s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.69s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.64s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.59s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.26s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.57s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.49s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.02s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.72s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.44s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.58s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.21s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.01s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 71 paragraph vectors (with document titles)
  üíæ Saved to: gai-corpus-paragraph-vectors.json
  üìä Added 71 vectors (total: 140)
  üìÅ File size: 4.41 MB
‚úÖ COMPLETED: gai-eng_corpus-item002
  ‚úÖ Created 87 vectors

üìÑ 3/13: gai-eng_corpus-item003
üìÑ Processing: gai-eng_corpus-item003 (Policy document)
  ‚úÖ Extracted: 36 sections, 194 paragraphs
  üìä Length: 14,050 words
‚úÖ Content validation passed for gai-eng_corpus-item003

üîÑ PROCESSING VECTORS FOR: gai-eng_corpus-item003
üìñ Document: Recommendation on the Ethics of Artificial Intelligence
üéØ Creating document vector for: gai-eng_corpus-item003


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [02:50<00:00, 170.74s/it]


  üíæ Saved to: gai-corpus-document-vectors.json
  üìä Added 1 vectors (total: 3)
  üìÅ File size: 0.27 MB
üìö Creating section vectors for: gai-eng_corpus-item003


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:20<00:00, 20.95s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:10<00:00, 10.36s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.87s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [02:43<00:00, 163.94s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:49<00:00, 49.42s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.90s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.73s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.86s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.33s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:28<00:00, 28.34s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.28s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.50s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.00s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚

  ‚úÖ Created 36 section vectors (with document titles)
  üíæ Saved to: gai-corpus-section-vectors.json
  üìä Added 36 vectors (total: 73)
  üìÅ File size: 2.62 MB
üìù Creating paragraph vectors for: gai-eng_corpus-item003


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.32s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.36s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.56s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.36s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.23s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.68s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.55s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.59s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.58s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.60s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.49s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.48s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.51s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 194 paragraph vectors (with document titles)
  üíæ Saved to: gai-corpus-paragraph-vectors.json
  üìä Added 194 vectors (total: 334)
  üìÅ File size: 10.49 MB
‚úÖ COMPLETED: gai-eng_corpus-item003
  ‚úÖ Created 231 vectors

üìÑ 4/13: gai-eng_corpus-item004
üìÑ Processing: gai-eng_corpus-item004 (Blog post)
  ‚úÖ Extracted: 7 sections, 56 paragraphs
  üìä Length: 3,594 words
‚úÖ Content validation passed for gai-eng_corpus-item004

üîÑ PROCESSING VECTORS FOR: gai-eng_corpus-item004
üìñ Document: The Age of AI has begun
üéØ Creating document vector for: gai-eng_corpus-item004


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [01:00<00:00, 60.55s/it]


  üíæ Saved to: gai-corpus-document-vectors.json
  üìä Added 1 vectors (total: 4)
  üìÅ File size: 0.32 MB
üìö Creating section vectors for: gai-eng_corpus-item004


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:07<00:00,  7.71s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.07s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.50s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.60s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.86s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.12s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.87s/it]


  ‚úÖ Created 7 section vectors (with document titles)
  üíæ Saved to: gai-corpus-section-vectors.json
  üìä Added 7 vectors (total: 80)
  üìÅ File size: 2.86 MB
üìù Creating paragraph vectors for: gai-eng_corpus-item004


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.03s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.59s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.68s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.32s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.57s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.48s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.12s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.11s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.36s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.52s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.37s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.35s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.59s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 56 paragraph vectors (with document titles)
  üíæ Saved to: gai-corpus-paragraph-vectors.json
  üìä Added 56 vectors (total: 390)
  üìÅ File size: 12.24 MB
‚úÖ COMPLETED: gai-eng_corpus-item004
  ‚úÖ Created 64 vectors

üìÑ 5/13: gai-eng_corpus-item005
üìÑ Processing: gai-eng_corpus-item005 (Report)
  ‚úÖ Extracted: 6 sections, 30 paragraphs
  üìä Length: 3,074 words
‚úÖ Content validation passed for gai-eng_corpus-item005

üîÑ PROCESSING VECTORS FOR: gai-eng_corpus-item005
üìñ Document: "If I Had Another Job, I Would Not Accept Data Annotation Tasks": How Syrian Refugees in Lebanon Train AI
üéØ Creating document vector for: gai-eng_corpus-item005


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:48<00:00, 48.74s/it]


  üíæ Saved to: gai-corpus-document-vectors.json
  üìä Added 1 vectors (total: 5)
  üìÅ File size: 0.37 MB
üìö Creating section vectors for: gai-eng_corpus-item005


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:09<00:00,  9.08s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.61s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.97s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:06<00:00,  6.05s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.16s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.70s/it]


  ‚úÖ Created 6 section vectors (with document titles)
  üíæ Saved to: gai-corpus-section-vectors.json
  üìä Added 6 vectors (total: 86)
  üìÅ File size: 3.06 MB
üìù Creating paragraph vectors for: gai-eng_corpus-item005


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.61s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.88s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.38s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.29s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.75s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.59s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.45s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.28s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.93s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.26s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.44s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.11s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.75s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 30 paragraph vectors (with document titles)
  üíæ Saved to: gai-corpus-paragraph-vectors.json
  üìä Added 30 vectors (total: 420)
  üìÅ File size: 13.18 MB
‚úÖ COMPLETED: gai-eng_corpus-item005
  ‚úÖ Created 37 vectors

üìÑ 6/13: gai-eng_corpus-item006
üìÑ Processing: gai-eng_corpus-item006 (['Policy document', 'Report'])
  ‚úÖ Extracted: 49 sections, 225 paragraphs
  üìä Length: 21,204 words
‚úÖ Content validation passed for gai-eng_corpus-item006

üîÑ PROCESSING VECTORS FOR: gai-eng_corpus-item006
üìñ Document: Copyright and Artificial Intelligence Part 3: Generative AI Training, Pre-Publication Version
üéØ Creating document vector for: gai-eng_corpus-item006


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [02:44<00:00, 164.99s/it]


  üíæ Saved to: gai-corpus-document-vectors.json
  üìä Added 1 vectors (total: 6)
  üìÅ File size: 0.53 MB
üìö Creating section vectors for: gai-eng_corpus-item006


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.87s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [01:24<00:00, 84.83s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.16s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:07<00:00,  7.44s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:13<00:00, 13.23s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:06<00:00,  6.42s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.70s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:09<00:00,  9.59s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.73s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.86s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:06<00:00,  6.86s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:12<00:00, 12.21s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.02s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 49 section vectors (with document titles)
  üíæ Saved to: gai-corpus-section-vectors.json
  üìä Added 49 vectors (total: 135)
  üìÅ File size: 4.92 MB
üìù Creating paragraph vectors for: gai-eng_corpus-item006


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.53s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.31s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.78s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.60s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.21s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.57s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.35s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.25s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.46s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.42s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.48s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.62s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.38s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 225 paragraph vectors (with document titles)
  üíæ Saved to: gai-corpus-paragraph-vectors.json
  üìä Added 225 vectors (total: 645)
  üìÅ File size: 20.27 MB
‚úÖ COMPLETED: gai-eng_corpus-item006
  ‚úÖ Created 275 vectors

üìÑ 7/13: gai-esp_corpus-item001
üìÑ Processing: gai-esp_corpus-item001 (Policy document)
  ‚úÖ Extracted: 16 sections, 202 paragraphs
  üìä Length: 10,317 words
‚úÖ Content validation passed for gai-esp_corpus-item001

üîÑ PROCESSING VECTORS FOR: gai-esp_corpus-item001
üìñ Document: Propuesta de Agenda Nacional de la Inteligencia Artificial para M√©xico 2024-2030
üéØ Creating document vector for: gai-esp_corpus-item001


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [02:48<00:00, 168.77s/it]


  üíæ Saved to: gai-corpus-document-vectors.json
  üìä Added 1 vectors (total: 7)
  üìÅ File size: 0.63 MB
üìö Creating section vectors for: gai-esp_corpus-item001


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:31<00:00, 31.84s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:18<00:00, 18.93s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.99s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:15<00:00, 15.10s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.52s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:52<00:00, 52.71s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.09s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:07<00:00,  7.16s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.55s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.90s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.91s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.48s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:20<00:00, 20.53s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 16 section vectors (with document titles)
  üíæ Saved to: gai-corpus-section-vectors.json
  üìä Added 16 vectors (total: 151)
  üìÅ File size: 5.52 MB
üìù Creating paragraph vectors for: gai-esp_corpus-item001


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.35s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.28s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.38s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.28s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.38s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.30s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.13s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.15s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.29s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.36s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.30s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.35s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.14s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 202 paragraph vectors (with document titles)
  üíæ Saved to: gai-corpus-paragraph-vectors.json
  üìä Added 202 vectors (total: 847)
  üìÅ File size: 26.58 MB
‚úÖ COMPLETED: gai-esp_corpus-item001
  ‚úÖ Created 219 vectors

üìÑ 8/13: gai-esp_corpus-item002
üìÑ Processing: gai-esp_corpus-item002 (Journal article)
  ‚úÖ Extracted: 6 sections, 42 paragraphs
  üìä Length: 7,020 words
‚úÖ Content validation passed for gai-esp_corpus-item002

üîÑ PROCESSING VECTORS FOR: gai-esp_corpus-item002
üìñ Document: Conversando con una computadora: ¬øC√≥mo entienden las inteligencias artificiales lo que les pedimos?
üéØ Creating document vector for: gai-esp_corpus-item002


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [02:44<00:00, 164.31s/it]


  üíæ Saved to: gai-corpus-document-vectors.json
  üìä Added 1 vectors (total: 8)
  üìÅ File size: 0.71 MB
üìö Creating section vectors for: gai-esp_corpus-item002


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.68s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.06s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:08<00:00,  8.95s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:06<00:00,  6.95s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:26<00:00, 26.69s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:21<00:00, 21.57s/it]


  ‚úÖ Created 6 section vectors (with document titles)
  üíæ Saved to: gai-corpus-section-vectors.json
  üìä Added 6 vectors (total: 157)
  üìÅ File size: 5.74 MB
üìù Creating paragraph vectors for: gai-esp_corpus-item002


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.79s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.64s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.82s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.59s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.86s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.74s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.90s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.91s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.57s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.70s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.77s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.75s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.79s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 42 paragraph vectors (with document titles)
  üíæ Saved to: gai-corpus-paragraph-vectors.json
  üìä Added 42 vectors (total: 889)
  üìÅ File size: 27.92 MB
‚úÖ COMPLETED: gai-esp_corpus-item002
  ‚úÖ Created 49 vectors

üìÑ 9/13: gai-esp_corpus-item003
üìÑ Processing: gai-esp_corpus-item003 (Academic paper)
  ‚úÖ Extracted: 4 sections, 61 paragraphs
  üìä Length: 6,855 words
‚úÖ Content validation passed for gai-esp_corpus-item003

üîÑ PROCESSING VECTORS FOR: gai-esp_corpus-item003
üìñ Document: Inteligencia artificial generativa: irrupci√≥n y desaf√≠os
üéØ Creating document vector for: gai-esp_corpus-item003


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [02:42<00:00, 162.04s/it]


  üíæ Saved to: gai-corpus-document-vectors.json
  üìä Added 1 vectors (total: 9)
  üìÅ File size: 0.78 MB
üìö Creating section vectors for: gai-esp_corpus-item003


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:34<00:00, 34.78s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:06<00:00,  6.14s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.77s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:49<00:00, 49.19s/it]


  ‚úÖ Created 4 section vectors (with document titles)
  üíæ Saved to: gai-corpus-section-vectors.json
  üìä Added 4 vectors (total: 161)
  üìÅ File size: 5.91 MB
üìù Creating paragraph vectors for: gai-esp_corpus-item003


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.74s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.65s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.79s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.54s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.75s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.18s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.21s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.42s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.88s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.79s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.39s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.86s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.07s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 61 paragraph vectors (with document titles)
  üíæ Saved to: gai-corpus-paragraph-vectors.json
  üìä Added 61 vectors (total: 950)
  üìÅ File size: 29.85 MB
‚úÖ COMPLETED: gai-esp_corpus-item003
  ‚úÖ Created 66 vectors

üìÑ 10/13: gai-esp_corpus-item004
üìÑ Processing: gai-esp_corpus-item004 (News article)
  ‚úÖ Extracted: 5 sections, 26 paragraphs
  üìä Length: 1,944 words
‚úÖ Content validation passed for gai-esp_corpus-item004

üîÑ PROCESSING VECTORS FOR: gai-esp_corpus-item004
üìñ Document: M√©xico avanza con su plan nacional para el desarrollo √©tico de la inteligencia artificial
üéØ Creating document vector for: gai-esp_corpus-item004


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:24<00:00, 24.14s/it]


  üíæ Saved to: gai-corpus-document-vectors.json
  üìä Added 1 vectors (total: 10)
  üìÅ File size: 0.83 MB
üìö Creating section vectors for: gai-esp_corpus-item004


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.73s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.59s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.41s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.80s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.30s/it]


  ‚úÖ Created 5 section vectors (with document titles)
  üíæ Saved to: gai-corpus-section-vectors.json
  üìä Added 5 vectors (total: 166)
  üìÅ File size: 6.07 MB
üìù Creating paragraph vectors for: gai-esp_corpus-item004


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.11s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.49s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.39s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.52s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.39s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.48s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.25s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.65s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.45s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.44s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.80s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.97s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.66s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 26 paragraph vectors (with document titles)
  üíæ Saved to: gai-corpus-paragraph-vectors.json
  üìä Added 26 vectors (total: 976)
  üìÅ File size: 30.66 MB
‚úÖ COMPLETED: gai-esp_corpus-item004
  ‚úÖ Created 32 vectors

üìÑ 11/13: gai-esp_corpus-item005
üìÑ Processing: gai-esp_corpus-item005 (News article)
  ‚úÖ Extracted: 12 sections, 34 paragraphs
  üìä Length: 1,115 words
‚úÖ Content validation passed for gai-esp_corpus-item005

üîÑ PROCESSING VECTORS FOR: gai-esp_corpus-item005
üìñ Document: Inteligencia Artificial Generativa: ¬øQu√© es? ¬øEs un riesgo o ventaja?
üéØ Creating document vector for: gai-esp_corpus-item005


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:10<00:00, 10.91s/it]


  üíæ Saved to: gai-corpus-document-vectors.json
  üìä Added 1 vectors (total: 11)
  üìÅ File size: 0.86 MB
üìö Creating section vectors for: gai-esp_corpus-item005


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.95s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.93s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.43s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.45s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.43s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.86s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.54s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.52s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.72s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.50s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.18s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.31s/it]


  ‚úÖ Created 12 section vectors (with document titles)
  üíæ Saved to: gai-corpus-section-vectors.json
  üìä Added 12 vectors (total: 178)
  üìÅ File size: 6.45 MB
üìù Creating paragraph vectors for: gai-esp_corpus-item005


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.14s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.26s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.62s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.58s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.30s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.17s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.32s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.30s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.34s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.35s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.22s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.43s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.20s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 34 paragraph vectors (with document titles)
  üíæ Saved to: gai-corpus-paragraph-vectors.json
  üìä Added 34 vectors (total: 1010)
  üìÅ File size: 31.72 MB
‚úÖ COMPLETED: gai-esp_corpus-item005
  ‚úÖ Created 47 vectors

üìÑ 12/13: gai-zho_corpus-item001
üìÑ Processing: gai-zho_corpus-item001 (Journal article)
  ‚úÖ Extracted: 12 sections, 36 paragraphs
  üìä Length: 140 words
‚úÖ Content validation passed for gai-zho_corpus-item001

üîÑ PROCESSING VECTORS FOR: gai-zho_corpus-item001
üìñ Document: Â§ßÊ®°ÂûãÂÖ≥ÈîÆÊäÄÊúØ‰∏éÊú™Êù•ÂèëÂ±ïÊñπÂêë‚Äî‚Äî‰ªé ChatGPT Ë∞àËµ∑
üéØ Creating document vector for: gai-zho_corpus-item001


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [01:09<00:00, 69.02s/it]


  üíæ Saved to: gai-corpus-document-vectors.json
  üìä Added 1 vectors (total: 12)
  üìÅ File size: 0.91 MB
üìö Creating section vectors for: gai-zho_corpus-item001


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.29s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:31<00:00, 31.88s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.35s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.16s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.16s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.04s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.81s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:07<00:00,  7.39s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.34s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.47s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.21s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.39s/it]


  ‚úÖ Created 12 section vectors (with document titles)
  üíæ Saved to: gai-corpus-section-vectors.json
  üìä Added 12 vectors (total: 190)
  üìÅ File size: 6.86 MB
üìù Creating paragraph vectors for: gai-zho_corpus-item001


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.49s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.45s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.62s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.55s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.45s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.80s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.17s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.06s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.84s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.62s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.63s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.50s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.65s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 36 paragraph vectors (with document titles)
  üíæ Saved to: gai-corpus-paragraph-vectors.json
  üìä Added 36 vectors (total: 1046)
  üìÅ File size: 32.85 MB
‚úÖ COMPLETED: gai-zho_corpus-item001
  ‚úÖ Created 49 vectors

üìÑ 13/13: gai-zho_corpus-item002
üìÑ Processing: gai-zho_corpus-item002 (Journal article)
  ‚úÖ Extracted: 19 sections, 22 paragraphs
  üìä Length: 192 words
‚úÖ Content validation passed for gai-zho_corpus-item002

üîÑ PROCESSING VECTORS FOR: gai-zho_corpus-item002
üìñ Document: ChatGPTÂèäÁîüÊàêÂºè‰∫∫Â∑•Êô∫ËÉΩÁé∞Áä∂ÂèäÊú™Êù•ÂèëÂ±ïÊñπÂêë
üéØ Creating document vector for: gai-zho_corpus-item002


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [01:21<00:00, 81.09s/it]


  üíæ Saved to: gai-corpus-document-vectors.json
  üìä Added 1 vectors (total: 13)
  üìÅ File size: 0.97 MB
üìö Creating section vectors for: gai-zho_corpus-item002


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.56s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:23<00:00, 23.11s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:10<00:00, 10.98s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.62s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.73s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.09s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.77s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.80s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.61s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.75s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.85s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:18<00:00, 18.25s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.28s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 19 section vectors (with document titles)
  üíæ Saved to: gai-corpus-section-vectors.json
  üìä Added 19 vectors (total: 209)
  üìÅ File size: 7.50 MB
üìù Creating paragraph vectors for: gai-zho_corpus-item002


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.50s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.01s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.71s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.73s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.85s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.19s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.64s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.03s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.05s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.29s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.73s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.64s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.68s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñ

  ‚úÖ Created 22 paragraph vectors (with document titles)
  üíæ Saved to: gai-corpus-paragraph-vectors.json
  üìä Added 22 vectors (total: 1068)
  üìÅ File size: 33.55 MB
‚úÖ COMPLETED: gai-zho_corpus-item002
  ‚úÖ Created 42 vectors

üìä BATCH PROCESSING COMPLETE
‚úÖ Documents processed: 13
üéØ Total vectors created: 1290
üåê Languages processed: ENG: 6, ESP: 5, ZHO: 2

üîç VERIFYING OUTPUT FILES
------------------------------

üîç VERIFYING VECTOR FILES (WITH DOCUMENT TITLES)
üìÑ gai-corpus-document-vectors.json:
   Model: jinaai/jina-embeddings-v3
   Dimension: 1024
   Vector count: 13
   ‚úÖ Schema complete (with document titles)
   ‚úÖ Vector dimension: 1024

üìÑ gai-corpus-section-vectors.json:
   Model: jinaai/jina-embeddings-v3
   Dimension: 1024
   Vector count: 209
   ‚úÖ Schema complete (with document titles)
   ‚úÖ Vector dimension: 1024
   ‚úÖ Document title: Attention is All You Need...

üìÑ gai-corpus-paragraph-vectors.json:
   Model: jinaai/jina-embeddings-v3

### STEP 8: GENERATE HTML VISUALIZATION REPORT

In [10]:
import sys
import subprocess

# Install plotly in the current Python environment
subprocess.check_call([sys.executable, "-m", "pip", "install", "plotly", "scikit-learn"])
print("‚úÖ Plotly installed in current environment")

‚úÖ Plotly installed in current environment


In [12]:
# ==============================================================================
# STEP 8: GENERATE OPTIMIZED JSON VISUALIZATION DATA
# Exports clean JSON data for frontend visualization system
# ==============================================================================

import plotly.graph_objects as go
import plotly.express as px
import numpy as np
from sklearn.decomposition import PCA
from datetime import datetime
import json
from pathlib import Path

def extract_language_from_id(doc_id):
    """Extract language code from document ID (e.g., 'gai-eng_item001' -> 'eng')"""
    try:
        if '_' in doc_id:
            prefix = doc_id.split('_')[0]
            if '-' in prefix:
                return prefix.split('-')[-1]
        return 'unknown'
    except:
        return 'unknown'

def get_language_color_scheme():
    """Define color schemes for each language and granularity level"""
    return {
        'eng': {
            'Document': '#1f77b4',    # Dark blue
            'Section (L0)': '#5299c4', # Medium blue  
            'Section (L1+)': '#7db8d4', # Light blue (all sub-levels)
            'Paragraph': '#a8d1e8'     # Lightest blue
        },
        'esp': {
            'Document': '#2ca02c',     # Dark green
            'Section (L0)': '#5cb85c', # Medium green
            'Section (L1+)': '#8cc98c', # Light green (all sub-levels)
            'Paragraph': '#b8dab8'     # Lightest green
        },
        'zho': {
            'Document': '#d62728',     # Dark red
            'Section (L0)': '#e55858', # Medium red
            'Section (L1+)': '#f08888', # Light red (all sub-levels)
            'Paragraph': '#fbb8b8'     # Lightest red
        },
        'unknown': {
            'Document': '#666666',     # Gray
            'Section (L0)': '#888888',
            'Section (L1+)': '#aaaaaa',
            'Paragraph': '#cccccc'
        }
    }

def load_all_vectors_for_visualization():
    """Load all vectors with improved language and type classification"""
    vector_dir = PATHS['vectors']
    
    all_vectors = []
    all_labels = []
    all_types = []
    all_languages = []
    all_colors = []
    all_details = []
    
    color_scheme = get_language_color_scheme()
    
    # Load document vectors
    doc_file = vector_dir / OUTPUT_FILES['document_vectors']
    if doc_file.exists():
        print(f"üìÑ Loading document vectors from {doc_file.name}")
        with open(doc_file, 'r', encoding='utf-8') as f:
            doc_data = json.load(f)
        
        for vector in doc_data.get('vectors', []):
            doc_id = vector['id']
            language = extract_language_from_id(doc_id)
            vector_type = 'Document'
            
            all_vectors.append(vector['vector'])
            
            # Create readable label
            title = vector.get('title', 'No title')
            short_title = title[:25] + "..." if len(title) > 25 else title
            all_labels.append(f"DOC ({language.upper()}): {short_title}")
            
            all_types.append(vector_type)
            all_languages.append(language)
            all_colors.append(color_scheme[language][vector_type])
            
            # Document popup data (NO excerpt for documents)
            all_details.append({
                'corpus_item': doc_id,
                'document_title': title,
                'type': 'DOCUMENT',
                'language': language.upper(),
                'word_count': vector.get('word_count', 'N/A')
            })
    
    # Load section vectors (sample to avoid clutter)
    section_file = vector_dir / OUTPUT_FILES['section_vectors']
    if section_file.exists():
        print(f"üìö Loading section vectors from {section_file.name}")
        with open(section_file, 'r', encoding='utf-8') as f:
            section_data = json.load(f)
        
        sections = section_data.get('vectors', [])
        # Take every 3rd section to avoid overcrowding
        for vector in sections[::3]:
            doc_id = vector.get('document_id', 'unknown')
            language = extract_language_from_id(doc_id)
            level = vector.get('level', 0)
            
            # Group all sub-levels (L1, L2, L3...) together
            if level == 0:
                vector_type = 'Section (L0)'
                level_display = 'L0'
            else:
                vector_type = 'Section (L1+)'
                level_display = f'L{level}'
            
            all_vectors.append(vector['vector'])
            
            # Create readable label
            title = vector.get('title', 'No title')
            short_title = title[:20] + "..." if len(title) > 20 else title
            all_labels.append(f"SEC-{level_display} ({language.upper()}): {short_title}")
            
            all_types.append(vector_type)
            all_languages.append(language)
            all_colors.append(color_scheme[language][vector_type])
            
            # Section popup data
            text = vector.get('text', '')
            excerpt = ' '.join(text.split()[:10])  # First 10 words
            if len(text.split()) > 10:
                excerpt += "..."
            
            all_details.append({
                'corpus_item': doc_id,
                'document_title': vector.get('document_title', 'No title'),
                'type': 'SECTION',
                'language': language.upper(),
                'section_id': vector['id'],
                'section_title': title,
                'excerpt': excerpt,
                'level': level
            })
    
    # Load paragraph vectors (smaller sample)
    para_file = vector_dir / OUTPUT_FILES['paragraph_vectors']
    if para_file.exists():
        print(f"üìù Loading paragraph vectors from {para_file.name}")
        with open(para_file, 'r', encoding='utf-8') as f:
            para_data = json.load(f)
        
        paragraphs = para_data.get('vectors', [])
        # Take every 8th paragraph to avoid overcrowding
        for vector in paragraphs[::8]:
            doc_id = vector.get('document_id', 'unknown')
            language = extract_language_from_id(doc_id)
            vector_type = 'Paragraph'
            
            all_vectors.append(vector['vector'])
            
            # Create readable label
            text = vector.get('text', '')
            short_text = text[:25] + "..." if len(text) > 25 else text
            all_labels.append(f"PARA ({language.upper()}): {short_text}")
            
            all_types.append(vector_type)
            all_languages.append(language)
            all_colors.append(color_scheme[language][vector_type])
            
            # Paragraph popup data
            excerpt = ' '.join(text.split()[:10])  # First 10 words
            if len(text.split()) > 10:
                excerpt += "..."
            
            all_details.append({
                'corpus_item': doc_id,
                'document_title': vector.get('document_title', 'No title'),
                'type': 'PARAGRAPH',
                'language': language.upper(),
                'paragraph_id': vector['id'],
                'section_title': vector.get('section_title', 'No section'),
                'excerpt': excerpt
            })
    
    print(f"‚úÖ Loaded {len(all_vectors)} vectors for visualization")
    return np.array(all_vectors), all_labels, all_types, all_languages, all_colors, all_details

def create_comprehensive_corpus_statistics():
    """Generate comprehensive corpus statistics including all required metrics"""
    vector_dir = PATHS['vectors']
    
    # Initialize statistics structure
    stats = {
        'total_documents': 0,
        'total_sections': 0,
        'total_paragraphs': 0,
        'total_vectors': 0,
        'languages': {},
        'vector_types': {
            'document_vectors': 0,
            'section_vectors': 0,
            'paragraph_vectors': 0
        },
        'coverage_percent': 0,
        'processed_documents': set(),
        'corpus_documents': 0
    }
    
    # Count vectors by type and language
    # Document vectors
    doc_file = vector_dir / OUTPUT_FILES['document_vectors']
    if doc_file.exists():
        try:
            with open(doc_file, 'r', encoding='utf-8') as f:
                doc_data = json.load(f)
            documents = doc_data.get('vectors', [])
            stats['total_documents'] = len(documents)
            stats['vector_types']['document_vectors'] = len(documents)
            
            for doc in documents:
                doc_id = doc.get('id', '')
                stats['processed_documents'].add(doc_id)
                language = extract_language_from_id(doc_id)
                stats['languages'][language] = stats['languages'].get(language, 0) + 1
        except Exception as e:
            print(f"Error reading document vectors: {e}")
    
    # Section vectors
    section_file = vector_dir / OUTPUT_FILES['section_vectors']
    if section_file.exists():
        try:
            with open(section_file, 'r', encoding='utf-8') as f:
                section_data = json.load(f)
            sections = section_data.get('vectors', [])
            stats['total_sections'] = len(sections)
            stats['vector_types']['section_vectors'] = len(sections)
        except Exception as e:
            print(f"Error reading section vectors: {e}")
    
    # Paragraph vectors
    para_file = vector_dir / OUTPUT_FILES['paragraph_vectors']
    if para_file.exists():
        try:
            with open(para_file, 'r', encoding='utf-8') as f:
                para_data = json.load(f)
            paragraphs = para_data.get('vectors', [])
            stats['total_paragraphs'] = len(paragraphs)
            stats['vector_types']['paragraph_vectors'] = len(paragraphs)
        except Exception as e:
            print(f"Error reading paragraph vectors: {e}")
    
    stats['total_vectors'] = stats['total_documents'] + stats['total_sections'] + stats['total_paragraphs']
    
    # Calculate coverage based on original corpus
    if 'document_metadata' in globals():
        total_corpus_docs = sum(len(docs) for docs in document_metadata.values())
        stats['corpus_documents'] = total_corpus_docs
        if total_corpus_docs > 0:
            stats['coverage_percent'] = round((len(stats['processed_documents']) / total_corpus_docs) * 100, 1)
    
    # Convert set to count for JSON serialization
    stats['processed_documents'] = len(stats['processed_documents'])
    
    return stats

def create_visualization_data(vectors, labels, types, languages, colors, details):
    """Create PCA projections and prepare all chart data"""
    
    # Create PCA projections
    pca_2d = PCA(n_components=2)
    pca_3d = PCA(n_components=3)
    
    coords_2d = pca_2d.fit_transform(vectors)
    coords_3d = pca_3d.fit_transform(vectors)
    
    # Build chart data
    chart_2d_data = []
    chart_3d_data = []
    
    for i, (label, type_name, lang, color, detail) in enumerate(zip(labels, types, languages, colors, details)):
        # 2D data point
        point_2d = {
            "x": float(coords_2d[i, 0]),
            "y": float(coords_2d[i, 1]),
            "label": label,
            "type": type_name,
            "language": lang.upper(),
            "color": color,
            "popup": detail
        }
        chart_2d_data.append(point_2d)
        
        # 3D data point  
        point_3d = {
            "x": float(coords_3d[i, 0]),
            "y": float(coords_3d[i, 1]),
            "z": float(coords_3d[i, 2]),
            "label": label,
            "type": type_name,
            "language": lang.upper(),
            "color": color,
            "popup": detail
        }
        chart_3d_data.append(point_3d)
    
    # Language distribution data
    lang_counts = {}
    for lang in languages:
        lang_counts[lang] = lang_counts.get(lang, 0) + 1
    
    lang_colors = {'eng': '#1f77b4', 'esp': '#2ca02c', 'zho': '#d62728', 'unknown': '#666666'}
    
    dist_data = []
    for lang, count in lang_counts.items():
        dist_data.append({
            "language": lang.upper(),
            "count": count,
            "color": lang_colors.get(lang, '#666666')
        })
    
    return {
        "pca_2d": {
            "data": chart_2d_data,
            "variance_explained": [float(pca_2d.explained_variance_ratio_[0]), float(pca_2d.explained_variance_ratio_[1])],
            "title": f"Cross-lingual Vector Space (2D) - {MODEL_DIMENSIONS}D Embeddings"
        },
        "pca_3d": {
            "data": chart_3d_data,
            "variance_explained": [float(pca_3d.explained_variance_ratio_[0]), float(pca_3d.explained_variance_ratio_[1]), float(pca_3d.explained_variance_ratio_[2])],
            "title": f"Cross-lingual Vector Space (3D) - {MODEL_DIMENSIONS}D Embeddings"
        },
        "language_distribution": {
            "data": dist_data,
            "title": "Vector Distribution by Language"
        }
    }

def generate_json_visualization_export():
    """Generate complete JSON export for frontend visualization system"""
    
    print("üé® GENERATING JSON VISUALIZATION EXPORT")
    print("=" * 60)
    
    # Use config paths
    vector_dir = PATHS['vectors']
    output_dir = PATHS['visualizations']
    
    # Ensure output directory exists
    output_dir.mkdir(parents=True, exist_ok=True)
    
    print(f"üìÅ Vector source: {vector_dir}")
    print(f"üìÅ Output directory: {output_dir}")
    
    # Load vectors and create visualizations
    print("üìä Loading vectors for visualization...")
    vectors, labels, types, languages, colors, details = load_all_vectors_for_visualization()
    
    if len(vectors) == 0:
        print("‚ùå No vectors found to visualize!")
        print(f"üí° Check that vector files exist in: {vector_dir}")
        return None
    
    print(f"‚úÖ Loaded {len(vectors)} vectors for visualization")
    print(f"üìê Vector dimensions: {len(vectors[0])} (expected: {MODEL_DIMENSIONS})")
    
    # Generate comprehensive statistics
    print("üìà Computing comprehensive corpus statistics...")
    stats = create_comprehensive_corpus_statistics()
    
    # Create visualization data
    print("üéØ Creating visualization data...")
    charts = create_visualization_data(vectors, labels, types, languages, colors, details)
    
    # Build complete JSON structure
    timestamp = datetime.now()
    visualization_export = {
        "metadata": {
            "model": MODEL_NAME,
            "dimensions": MODEL_DIMENSIONS,
            "task": MODEL_TASK,
            "domain": DOMAIN.upper(),
            "generated": timestamp.isoformat(),
            "generated_readable": timestamp.strftime('%Y-%m-%d %H:%M:%S'),
            "version": timestamp.strftime('%Y%m%d_%H%M%S')
        },
        "corpus_statistics": {
            "total_documents": stats['total_documents'],
            "total_sections": stats['total_sections'],
            "total_paragraphs": stats['total_paragraphs'],
            "total_vectors": stats['total_vectors'],
            "document_vectors": stats['vector_types']['document_vectors'],
            "section_vectors": stats['vector_types']['section_vectors'],
            "paragraph_vectors": stats['vector_types']['paragraph_vectors'],
            "coverage_percent": stats['coverage_percent'],
            "processed_documents": stats['processed_documents'],
            "corpus_documents": stats['corpus_documents'],
            "languages": stats['languages']
        },
        "charts": charts
    }
    
    # Save JSON file with timestamp
    json_filename = f"{DOMAIN}_visualization_data_{timestamp.strftime('%Y%m%d_%H%M%S')}.json"
    json_path = output_dir / json_filename
    
    with open(json_path, 'w', encoding='utf-8') as f:
        json.dump(visualization_export, f, ensure_ascii=False, indent=2)
    
    # Calculate file size
    json_size_mb = json_path.stat().st_size / (1024*1024)
    
    print(f"‚úÖ JSON VISUALIZATION EXPORT COMPLETE")
    print(f"üìÅ Saved to: {json_path}")
    print(f"üìä File size: {json_size_mb:.2f}MB")
    print(f"üìà Statistics included:")
    print(f"   ‚Ä¢ Total documents: {stats['total_documents']}")
    print(f"   ‚Ä¢ Total vectors: {stats['total_vectors']}")
    print(f"   ‚Ä¢ Coverage: {stats['coverage_percent']}%")
    print(f"   ‚Ä¢ Languages: {dict(stats['languages'])}")
    print(f"üéØ Charts included: 2D PCA, 3D PCA, Language Distribution")
    print(f"üîß Document popups: NO excerpt (cleaned as requested)")
    print(f"üì± Ready for frontend integration")
    
    return {
        'json_file': str(json_path),
        'json_filename': json_filename,
        'size_mb': json_size_mb,
        'stats': stats,
        'timestamp': timestamp.strftime('%Y-%m-%d %H:%M:%S')
    }

# Execute the JSON export
print("üöÄ Creating JSON visualization export...")
export_result = generate_json_visualization_export()

if export_result:
    print(f"\nüéâ JSON EXPORT COMPLETE!")
    print(f"üí° Frontend Integration:")
    print(f"   1. Upload {export_result['json_filename']} to your website")
    print(f"   2. Update visualizations.js to load this file")
    print(f"   3. Corpus statistics and charts will auto-populate")
    print(f"")
    print(f"üìä Export Summary:")
    print(f"   ‚Ä¢ File: {export_result['json_filename']}")
    print(f"   ‚Ä¢ Size: {export_result['size_mb']:.2f}MB")
    print(f"   ‚Ä¢ Generated: {export_result['timestamp']}")
    print(f"   ‚Ä¢ Documents: {export_result['stats']['total_documents']}")
    print(f"   ‚Ä¢ Vectors: {export_result['stats']['total_vectors']}")
    print(f"")
    print(f"üîÑ Re-run this step after processing new documents")
    print(f"üìà Each export contains complete corpus state")
    print(f"üïê Historical comparison: Compare multiple JSON files over time")
else:
    print(f"\n‚ùå Could not generate JSON export")
    print(f"üí° Make sure Step 7 batch processing completed successfully")

üöÄ Creating JSON visualization export...
üé® GENERATING JSON VISUALIZATION EXPORT
üìÅ Vector source: c:\Users\alain\OneDrive\Documents\GitHub\pragmatic-auto-translator-v2\corpora\gai\vectors
üìÅ Output directory: c:\Users\alain\OneDrive\Documents\GitHub\pragmatic-auto-translator-v2\corpora\gai\vectors\visualizations
üìä Loading vectors for visualization...
üìÑ Loading document vectors from gai-corpus-document-vectors.json
üìö Loading section vectors from gai-corpus-section-vectors.json
üìù Loading paragraph vectors from gai-corpus-paragraph-vectors.json
‚úÖ Loaded 217 vectors for visualization
‚úÖ Loaded 217 vectors for visualization
üìê Vector dimensions: 1024 (expected: 1024)
üìà Computing comprehensive corpus statistics...
üéØ Creating visualization data...
‚úÖ JSON VISUALIZATION EXPORT COMPLETE
üìÅ Saved to: c:\Users\alain\OneDrive\Documents\GitHub\pragmatic-auto-translator-v2\corpora\gai\vectors\visualizations\gai_visualization_data_20250711_153240.json
üìä File size