# 🔬 Aavishkar.ai Expert System Notebook

<img src="https://github.com/astitvac/AI4Science/raw/main/assets/AA_Main_Banner.jpg" alt="Aavishkar.ai Banner" width="600"/>

### <span style="color:#6C5CE7;">AI for Science</span>
<small><i>Democratizing advanced AI capabilities for scientific research</i></small>

---

## 👋 Welcome to Aavishkar.ai Expert Systems!

<small>This notebook is part of the <b>Aavishkar.ai AI4Science</b> initiative, which develops LLM-based expert systems to enhance scientific research workflows. Our expert systems formalize scientific cognitive processes using Large Language Models, structured knowledge representations, and interactive interfaces.</small>

### 🧠 About This Expert System

<small>This notebook implements one of the five scientific cognitive archetypes developed by Aavishkar.ai:</small>

<small>
1. 📚 <b>Literature Synthesist</b>: Identifies patterns, contradictions, and knowledge gaps across research corpora<br>
2. 🧪 <b>Experimental Architect</b>: Translates abstract hypotheses into methodologically sound experimental designs<br>
3. 📊 <b>Analytical Navigator</b>: Constructs adaptive analytical pathways through complex datasets<br>
4. 📝 <b>Research Documentarian</b>: Structures and articulates scientific findings and methodologies<br>
5. 🔄 <b>Interdisciplinary Translator</b>: Establishes conceptual bridges between disparate knowledge domains
</small>

### 👥 Who Can Use This?

<small>
Aavishkar.ai tools are designed for all practitioners of hypothesis-driven science:<br>
• 🎓 Academic researchers and students<br>
• 🏢 Commercial/industrial researchers<br>
• 🏛️ Government scientists<br>
• 🔭 Citizen scientists<br>
• 🧩 Independent researchers
</small>

<small>No matter your technical background or institutional affiliation, this notebook provides accessible AI capabilities for rigorous scientific work.</small>

---

### ⚙️ Setup Instructions

<small>

**Google Colab**
* Click on "Runtime" in the menu
* Select "Run all" to install dependencies and initialize the system
* Ensure you have your API keys ready for the LLM provider

**Local Environment**
* Ensure you have Python 3.8+ installed
* Install dependencies by running the installation cell below
* Set up your API keys as instructed in the initialization section

**Prerequisites**
* Python 3.8+
* API key for OpenAI or Google Vertex AI
* Basic familiarity with Jupyter notebooks

</small>

---

### 📜 License

<small>This project is licensed under the <b>MIT License</b></small>

<small>
<details>
<summary>View License Text</summary>
MIT License<br><br>
Copyright (c) 2023-2024 Aavishkar.ai<br><br>
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:<br><br>
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
</details>
</small>

<small>

### 🔗 Connect with Aavishkar.ai
* 📦 **GitHub**: [github.com/astitvac/AI4Science](https://github.com/astitvac/AI4Science)
* 🌐 **Website**: [aavishkar.ai](https://aavishkar.ai)
* 💬 **Community**: [Discord](https://discord.gg/aavishkar)
* 🤝 **Contribute**: [Contribution Guidelines](https://github.com/astitvac/AI4Science/tree/main/Contributing)

</small>

## ⚙️ Installation
<small>
This cell installs all required dependencies for this expert system notebook. The installation process uses uv for faster package management when available, with automatic fallback to standard pip.
Key components being installed:
LLM frameworks: LangChain and provider-specific libraries
Data modeling: Pydantic
UI: Gradio
Core utilities: Data processing and visualization libraries
Troubleshooting tips:
If you encounter errors, try running the cell again
For persistent issues, check your Python version (3.8+ required)
In Colab, restart the runtime if packages aren't recognized after installation
Note: Initial installation may take 1-2 minutes to complete. A confirmation message will appear when successful.
</small>

In [None]:
# Installation
import sys, os, subprocess, time
from IPython.display import HTML, display

# All required packages (no version constraints for better future-proofing)
ALL_PACKAGES = {
    "core": "langchain pydantic python-dotenv uuid",
    "providers": "langchain-openai langchain-google-vertexai langchain-community",
    "ui": "gradio",
    "data": "numpy pandas matplotlib plotly networkx",
    "documents": "pypdf PyPDF2 pillow",
    "vectors": "chromadb sentence-transformers scikit-learn",
    "parallel": "joblib"
}

# Environment detection
IN_COLAB = 'google.colab' in sys.modules

def show(msg, type="info"):
    """Display styled message"""
    colors = {"info": "#3a7bd5", "success": "#00c853", "warning": "#f57c00", "error": "#d50000"}
    icons = {"info": "ℹ️", "success": "✅", "warning": "⚠️", "error": "❌"}
    display(HTML(f"<div style='color:white; background:{colors[type]}; padding:5px; margin:2px 0; border-radius:3px'>{icons[type]} {msg}</div>"))

def install_packages():
    """Install all packages using uv when possible, with minimal messaging"""
    start = time.time()
    show("Starting installation...", "info")
    
    # Try to use uv for faster installation
    try:
        subprocess.run("pip install -q uv", shell=True, check=True, timeout=30)
        installer = "uv pip"
    except:
        installer = "pip"
    
    # Install each category
    success_count = 0
    total_categories = len(ALL_PACKAGES)
    
    for category, packages in ALL_PACKAGES.items():
        try:
            # Install entire category at once for speed
            cmd = f"{installer} install -q {packages}"
            result = subprocess.run(cmd, shell=True, capture_output=True, timeout=120)
            
            if result.returncode == 0:
                success_count += 1
        except Exception:
            pass  # Silent failure, will be reflected in final success rate
    
    # Simple verification of core packages
    try:
        import langchain
        import pydantic
        import gradio
        verification = "with verification"
    except ImportError:
        verification = "with partial verification failures"
    
    # Single completion message with success rate
    elapsed = time.time() - start
    success_rate = int((success_count / total_categories) * 100)
    show(f"Installation completed in {elapsed:.1f}s ({success_rate}% success) {verification}", 
         "success" if success_rate > 80 else "warning")
    
    return success_rate > 80

# Run installation
install_packages()

## 🔧 Initialization

<small>This section configures the LLM provider, API keys, and core components needed for this expert system. The implementation follows a modular architecture that supports multiple AI providers and environments.</small>

## Purpose

<small>The initialization process:
1. **Sets up environment variables** including API keys
2. **Configures the LLM provider** with appropriate models and settings
3. **Initializes specialized capabilities** when needed (e.g., vision, embedding)
4. **Validates the environment** to ensure all requirements are met
</small>

## Configuration Options

<small>
You can customize the initialization by adjusting these parameters:

| Parameter | Description | Default |
|-----------|-------------|---------|
| **Provider** | AI service to use (OpenAI, Google, etc.) | OpenAI |
| **Model** | Specific model name | Depends on provider |
| **Temperature** | Creativity level (0.0-1.0) | 0.7 |
| **Features** | Additional capabilities to enable | None |

**💡 Tip**: For reproducible results, use lower temperature values (0.0-0.3).
</small>

## Provider Support

<small>
This notebook supports these LLM providers:

- **OpenAI**: GPT-4, GPT-3.5-Turbo
- **Google**: Gemini Pro, PaLM
- **Anthropic**: Claude (optional)
- **Local**: Ollama with various models (optional)

**Note**: Different providers may have varying capabilities and pricing structures.
</small>

## Setup Instructions

<small>
**For Google Colab:**
1. Store your API keys in Colab Secrets
2. Select your provider from the dropdown
3. Run the initialization cell

**For Local Environment:**
1. Create a `.env` file with your API keys
2. Select your provider
3. Run the initialization cell

**API Key Variables:**
- OpenAI: `OPENAI_API_KEY`
- Google: `GOOGLE_API_KEY`
- Anthropic: `ANTHROPIC_API_KEY`
</small>

## Troubleshooting

<small>
Common issues:
- **Authentication errors**: Check your API key is correctly set
- **Model unavailability**: Ensure you have access to the specified model
- **Import errors**: Run the installation cell first
- **Memory issues**: Select a smaller model or reduce context length

The initialization cell includes diagnostics that will help identify any configuration problems.
</small>


In [None]:
# 🔧 LLM Setup
import os, sys
from IPython.display import Markdown, display
from typing import Dict, Any, Tuple, Optional

# Colab form fields for configuration
# @title LLM Configuration
api_key = "" # @param {type:"string"}
model = "gpt-4o" # @param ["gpt-4o", "gpt-4-turbo", "gpt-4", "gpt-3.5-turbo"]
embedding_model = "text-embedding-3-small" # @param ["text-embedding-3-small", "text-embedding-3-large", "text-embedding-ada-002"]
temperature = 0.7 # @param {type:"slider", min:0, max:1, step:0.1}
debug = False # @param {type:"boolean"}

# Environment detection
IN_COLAB = 'google.colab' in sys.modules

def show(msg, type="info"):
    """Display styled message"""
    if type == "debug" and not debug:
        return
    colors = {"success": "#00C853", "info": "#2196F3", "warning": "#FF9800", "error": "#F44336", "debug": "#9C27B0"}
    icons = {"success": "✅", "info": "ℹ️", "warning": "⚠️", "error": "❌", "debug": "🔍"}
    display(Markdown(f"<div style='padding:8px;border-radius:4px;background:{colors[type]};color:white'>{icons[type]} {msg}</div>"))

def get_api_key() -> Optional[str]:
    """Get API key from various possible sources"""
    # Check form input first
    key = api_key
    
    # Try Colab secret if empty and in Colab
    if not key and IN_COLAB:
        try:
            from google.colab import userdata
            key = userdata.get('openai_api_key')
            if key:
                show("API key loaded from Colab secret", "success")
        except Exception as e:
            show(f"Error accessing Colab secrets: {e}", "debug")
    
    # Try environment variable
    if not key:
        key = os.environ.get("OPENAI_API_KEY", "")
        if key:
            show("API key loaded from environment variable", "debug")
    
    # Try .env file
    if not key:
        try:
            from dotenv import load_dotenv
            load_dotenv()
            key = os.environ.get("OPENAI_API_KEY", "")
            if key:
                show("API key loaded from .env file", "debug")
        except:
            pass
    
    # Final check and request if needed
    if not key:
        if IN_COLAB:
            show("""
            No API key found. Either:
            1. Add it in the form field above
            2. Set a Colab secret named 'openai_api_key'
            """, "warning")
        else:
            show("No API key found. Add it in the form field or set OPENAI_API_KEY environment variable", "warning")
        return None
        
    return key

# === PROVIDER-SPECIFIC: OPENAI ===
def initialize_models(api_key: str) -> Tuple[Optional[Any], Optional[Any]]:
    """Initialize OpenAI models with the provided API key"""
    from langchain_openai import ChatOpenAI, OpenAIEmbeddings
    
    # Set environment variable for consistency
    os.environ["OPENAI_API_KEY"] = api_key
    
    try:
        llm = ChatOpenAI(
            model_name=model,
            temperature=temperature,
            openai_api_key=api_key
        )
        
        embeddings = OpenAIEmbeddings(
            model=embedding_model,
            openai_api_key=api_key
        )
        
        show(f"OpenAI initialized with {model} and {embedding_model}", "success")
        return llm, embeddings
        
    except Exception as e:
        show(f"Error initializing OpenAI: {e}", "error")
        return None, None
# === END PROVIDER-SPECIFIC ===

def initialize_llm() -> Tuple[Optional[Any], Optional[Any]]:
    """Main function to set up and initialize LLM"""
    show("Initializing LLM...", "info")
    
    # Get API key
    key = get_api_key()
    if not key:
        return None, None
    
    # Initialize models
    llm, embeddings = initialize_models(key)
    
    if llm and embeddings:
        show("Initialization complete! LLM and embeddings ready to use.", "success")
    
    return llm, embeddings

# Run initialization
llm, embeddings = initialize_llm()

In [None]:
# 🛠️ Core Utilities
"""
Core utilities for Aavishkar.ai expert systems.
Includes logging, caching, error handling, JSON parsing, and state management.
"""

import os, json, time, hashlib, functools, uuid
from typing import Dict, Any, Optional, Callable, Union, List, Type
from IPython.display import Markdown, display

# === GLOBAL SETTINGS ===
DEBUG_MODE = False
CACHE_ENABLED = True
CACHE_DIR = "./cache"
STATES_DIR = "./states"
os.makedirs(CACHE_DIR, exist_ok=True)
os.makedirs(STATES_DIR, exist_ok=True)

# === DISPLAY & ERROR HANDLING ===
def show(msg: str, level: str = "info") -> None:
    """Display formatted message with appropriate styling.
    
    Args:
        msg: Message to display
        level: Message level (success, info, warning, error, debug)
    """
    colors = {"success": "#00C853", "info": "#2196F3", "warning": "#FF9800", "error": "#F44336", "debug": "#9C27B0"}
    icons = {"success": "✅", "info": "ℹ️", "warning": "⚠️", "error": "❌", "debug": "🔍"}
    
    if level == "debug" and not DEBUG_MODE:
        return
        
    color = colors.get(level, colors["info"])
    icon = icons.get(level, icons["info"])
    display(Markdown(f"<div style='padding:6px;border-radius:4px;background:{color};color:white'>{icon} {msg}</div>"))

def retry(max_attempts: int = 3, delay: float = 1.0) -> Callable:
    """Decorator for retrying functions with exponential backoff."""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(1, max_attempts + 1):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_attempts:
                        raise
                    wait = delay * (2 ** (attempt - 1))
                    show(f"Attempt {attempt} failed: {str(e)}. Retrying in {wait:.1f}s...", "warning")
                    time.sleep(wait)
        return wrapper
    return decorator

# === CACHE SYSTEM ===
def cache_key(**kwargs) -> str:
    """Generate a cache key from input parameters."""
    serialized = json.dumps({k: v for k, v in kwargs.items() if v is not None}, sort_keys=True)
    return hashlib.md5(serialized.encode()).hexdigest()

def get_cache(key: str) -> Optional[Any]:
    """Get item from cache if available and not expired."""
    if not CACHE_ENABLED:
        return None
        
    path = os.path.join(CACHE_DIR, f"{key}.json")
    if not os.path.exists(path):
        return None
        
    try:
        with open(path, 'r') as f:
            data = json.load(f)
            
        # Check if expired (default: 1 day)
        if time.time() - data.get("timestamp", 0) > 86400:
            return None
            
        return data.get("value")
    except:
        return None

def set_cache(key, value):
    """Store value in cache with current timestamp."""
    if not CACHE_ENABLED:
        return
        
    path = os.path.join(CACHE_DIR, f"{key}.json")
    try:
        # Handle Pydantic models by converting to dictionaries
        def serialize_pydantic(obj):
            if hasattr(obj, 'model_dump'):  # Pydantic v2 models use model_dump
                return obj.model_dump()
            elif hasattr(obj, 'dict'):      # Older Pydantic models use dict()
                return obj.dict()
            elif isinstance(obj, list):
                return [serialize_pydantic(item) for item in obj]
            elif isinstance(obj, dict):
                return {k: serialize_pydantic(v) for k, v in obj.items()}
            return obj
            
        serialized_value = serialize_pydantic(value)
        
        with open(path, 'w') as f:
            json.dump({"timestamp": time.time(), "value": serialized_value}, f)
            
    except Exception as e:
        show(f"Cache write error: {str(e)}", "debug")

def clear_cache(older_than: Optional[int] = None) -> int:
    """Clear cache entries, optionally only those older than specified seconds.
    
    Returns:
        Number of entries cleared
    """
    if not os.path.exists(CACHE_DIR):
        return 0
        
    count = 0
    for filename in os.listdir(CACHE_DIR):
        if not filename.endswith('.json'):
            continue
            
        path = os.path.join(CACHE_DIR, filename)
        
        if older_than:
            try:
                with open(path, 'r') as f:
                    data = json.load(f)
                if time.time() - data.get("timestamp", 0) <= older_than:
                    continue
            except:
                pass
        
        try:
            os.remove(path)
            count += 1
        except:
            pass
            
    return count

# === LLM & JSON HELPERS ===
def call_llm_with_cache(llm, prompt: str, **kwargs) -> Any:
    """Call LLM with caching to avoid redundant API calls."""
    if CACHE_ENABLED:
        key = cache_key(prompt=prompt, **kwargs)
        cached = get_cache(key)
        if cached:
            show("Using cached response", "debug")
            return cached
    
    response = llm.invoke(prompt, **kwargs)
    
    if CACHE_ENABLED:
        set_cache(key, response)
    
    return response

def parse_json_safely(text: str, default: Any = None) -> Any:
    """Extract and parse JSON from text with multiple fallback strategies."""
    import re
    
    # Try direct parsing first
    try:
        return json.loads(text)
    except:
        pass
    
    # Try to extract JSON blocks
    try:
        # Try code blocks with JSON
        if "```json" in text:
            json_block = text.split("```json")[1].split("```")[0].strip()
            return json.loads(json_block)
            
        # Try any code blocks
        if "```" in text:
            code_block = text.split("```")[1].split("```")[0].strip()
            if code_block.strip().startswith(("{", "[")):
                return json.loads(code_block)
        
        # Try regex patterns for JSON objects/arrays
        patterns = [
            r'\{[\s\S]*?\}',  # JSON objects
            r'\[[\s\S]*?\]'   # JSON arrays
        ]
        
        for pattern in patterns:
            matches = re.findall(pattern, text)
            for match in matches:
                try:
                    return json.loads(match)
                except:
                    continue
    except:
        pass
    
    return default

# === STATE MANAGEMENT SYSTEM ===
class ExpertSystemState:
    """State container for expert system notebooks.
    
    This class manages the entire system state, providing methods for
    initialization, updates, persistence, and interaction with the UI.
    
    Attributes:
        session_id: Unique identifier for this state session
        config: Configuration parameters
        data: Main data store
        history: Operation history for tracking changes
        status: Current status information
    """
    
    def __init__(self, config: Optional[Dict[str, Any]] = None):
        """Initialize a new state container.
        
        Args:
            config: Optional configuration parameters
        """
        self.session_id = f"session_{int(time.time())}"
        self.config = config or {}
        self.data = {}  # Main data store
        self.history = []  # Operation history
        self.status = {"initialized": True, "last_updated": time.time()}
        
    def update(self, key: str, value: Any, track_history: bool = True) -> Any:
        """Update a state element with change tracking.
        
        Args:
            key: Data key to update
            value: New value to store
            track_history: Whether to record this change in history
            
        Returns:
            The stored value
        """
        old_value = self.data.get(key)
        self.data[key] = value
        
        if track_history:
            self.history.append({
                "timestamp": time.time(),
                "operation": "update",
                "key": key,
                "old_value_type": type(old_value).__name__,
                "new_value_type": type(value).__name__
            })
        
        self.status["last_updated"] = time.time()
        return value
        
    def get(self, key: str, default: Any = None) -> Any:
        """Retrieve a state element.
        
        Args:
            key: Data key to retrieve
            default: Default value if key doesn't exist
            
        Returns:
            The stored value or default
        """
        return self.data.get(key, default)
        
    def save(self, path: Optional[str] = None) -> str:
        """Save state to disk.
        
        Args:
            path: Optional file path, defaults to a timestamped file
            
        Returns:
            Path where state was saved
        """
        path = path or os.path.join(STATES_DIR, f"{self.session_id}.json")
        
        # Prepare serializable data
        export_data = {
            "session_id": self.session_id,
            "timestamp": time.time(),
            "config": self.config,
            "data": {},
            "status": self.status
        }
        
        # Serialize data using our existing helper
        for key, value in self.data.items():
            if key not in ["history"]:  # Skip history to keep file size manageable
                try:
                    export_data["data"][key] = serialize_pydantic(value)
                except Exception as e:
                    show(f"Error serializing {key}: {str(e)}", "warning")
        
        # Save to file
        try:
            os.makedirs(os.path.dirname(path), exist_ok=True)
            with open(path, "w") as f:
                json.dump(export_data, f, indent=2)
            show(f"State saved to {path}", "success")
        except Exception as e:
            show(f"Error saving state: {str(e)}", "error")
            
        return path
        
    @classmethod
    def load(cls, path: str) -> 'ExpertSystemState':
        """Load state from disk.
        
        Args:
            path: File path to load from
            
        Returns:
            Reconstructed state object
        """
        try:
            with open(path, "r") as f:
                import_data = json.load(f)
            
            # Create new state
            state = cls(import_data.get("config"))
            state.session_id = import_data.get("session_id", state.session_id)
            state.status = import_data.get("status", state.status)
            
            # Load data
            for key, value in import_data.get("data", {}).items():
                state.data[key] = value
                
            show(f"State loaded from {path}", "success")
            return state
            
        except Exception as e:
            show(f"Error loading state: {str(e)}", "error")
            return cls()  # Return a new empty state on error
        
    def to_ui_dict(self) -> Dict[str, Any]:
        """Convert state to UI-friendly dictionary.
        
        Returns:
            Dictionary representation for UI
        """
        result = {
            "session_id": self.session_id,
            "status": self.status,
            "last_updated": self.status.get("last_updated")
        }
        
        # Add serializable data for UI
        for key, value in self.data.items():
            try:
                result[key] = serialize_pydantic(value)
            except:
                # Skip values that can't be serialized for UI
                pass
                
        return result
        
    @classmethod
    def from_ui_dict(cls, ui_dict: Dict[str, Any]) -> 'ExpertSystemState':
        """Reconstruct state from UI dictionary.
        
        Args:
            ui_dict: Dictionary from UI
            
        Returns:
            Reconstructed state object
        """
        state = cls()
        state.session_id = ui_dict.get("session_id", state.session_id)
        state.status = ui_dict.get("status", state.status)
        
        # Copy data (excluding metadata fields)
        for key, value in ui_dict.items():
            if key not in ["session_id", "status", "last_updated"]:
                state.data[key] = value
                
        return state

def serialize_pydantic(obj: Any) -> Any:
    """Serialize Pydantic models and other complex objects to JSON-compatible format.
    
    Args:
        obj: Object to serialize
        
    Returns:
        JSON-serializable representation
    """
    if hasattr(obj, 'model_dump'):  # Pydantic v2 models
        return obj.model_dump()
    elif hasattr(obj, 'dict'):      # Older Pydantic models
        return obj.dict()
    elif isinstance(obj, list):
        return [serialize_pydantic(item) for item in obj]
    elif isinstance(obj, dict):
        return {k: serialize_pydantic(v) for k, v in obj.items()}
    elif isinstance(obj, (str, int, float, bool, type(None))):
        return obj
    else:
        # Try to convert to dict if possible
        try:
            return dict(obj)
        except:
            # Fall back to string representation
            return str(obj)

def deserialize_pydantic(data: Any, model_class: Optional[Type] = None) -> Any:
    """Deserialize data into Pydantic models if model_class is provided.
    
    Args:
        data: Data to deserialize
        model_class: Optional Pydantic model class
        
    Returns:
        Deserialized object
    """
    if model_class is not None:
        if isinstance(data, list):
            return [model_class(**item) for item in data]
        elif isinstance(data, dict):
            return model_class(**data)
        else:
            return data
    
    return data

# Initialization
show("Core utilities and state management initialized", "success")

## 📚 Literature Synthesis: Multi-Document Analysis

## Purpose
Extracts, connects, and synthesizes knowledge across multiple scientific papers to identify patterns, agreements, contradictions, and research opportunities across an entire scientific literature corpus.

## Core Functions

* **Multi-Document Information Extraction**
  * Identifies domain-specific elements (scientific claims, methodologies, contributions, research directions)
  * Tracks source provenance for all extracted information
  * Preserves context and maintains cross-document connections

* **Cross-Document Knowledge Integration**
  * Identifies common concepts and findings across papers
  * Maps relationships between elements from different documents
  * Detects supporting, contradicting, and extending relationships

* **Hierarchical Research Synthesis**
  * Generates category-specific syntheses (claims, methods, contributions, directions)
  * Creates comprehensive overview across all documents
  * Highlights consensus patterns and disagreements in the literature

* **Interactive Visualization**
  * Color-codes elements by document source
  * Visualizes cross-document relationships
  * Supports filtering by document, element type, and importance

* **Progressive Processing**
  * Processes documents incrementally with clear progress tracking
  * Allows adding new documents to existing analysis
  * Preserves session state with save/load capabilities

## Input

* Multiple PDF uploads, plain text, DOIs, or ArXiv IDs
* Supports processing of document collections of varying sizes
* Best results with complete papers containing clear section structure

## Usage

1. Run setup cells (installation and initialization)
2. Upload multiple scientific papers to analyze
3. Monitor processing progress across documents
4. Explore extracted elements by category or document source
5. Review cross-document relationships and patterns
6. Examine the hierarchical synthesis of the entire document collection

**Note**: This multi-document analyzer builds on the original LitSynth system, adding cross-document analysis and synthesis capabilities.

---

*Implementation of the Literature Synthesist cognitive archetype from Aavishkar.ai*


## Data Models

## Purpose

<small>
This section defines the structured data representations that power our Literature Synthesis system. These Pydantic models perform two essential functions:

1. **Represent Knowledge**: Define how scientific concepts, relationships, and documents are structured
2. **Control System Behavior**: Configure how the system processes and analyzes content
</small>

## How It Works

<small>
Our system uses Pydantic models to ensure data validation and clear structure. Think of these models as "smart containers" that:

- Validate data to prevent errors
- Provide helpful error messages when something is wrong
- Include documentation for each field
- Support extensibility for specialized needs
</small>

## Core Models

<small>
Our implementation uses these key models:

| Model | Purpose |
|-------|---------|
| **LitSynthConfig** | Consolidated configuration for all system parameters |
| **Concept** | Scientific concepts extracted from literature |
| **Relationship** | Connections between scientific concepts |
| **ResearchGap** | Identified research gaps and opportunities |
| **LiteratureSynthesisOutput** | Complete analysis results container |

We've simplified the configuration into a single model (`LitSynthConfig`) to make customization easier. You can adjust parameters by modifying the `config` variable in the code cell.
</small>

## Customization

<small>
To customize the system behavior, simply modify the config variable:

```python
# Example: Increase sensitivity to detect more concepts
config.extraction_confidence = 0.6
config.max_concepts = 40

# Example: Focus only on high-importance concepts
config.min_concept_importance = "high"
```

This approach allows you to tune the system's behavior without changing the core models or implementation.
</small>


In [None]:
# 📋 Data Models
"""
This section defines the data structures for the LitSynth-Multidoc system.
These models support multi-document analysis with source tracking, cross-document
relationships, and hierarchical synthesis capabilities.

CUSTOMIZATION TIPS:
1. Adjust domain-specific fields to match your research area
2. Modify importance thresholds based on your analysis priorities
3. Extend relationship types for your specific scientific domain
4. Customize state persistence options for your environment
"""

from pydantic import BaseModel, Field, field_validator, ConfigDict
from typing import List, Dict, Optional, Literal, Any, Set, Union, TypeVar, Generic, Type
from datetime import datetime
import uuid
import os
import json
import pickle
import hashlib
import time
from IPython.display import Markdown, display
from pathlib import Path

def show(message, type="info"):
    """Display styled message"""
    colors = {"success": "#00C853", "info": "#2196F3", "warning": "#FF9800", "error": "#F44336", "debug": "#9C27B0"}
    icons = {"success": "✅", "info": "ℹ️", "warning": "⚠️", "error": "❌", "debug": "🔍"}
    display(Markdown(f"<div style='padding:8px;border-radius:4px;background:{colors[type]};color:white'>{icons[type]} {message}</div>"))

# === ENVIRONMENT SETUP ===

# Create directories for cache and state
BASE_DIR = Path("LitSynthMulti_data")
CACHE_DIR = BASE_DIR / "cache"
STATES_DIR = BASE_DIR / "states"

# Create directories if they don't exist
CACHE_DIR.mkdir(parents=True, exist_ok=True)
STATES_DIR.mkdir(parents=True, exist_ok=True)

# === INPUT MODELS ===

class DocumentSource(BaseModel):
    """Input document with metadata and processing status.
    
    Represents a document in the analysis collection with tracking information
    for multi-document processing and synthesis.
    
    Attributes:
        source_id: Unique identifier for the document
        title: Document title (extracted or provided)
        authors: List of authors (when available)
        source_type: Type of document source (pdf, text, etc.)
        content: Document content or reference path
        metadata: Additional document information
        processed: Whether the document has been processed
    """
    source_id: str = Field(default_factory=lambda: f"doc_{uuid.uuid4().hex[:8]}")
    title: Optional[str] = None
    authors: List[str] = Field(default_factory=list)
    source_type: Literal["pdf", "text", "url", "doi", "arxiv"]
    content: str
    metadata: Dict[str, Any] = Field(default_factory=dict)
    processed: bool = False
    
    @field_validator('source_id')
    @classmethod
    def validate_source_id(cls, v):
        """Ensure source_id is properly formatted."""
        if not v or not isinstance(v, str) or len(v) < 5:
            return f"doc_{uuid.uuid4().hex[:8]}"
        return v
    
    def hash_content(self) -> str:
        """Generate a hash of document content for deduplication."""
        if not self.content:
            return ""
        content_sample = self.content[:min(1000, len(self.content))]
        return hashlib.md5(content_sample.encode('utf-8')).hexdigest()[:10]

# === KNOWLEDGE REPRESENTATION MODELS ===

class ScientificClaim(BaseModel):
    """Scientific claim extracted from literature.
    
    Represents a key scientific claim with source tracking for
    multi-document analysis and synthesis.
    
    Attributes:
        claim_id: Unique identifier for the claim
        claim_text: The textual content of the claim
        claim_type: Classification of the claim type
        importance: How important this claim is
        evidence: Supporting evidence for the claim
        source_documents: List of document IDs containing this claim
        confidence: Confidence score for extraction
    """
    claim_id: str = Field(default_factory=lambda: f"claim_{uuid.uuid4().hex[:8]}")
    claim_text: str
    claim_type: Literal["hypothesis", "finding", "assertion"] = "finding"
    importance: Literal["high", "medium", "low"] = "medium"
    evidence: Optional[str] = None
    source_documents: List[str] = Field(default_factory=list)
    confidence: float = Field(default=0.8, ge=0.0, le=1.0)
    
    model_config = ConfigDict(
        json_schema_extra={
            "example": {
                "claim_id": "claim_a1b2c3d4",
                "claim_text": "Increased temperature accelerates the reaction rate by a factor of 2.5",
                "claim_type": "finding",
                "importance": "high",
                "evidence": "Experimental results in Table 2 show rate constants at different temperatures",
                "source_documents": ["doc_12345678", "doc_23456789"],
                "confidence": 0.9
            }
        }
    )

class Methodology(BaseModel):
    """Research methodology used in the literature.
    
    Represents a research methodology with source tracking for
    multi-document analysis and synthesis.
    
    Attributes:
        method_id: Unique identifier for the methodology
        method_name: Name or title of the methodology
        description: Detailed description of the methodology
        context: Context in which the methodology was used
        limitations: Known limitations of the methodology
        source_documents: List of document IDs using this methodology
        confidence: Confidence score for extraction
    """
    method_id: str = Field(default_factory=lambda: f"method_{uuid.uuid4().hex[:8]}")
    method_name: str
    description: str
    context: Optional[str] = None
    limitations: Optional[str] = None
    source_documents: List[str] = Field(default_factory=list)
    confidence: float = Field(default=0.8, ge=0.0, le=1.0)
    
    model_config = ConfigDict(
        json_schema_extra={
            "example": {
                "method_id": "method_a1b2c3d4",
                "method_name": "CRISPR-Cas9 Gene Editing",
                "description": "A precise gene editing technique using Cas9 nuclease guided by RNA",
                "context": "Used to modify gene expression in mouse embryos",
                "limitations": "Potential off-target effects and delivery challenges",
                "source_documents": ["doc_12345678", "doc_87654321"],
                "confidence": 0.85
            }
        }
    )

class KeyContribution(BaseModel):
    """Key contribution or finding from the literature.
    
    Represents a significant research contribution with connections
    to claims and methods across documents.
    
    Attributes:
        contribution_id: Unique identifier for the contribution
        contribution_text: The textual content of the contribution
        contribution_type: Classification of the contribution type
        importance: How important this contribution is
        related_claims: List of related claim IDs
        related_methods: List of related methodology IDs
        source_documents: List of document IDs containing this contribution
        confidence: Confidence score for extraction
    """
    contribution_id: str = Field(default_factory=lambda: f"contrib_{uuid.uuid4().hex[:8]}")
    contribution_text: str
    contribution_type: Literal["theoretical", "empirical", "methodological", "practical"] = "empirical"
    importance: Literal["high", "medium", "low"] = "medium"
    related_claims: List[str] = Field(default_factory=list)
    related_methods: List[str] = Field(default_factory=list)
    source_documents: List[str] = Field(default_factory=list)
    confidence: float = Field(default=0.8, ge=0.0, le=1.0)

class ResearchDirection(BaseModel):
    """Future research direction identified in the literature.
    
    Represents a potential future research direction with connections
    to claims and contributions across documents.
    
    Attributes:
        direction_id: Unique identifier for the research direction
        direction_text: The textual content of the research direction
        rationale: Reasoning behind this research direction
        related_claims: List of related claim IDs
        related_contributions: List of related contribution IDs
        source_documents: List of document IDs suggesting this direction
        importance: How important this research direction is
        confidence: Confidence score for extraction
    """
    direction_id: str = Field(default_factory=lambda: f"direction_{uuid.uuid4().hex[:8]}")
    direction_text: str
    rationale: Optional[str] = None
    related_claims: List[str] = Field(default_factory=list)
    related_contributions: List[str] = Field(default_factory=list)
    source_documents: List[str] = Field(default_factory=list)
    importance: Literal["high", "medium", "low"] = "medium"
    confidence: float = Field(default=0.7, ge=0.0, le=1.0)

# === CROSS-DOCUMENT RELATIONSHIP MODEL ===

class CrossDocumentRelationship(BaseModel):
    """Relationship between elements across documents.
    
    Represents connections between knowledge elements from different
    documents, identifying patterns across the literature.
    
    Attributes:
        relationship_id: Unique identifier for the relationship
        source_element_id: ID of the source element
        source_element_type: Type of the source element
        target_element_id: ID of the target element
        target_element_type: Type of the target element
        relationship_type: Type of relationship between elements
        evidence: Supporting evidence for this relationship
        confidence: Confidence score for this relationship
    """
    relationship_id: str = Field(default_factory=lambda: f"rel_{uuid.uuid4().hex[:8]}")
    source_element_id: str
    source_element_type: Literal["claim", "methodology", "contribution", "direction"]
    target_element_id: str
    target_element_type: Literal["claim", "methodology", "contribution", "direction"]
    relationship_type: Literal["supports", "contradicts", "extends", "cites", "replicates"] = "supports"
    evidence: Optional[str] = None
    confidence: float = Field(default=0.7, ge=0.0, le=1.0)

# === SYNTHESIS MODELS ===

class CategorySynthesis(BaseModel):
    """Synthesized analysis for a specific category across documents.
    
    Represents the synthesis output for a specific knowledge category,
    summarizing patterns across multiple documents.
    
    Attributes:
        category: Category of elements being synthesized
        synthesis_text: The synthesized text summary
        element_count: Count of elements in this category
        document_count: Count of documents included in synthesis
        timestamp: When the synthesis was generated
    """
    category: Literal["claims", "methodologies", "contributions", "directions"]
    synthesis_text: str
    element_count: int
    document_count: int
    timestamp: datetime = Field(default_factory=datetime.now)

# === COMPLETE OUTPUT MODEL ===

class MultiDocSynthesisOutput(BaseModel):
    """Complete output from multi-document literature synthesis.
    
    Comprehensive container for all results from multi-document
    analysis, including source documents, extracted elements,
    cross-document relationships, and syntheses.
    
    Attributes:
        analysis_id: Unique identifier for this analysis
        documents: List of source documents
        claims: List of extracted scientific claims
        methodologies: List of extracted methodologies
        contributions: List of extracted key contributions
        research_directions: List of extracted research directions
        cross_document_relationships: List of relationships between elements
        category_syntheses: Dictionary of category-specific syntheses
        overall_synthesis: Overall synthesis across all documents
        timestamp: When the analysis was completed
    """
    analysis_id: str = Field(default_factory=lambda: f"analysis_{int(time.time())}")
    documents: List[DocumentSource] = Field(default_factory=list)
    claims: List[ScientificClaim] = Field(default_factory=list)
    methodologies: List[Methodology] = Field(default_factory=list)
    contributions: List[KeyContribution] = Field(default_factory=list)
    research_directions: List[ResearchDirection] = Field(default_factory=list)
    cross_document_relationships: List[CrossDocumentRelationship] = Field(default_factory=list)
    category_syntheses: Dict[str, CategorySynthesis] = Field(default_factory=dict)
    overall_synthesis: Optional[str] = None
    timestamp: datetime = Field(default_factory=datetime.now)
    
    def get_document_by_id(self, doc_id: str) -> Optional[DocumentSource]:
        """Helper method to retrieve a document by ID."""
        for doc in self.documents:
            if doc.source_id == doc_id:
                return doc
        return None
        
    def get_elements_for_document(self, doc_id: str, category: str) -> List[Any]:
        """Helper method to retrieve all elements of a category from a specific document."""
        if category == "claims":
            return [c for c in self.claims if doc_id in c.source_documents]
        elif category == "methodologies":
            return [m for m in self.methodologies if doc_id in m.source_documents]
        elif category == "contributions":
            return [c for c in self.contributions if doc_id in c.source_documents]
        elif category == "directions":
            return [d for d in self.research_directions if doc_id in d.source_documents]
        return []

# === CONFIGURATION MODEL ===

class LitSynthMultiConfig(BaseModel):
    """Configuration for the LitSynth-Multidoc system.
    
    Controls system behavior and processing parameters for
    multi-document analysis and synthesis.
    
    Attributes:
        text_chunk_size: Characters per text chunk
        text_chunk_overlap: Overlap between chunks
        min_importance: Minimum importance level to include
        extraction_confidence: Minimum extraction confidence
        max_claims: Maximum claims per document
        max_methodologies: Maximum methodologies per document
        max_contributions: Maximum contributions per document
        max_directions: Maximum research directions per document
        relationship_confidence: Minimum relationship confidence
        max_cross_relationships: Maximum cross-document relationships
        similarity_threshold: Threshold for element similarity detection
        parallel_processing: Whether to process documents in parallel
        scientific_domain: Scientific domain for specialized analysis
    """
    # Text Processing Parameters
    text_chunk_size: int = Field(
        default=4000, ge=500, le=8000, 
        description="Characters per text chunk"
    )
    text_chunk_overlap: int = Field(
        default=100, ge=50, le=1000,
        description="Overlap between chunks"
    )
    
    # Element Extraction Parameters
    min_importance: Literal["low", "medium", "high"] = Field(
        default="medium", 
        description="Minimum importance level to include"
    )
    extraction_confidence: float = Field(
        default=0.7, ge=0.0, le=1.0,
        description="Minimum extraction confidence"
    )
    
    # Category Limits
    max_claims: int = Field(
        default=20, ge=5, le=100,
        description="Maximum claims to extract per document"
    )
    max_methodologies: int = Field(
        default=10, ge=3, le=50,
        description="Maximum methodologies to extract per document"
    )
    max_contributions: int = Field(
        default=15, ge=3, le=50,
        description="Maximum contributions to extract per document"
    )
    max_directions: int = Field(
        default=10, ge=3, le=50,
        description="Maximum research directions to extract per document"
    )
    
    # Cross-Document Parameters
    relationship_confidence: float = Field(
        default=0.6, ge=0.0, le=1.0,
        description="Minimum relationship confidence"
    )
    max_cross_relationships: int = Field(
        default=50, ge=10, le=200,
        description="Maximum cross-document relationships"
    )
    similarity_threshold: float = Field(
        default=0.75, ge=0.5, le=0.95,
        description="Threshold for element similarity detection"
    )
    
    # Processing Parameters
    parallel_processing: bool = Field(
        default=False,
        description="Process documents in parallel when possible"
    )
    
    # Customization
    scientific_domain: Optional[str] = Field(
        default=None,
        description="Scientific domain for specialized analysis"
    )
    
    @field_validator('text_chunk_overlap')
    @classmethod
    def validate_overlap(cls, v, info):
        """Ensure overlap is less than chunk size."""
        if 'text_chunk_size' in info.data and v >= info.data['text_chunk_size']:
            raise ValueError("text_chunk_overlap must be less than text_chunk_size")
        return v

# === STATE MANAGEMENT CLASSES ===

# Helper functions for serialization and deserialization
def serialize_pydantic(obj):
    """Convert Pydantic models and other complex objects to JSON-compatible format."""
    if hasattr(obj, 'model_dump'):
        return obj.model_dump()
    elif isinstance(obj, datetime):
        return obj.isoformat()
    elif isinstance(obj, (list, tuple)):
        return [serialize_pydantic(item) for item in obj]
    elif isinstance(obj, dict):
        return {k: serialize_pydantic(v) for k, v in obj.items()}
    else:
        return obj

def deserialize_pydantic(data, model_class=None):
    """Reconstruct Pydantic models from data."""
    if model_class and data and isinstance(data, dict):
        return model_class.model_validate(data)
    elif isinstance(data, list):
        return [deserialize_pydantic(item, None) for item in data]
    elif isinstance(data, dict):
        return {k: deserialize_pydantic(v, None) for k, v in data.items()}
    else:
        return data

class ExpertSystemState(BaseModel):
    """Base state management class for expert systems.
    
    Provides core state management capabilities for expert systems,
    including state updates, persistence, and UI integration.
    
    Attributes:
        session_id: Unique identifier for this session
        config: Configuration parameters
        data: Main data container for the expert system
        history: History of state changes for tracking/undo
        status: Current status information
    """
    session_id: str = Field(default_factory=lambda: f"session_{time.strftime('%Y%m%d_%H%M%S')}")
    config: Dict[str, Any] = Field(default_factory=dict)
    data: Dict[str, Any] = Field(default_factory=dict)
    history: List[Dict[str, Any]] = Field(default_factory=list, exclude=True)
    status: Dict[str, Any] = Field(default_factory=lambda: {"status": "initialized", "last_updated": datetime.now()})
    
    def update(self, key: str, value: Any, track_history: bool = True) -> None:
        """Update state with new data.
        
        Args:
            key: The state key to update
            value: The new value to store
            track_history: Whether to record this change in history
        """
        if track_history:
            # Save previous state in history
            if key in self.data:
                self.history.append({
                    "key": key,
                    "prev_value": self.data.get(key),
                    "timestamp": datetime.now()
                })
                # Limit history size
                if len(self.history) > 100:
                    self.history = self.history[-100:]
        
        # Update the state
        self.data[key] = value
        
        # Update status
        self.status.update({
            "status": "updated",
            "last_key_updated": key,
            "last_updated": datetime.now()
        })
    
    def get(self, key: str, default: Any = None) -> Any:
        """Get value from state.
        
        Args:
            key: The state key to retrieve
            default: Default value if key doesn't exist
            
        Returns:
            The value from state or default
        """
        return self.data.get(key, default)
    
    def save(self, filepath: Optional[str] = None) -> str:
        """Save state to file.
        
        Args:
            filepath: Optional custom filepath
            
        Returns:
            Path to the saved file
        """
        if not filepath:
            filepath = STATES_DIR / f"{self.session_id}.json"
        
        # Prepare state for serialization
        state_dict = serialize_pydantic(self.model_dump(exclude={"history"}))
        
        with open(filepath, 'w') as f:
            json.dump(state_dict, f, indent=2)
        
        self.status.update({
            "status": "saved",
            "save_location": str(filepath),
            "last_updated": datetime.now()
        })
        
        return str(filepath)
    
    @classmethod
    def load(cls, filepath: str) -> 'ExpertSystemState':
        """Load state from file.
        
        Args:
            filepath: Path to the saved state file
            
        Returns:
            Loaded state instance
        """
        with open(filepath, 'r') as f:
            state_dict = json.load(f)
        
        # Create new instance from saved data
        instance = cls.model_validate(state_dict)
        
        instance.status.update({
            "status": "loaded",
            "load_source": filepath,
            "last_updated": datetime.now()
        })
        
        return instance
    
    def to_ui_dict(self) -> Dict[str, Any]:
        """Convert state to UI-friendly dictionary.
        
        Returns:
            Dictionary representation for UI components
        """
        # Create UI-friendly representation for display
        return {
            "session_id": self.session_id,
            "status": self.status.get("status", "unknown"),
            "last_updated": self.status.get("last_updated", datetime.now()).isoformat(),
            "data_keys": list(self.data.keys()),
            "config_summary": self.config
        }
    
    @classmethod
    def from_ui_dict(cls, ui_dict: Dict[str, Any]) -> 'ExpertSystemState':
        """Create state from UI dictionary.
        
        Args:
            ui_dict: Dictionary from UI components
            
        Returns:
            New state instance
        """
        # This is a simplified implementation
        # In practice, you would map UI values to proper state structure
        return cls(
            session_id=ui_dict.get("session_id", f"session_{time.strftime('%Y%m%d_%H%M%S')}"),
            config=ui_dict.get("config", {}),
            data=ui_dict.get("data", {}),
            status={
                "status": "created_from_ui",
                "last_updated": datetime.now()
            }
        )

class LitSynthMultiState(ExpertSystemState):
    """State management for LitSynth-Multidoc system.
    
    Extends the base expert system state with document-specific
    state management for multi-document analysis workflows.
    
    This class provides methods for tracking document processing,
    managing extraction results, and handling synthesis generation.
    """
    def __init__(self, **data):
        """Initialize with default LitSynth-specific state."""
        super().__init__(**data)
        
        # Initialize default configuration if not provided
        if not self.config:
            self.config = LitSynthMultiConfig().model_dump()
        
        # Initialize data containers if not present
        if 'documents' not in self.data:
            self.data['documents'] = []
        
        if 'claims' not in self.data:
            self.data['claims'] = []
            
        if 'methodologies' not in self.data:
            self.data['methodologies'] = []
            
        if 'contributions' not in self.data:
            self.data['contributions'] = []
            
        if 'research_directions' not in self.data:
            self.data['research_directions'] = []
            
        if 'cross_document_relationships' not in self.data:
            self.data['cross_document_relationships'] = []
            
        if 'category_syntheses' not in self.data:
            self.data['category_syntheses'] = {}
            
        if 'overall_synthesis' not in self.data:
            self.data['overall_synthesis'] = None
    
    def add_document(self, document: DocumentSource) -> bool:
        """Add a document to the collection.
        
        Args:
            document: The document to add
            
        Returns:
            True if document was added, False if duplicate
        """
        # Check for duplicates
        doc_hash = document.hash_content()
        for existing_doc in self.data['documents']:
            if existing_doc.hash_content() == doc_hash:
                show(f"Document '{document.title or 'Untitled'}' appears to be a duplicate", "warning")
                return False
        
        # Add the document
        documents = self.data['documents']
        documents.append(document)
        self.update('documents', documents)
        
        show(f"Added document: {document.title or 'Untitled'}", "success")
        return True
    
    def update_extraction_results(self, 
                                 document_id: str, 
                                 category: str, 
                                 results: List[Any]) -> None:
        """Update extraction results for a document and category.
        
        Args:
            document_id: ID of the document
            category: Category of extracted elements
            results: List of extraction results
        """
        if category not in ['claims', 'methodologies', 'contributions', 'research_directions']:
            show(f"Invalid category: {category}", "error")
            return
        
        # Get current elements
        current = self.data[category]
        
        # Remove existing elements for this document
        filtered = [item for item in current if document_id not in item.source_documents]
        
        # Add new results
        combined = filtered + results
        
        # Update state
        self.update(category, combined)
        
        show(f"Updated {len(results)} {category} for document {document_id}", "success")
    
    def mark_document_processed(self, document_id: str) -> None:
        """Mark a document as processed.
        
        Args:
            document_id: ID of the document to mark
        """
        documents = self.data['documents']
        for i, doc in enumerate(documents):
            if doc.source_id == document_id:
                doc.processed = True
                documents[i] = doc
                break
        
        self.update('documents', documents)
    
    def regenerate_cross_document_analysis(self) -> None:
        """Regenerate cross-document relationships and syntheses."""
        # This would trigger the cross-document analysis chain
        # For now we just update the status
        self.status.update({
            "status": "cross_document_analysis_needed",
            "last_updated": datetime.now()
        })
        
        show("Cross-document analysis needs to be regenerated", "info")
    
    def get_processed_document_count(self) -> int:
        """Get count of processed documents.
        
        Returns:
            Number of processed documents
        """
        return sum(1 for doc in self.data['documents'] if doc.processed)
    
    def get_progress_percentage(self) -> int:
        """Calculate overall processing progress.
        
        Returns:
            Progress percentage (0-100)
        """
        docs = self.data['documents']
        if not docs:
            return 0
        
        processed = self.get_processed_document_count()
        return int((processed / len(docs)) * 100)
    
    def to_output_model(self) -> MultiDocSynthesisOutput:
        """Convert state to MultiDocSynthesisOutput model.
        
        Returns:
            Complete analysis results in MultiDocSynthesisOutput format
        """
        # Helper to convert lists of dictionaries to lists of models
        def convert_list(items, model_class):
            return [
                item if isinstance(item, model_class) 
                else model_class.model_validate(item)
                for item in items
            ]
        
        # Convert category syntheses
        category_syntheses = {}
        for cat, synth in self.data.get('category_syntheses', {}).items():
            if isinstance(synth, CategorySynthesis):
                category_syntheses[cat] = synth
            elif isinstance(synth, dict):
                category_syntheses[cat] = CategorySynthesis.model_validate(synth)
        
        # Create output model
        return MultiDocSynthesisOutput(
            analysis_id=self.session_id,
            documents=convert_list(self.data.get('documents', []), DocumentSource),
            claims=convert_list(self.data.get('claims', []), ScientificClaim),
            methodologies=convert_list(self.data.get('methodologies', []), Methodology),
            contributions=convert_list(self.data.get('contributions', []), KeyContribution),
            research_directions=convert_list(self.data.get('research_directions', []), ResearchDirection),
            cross_document_relationships=convert_list(
                self.data.get('cross_document_relationships', []), 
                CrossDocumentRelationship
            ),
            category_syntheses=category_syntheses,
            overall_synthesis=self.data.get('overall_synthesis'),
            timestamp=datetime.now()
        )

# === INITIALIZE STATE ===

# Create default configuration
config = LitSynthMultiConfig()

# Create initial state container
state = LitSynthMultiState(config=config.model_dump())

# Show confirmation
show("Data models and state management initialized successfully", "success")

## Core Functions

<small>
This section contains the heart of our Literature Synthesis system - the functions that analyze documents, extract key information, and generate insights.

## What's Included

1. **Prompt Library**: The instructions we give to the AI model
2. **Function Definitions**: The code that processes documents and manages the analysis

## How to Customize

You can easily modify the system's behavior by:

- **Changing prompts**: Edit the instructions to focus on specific types of information
- **Adjusting parameters**: Fine-tune the analysis by modifying the `config` settings

No coding knowledge is required - simply edit the text of prompts in the first code cell below.
</small>


In [None]:
# Prompts Library
"""
This section contains all prompts for the LitSynth-Multidoc system.
These prompts are optimized for scientific paper analysis across multiple documents,
with specific attention to extracting structured knowledge and cross-document synthesis.

CUSTOMIZATION TIPS:
1. Keep the output format instructions intact to ensure proper parsing
2. Add domain-specific terminology or examples for your research area
3. Emphasize particular aspects relevant to your scientific domain
4. Update section references to match conventions in your field
"""

PROMPTS = {
    # === CLAIM EXTRACTION ===
    # Purpose: Extract scientific claims with provenance tracking
    "claim_extraction": """
    You are a scientific claim extraction specialist. Extract key scientific claims from the following text segment of a scientific paper.
    
    TEXT:
    {text}
    
    DOCUMENT ID: {document_id}
    
    INSTRUCTIONS:
    1. Focus on extracting clear scientific claims, findings, and assertions
    2. Pay special attention to sentences containing empirical findings with statistical significance
    3. Look for claims in the Abstract, Results, and Discussion sections
    4. Classify each claim as "hypothesis", "finding", or "assertion"
    5. Rate importance based on: centrality to the paper's thesis, statistical significance, and novelty
    6. Extract supporting evidence when available (statistical results, experiment outcomes, citations)
    7. Include the document_id in the source_documents list for provenance tracking
    
    Return ONLY a JSON array of claims with this structure:
    ```json
    [
      {{
        "claim_id": "claim_12345",
        "claim_text": "Clear statement of the scientific claim",
        "claim_type": "hypothesis|finding|assertion",
        "importance": "high|medium|low",
        "evidence": "Supporting evidence from text (optional)",
        "source_documents": ["{document_id}"],
        "confidence": 0.8
      }}
    ]
    ```
    
    CONFIG PARAMETERS:
    {config}
    """,
    
    # === METHODOLOGY EXTRACTION ===
    # Purpose: Extract research methodologies with detailed information
    "methodology_extraction": """
    You are a research methodology specialist. Extract key research methodologies from the following text segment of a scientific paper.
    
    TEXT:
    {text}
    
    DOCUMENT ID: {document_id}
    
    INSTRUCTIONS:
    1. Focus on the Methods/Methodology section, but also check Introduction for methodological approaches
    2. Extract complete methodology descriptions including experimental designs, analytical techniques, and procedures
    3. Identify methodological limitations when mentioned
    4. Note the context in which each methodology was applied
    5. Include the document_id in the source_documents list for provenance tracking
    
    Return ONLY a JSON array of methodologies with this structure:
    ```json
    [
      {{
        "method_id": "method_12345",
        "method_name": "Name or title of methodology",
        "description": "Detailed description of the methodology",
        "context": "Context in which the methodology was applied (optional)",
        "limitations": "Known limitations of the methodology (optional)",
        "source_documents": ["{document_id}"],
        "confidence": 0.8
      }}
    ]
    ```
    
    CONFIG PARAMETERS:
    {config}
    """,
    
    # === CONTRIBUTION EXTRACTION ===
    # Purpose: Extract key contributions and findings
    "contribution_extraction": """
    You are a scientific contribution analyst. Extract key contributions from the following text segment of a scientific paper.
    
    TEXT:
    {text}
    
    DOCUMENT ID: {document_id}
    
    INSTRUCTIONS:
    1. Focus on significant contributions, innovations, and findings
    2. Look primarily in Abstract, Introduction, and Conclusion sections
    3. Classify each contribution as "theoretical", "empirical", "methodological", or "practical"
    4. Rate importance based on novelty, impact, and emphasis in the text
    5. Note connections to research claims and methods when evident
    6. Include the document_id in the source_documents list for provenance tracking
    
    Return ONLY a JSON array of contributions with this structure:
    ```json
    [
      {{
        "contribution_id": "contrib_12345",
        "contribution_text": "Clear statement of the contribution",
        "contribution_type": "theoretical|empirical|methodological|practical",
        "importance": "high|medium|low",
        "related_claims": [],
        "related_methods": [],
        "source_documents": ["{document_id}"],
        "confidence": 0.8
      }}
    ]
    ```
    
    CONFIG PARAMETERS:
    {config}
    """,
    
    # === RESEARCH DIRECTION EXTRACTION ===
    # Purpose: Extract future research directions
    "direction_extraction": """
    You are a research direction analyst. Extract future research directions from the following text segment of a scientific paper.
    
    TEXT:
    {text}
    
    DOCUMENT ID: {document_id}
    
    INSTRUCTIONS:
    1. Focus on explicit suggestions for future research in Discussion and Conclusion sections
    2. Look for phrases like "future research should", "further studies are needed"
    3. Extract implicit research directions from identified limitations and knowledge gaps
    4. Include rationale or justification for the research direction when available
    5. Note connections to claims or contributions that the direction builds upon
    6. Rate importance based on emphasis, specificity, and potential impact
    7. Include the document_id in the source_documents list for provenance tracking
    
    Return ONLY a JSON array of research directions with this structure:
    ```json
    [
      {{
        "direction_id": "direction_12345",
        "direction_text": "Clear statement of the research direction",
        "rationale": "Reasoning behind this direction (optional)",
        "related_claims": [],
        "related_contributions": [],
        "source_documents": ["{document_id}"],
        "importance": "high|medium|low",
        "confidence": 0.7
      }}
    ]
    ```
    
    CONFIG PARAMETERS:
    {config}
    """,
    
    # === CROSS-DOCUMENT RELATIONSHIP IDENTIFICATION ===
    # Purpose: Identify relationships between elements from different documents
    "cross_document_relationship": """
    You are a scientific relationship analyst. Identify meaningful relationships between the following scientific elements from potentially different documents.
    
    ELEMENTS:
    {elements}
    
    INSTRUCTIONS:
    1. Analyze the provided elements and identify significant relationships between them
    2. Focus on substantive relationships (supports, contradicts, extends, etc.)
    3. Only create relationships that are justified by the element content
    4. Prioritize relationships between elements from different documents (check the Documents field)
    5. Generate a unique relationship_id for each relationship
    6. Provide evidence or justification for each identified relationship
    7. Assign a confidence score based on the strength of the relationship (0.0-1.0)
    
    Return ONLY a JSON array of relationships with this structure:
    ```json
    [
      {{
        "relationship_id": "rel_[unique_id]",
        "source_element_id": "ID of source element",
        "source_element_type": "claim|methodology|contribution|direction",
        "target_element_id": "ID of target element",
        "target_element_type": "claim|methodology|contribution|direction",
        "relationship_type": "supports|contradicts|extends|cites|replicates",
        "evidence": "Justification for this relationship",
        "confidence": 0.8
      }}
    ]
    ```
    """,
    
    # === CLAIMS SYNTHESIS ===
    # Purpose: Synthesize claims across multiple documents
    "claims_synthesis": """
    You are a scientific claims synthesist. Create a comprehensive synthesis of scientific claims across multiple documents.
    
    CLAIMS:
    {items}
    
    DOCUMENT COUNT: {document_count}
    
    INSTRUCTIONS:
    1. Synthesize the key patterns, themes, and findings across all claims
    2. Identify areas of consensus where multiple documents make similar claims
    3. Highlight contradictions or disagreements between claims from different documents
    4. Note the progression of knowledge or evolution of claims if time differences are apparent
    5. Evaluate the strength of evidence across claims, noting where evidence is robust or lacking
    6. Consider the importance ratings in determining emphasis
    7. Use specific references to documents (as cited in source_documents)
    
    Format your response as a well-organized synthesis of the claims across {document_count} documents.
    Include sections for: 
    - Major Consensus Findings
    - Areas of Disagreement or Contradiction
    - Emerging Trends
    - Strength of Evidence Analysis
    """,
    
    # === METHODOLOGIES SYNTHESIS ===
    # Purpose: Synthesize methodologies across multiple documents
    "methodologies_synthesis": """
    You are a research methodology synthesist. Create a comprehensive synthesis of research methodologies across multiple documents.
    
    METHODOLOGIES:
    {items}
    
    DOCUMENT COUNT: {document_count}
    
    INSTRUCTIONS:
    1. Identify the predominant methodological approaches across the document set
    2. Categorize methodologies into broader methodological paradigms or techniques
    3. Compare and contrast how similar methodologies are applied across different documents
    4. Highlight methodological innovations or unique approaches
    5. Analyze common limitations or challenges across methodologies
    6. Identify methodological trends or evolutions if apparent
    7. Assess the appropriateness and rigor of methodologies for their research contexts
    8. Use specific references to documents (as cited in source_documents)
    
    Format your response as a well-organized synthesis of methodologies across {document_count} documents.
    Include sections for:
    - Predominant Methodological Approaches
    - Methodological Variations and Innovations
    - Common Limitations and Challenges
    - Methodological Rigor Assessment
    """,
    
    # === CONTRIBUTIONS SYNTHESIS ===
    # Purpose: Synthesize contributions across multiple documents
    "contributions_synthesis": """
    You are a scientific contribution synthesist. Create a comprehensive synthesis of research contributions across multiple documents.
    
    CONTRIBUTIONS:
    {items}
    
    DOCUMENT COUNT: {document_count}
    
    INSTRUCTIONS:
    1. Identify the major theoretical, empirical, methodological, and practical contributions
    2. Map how contributions from different documents build upon or complement each other
    3. Highlight particularly novel or high-impact contributions
    4. Identify patterns in how theoretical and empirical contributions relate
    5. Assess the cumulative impact of the contributions on the research field
    6. Consider how practical and theoretical contributions align or diverge
    7. Note any progression or evolution of contributions if time differences are apparent
    8. Use specific references to documents (as cited in source_documents)
    
    Format your response as a well-organized synthesis of contributions across {document_count} documents.
    Include sections for:
    - Major Theoretical Advances
    - Significant Empirical Findings
    - Methodological Innovations
    - Practical Applications and Implications
    - Cumulative Impact Assessment
    """,
    
    # === DIRECTIONS SYNTHESIS ===
    # Purpose: Synthesize research directions across multiple documents
    "directions_synthesis": """
    You are a research direction synthesist. Create a comprehensive synthesis of future research directions across multiple documents.
    
    RESEARCH DIRECTIONS:
    {items}
    
    DOCUMENT COUNT: {document_count}
    
    INSTRUCTIONS:
    1. Identify common research directions suggested across multiple documents
    2. Group related directions into coherent research streams or themes
    3. Prioritize directions based on frequency of mention and importance ratings
    4. Analyze how directions from different documents complement or extend each other
    5. Identify potential integrated research agendas that combine directions from multiple documents
    6. Highlight directions that address significant gaps identified across the document set
    7. Consider how theoretical and practical directions relate to each other
    8. Use specific references to documents (as cited in source_documents)
    
    Format your response as a well-organized synthesis of research directions across {document_count} documents.
    Include sections for:
    - Priority Research Streams
    - Cross-Cutting Research Opportunities
    - Theoretical Development Needs
    - Practical and Applied Research Directions
    - Integrated Research Agenda Recommendations
    """,
    
    # === MULTI-DOCUMENT SYNTHESIS ===
    # Purpose: Generate comprehensive synthesis across all documents and categories
    "multi_document_synthesis": """
    You are a scientific literature synthesist. Create a comprehensive synthesis of research across multiple scientific documents.
    
    DOCUMENT COUNT: {document_count}
    DOCUMENT TITLES: {document_titles}
    
    CLAIMS SYNTHESIS:
    {claims_synthesis}
    
    METHODOLOGIES SYNTHESIS:
    {methods_synthesis}
    
    CONTRIBUTIONS SYNTHESIS:
    {contributions_synthesis}
    
    RESEARCH DIRECTIONS SYNTHESIS:
    {directions_synthesis}
    
    RELATIONSHIPS IDENTIFIED: {relationship_count}
    RELATIONSHIP INFORMATION:
    {relationship_info}
    
    INSTRUCTIONS:
    1. Craft a comprehensive, integrated synthesis across all {document_count} documents
    2. Begin with an executive summary highlighting the most significant findings
    3. Present a conceptual framework that organizes the key insights across documents
    4. Analyze how claims, methods, and contributions relate to each other across the literature
    5. Highlight areas of consensus and well-established knowledge
    6. Identify knowledge controversies, contradictions, or competing perspectives
    7. Map the evolution or progression of research if temporal patterns are evident
    8. Articulate the most promising integrated research agenda based on identified directions
    9. Provide specific citations to documents when discussing findings (use document titles)
    10. Consider both theoretical implications and practical applications
    
    Format your response as a comprehensive research synthesis with the following sections:
    1. Executive Summary
    2. Conceptual Framework
    3. Knowledge Consensus
    4. Open Questions and Controversies
    5. Methodological Assessment
    6. Significant Contributions
    7. Integrated Research Agenda
    8. Theoretical and Practical Implications
    9. Conclusion
    
    Use references to specific documents throughout, and focus on cross-document patterns and insights.
    """
}

# Show confirmation
show("Prompt library initialized with specialized scientific paper analysis prompts", "success")

In [None]:
# Core functions
"""
This section contains the core functionality for the LitSynth-Multidoc system.
Each function is designed to support multi-document analysis with a focus on
efficiency and clean architecture.

This implementation uses a Retrieval-Augmented Generation (RAG) approach for
more efficient and contextually aware extraction of elements from documents.
"""

import json, re, os, time, hashlib, uuid
from typing import List, Dict, Any, Optional, Type, Union, Tuple, Callable, Literal
from pathlib import Path
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnableLambda
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document as LangchainDocument
from datetime import datetime

# ===== DOCUMENT PROCESSING =====

def load_document(source_type: str, content: str) -> Dict[str, Any]:
    """Load document from various sources (text or PDF)"""
    if source_type == "text":
        return {"text": content, "metadata": {"source_type": "text", "length": len(content)}}
    elif source_type == "pdf":
        try:
            from pypdf import PdfReader
            reader = PdfReader(content)
            text = "\n\n".join(page.extract_text() for page in reader.pages)
            return {"text": text, "metadata": {"source_type": "pdf", "filename": os.path.basename(content), "pages": len(reader.pages)}}
        except Exception as e:
            return {"text": "", "metadata": {"error": str(e)}}
    else:
        return {"text": "", "metadata": {"error": "Unsupported source type"}}

def process_doc(source_type: str, content: str, doc_id: str, config: Union[Dict, Any]) -> Dict[str, Any]:
    """Process document: load, chunk, and create vector store for efficient retrieval
    
    This function combines document loading and chunking into a unified process
    that prepares the document for RAG-based extraction.
    """
    # Load document
    doc_data = load_document(source_type, content)
    text = doc_data.get("text", "")
    
    if not text:
        return {"error": "Failed to extract text", "vector_store": None, "metadata": doc_data.get("metadata", {})}
    
    # Get config values
    chunk_size = getattr(config, 'text_chunk_size', config.get('text_chunk_size', 2000)) if not isinstance(config, int) else config
    overlap = getattr(config, 'text_chunk_overlap', config.get('text_chunk_overlap', 200)) if not isinstance(config, int) else 200
    
    # Safety bounds for config values
    chunk_size = min(max(chunk_size, 1000), 4000)
    overlap = min(overlap, chunk_size // 4)
    
    # Create a text splitter that respects scientific document structure
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n## ", "\n### ", "\n# ", "\n\n", "\n", ". ", "! ", "?"],
        length_function=len
    )
    
    # Process document with metadata
    langchain_docs = []
    
    # Try to identify scientific sections first
    section_headers = [
        r'\n+\s*ABSTRACT\s*\n+', r'\n+\s*INTRODUCTION\s*\n+', r'\n+\s*METHODS?\s*\n+',
        r'\n+\s*RESULTS\s*\n+', r'\n+\s*DISCUSSION\s*\n+', r'\n+\s*CONCLUSION\s*\n+',
        r'\n+\s*REFERENCES\s*\n+'
    ]
    
    # Attempt to find section boundaries
    section_splits = [0]
    for pattern in section_headers:
        for match in re.finditer(pattern, text, re.IGNORECASE):
            section_splits.append(match.start())
    section_splits.append(len(text))
    section_splits = sorted(set(section_splits))
    
    # Extract sections with metadata
    if len(section_splits) > 2:
        # Document has identifiable sections
        for i in range(len(section_splits) - 1):
            start, end = section_splits[i], section_splits[i+1]
            if end - start > 100:  # Avoid tiny sections
                section_text = text[start:end].strip()
                
                # Identify section type
                section_type = "unknown"
                for pattern in section_headers:
                    if re.match(pattern, section_text[:50], re.IGNORECASE):
                        section_type = re.match(pattern, section_text[:50], re.IGNORECASE).group(0).strip()
                        section_text = re.sub(pattern, "", section_text[:50], flags=re.IGNORECASE) + section_text[50:]
                        break
                
                # Split section into chunks
                chunks = splitter.split_text(section_text)
                
                # Create LangChain documents with metadata
                for j, chunk in enumerate(chunks):
                    langchain_docs.append(
                        LangchainDocument(
                            page_content=chunk,
                            metadata={
                                "doc_id": doc_id,
                                "section": section_type,
                                "chunk_id": f"{doc_id}_chunk_{i}_{j}",
                                "section_idx": i,
                                "chunk_idx": j
                            }
                        )
                    )
    else:
        # No clear sections found, process as plain text
        chunks = splitter.split_text(text)
        for j, chunk in enumerate(chunks):
            langchain_docs.append(
                LangchainDocument(
                    page_content=chunk,
                    metadata={
                        "doc_id": doc_id,
                        "section": "unknown",
                        "chunk_id": f"{doc_id}_chunk_{j}",
                        "chunk_idx": j
                    }
                )
            )
    
    # Create vector store with embeddings
    try:
        if "embeddings" in globals() and embeddings is not None:
            vector_store = FAISS.from_documents(langchain_docs, embeddings)
            show(f"Vector store created with {len(langchain_docs)} chunks", "success")
        else:
            show("No embedding model available, using text-based retrieval", "warning")
            vector_store = None
    except Exception as e:
        show(f"Error creating vector store: {str(e)}", "error")
        vector_store = None
    
    return {
        "text": text,
        "vector_store": vector_store,
        "chunks": langchain_docs,
        "metadata": doc_data.get("metadata", {})
    }

# ===== LLM INTEGRATION =====

def create_chain(prompt_name: str, output_model=None):
    """Create an LCEL chain for a specific LLM task"""
    # Guard against missing prompts
    if prompt_name not in PROMPTS:
        show(f"Prompt '{prompt_name}' not found in prompt library", "error")
        return None
        
    template = ChatPromptTemplate.from_template(PROMPTS[prompt_name])
    
    def parse_output(response):
        """Parse LLM response with multi-strategy approach"""
        content = response.content if hasattr(response, 'content') else response
        
        # For text output (synthesis), return directly
        if not output_model:
            return content
        
        # For structured output, try multiple parsing strategies:
        try:
            # 1. Code block extraction (```json or ```)
            if isinstance(content, str) and "```" in content:
                if "```json" in content:
                    json_block = content.split("```json")[1].split("```")[0].strip()
                    data = json.loads(json_block)
                else:
                    code_block = content.split("```")[1].split("```")[0].strip()
                    if code_block.strip().startswith("[") and code_block.strip().endswith("]"):
                        data = json.loads(code_block)
                return [output_model(**item) for item in data]
                
            # 2. Direct JSON parsing
            data = json.loads(content if isinstance(content, str) else content.content)
            return [output_model(**item) for item in data]
                
        except Exception:
            # 3. Fallback regex extraction
            try:
                matches = re.findall(r'\[(.*?)\]', content, re.DOTALL)
                if matches:
                    data = json.loads(f"[{max(matches, key=len)}]")
                    return [output_model(**item) for item in data]
            except Exception as e:
                show(f"All parsing methods failed: {str(e)}", "debug")
        
        return []
    
    # Create and return the chain
    if "llm" in globals():
        return template | llm | RunnableLambda(parse_output)
    else:
        return RunnableLambda(lambda _: [] if output_model else "Placeholder output (no LLM configured)")

def cached_run(chain, inputs: Dict, key_prefix: str = ""):
    """Run chain with caching to minimize API calls"""
    if not chain: 
        return [] if 'text' not in key_prefix else "No chain available"
    
    # Map key_prefix to the appropriate model class
    model_map = {
        "claims": ScientificClaim,
        "methodologies": Methodology,
        "contributions": KeyContribution,
        "directions": ResearchDirection,
        "cross_relationships": CrossDocumentRelationship
    }
    output_model = model_map.get(key_prefix)
    
    # Use the existing caching functions from Core Utilities
    if 'CACHE_ENABLED' in globals() and CACHE_ENABLED:
        # Prepare inputs for caching
        cache_inputs = {}
        for k, v in inputs.items():
            if k == 'text' and isinstance(v, str) and len(v) > 500:
                cache_inputs[k] = v[:500]  # Use first 500 chars of text for cache key
            elif isinstance(v, (dict, list)) and len(str(v)) > 500:
                # Handle large collections by hashing
                cache_inputs[k] = hashlib.md5(str(v).encode()).hexdigest()
            else:
                cache_inputs[k] = v
                
        # Add prefix to differentiate between similar calls
        cache_inputs['_function'] = key_prefix
        
        # Try to get cached result
        try:
            key = cache_key(**cache_inputs)
            cached_result = get_cache(key)
            
            if cached_result is not None:
                show(f"Using cached result for {key_prefix}", "debug")
                
                # Convert dictionaries back to Pydantic models if needed
                if output_model and isinstance(cached_result, list) and cached_result and isinstance(cached_result[0], dict):
                    try:
                        cached_result = [output_model(**item) for item in cached_result]
                    except Exception as e:
                        show(f"Model reconstruction error: {str(e)}", "debug")
                
                return cached_result
        except Exception as e:
            show(f"Cache access error: {str(e)}", "debug")
    
    # Run chain if not in cache or caching disabled
    try:
        result = chain.invoke(inputs)
        
        # Try to cache the result
        if 'CACHE_ENABLED' in globals() and CACHE_ENABLED:
            try:
                set_cache(key, result)
            except Exception as e:
                show(f"Cache storage error: {str(e)}", "debug")
                
        return result
    except Exception as e:
        error_msg = str(e)
        show(f"Error in {key_prefix}: {error_msg}", "error")
        return [] if 'text' not in key_prefix else f"Error: {error_msg}"

# ===== DOCUMENT COLLECTION MANAGEMENT =====

def load_document_collection(documents: List[DocumentSource]) -> Dict[str, Any]:
    """Load and prepare multiple documents for processing"""
    collection_stats = {
        "document_count": len(documents),
        "total_size": 0,
        "processing_estimate": 0,
        "formats": {}
    }
    
    for i, doc in enumerate(documents):
        # Load document content
        doc_data = load_document(doc.source_type, doc.content)
        
        # Update document with content and metadata
        doc.metadata.update(doc_data.get("metadata", {}))
        if not doc.title and "filename" in doc.metadata:
            doc.title = doc.metadata["filename"]
        if not doc.title:
            doc.title = f"Document {i+1}"
            
        # Generate document ID if not provided
        if not doc.source_id:
            doc.source_id = hashlib.md5(doc_data.get("text", "")[:1000].encode()).hexdigest()[:10]
        
        # Update collection stats
        collection_stats["total_size"] += len(doc_data.get("text", ""))
        collection_stats["formats"][doc.source_type] = collection_stats["formats"].get(doc.source_type, 0) + 1
    
    # Estimate processing time based on document count and size
    doc_count = collection_stats["document_count"]
    total_size = collection_stats["total_size"]
    base_time_per_doc = 20  # seconds - faster with RAG
    time_per_10kb = 5       # seconds - faster with RAG
    collection_stats["processing_estimate"] = int((doc_count * base_time_per_doc) + (total_size / 10240 * time_per_10kb))
    
    return collection_stats

# ===== ELEMENT EXTRACTION FUNCTIONS (RAG-BASED) =====

def get_config_value(config: Union[Dict, Any], key: str, default: Any) -> Any:
    """Helper to get a value from either dict or object config"""
    if isinstance(config, dict):
        return config.get(key, default)
    return getattr(config, key, default)

def extract_elements_with_rag(
    doc_data: Dict[str, Any],
    config: Union[Dict, Any], 
    document_id: str, 
    element_type: Literal["claims", "methodologies", "contributions", "directions"],
    output_model: Type,
    prompt_name: str
) -> List[Any]:
    """Extract elements using RAG approach with focused retrieval and context"""
    text = doc_data.get("text", "")
    vector_store = doc_data.get("vector_store")
    chunks = doc_data.get("chunks", [])
    
    if not text or len(text.strip()) < 20:
        show(f"Text too short for {element_type} extraction", "warning")
        return []
    
    show(f"Extracting {element_type} using RAG from document ({len(text)} chars)...", "info")
    
    # Define retrieval queries based on element type
    retrieval_queries = {
        "claims": "What are the main scientific claims, findings, and assertions in this paper?",
        "methodologies": "What research methodologies, techniques, and approaches are described in this paper?",
        "contributions": "What are the key contributions, innovations, and findings in this paper?",
        "directions": "What future research directions are suggested or implied in this paper?"
    }
    
    # Define section priorities for each element type
    section_priorities = {
        "claims": ["abstract", "results", "discussion", "conclusion"],
        "methodologies": ["methods", "methodology", "materials", "experimental"],
        "contributions": ["abstract", "introduction", "discussion", "conclusion"],
        "directions": ["discussion", "conclusion", "future work", "limitations"]
    }
    
    try:
        # Get retrieval context
        if vector_store is not None:
            # Use vector retrieval
            query = retrieval_queries.get(element_type, f"Information about {element_type}")
            retrieved_docs = vector_store.similarity_search(
                query,
                k=5  # Get top 5 relevant chunks
            )
            retrieved_texts = [doc.page_content for doc in retrieved_docs]
            
            # Combine retrieved texts
            retrieval_context = "\n\n---\n\n".join(retrieved_texts)
        else:
            # Fallback: Use prioritized sections or first/last parts
            prioritized_chunks = []
            priority_sections = section_priorities.get(element_type, [])
            
            # First, try to find chunks from priority sections
            for chunk in chunks:
                section = chunk.metadata.get("section", "").lower()
                for priority in priority_sections:
                    if priority in section:
                        prioritized_chunks.append(chunk)
                        break
            
            # If no priority chunks found, use document start and end
            if not prioritized_chunks and chunks:
                # Include beginning chunks
                prioritized_chunks.extend(chunks[:2])
                # Include ending chunks if document is long enough
                if len(chunks) > 4:
                    prioritized_chunks.extend(chunks[-2:])
            
            # Combine texts from prioritized chunks
            retrieval_context = "\n\n---\n\n".join([chunk.page_content for chunk in prioritized_chunks])
            
            # If still empty, use document summary (first and last portions)
            if not retrieval_context:
                doc_start = text[:2000] if len(text) > 2000 else text
                doc_end = text[-2000:] if len(text) > 4000 else ""
                retrieval_context = f"{doc_start}\n\n[...]\n\n{doc_end}"
            
        # Create chain for extraction
        chain = create_chain(prompt_name, output_model)
        
        # Convert config to dict if needed
        config_dict = config if isinstance(config, dict) else config.model_dump()
        
        # Execute extraction with retrieved context
        elements = cached_run(chain, {
            "text": retrieval_context, 
            "document_id": document_id, 
            "config": config_dict
        }, element_type)
        
        # Add source document to all elements
        source_field_name = "source_documents"  # All models use the same field name
        for element in elements:
            sources = getattr(element, source_field_name, [])
            if document_id not in sources:
                setattr(element, source_field_name, sources + [document_id])
        
        # Apply filtering by importance if applicable
        if hasattr(output_model, "importance"):
            importance_map = {"high": 3, "medium": 2, "low": 1}
            min_importance = get_config_value(config, "min_importance", "medium")
            min_value = importance_map.get(min_importance, 1)
            
            if not elements:
                show(f"No {element_type} extracted", "warning")
                return []
                
            elements = [e for e in elements if importance_map.get(getattr(e, "importance", "low"), 0) >= min_value]
            elements.sort(key=lambda e: importance_map.get(getattr(e, "importance", "low"), 0), reverse=True)
        
        # Apply maximum limit from config
        max_config_key = f"max_{element_type}"
        max_elements = get_config_value(config, max_config_key, 20)
        
        show(f"Extracted {len(elements[:max_elements])} {element_type}", "info")
        return elements[:max_elements]
        
    except Exception as e:
        show(f"Error in {element_type} extraction: {str(e)}", "error")
        return []

# Element-specific extraction functions using the RAG-based extractor
def extract_scientific_claims(doc_data: Union[Dict[str, Any], str], config: Union[Dict, Any], document_id: str) -> List[ScientificClaim]:
    """Extract scientific claims using RAG with document source tracking"""
    # Handle both RAG and legacy input formats
    if isinstance(doc_data, str):
        # Legacy mode - convert to minimal doc_data for backward compatibility
        doc_data = {"text": doc_data, "vector_store": None, "chunks": []}
    
    return extract_elements_with_rag(doc_data, config, document_id, "claims", ScientificClaim, "claim_extraction")

def extract_methodologies(doc_data: Union[Dict[str, Any], str], config: Union[Dict, Any], document_id: str) -> List[Methodology]:
    """Extract research methodologies using RAG with document source tracking"""
    # Handle both RAG and legacy input formats
    if isinstance(doc_data, str):
        # Legacy mode - convert to minimal doc_data for backward compatibility
        doc_data = {"text": doc_data, "vector_store": None, "chunks": []}
    
    return extract_elements_with_rag(doc_data, config, document_id, "methodologies", Methodology, "methodology_extraction")

def extract_key_contributions(doc_data: Union[Dict[str, Any], str], config: Union[Dict, Any], document_id: str) -> List[KeyContribution]:
    """Extract key contributions using RAG with document source tracking"""
    # Handle both RAG and legacy input formats
    if isinstance(doc_data, str):
        # Legacy mode - convert to minimal doc_data for backward compatibility
        doc_data = {"text": doc_data, "vector_store": None, "chunks": []}
    
    return extract_elements_with_rag(doc_data, config, document_id, "contributions", KeyContribution, "contribution_extraction")

def extract_research_directions(doc_data: Union[Dict[str, Any], str], config: Union[Dict, Any], document_id: str) -> List[ResearchDirection]:
    """Extract research directions using RAG with document source tracking"""
    # Handle both RAG and legacy input formats
    if isinstance(doc_data, str):
        # Legacy mode - convert to minimal doc_data for backward compatibility
        doc_data = {"text": doc_data, "vector_store": None, "chunks": []}
    
    return extract_elements_with_rag(doc_data, config, document_id, "directions", ResearchDirection, "direction_extraction")

# ===== ELEMENT DEDUPLICATION AND MERGING =====

def text_similarity(text1: str, text2: str) -> float:
    """Calculate simple text similarity using Jaccard index"""
    # Convert to lowercase and remove punctuation
    t1 = re.sub(r'[^\w\s]', '', text1.lower())
    t2 = re.sub(r'[^\w\s]', '', text2.lower())
    
    # Word sets
    words1 = set(t1.split())
    words2 = set(t2.split())
    
    # Calculate Jaccard similarity
    intersection = len(words1.intersection(words2))
    union = len(words1.union(words2))
    
    return intersection / union if union > 0 else 0

def get_element_text(element: Any) -> str:
    """Extract primary text content from different element types"""
    element_type = type(element).__name__.lower()
    
    if element_type == "scientificclaim":
        return element.claim_text
    elif element_type == "methodology":
        return f"{element.method_name}: {element.description}"
    elif element_type == "keycontribution":
        return element.contribution_text
    elif element_type == "researchdirection":
        return element.direction_text
    return str(element)

def merge_source_documents(existing: Any, element: Any) -> None:
    """Merge source documents from element into existing element"""
    if hasattr(existing, 'source_documents') and hasattr(element, 'source_documents'):
        for doc_id in element.source_documents:
            if doc_id not in existing.source_documents:
                existing.source_documents.append(doc_id)

def deduplicate_elements(elements: List[Any], similarity_threshold: float = 0.75) -> List[Any]:
    """Deduplicate similar elements while preserving source document information"""
    if not elements or len(elements) <= 1:
        return elements
        
    # Use embeddings to calculate similarity when available
    if "embeddings" in globals() and embeddings is not None:
        try:
            # Get text to embed for each element
            element_texts = [get_element_text(element) for element in elements]
            
            # Generate embeddings
            embedded = embeddings.embed_documents(element_texts)
            
            # Find similar elements
            result = []
            added_indices = set()
            
            for i, element in enumerate(elements):
                if i in added_indices:
                    continue
                    
                result.append(element)
                added_indices.add(i)
                
                # Find similar elements
                for j in range(i + 1, len(elements)):
                    if j in added_indices:
                        continue
                        
                    # Calculate cosine similarity
                    similarity = sum(a * b for a, b in zip(embedded[i], embedded[j])) / (
                        (sum(a * a for a in embedded[i]) ** 0.5) * 
                        (sum(b * b for b in embedded[j]) ** 0.5)
                    )
                    
                    if similarity >= similarity_threshold:
                        # Merge source documents
                        merge_source_documents(element, elements[j])
                        added_indices.add(j)
            
            return result
            
        except Exception as e:
            show(f"Error using embeddings for deduplication: {str(e)}", "debug")
    
    # Fallback implementation using text comparison
    result = []
    added_texts = set()
    
    for element in elements:
        # Create normalized text for comparison
        compare_text = get_element_text(element).lower()
            
        # Check if we already have a similar element
        is_duplicate = False
        for existing_text in added_texts:
            if text_similarity(compare_text, existing_text) >= similarity_threshold:
                is_duplicate = True
                # Merge source documents into existing element
                for existing in result:
                    if get_element_text(existing).lower() == existing_text:
                        merge_source_documents(existing, element)
                        break
                break
                
        if not is_duplicate:
            added_texts.add(compare_text)
            result.append(element)
            
    return result

# ===== CROSS-DOCUMENT ANALYSIS =====

def format_elements_for_analysis(elements: List[Dict[str, Any]]) -> str:
    """Format elements for cross-document relationship analysis"""
    return "\n\n".join([
        f"ID: {item.get('element_id', f'element_{i}')}\n"
        f"Type: {item.get('type', 'unknown')}\n"
        f"Text: {item.get('text', '')}\n"
        f"Documents: {', '.join(item.get('documents', []))}"
        for i, item in enumerate(elements)
    ])

def identify_cross_document_relationships(
    claims: List[ScientificClaim], 
    methodologies: List[Methodology],
    contributions: List[KeyContribution],
    directions: List[ResearchDirection],
    config: Union[Dict, Any]
) -> List[CrossDocumentRelationship]:
    """Identify relationships between elements from different documents"""
    # Create a mapping of all elements for easy lookup
    element_map = {}
    
    # Helper to add elements to map
    def add_to_map(elements, element_type, id_field, text_getter):
        for element in elements:
            if hasattr(element, 'source_documents') and len(element.source_documents) > 0:
                element_map[getattr(element, id_field)] = {
                    "element_id": getattr(element, id_field),
                    "type": element_type,
                    "text": text_getter(element),
                    "documents": element.source_documents
                }
    
    # Add all element types to map
    add_to_map(claims, "claim", "claim_id", lambda e: e.claim_text)
    add_to_map(methodologies, "methodology", "method_id", lambda e: f"{e.method_name}: {e.description}")
    add_to_map(contributions, "contribution", "contribution_id", lambda e: e.contribution_text)
    add_to_map(directions, "direction", "direction_id", lambda e: e.direction_text)
    
    # Get relevant elements (either in multiple docs or potentially related)
    cross_doc_elements = [element_id for element_id, info in element_map.items()]
    
    # Use document combinations for uniqueness
    doc_combos = set()
    unique_elements = []
    
    for element_id in cross_doc_elements:
        docs = frozenset(element_map[element_id]["documents"])
        if docs not in doc_combos:
            doc_combos.add(docs)
            unique_elements.append(element_id)
    
    # Get config values
    relationship_confidence = get_config_value(config, 'relationship_confidence', 0.6)
    max_relationships = get_config_value(config, 'max_cross_relationships', 50)
    
    # Process in batches to avoid LLM context limits
    batch_size = 10
    relationships = []
    
    for i in range(0, len(unique_elements), batch_size):
        batch = unique_elements[i:i+batch_size]
        batch_elements = [element_map[element_id] for element_id in batch]
        
        # Run relationship analysis chain
        chain = create_chain("cross_document_relationship", CrossDocumentRelationship)
        batch_relationships = cached_run(chain, {
            "elements": format_elements_for_analysis(batch_elements)
        }, "cross_relationships")
        
        relationships.extend(batch_relationships)
    
    # Filter and sort by confidence
    valid_relationships = [r for r in relationships if r.confidence >= relationship_confidence]
    valid_relationships.sort(key=lambda r: r.confidence, reverse=True)
    
    return valid_relationships[:max_relationships]

# ===== SYNTHESIS GENERATION =====

def format_items_for_synthesis(items: List[Any], category_type: str) -> str:
    """Format items for synthesis prompt"""
    formatted = []
    
    for item in items:
        if category_type == "claims":
            text = [
                f"• {item.claim_text}",
                f"  - Type: {item.claim_type}",
                f"  - Importance: {item.importance}"
            ]
            if item.evidence:
                text.append(f"  - Evidence: {item.evidence}")
            text.append(f"  - Source Documents: {', '.join(item.source_documents)}")
            formatted.append("\n".join(text))
            
        elif category_type == "methodologies":
            text = [
                f"• {item.method_name}",
                f"  - Description: {item.description}"
            ]
            if item.context:
                text.append(f"  - Context: {item.context}")
            if item.limitations:
                text.append(f"  - Limitations: {item.limitations}")
            text.append(f"  - Source Documents: {', '.join(item.source_documents)}")
            formatted.append("\n".join(text))
            
        elif category_type == "contributions":
            text = [
                f"• {item.contribution_text}",
                f"  - Type: {item.contribution_type}",
                f"  - Importance: {item.importance}"
            ]
            if item.related_claims:
                text.append(f"  - Related Claims: {len(item.related_claims)}")
            if item.related_methods:
                text.append(f"  - Related Methods: {len(item.related_methods)}")
            text.append(f"  - Source Documents: {', '.join(item.source_documents)}")
            formatted.append("\n".join(text))
            
        elif category_type == "directions":
            text = [
                f"• {item.direction_text}",
                f"  - Importance: {item.importance}"
            ]
            if item.rationale:
                text.append(f"  - Rationale: {item.rationale}")
            text.append(f"  - Source Documents: {', '.join(item.source_documents)}")
            formatted.append("\n".join(text))

    return "\n\n".join(formatted)

def synthesize_across_category(items: List[Any], category_type: str, document_count: int) -> CategorySynthesis:
    """Generate synthesis across documents for a specific category"""
    if not items:
        return CategorySynthesis(
            category=category_type,
            synthesis_text=f"No {category_type} were extracted from the documents.",
            element_count=0,
            document_count=document_count
        )

    # Format items for the prompt
    items_text = format_items_for_synthesis(items, category_type)

    # Use category-specific prompt
    prompt_key = f"{category_type}_synthesis"

    # Run synthesis chain
    chain = create_chain(prompt_key)
    synthesis_text = cached_run(chain, {
        "items": items_text,
        "document_count": document_count,
        "category": category_type
    }, f"{category_type}_synthesis")

    return CategorySynthesis(
        category=category_type,
        synthesis_text=synthesis_text,
        element_count=len(items),
        document_count=document_count
    )

def format_relationships_for_synthesis(relationships: List[CrossDocumentRelationship]) -> str:
    """Format relationships for synthesis prompt"""
    return "\n\n".join([
        f"• {rel.source_element_type.capitalize()} '{rel.source_element_id}' {rel.relationship_type} " +
        f"{rel.target_element_type.capitalize()} '{rel.target_element_id}'" +
        (f"\n  - Evidence: {rel.evidence}" if rel.evidence else "") +
        f"\n  - Confidence: {rel.confidence:.2f}"
        for rel in relationships
    ])

def generate_multi_document_synthesis(
    documents: List[DocumentSource],
    category_syntheses: Dict[str, CategorySynthesis],
    relationships: List[CrossDocumentRelationship]
) -> str:
    """Generate comprehensive synthesis across all documents and categories"""
    # Extract document information
    document_titles = [doc.title or f"Document {i+1}" for i, doc in enumerate(documents)]

    # Get category syntheses (safely)
    get_synthesis = lambda cat: category_syntheses.get(cat, CategorySynthesis(
        category=cat, synthesis_text="", element_count=0, document_count=len(documents)
    )).synthesis_text

    claims_synthesis = get_synthesis("claims")
    methods_synthesis = get_synthesis("methodologies")
    contributions_synthesis = get_synthesis("contributions")
    directions_synthesis = get_synthesis("directions")

    # Format relationship information
    relationship_info = format_relationships_for_synthesis(relationships)

    # Create overall synthesis
    chain = create_chain("multi_document_synthesis")
    synthesis = cached_run(chain, {
        "document_count": len(documents),
        "document_titles": document_titles,
        "claims_synthesis": claims_synthesis,
        "methods_synthesis": methods_synthesis,
        "contributions_synthesis": contributions_synthesis,
        "directions_synthesis": directions_synthesis,
        "relationship_count": len(relationships),
        "relationship_info": relationship_info
    }, "final_synthesis")

    return synthesis

# ===== MAIN MULTI-DOCUMENT ANALYSIS FUNCTION =====

def analyze_multiple_documents(
    documents: List[DocumentSource], 
    config: Union[Dict, Any] = None
) -> MultiDocSynthesisOutput:
    """Complete end-to-end multi-document analysis using RAG approach"""
    if config is None:
        config = LitSynthMultiConfig().model_dump()

    try:
        # 1. Initialize results container
        results = MultiDocSynthesisOutput()
        results.documents = documents

        # 2. Load and prepare document collection
        collection_stats = load_document_collection(documents)
        show(f"Processing {len(documents)} documents...", "info")

        # 3. Process each document using RAG approach
        all_claims = []
        all_methodologies = []
        all_contributions = []
        all_directions = []

        for i, doc in enumerate(documents):
            show(f"Processing document {i+1}/{len(documents)}: {doc.title or f'Document {i+1}'}", "info")

            # Process document with RAG approach
            doc_data = process_doc(doc.source_type, doc.content, doc.source_id, config)

            if "error" in doc_data:
                show(f"Failed to process document {i+1}: {doc_data['error']}", "error")
                continue

            # Extract all element types using RAG
            doc_claims = extract_scientific_claims(doc_data, config, doc.source_id)
            doc_methodologies = extract_methodologies(doc_data, config, doc.source_id)
            doc_contributions = extract_key_contributions(doc_data, config, doc.source_id)
            doc_directions = extract_research_directions(doc_data, config, doc.source_id)

            # Add to collection
            all_claims.extend(doc_claims)
            all_methodologies.extend(doc_methodologies)
            all_contributions.extend(doc_contributions)
            all_directions.extend(doc_directions)

            # Mark document as processed
            doc.processed = True

        # 4. Deduplicate across documents
        similarity_threshold = get_config_value(config, 'similarity_threshold', 0.75)

        claims = deduplicate_elements(all_claims, similarity_threshold)
        methodologies = deduplicate_elements(all_methodologies, similarity_threshold)
        contributions = deduplicate_elements(all_contributions, similarity_threshold)
        directions = deduplicate_elements(all_directions, similarity_threshold)

        # 5. Identify cross-document relationships
        relationships = identify_cross_document_relationships(
            claims, methodologies, contributions, directions, config
        )

        # 6. Generate category-specific syntheses
        category_syntheses = {
            "claims": synthesize_across_category(claims, "claims", len(documents)),
            "methodologies": synthesize_across_category(methodologies, "methodologies", len(documents)),
            "contributions": synthesize_across_category(contributions, "contributions", len(documents)),
            "directions": synthesize_across_category(directions, "directions", len(documents))
        }

        # 7. Generate overall synthesis
        overall_synthesis = generate_multi_document_synthesis(
            documents, category_syntheses, relationships
        )

        # 8. Populate results
        results.claims = claims
        results.methodologies = methodologies
        results.contributions = contributions
        results.research_directions = directions
        results.cross_document_relationships = relationships
        results.category_syntheses = category_syntheses
        results.overall_synthesis = overall_synthesis

        return results

    except Exception as e:
        # Return error container
        error_output = MultiDocSynthesisOutput(
            analysis_id=f"error_{int(time.time())}",
            documents=documents
        )
        error_output.overall_synthesis = f"Analysis error: {str(e)}"
        return error_output

# ===== STATE MANAGEMENT UTILITY FUNCTIONS =====

def save_session(state: LitSynthMultiState, filepath: Optional[str] = None) -> str:
    """Save current analysis session to disk"""
    if not filepath:
        filepath = STATES_DIR / f"{state.session_id}.json"

    saved_path = state.save(filepath)
    show(f"Analysis session saved to {saved_path}", "success")
    return saved_path

def load_session(filepath: str) -> LitSynthMultiState:
    """Load analysis session from disk"""
    try:
        state = LitSynthMultiState.load(filepath)
        show(f"Analysis session loaded from {filepath}", "success")
        return state
    except Exception as e:
        show(f"Error loading session: {e}", "error")
        return LitSynthMultiState()

def process_document_from_state(state: LitSynthMultiState, document_id: str) -> bool:
    """Process a single document from the state using RAG approach"""
    # Find the document in state
    document = next((doc for doc in state.data['documents'] if doc.source_id == document_id), None)

    if not document:
        show(f"Document {document_id} not found in state", "error")
        return False

    # Skip if already processed
    if document.processed:
        show(f"Document {document.title or document_id} already processed", "info")
        return True

    try:
        # Get config from state
        config = state.config if isinstance(state.config, dict) else LitSynthMultiConfig(**state.config).model_dump()

        # Process document with RAG approach
        doc_data = process_doc(document.source_type, document.content, document.source_id, config)

        if "error" in doc_data:
            show(f"Failed to process document {document_id}: {doc_data['error']}", "error")
            return False

        # Extract all element types using RAG
        results = {
            'claims': extract_scientific_claims(doc_data, config, document.source_id),
            'methodologies': extract_methodologies(doc_data, config, document.source_id),
            'contributions': extract_key_contributions(doc_data, config, document.source_id),
            'research_directions': extract_research_directions(doc_data, config, document.source_id)
        }

        # Update state with all results
        for category, elements in results.items():
            state.update_extraction_results(document.source_id, category, elements)

        # Mark document as processed and signal cross-document update
        state.mark_document_processed(document.source_id)
        state.regenerate_cross_document_analysis()

        return True

    except Exception as e:
        show(f"Error processing document {document_id}: {e}", "error")
        return False

# Initialization complete
show("Core functions initialized for multi-document analysis with RAG approach", "success")

## Launch UI

In [None]:
# System Initialization
"""
This cell initializes the LitSynth-Multidoc system, connecting all components
defined in previous sections and preparing the system for document analysis.
"""

from pathlib import Path
import time

try:
    # Initialize the Literature Synthesis Multi-Document System
    class LitSynthMultiSystem:
        """Main system class that coordinates all components of the Multi-Document Literature Synthesis system."""
        
        def __init__(self, state=None, config=None):
            """Initialize the system with state management and configuration.
            
            Args:
                state: Existing state container or None to use global state
                config: Configuration parameters or None to use default/global config
            """
            # Use existing state or create new one
            self.state = state or globals().get('state', LitSynthMultiState())
            
            # Use existing config or create new one
            if config:
                self.state.config = config
            elif not self.state.config and 'config' in globals():
                self.state.config = globals()['config'].model_dump()
            
            # System status tracking
            self.system_id = f"litsynth_multi_{int(time.time())}"
            
            # Register system in state for tracking
            self.state.status.update({
                "system_initialized": True,
                "system_id": self.system_id,
                "initialization_time": time.time()
            })
        
        def analyze_document(self, document: DocumentSource) -> bool:
            """Process a single document.
            
            Args:
                document: The document to analyze
                
            Returns:
                True if processing was successful
            """
            # Add document to state if not already present
            if not any(doc.source_id == document.source_id for doc in self.state.data.get('documents', [])):
                self.state.add_document(document)
            
            # Process the document
            return process_document_from_state(self.state, document.source_id)
        
        def analyze_multiple_documents(self, documents: List[DocumentSource]) -> MultiDocSynthesisOutput:
            """Process multiple documents at once.
            
            Args:
                documents: List of documents to analyze
                
            Returns:
                Complete multi-document analysis results
            """
            # Use standalone function that adds documents to state
            return analyze_multiple_documents(documents, self.state.config)
        
        def get_cross_document_relationships(self) -> List[CrossDocumentRelationship]:
            """Get relationships between documents in the current state.
            
            Returns:
                List of cross-document relationships
            """
            # Extract needed information from state
            claims = self.state.data.get('claims', [])
            methodologies = self.state.data.get('methodologies', [])
            contributions = self.state.data.get('contributions', [])
            directions = self.state.data.get('research_directions', [])
            
            # Generate relationships
            return identify_cross_document_relationships(
                claims, methodologies, contributions, directions, self.state.config
            )
        
        def generate_syntheses(self) -> Dict[str, Any]:
            """Generate syntheses for all categories.
            
            Returns:
                Dictionary with category syntheses and overall synthesis
            """
            # Get documents and extracted elements
            documents = self.state.data.get('documents', [])
            claims = self.state.data.get('claims', [])
            methodologies = self.state.data.get('methodologies', [])
            contributions = self.state.data.get('contributions', [])
            directions = self.state.data.get('research_directions', [])
            
            # Get or generate relationships
            relationships = self.state.data.get('cross_document_relationships', [])
            if not relationships:
                relationships = self.get_cross_document_relationships()
                
            # Generate category syntheses
            category_syntheses = {
                "claims": synthesize_across_category(claims, "claims", len(documents)),
                "methodologies": synthesize_across_category(methodologies, "methodologies", len(documents)),
                "contributions": synthesize_across_category(contributions, "contributions", len(documents)),
                "directions": synthesize_across_category(directions, "directions", len(documents))
            }
            
            # Generate overall synthesis
            overall_synthesis = generate_multi_document_synthesis(
                documents, category_syntheses, relationships
            )
            
            # Update state
            self.state.update('category_syntheses', category_syntheses)
            self.state.update('overall_synthesis', overall_synthesis)
            self.state.update('cross_document_relationships', relationships)
            
            return {
                "category_syntheses": category_syntheses,
                "overall_synthesis": overall_synthesis
            }
        
        def save_session(self, filepath: Optional[str] = None) -> str:
            """Save current session to disk.
            
            Args:
                filepath: Optional custom filepath
                
            Returns:
                Path to the saved file
            """
            return save_session(self.state, filepath)
        
        def load_session(self, filepath: str) -> bool:
            """Load session from disk.
            
            Args:
                filepath: Path to the saved state file
                
            Returns:
                True if loading was successful
            """
            try:
                loaded_state = load_session(filepath)
                self.state = loaded_state
                return True
            except Exception as e:
                show(f"Error loading session: {str(e)}", "error")
                return False
        
        def get_current_results(self) -> MultiDocSynthesisOutput:
            """Get complete current results.
            
            Returns:
                Complete results in output model format
            """
            return self.state.to_output_model()

    # Initialize the system (connects previously defined components)
    # Use the global state that was already initialized in the Data Models section
    litsynth_multi = LitSynthMultiSystem(state=state)
    
    # Confirm successful initialization
    show("LitSynth Multi-Document System initialized successfully", "success")
    
except Exception as e:
    # Use show from Core Utilities for error (with level parameter)
    error_msg = f"System failed to initialize: {str(e)}"
    show(f"{error_msg}\nPlease make sure you run the cells in order: Installation-Initialization-Data Models-Core Functions", "error")

In [None]:
# UI Structure
"""
This cell defines the Gradio UI layout for the LitSynth-Multidoc system.
The UI is designed for simplicity and follows the same visual style as the 
original LitSynth system, with adaptations for multi-document workflows.
"""

import gradio as gr
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from IPython.display import Markdown, display

def create_multi_doc_ui():
    """Create optimized Literature Synthesis UI for multiple documents."""
    
    with gr.Blocks(theme=gr.themes.Default()) as app:
        gr.Markdown("# 📚 Multi-Document Literature Synthesis Expert System")
        
        with gr.Tabs() as tabs:
            # === DOCUMENT INPUT TAB ===
            with gr.Tab("📄 Document Upload"):
                with gr.Row():
                    with gr.Column(scale=3):
                        # Multiple PDF Upload
                        pdf_input = gr.File(
                            label="Upload Scientific PDFs (max 5 documents)", 
                            file_types=[".pdf"],
                            file_count="multiple"
                        )
                        
                        with gr.Row():
                            analysis_mode = gr.Radio(
                                choices=["quick", "balanced", "thorough"],
                                label="Analysis Mode",
                                value="balanced",
                                info="Quick: 30-40s per doc, Balanced: 1-2min per doc, Thorough: 3-5min per doc"
                            )
                            analyze_btn = gr.Button("Run Analysis", variant="primary", size="lg")
                    
                    with gr.Column(scale=2):
                        status_box = gr.Textbox(
                            label="Status", 
                            interactive=False,
                            value="Upload 1-5 PDF documents to begin analysis"
                        )
                        progress_bar = gr.Slider(
                            minimum=0, maximum=100, value=0, 
                            label="Processing Progress",
                            interactive=False
                        )
                
                with gr.Row():
                    document_list = gr.Dataframe(
                        headers=["Document", "Status", "Size", "Pages", "Processing Time"],
                        datatype=["str", "str", "str", "number", "str"],
                        label="Document Queue"
                    )
            
            # === EXTRACTION RESULTS TAB ===
            with gr.Tab("🔍 Extraction Results"):
                with gr.Row():
                    # Document selection dropdown
                    doc_selector = gr.Dropdown(
                        choices=["All Documents"], 
                        value="All Documents",
                        label="Document Filter"
                    )
                    
                    importance_filter = gr.Radio(
                        choices=["all", "high", "medium", "low"],
                        label="Importance Filter",
                        value="all"
                    )
                
                with gr.Tabs() as results_tabs:
                    # Claims Tab
                    with gr.Tab("Scientific Claims"):
                        claims_table = gr.DataFrame(
                            headers=["Claim", "Type", "Importance", "Source Documents", "Confidence"],
                            label="Extracted Claims"
                        )
                    
                    # Methodologies Tab
                    with gr.Tab("Research Methodologies"):
                        methods_table = gr.DataFrame(
                            headers=["Method Name", "Description", "Limitations", "Source Documents", "Confidence"],
                            label="Extracted Methodologies"
                        )
                    
                    # Contributions Tab
                    with gr.Tab("Key Contributions"):
                        contributions_table = gr.DataFrame(
                            headers=["Contribution", "Type", "Importance", "Source Documents", "Confidence"],
                            label="Extracted Contributions"
                        )
                    
                    # Research Directions Tab
                    with gr.Tab("Research Directions"):
                        directions_table = gr.DataFrame(
                            headers=["Direction", "Rationale", "Importance", "Source Documents", "Confidence"],
                            label="Extracted Research Directions"
                        )
                
                with gr.Row():
                    with gr.Column():
                        gr.Markdown("### Cross-Document Relationships")
                        relationships_table = gr.DataFrame(
                            headers=["Source Element", "Relationship", "Target Element", "Evidence", "Confidence"],
                            label="Identified Relationships"
                        )
                    
                    with gr.Column():
                        relationships_plot = gr.Plot(label="Cross-Document Network")
                        with gr.Row():
                            min_confidence = gr.Slider(
                                minimum=0.0, maximum=1.0, value=0.6, step=0.1,
                                label="Minimum Confidence"
                            )
                            refresh_viz_btn = gr.Button("Refresh Visualization")
            
            # === SYNTHESIS TAB ===
            with gr.Tab("📝 Research Synthesis"):
                with gr.Row():
                    run_synthesis_btn = gr.Button("Generate Synthesis", variant="primary")
                    synthesis_status = gr.Textbox(
                        label="Synthesis Status", 
                        value="Click 'Generate Synthesis' to create synthesis across all documents",
                        interactive=False
                    )
                
                with gr.Tabs() as synthesis_tabs:
                    # Overall Synthesis
                    with gr.Tab("Overall Synthesis"):
                        overall_synthesis = gr.Markdown()
                    
                    # Category Syntheses
                    with gr.Tab("Claims Synthesis"):
                        claims_synthesis = gr.Markdown()
                    
                    with gr.Tab("Methodologies Synthesis"):
                        methods_synthesis = gr.Markdown()
                    
                    with gr.Tab("Contributions Synthesis"):
                        contributions_synthesis = gr.Markdown()
                    
                    with gr.Tab("Research Directions Synthesis"):
                        directions_synthesis = gr.Markdown()
                
                with gr.Row():
                    export_format = gr.Dropdown(
                        choices=["Markdown", "Text", "JSON"],
                        label="Export Format",
                        value="Markdown"
                    )
                    export_btn = gr.Button("Export Synthesis")
                    save_session_btn = gr.Button("Save Session")
            
            # === SETTINGS TAB ===
            with gr.Tab("⚙️ Settings"):
                with gr.Row():
                    with gr.Column():
                        # Multi-document settings
                        gr.Markdown("#### Multi-Document Settings")
                        parallel_processing = gr.Checkbox(
                            label="Enable Parallel Processing",
                            value=False,
                            info="Process documents in parallel (requires more memory)"
                        )
                        
                        # Text processing settings
                        gr.Markdown("#### Text Processing")
                        chunk_size = gr.Slider(
                            minimum=1000, maximum=8000, value=4000, step=500,
                            label="Chunk Size (chars)",
                            info="Larger chunks capture more context but process slower"
                        )
                        chunk_overlap = gr.Slider(
                            minimum=50, maximum=1000, value=150, step=50,
                            label="Chunk Overlap"
                        )
                        
                    with gr.Column():
                        # Element Extraction Settings
                        gr.Markdown("#### Element Extraction Settings")
                        min_importance = gr.Dropdown(
                            choices=["low", "medium", "high"],
                            label="Min Importance",
                            value="medium"
                        )
                        
                        # Category limits
                        gr.Markdown("#### Maximum Elements per Document")
                        max_claims = gr.Slider(
                            minimum=5, maximum=100, value=20, step=5,
                            label="Max Claims"
                        )
                        max_methods = gr.Slider(
                            minimum=3, maximum=50, value=10, step=5,
                            label="Max Methodologies"
                        )
                        max_contributions = gr.Slider(
                            minimum=3, maximum=50, value=15, step=5,
                            label="Max Contributions"
                        )
                        max_directions = gr.Slider(
                            minimum=3, maximum=50, value=10, step=5,
                            label="Max Research Directions"
                        )
                        
                    with gr.Column():
                        # Cross-Document Settings
                        gr.Markdown("#### Cross-Document Analysis")
                        relationship_confidence = gr.Slider(
                            minimum=0.0, maximum=1.0, value=0.6, step=0.1,
                            label="Min Relationship Confidence"
                        )
                        max_cross_relationships = gr.Slider(
                            minimum=10, maximum=200, value=50, step=10,
                            label="Max Cross-Document Relationships"
                        )
                        similarity_threshold = gr.Slider(
                            minimum=0.5, maximum=0.95, value=0.75, step=0.05,
                            label="Element Similarity Threshold",
                            info="Higher values require more similarity for deduplication"
                        )
                        
                        scientific_domain = gr.Textbox(
                            label="Scientific Domain (Optional)",
                            placeholder="e.g., Molecular Biology, Quantum Physics",
                            info="Helps tailor analysis to specific domains"
                        )
                        
                        apply_settings_btn = gr.Button("Apply Settings", variant="primary")
                        settings_status = gr.Textbox(label="Settings Status", interactive=False)
                        
                        # Session Management
                        gr.Markdown("#### Session Management")
                        with gr.Row():
                            load_session_btn = gr.Button("Load Session")
                            session_file = gr.File(
                                label="Session File",
                                file_types=[".json"],
                                file_count="single"
                            )
        
        # === State Variables ===
        system_state = gr.State(None)
        results_state = gr.State(None)
    
    return app

# Create the UI
multidoc_ui = create_multi_doc_ui()

# Display success message in notebook
show("UI structure for multi-document system defined successfully", "success")

In [None]:
# UI Launch
"""Streamlined LitSynth-Multidoc UI launch with RAG-optimized document processing."""

import tempfile, os, time, json, re
import pandas as pd
from pathlib import Path

def launch_litsynth_multi_ui():
    """Launch the Multi-Document Literature Synthesis UI using RAG-based core functions."""
    
    # === Helper Functions (UI-specific only) ===
    
    def format_document_list(results):
        """Format document list for display in UI table."""
        if not results: return pd.DataFrame()
        
        docs = []
        for doc_id, result in results.items():
            if not isinstance(result, dict) or doc_id in ["summary", "cross_document_relationships"]:
                continue
                
            if "error" in result:
                docs.append({
                    "Document": result.get("title", doc_id),
                    "Status": "Error",
                    "Size": "N/A",
                    "Pages": "N/A",
                    "Processing Time": "N/A"
                })
            else:
                metadata = result.get("metadata", {})
                docs.append({
                    "Document": result.get("title", doc_id),
                    "Status": "Complete",
                    "Size": f"{metadata.get('original_length', 0):,} chars",
                    "Pages": metadata.get("pages", "N/A"),
                    "Processing Time": f"{metadata.get('processing_time', 0):.1f}s"
                })
        
        return pd.DataFrame(docs)
    
    def get_doc_selector_options(results):
        """Get document selector options for dropdown."""
        options = ["All Documents"]
        if results:
            for doc_id, result in results.items():
                if isinstance(result, dict) and doc_id not in ["summary", "cross_document_relationships"]:
                    options.append(f"{result.get('title', doc_id)} ({doc_id})")
        return options, "All Documents"
    
    def filter_and_format_elements(results, doc_filter, category, importance_filter="all"):
        """Filter elements by document and format for display."""
        if not results: return pd.DataFrame()
        
        # Get document ID from filter
        selected_doc = None
        if doc_filter != "All Documents":
            doc_id_match = re.search(r'\((.*?)\)$', doc_filter)
            if doc_id_match:
                selected_doc = doc_id_match.group(1)
        
        # Collect elements
        elements = []
        if selected_doc:
            result = results.get(selected_doc, {})
            if isinstance(result, dict): elements.extend(result.get(category, []))
        else:
            for doc_id, result in results.items():
                if isinstance(result, dict) and doc_id not in ["summary", "cross_document_relationships"]:
                    elements.extend(result.get(category, []))
        
        # Filter by importance
        if importance_filter != "all" and elements and hasattr(elements[0], "importance"):
            elements = [e for e in elements if e.importance == importance_filter]
        
        # Format based on category
        data = []
        for element in elements:
            # Get source documents
            source_docs = []
            for doc_id in element.source_documents:
                doc_result = results.get(doc_id, {})
                doc_title = doc_result.get("title", doc_id) if isinstance(doc_result, dict) else doc_id
                source_docs.append(doc_title)
                
            # Format based on element type
            if category == "claims":
                data.append({
                    "Claim": element.claim_text,
                    "Type": element.claim_type.capitalize(),
                    "Importance": element.importance.capitalize(),
                    "Source Documents": ", ".join(source_docs),
                    "Confidence": f"{element.confidence:.2f}"
                })
            elif category == "methodologies":
                data.append({
                    "Method Name": element.method_name,
                    "Description": element.description,
                    "Limitations": element.limitations or "Not specified",
                    "Source Documents": ", ".join(source_docs),
                    "Confidence": f"{element.confidence:.2f}"
                })
            elif category == "contributions":
                data.append({
                    "Contribution": element.contribution_text,
                    "Type": element.contribution_type.capitalize(),
                    "Importance": element.importance.capitalize(),
                    "Source Documents": ", ".join(source_docs),
                    "Confidence": f"{element.confidence:.2f}"
                })
            elif category == "research_directions":
                data.append({
                    "Direction": element.direction_text,
                    "Rationale": element.rationale or "Not specified",
                    "Importance": element.importance.capitalize(),
                    "Source Documents": ", ".join(source_docs),
                    "Confidence": f"{element.confidence:.2f}"
                })
                
        df = pd.DataFrame(data)
        if not df.empty and hasattr(elements[0], "importance"):
            # Sort by importance
            importance_map = {"High": 0, "Medium": 1, "Low": 2}
            df["_imp"] = df["Importance"].map(importance_map)
            df = df.sort_values(by=["_imp", "Confidence"], ascending=[True, False])
            df = df.drop(columns=["_imp"])
            
        return df
    
    def format_relationships_table(results):
        """Format relationships for display in UI table."""
        if not results: return pd.DataFrame()
        relationships = results.get("cross_document_relationships", [])
        if not relationships: return pd.DataFrame()
        
        # Map elements to text and type
        element_info = {}
        for doc_id, result in results.items():
            if not isinstance(result, dict) or doc_id in ["summary", "cross_document_relationships"]:
                continue
                
            for category, id_field, text_field, type_label in [
                ("claims", "claim_id", "claim_text", "Claim"),
                ("methodologies", "method_id", "method_name", "Method"),
                ("contributions", "contribution_id", "contribution_text", "Contribution"),
                ("research_directions", "direction_id", "direction_text", "Direction")
            ]:
                for item in result.get(category, []):
                    if hasattr(item, id_field) and hasattr(item, text_field):
                        item_id = getattr(item, id_field)
                        text = getattr(item, text_field)
                        element_info[item_id] = {
                            "text": text[:40] + "..." if len(text) > 40 else text,
                            "type": type_label
                        }
        
        # Format data
        data = []
        for rel in relationships:
            source_info = element_info.get(rel.source_element_id, {"text": rel.source_element_id, "type": rel.source_element_type.capitalize()})
            target_info = element_info.get(rel.target_element_id, {"text": rel.target_element_id, "type": rel.target_element_type.capitalize()})
            
            data.append({
                "Source Element": f"{source_info['type']}: {source_info['text']}",
                "Relationship": rel.relationship_type.capitalize(),
                "Target Element": f"{target_info['type']}: {target_info['text']}",
                "Evidence": rel.evidence or "N/A",
                "Confidence": f"{rel.confidence:.2f}"
            })
        
        return pd.DataFrame(data).sort_values(by="Confidence", ascending=False)
    
    # === Main Processing Function (using RAG-based core functions) ===
    
    def process_documents(files, analysis_mode):
        """Process documents using RAG-based core analyze_multiple_documents function."""
        if not files:
            yield "Please upload at least one document.", None, 0
            return
        
        start_time = time.time()
        num_docs = len(files)
        yield f"Processing {num_docs} documents...", None, 5
        
        try:
            # Create DocumentSource objects from files
            documents = []
            for i, file in enumerate(files):
                file_path = getattr(file, 'name', None)
                if not file_path and hasattr(file, 'path'):
                    file_path = file.path
                
                doc = DocumentSource(
                    source_id=f"doc_{i}",
                    title=f"Document {i+1}",
                    source_type="pdf",
                    content=file_path,
                    processed=False
                )
                documents.append(doc)
            
            # Set config based on analysis mode
            config_dict = config.model_dump() if hasattr(config, 'model_dump') else config
            if analysis_mode == "quick":
                config_dict.update({
                    "text_chunk_size": min(config_dict.get("text_chunk_size", 4000), 2500),
                    "min_importance": "high",
                    "max_claims": min(config_dict.get("max_claims", 20), 10),
                    "max_methodologies": min(config_dict.get("max_methodologies", 10), 5),
                    "max_contributions": min(config_dict.get("max_contributions", 15), 8),
                    "max_directions": min(config_dict.get("max_directions", 10), 5)
                })
            elif analysis_mode == "thorough":
                config_dict.update({
                    "text_chunk_size": max(config_dict.get("text_chunk_size", 4000), 5000),
                    "min_importance": "low"
                })
            
            # Use the RAG-enabled analyze_multiple_documents function
            yield f"Analyzing documents using RAG-based multi-document analysis...", None, 20
            results = analyze_multiple_documents(documents, config_dict)
            
            # Transform results to match UI expectations
            ui_results = {}
            for i, doc in enumerate(documents):
                doc_id = doc.source_id
                doc_elements = {
                    "document_id": doc_id,
                    "title": doc.title,
                    "claims": results.get_elements_for_document(doc_id, "claims"),
                    "methodologies": results.get_elements_for_document(doc_id, "methodologies"),
                    "contributions": results.get_elements_for_document(doc_id, "contributions"),
                    "research_directions": results.get_elements_for_document(doc_id, "directions"),
                    "metadata": {
                        "original_length": len(load_document(doc.source_type, doc.content).get("text", "")),
                        "processing_time": (time.time() - start_time) / num_docs,
                        "pages": doc.metadata.get("pages", "N/A")
                    }
                }
                ui_results[doc_id] = doc_elements
            
            # Add cross-document relationships
            ui_results["cross_document_relationships"] = results.cross_document_relationships
            
            # Add summary stats
            ui_results["summary"] = {
                "total_documents": num_docs,
                "total_processing_time": round(time.time() - start_time, 2),
                "total_elements_extracted": (
                    len(results.claims) + 
                    len(results.methodologies) + 
                    len(results.contributions) + 
                    len(results.research_directions)
                ),
                "analysis_mode": analysis_mode,
                "cross_document_relationships": len(results.cross_document_relationships),
                "processing_approach": "RAG-based"  # Indicate that RAG was used
            }
            
            yield f"Analysis complete: {num_docs} docs in {time.time() - start_time:.2f}s (using RAG)", ui_results, 100
            
        except Exception as e:
            yield f"Error in document processing: {str(e)}", None, 0
    
    # === Synthesis Function (using core functions) ===
    
    def generate_synthesis(results):
        """Generate synthesis using core functions."""
        if not results: 
            return {"error": "No results available"}
        
        try:
            # Collect documents from results
            documents = []
            elements = {"claims": [], "methodologies": [], "contributions": [], "research_directions": []}
            
            for doc_id, result in results.items():
                if not isinstance(result, dict) or doc_id in ["summary", "cross_document_relationships"]:
                    continue
                    
                doc = DocumentSource(
                    source_id=doc_id,
                    title=result.get("title", doc_id),
                    source_type="text",
                    content="",  # Empty content as we're only using for synthesis
                    processed=True
                )
                documents.append(doc)
                
                # Collect elements
                for category in elements:
                    elements[category].extend(result.get(category, []))
            
            # Get relationships
            relationships = results.get("cross_document_relationships", [])
            
            # Generate category syntheses using core functions
            category_syntheses = {
                "claims": synthesize_across_category(elements["claims"], "claims", len(documents)),
                "methodologies": synthesize_across_category(elements["methodologies"], "methodologies", len(documents)),
                "contributions": synthesize_across_category(elements["contributions"], "contributions", len(documents)),
                "directions": synthesize_across_category(elements["research_directions"], "directions", len(documents))
            }
            
            # Generate overall synthesis
            overall = generate_multi_document_synthesis(documents, category_syntheses, relationships)
            
            return {
                "overall": overall,
                "claims": category_syntheses["claims"].synthesis_text,
                "methodologies": category_syntheses["methodologies"].synthesis_text,
                "contributions": category_syntheses["contributions"].synthesis_text,
                "directions": category_syntheses["directions"].synthesis_text
            }
        except Exception as e:
            return {
                "error": str(e),
                "overall": f"Error: {str(e)}",
                "claims": "Synthesis failed",
                "methodologies": "Synthesis failed",
                "contributions": "Synthesis failed",
                "directions": "Synthesis failed"
            }
    
    # === Settings Update Function ===
    
    def update_settings(use_rag, parallel, chunk_size, chunk_overlap, min_importance, 
                       max_claims, max_methods, max_contribs, max_dirs, 
                       rel_conf, max_rels, sim_threshold, domain):
        """Update system configuration with RAG toggle."""
        try:
            new_config = LitSynthMultiConfig(
                # Add RAG toggle (will be ignored by older versions)
                use_rag=use_rag,
                text_chunk_size=chunk_size,
                text_chunk_overlap=chunk_overlap,
                min_importance=min_importance,
                extraction_confidence=0.7,
                max_claims=max_claims,
                max_methodologies=max_methods,
                max_contributions=max_contribs,
                max_directions=max_dirs,
                relationship_confidence=rel_conf,
                max_cross_relationships=max_rels,
                similarity_threshold=sim_threshold,
                parallel_processing=parallel,
                scientific_domain=domain if domain else None
            )
            
            # Update global config
            global config
            config = new_config
            
            rag_status = "enabled" if use_rag else "disabled"
            return f"Settings updated: RAG {rag_status}, {chunk_size} chunk size, {min_importance} min importance"
        except Exception as e:
            return f"Error updating configuration: {str(e)}"
    
    # === Session Management Functions ===
    
    def save_current_session(results, synthesis):
        """Save current session state to file."""
        if not results:
            return "No session data to save."
            
        try:
            session_data = {
                "results": results,
                "synthesis": synthesis,
                "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
                "processing_approach": "RAG-based"  # Add RAG indicator
            }
            
            # Save to temporary file
            with tempfile.NamedTemporaryFile(delete=False, suffix='.json', mode='w') as f:
                json.dump(session_data, f, default=lambda o: o.__dict__ if hasattr(o, '__dict__') else str(o))
                filepath = f.name
            
            return filepath
        except Exception as e:
            return f"Error saving session: {str(e)}"
            
    def load_saved_session(file):
        """Load session from file."""
        if not file:
            return None, "Please select a session file."
            
        try:
            with open(file.name, 'r') as f:
                session_data = json.load(f)
                
            return session_data.get("results"), "Session loaded successfully."
        except Exception as e:
            return None, f"Error loading session: {str(e)}"
    
    # === Filter Update Function ===
    
    def filter_update_handler(results, doc_filter, importance):
        """Update all tables based on filter changes."""
        return (
            filter_and_format_elements(results, doc_filter, "claims", importance),
            filter_and_format_elements(results, doc_filter, "methodologies", importance),
            filter_and_format_elements(results, doc_filter, "contributions", importance),
            filter_and_format_elements(results, doc_filter, "research_directions", importance)
        )
    
    # === Create UI and Connect Event Handlers ===
    
    # Setup Gradio UI components
    import gradio as gr
    
    with gr.Blocks() as app:
        gr.Markdown("# 📚 Multi-Document Literature Synthesis Expert System")
        
        # State containers
        results_state = gr.State(None)
        synthesis_state = gr.State(None)
        
        with gr.Tabs() as tabs:
            # Document Upload Tab
            with gr.Tab("📄 Document Upload"):
                with gr.Row():
                    with gr.Column(scale=3):
                        pdf_input = gr.File(label="Upload Scientific PDFs (max 5)", file_types=[".pdf"], file_count="multiple")
                        with gr.Row():
                            analysis_mode = gr.Radio(choices=["quick", "balanced", "thorough"], label="Analysis Mode", value="balanced")
                            analyze_btn = gr.Button("Run Analysis", variant="primary")
                    with gr.Column(scale=2):
                        status_box = gr.Textbox(label="Status", interactive=False, value="Upload 1-5 PDF documents to begin analysis", lines=5)
                        progress_bar = gr.Slider(minimum=0, maximum=100, value=0, label="Processing Progress", interactive=False)
                document_list = gr.DataFrame(headers=["Document", "Status", "Size", "Pages", "Processing Time"])
            
            # Extraction Results Tab
            with gr.Tab("🔍 Extraction Results"):
                with gr.Row():
                    doc_selector = gr.Dropdown(choices=["All Documents"], value="All Documents", label="Document Filter")
                    importance_filter = gr.Radio(choices=["all", "high", "medium", "low"], label="Importance Filter", value="all")
                
                with gr.Tabs() as results_tabs:
                    with gr.Tab("Scientific Claims"):
                        claims_table = gr.DataFrame(headers=["Claim", "Type", "Importance", "Source Documents", "Confidence"])
                    with gr.Tab("Research Methodologies"):
                        methods_table = gr.DataFrame(headers=["Method Name", "Description", "Limitations", "Source Documents", "Confidence"])
                    with gr.Tab("Key Contributions"):
                        contributions_table = gr.DataFrame(headers=["Contribution", "Type", "Importance", "Source Documents", "Confidence"])
                    with gr.Tab("Research Directions"):
                        directions_table = gr.DataFrame(headers=["Direction", "Rationale", "Importance", "Source Documents", "Confidence"])
                
                with gr.Row():
                    with gr.Column():
                        gr.Markdown("### Cross-Document Relationships")
                        relationships_table = gr.DataFrame(headers=["Source Element", "Relationship", "Target Element", "Evidence", "Confidence"])
                    
                    with gr.Column():
                        relationships_plot = gr.Plot(label="Cross-Document Network")
                        with gr.Row():
                            min_confidence = gr.Slider(
                                minimum=0.0, maximum=1.0, value=0.6, step=0.1,
                                label="Minimum Confidence"
                            )
                            refresh_viz_btn = gr.Button("Refresh Visualization")
            
            # Synthesis Tab
            with gr.Tab("📝 Research Synthesis"):
                with gr.Row():
                    run_synthesis_btn = gr.Button("Generate Synthesis", variant="primary")
                    synthesis_status = gr.Textbox(label="Synthesis Status", value="Click 'Generate Synthesis' to create synthesis across all documents", interactive=False)
                
                with gr.Tabs() as synthesis_tabs:
                    with gr.Tab("Overall Synthesis"):
                        overall_synthesis = gr.Markdown()
                    with gr.Tab("Claims Synthesis"):
                        claims_synthesis = gr.Markdown()
                    with gr.Tab("Methodologies Synthesis"):
                        methods_synthesis = gr.Markdown()
                    with gr.Tab("Contributions Synthesis"):
                        contributions_synthesis = gr.Markdown()
                    with gr.Tab("Research Directions Synthesis"):
                        directions_synthesis = gr.Markdown()
                
                with gr.Row():
                    export_format = gr.Dropdown(
                        choices=["Markdown", "Text", "JSON"],
                        label="Export Format",
                        value="Markdown"
                    )
                    export_btn = gr.Button("Export Synthesis")
                    save_session_btn = gr.Button("Save Session")
            
            # Settings Tab
            with gr.Tab("⚙️ Settings"):
                with gr.Row():
                    with gr.Column():
                        # Add RAG toggle at the top
                        use_rag_approach = gr.Checkbox(
                            label="Use RAG Approach (Recommended)",
                            value=True,
                            info="Uses embeddings to retrieve relevant sections instead of chunking"
                        )
                        parallel_processing = gr.Checkbox(
                            label="Enable Parallel Processing",
                            value=False,
                            info="Process documents in parallel (requires more memory)"
                        )
                        
                        # Text processing settings
                        gr.Markdown("#### Text Processing")
                        chunk_size = gr.Slider(
                            minimum=1000, maximum=8000, value=2000, step=500,
                            label="Chunk Size (chars)",
                            info="Smaller chunks (2000-3000) recommended with RAG for better performance"
                        )
                        chunk_overlap = gr.Slider(
                            minimum=50, maximum=1000, value=50, step=50,
                            label="Chunk Overlap",
                            info="Minimal overlap recommended with RAG approach"
                        )
                        
                    with gr.Column():
                        # Element Extraction Settings
                        gr.Markdown("#### Element Extraction Settings")
                        min_importance = gr.Dropdown(
                            choices=["low", "medium", "high"],
                            label="Min Importance",
                            value="medium"
                        )
                        
                        # Category limits
                        gr.Markdown("#### Maximum Elements per Document")
                        max_claims = gr.Slider(
                            minimum=5, maximum=100, value=20, step=5,
                            label="Max Claims"
                        )
                        max_methods = gr.Slider(
                            minimum=3, maximum=50, value=10, step=5,
                            label="Max Methodologies"
                        )
                        max_contributions = gr.Slider(
                            minimum=3, maximum=50, value=15, step=5,
                            label="Max Contributions"
                        )
                        max_directions = gr.Slider(
                            minimum=3, maximum=50, value=10, step=5,
                            label="Max Research Directions"
                        )
                        
                    with gr.Column():
                        # Cross-Document Settings
                        gr.Markdown("#### Cross-Document Analysis")
                        relationship_confidence = gr.Slider(
                            minimum=0.0, maximum=1.0, value=0.6, step=0.1,
                            label="Min Relationship Confidence"
                        )
                        max_cross_relationships = gr.Slider(
                            minimum=10, maximum=200, value=50, step=10,
                            label="Max Cross-Document Relationships"
                        )
                        similarity_threshold = gr.Slider(
                            minimum=0.5, maximum=0.95, value=0.75, step=0.05,
                            label="Element Similarity Threshold",
                            info="Higher values require more similarity for deduplication"
                        )
                        
                        scientific_domain = gr.Textbox(
                            label="Scientific Domain (Optional)",
                            placeholder="e.g., Molecular Biology, Quantum Physics",
                            info="Helps tailor analysis to specific domains"
                        )
                        
                        apply_settings_btn = gr.Button("Apply Settings", variant="primary")
                        settings_status = gr.Textbox(label="Settings Status", interactive=False)
                        
                        # Session Management
                        gr.Markdown("#### Session Management")
                        with gr.Row():
                            load_session_btn = gr.Button("Load Session")
                            session_file = gr.File(
                                label="Session File",
                                file_types=[".json"],
                                file_count="single"
                            )
        
        # Event Handlers
        analyze_btn.click(
            fn=process_documents,  # Now using RAG-based function
            inputs=[pdf_input, analysis_mode],
            outputs=[status_box, results_state, progress_bar]
        ).then(
            fn=format_document_list,
            inputs=[results_state],
            outputs=[document_list]
        ).then(
            fn=get_doc_selector_options,
            inputs=[results_state],
            outputs=[doc_selector, doc_selector]
        ).then(
            fn=lambda r, d, i: filter_and_format_elements(r, d, "claims", i),
            inputs=[results_state, doc_selector, importance_filter],
            outputs=[claims_table]
        ).then(
            fn=lambda r, d, i: filter_and_format_elements(r, d, "methodologies", i),
            inputs=[results_state, doc_selector, importance_filter],
            outputs=[methods_table]
        ).then(
            fn=lambda r, d, i: filter_and_format_elements(r, d, "contributions", i),
            inputs=[results_state, doc_selector, importance_filter],
            outputs=[contributions_table]
        ).then(
            fn=lambda r, d, i: filter_and_format_elements(r, d, "research_directions", i),
            inputs=[results_state, doc_selector, importance_filter],
            outputs=[directions_table]
        ).then(
            fn=format_relationships_table,
            inputs=[results_state],
            outputs=[relationships_table]
        )
        
        # Filter update event
        for filter_input in [doc_selector, importance_filter]:
            filter_input.change(
                fn=filter_update_handler,
                inputs=[results_state, doc_selector, importance_filter],
                outputs=[claims_table, methods_table, contributions_table, directions_table]
            )
        
        # Synthesis events
        run_synthesis_btn.click(
            fn=lambda r: "Generating synthesis across all documents..." if r else "Please analyze documents first",
            inputs=[results_state],
            outputs=[synthesis_status]
        ).then(
            fn=generate_synthesis,  # Synthesis functions remain unchanged
            inputs=[results_state],
            outputs=[synthesis_state]
        ).then(
            fn=lambda s: s.get("overall", "No overall synthesis available.") if s else "No synthesis available.",
            inputs=[synthesis_state],
            outputs=[overall_synthesis]
        ).then(
            fn=lambda s: s.get("claims", "No claims synthesis available.") if s else "No synthesis available.",
            inputs=[synthesis_state],
            outputs=[claims_synthesis]
        ).then(
            fn=lambda s: s.get("methodologies", "No methodologies synthesis available.") if s else "No synthesis available.",
            inputs=[synthesis_state],
            outputs=[methods_synthesis]
        ).then(
            fn=lambda s: s.get("contributions", "No contributions synthesis available.") if s else "No synthesis available.",
            inputs=[synthesis_state],
            outputs=[contributions_synthesis]
        ).then(
            fn=lambda s: s.get("directions", "No directions synthesis available.") if s else "No synthesis available.",
            inputs=[synthesis_state],
            outputs=[directions_synthesis]
        ).then(
            fn=lambda: "Synthesis complete! Review the tabs above for category-specific syntheses.",
            outputs=[synthesis_status]
        )
        
        # Settings update event - now includes use_rag_approach
        apply_settings_btn.click(
            fn=update_settings,
            inputs=[
                use_rag_approach, parallel_processing, chunk_size, chunk_overlap, min_importance,
                max_claims, max_methods, max_contributions, max_directions,
                relationship_confidence, max_cross_relationships, similarity_threshold, scientific_domain
            ],
            outputs=[settings_status]
        )
        
        # Session management events
        save_session_btn.click(
            fn=save_current_session,
            inputs=[results_state, synthesis_state],
            outputs=[synthesis_status]
        )
        
        load_session_btn.click(
            fn=load_saved_session,
            inputs=[session_file],
            outputs=[results_state, settings_status]
        ).then(
            fn=format_document_list,
            inputs=[results_state],
            outputs=[document_list]
        ).then(
            fn=get_doc_selector_options,
            inputs=[results_state],
            outputs=[doc_selector, doc_selector]
        ).then(
            fn=filter_update_handler,
            inputs=[results_state, doc_selector, importance_filter],
            outputs=[claims_table, methods_table, contributions_table, directions_table]
        ).then(
            fn=format_relationships_table,
            inputs=[results_state],
            outputs=[relationships_table]
        )
    
    return app

# Initialize the system
if 'config' not in globals():
    # Create default config with RAG enabled
    config = LitSynthMultiConfig(use_rag=True)
    
# Launch the UI
app = launch_litsynth_multi_ui()
app.launch(quiet=True, inbrowser=True)

# Show success message
show("Multi-Document Literature Synthesis system with RAG approach launched successfully", "success")

## [Temp] Diagnostics

In [None]:
# PDF Processing Diagnostic Script

import os
import time
import traceback
from pathlib import Path

def diagnose_pdf_processing(file_path):
    """Diagnose PDF processing issues with detailed error reporting."""
    
    print(f"🔍 DIAGNOSTIC LOG: PDF Processing Test")
    print(f"File path: {file_path}")
    print(f"File exists: {os.path.exists(file_path)}")
    print(f"File size: {os.path.getsize(file_path) if os.path.exists(file_path) else 'N/A'} bytes")
    
    # 1. Test file object structure (simulating Gradio file object)
    print("\n== Testing File Structure ==")
    try:
        # Create mock document similar to what Gradio would provide
        from types import SimpleNamespace
        doc = SimpleNamespace(
            source_id="test_doc",
            title="Test Document",
            source_type="pdf",
            content=file_path,
            processed=False
        )
        print("✓ Created test document object")
    except Exception as e:
        print(f"✗ Error creating document object: {str(e)}")
        print(traceback.format_exc())
    
    # 2. Test PDF text extraction
    print("\n== Testing PDF Extraction ==")
    try:
        # Try different PDF extraction methods
        print("Method 1: PyPDF2")
        import PyPDF2
        with open(file_path, 'rb') as f:
            reader = PyPDF2.PdfReader(f)
            num_pages = len(reader.pages)
            sample_text = reader.pages[0].extract_text()[:100]
            print(f"✓ Pages found: {num_pages}")
            print(f"✓ Sample text: {sample_text}...")
    except ImportError:
        print("✗ PyPDF2 not installed")
    except Exception as e:
        print(f"✗ PyPDF2 extraction failed: {str(e)}")
        print(traceback.format_exc())
    
    try:
        print("\nMethod 2: pdfplumber")
        import pdfplumber
        with pdfplumber.open(file_path) as pdf:
            num_pages = len(pdf.pages)
            sample_text = pdf.pages[0].extract_text()[:100]
            print(f"✓ Pages found: {num_pages}")
            print(f"✓ Sample text: {sample_text}...")
    except ImportError:
        print("✗ pdfplumber not installed")
    except Exception as e:
        print(f"✗ pdfplumber extraction failed: {str(e)}")
        print(traceback.format_exc())
    
    # 3. Test the actual document processing function if available
    print("\n== Testing process_document Function ==")
    try:
        if 'process_document' in globals():
            # Create a simple status callback
            def status_callback(doc_id, status):
                print(f"Status ({doc_id}): {status}")
            
            # Run the processing function with detailed error capture
            start_time = time.time()
            result = process_document(doc, doc.source_id, "balanced", status_callback)
            duration = time.time() - start_time
            
            print(f"✓ Processing completed in {duration:.2f}s")
            
            if "error" in result:
                print(f"✗ Processing error: {result['error']}")
            else:
                print(f"✓ Extracted elements:")
                print(f"  - Claims: {len(result.get('claims', []))}")
                print(f"  - Methodologies: {len(result.get('methodologies', []))}")
                print(f"  - Contributions: {len(result.get('contributions', []))}")
                print(f"  - Directions: {len(result.get('research_directions', []))}")
        else:
            print("✗ process_document function not available in global scope")
            
            # Try load_document function if available
            if 'load_document' in globals():
                print("\nTesting load_document function directly")
                doc_data = load_document("pdf", file_path)
                text_length = len(doc_data.get("text", ""))
                print(f"✓ Extracted text length: {text_length} chars")
                print(f"✓ Text sample: {doc_data.get('text', '')[:100]}...")
            else:
                print("✗ load_document function not available in global scope")
    except Exception as e:
        print(f"✗ Error testing process_document: {str(e)}")
        print(traceback.format_exc())
    
    print("\n== DIAGNOSTIC COMPLETE ==")

# Run the diagnostic on a user-provided PDF
if __name__ == "__main__":
    pdf_path = input("Enter the path to the PDF file to test: ")
    diagnose_pdf_processing(pdf_path)

## Test

In [None]:
# Simplified Testing Framework
"""
Efficient testing framework for LitSynth-Multidoc that samples a few chunks
rather than processing entire documents and displays actual extraction content.
"""

import time, random, tempfile, os
import gradio as gr
from typing import Dict, List, Any
from IPython.display import Markdown, display

# ===== TEST UTILITIES =====

def show(msg, type="info"):
    """Display formatted messages in the notebook"""
    colors = {"success": "#00C853", "info": "#2196F3", "warning": "#FF9800", "error": "#F44336"}
    icons = {"success": "✅", "info": "ℹ️", "warning": "⚠️", "error": "❌"}
    display(Markdown(f"<div style='padding:8px;border-radius:4px;background:{colors[type]};color:white'>{icons[type]} {msg}</div>"))

# ===== SIMPLE TEST RUNNER =====

def run_sample_test(func_name, func, *args, **kwargs):
    """Run a single function test and measure performance"""
    # Force debug mode off
    global debug, DEBUG_MODE
    debug = False
    if 'DEBUG_MODE' in globals():
        DEBUG_MODE = False
    
    start_time = time.time()
    error = None
    result = None
    
    try:
        result = func(*args, **kwargs)
        success = True
    except Exception as e:
        error = str(e)
        success = False
    
    duration = time.time() - start_time
    
    # Determine if this was an LLM call
    is_llm = any(name in func_name for name in ["extract_", "identify_", "synthesize_", "generate_"])
    
    # Print result immediately
    status = "✅ success" if success else f"❌ error: {error}"
    
    details = ""
    if success:
        if isinstance(result, list):
            details = f"count: {len(result)}"
            
            # Display content of first item if available
            if len(result) > 0:
                if hasattr(result[0], 'claim_text'):
                    details += f", first claim: \"{result[0].claim_text[:100]}...\""
                elif hasattr(result[0], 'method_name'):
                    details += f", first method: \"{result[0].method_name}\""
                elif hasattr(result[0], 'contribution_text'):
                    details += f", first contribution: \"{result[0].contribution_text[:100]}...\""
                elif hasattr(result[0], 'direction_text'):
                    details += f", first direction: \"{result[0].direction_text[:100]}...\""
        elif hasattr(result, "model_dump"):
            details = f"type: {result.__class__.__name__}"
    
    print(f"{func_name}\t{status}\t{duration:.2f}s\t{details}")
    
    return {
        "name": func_name,
        "success": success,
        "duration": duration,
        "is_llm": is_llm,
        "result": result,
        "error": error
    }

# ===== CORE TEST FUNCTIONS =====

def sample_chunks_from_document(document, config, num_chunks=2):
    """Extract a few sample chunks from a document for testing"""
    # Load document
    doc_data = load_document(document.source_type, document.content)
    text = doc_data.get("text", "")
    
    if not text:
        return []
    
    # Create chunks
    all_chunks = chunk_text(text, config)
    
    # Select random chunks (or all if fewer than requested)
    if len(all_chunks) <= num_chunks:
        return all_chunks
    
    # Select evenly distributed chunks for better coverage
    indices = [int(i * (len(all_chunks) - 1) / (num_chunks - 1)) for i in range(num_chunks)]
    return [all_chunks[i] for i in indices]

def run_extraction_tests(document, chunks, config):
    """Run extraction tests on a small sample of chunks"""
    results = []
    extracted_content = {
        "claims": [],
        "methodologies": [],
        "contributions": [],
        "directions": []
    }
    
    # Test only a few chunks instead of the entire document
    for i, chunk in enumerate(chunks):
        print(f"Testing chunk {i+1}/{len(chunks)} (length: {len(chunk)} chars)")
        
        # Test claim extraction
        claims_result = run_sample_test(
            "extract_scientific_claims",
            extract_scientific_claims,
            chunk, config, document.source_id
        )
        results.append(claims_result)
        if claims_result["success"] and claims_result["result"]:
            extracted_content["claims"].extend(claims_result["result"])
        
        # Test methodology extraction
        methods_result = run_sample_test(
            "extract_methodologies",
            extract_methodologies,
            chunk, config, document.source_id
        )
        results.append(methods_result)
        if methods_result["success"] and methods_result["result"]:
            extracted_content["methodologies"].extend(methods_result["result"])
        
        # Test contribution extraction
        contribs_result = run_sample_test(
            "extract_key_contributions",
            extract_key_contributions,
            chunk, config, document.source_id
        )
        results.append(contribs_result)
        if contribs_result["success"] and contribs_result["result"]:
            extracted_content["contributions"].extend(contribs_result["result"])
        
        # Test direction extraction
        dirs_result = run_sample_test(
            "extract_research_directions",
            extract_research_directions,
            chunk, config, document.source_id
        )
        results.append(dirs_result)
        if dirs_result["success"] and dirs_result["result"]:
            extracted_content["directions"].extend(dirs_result["result"])
    
    # Display summary of extracted content
    print("\nExtracted Content Summary:")
    for category, items in extracted_content.items():
        print(f"- {category.title()}: {len(items)} items")
        
        # Display first 2 items of each category
        for idx, item in enumerate(items[:2]):
            if hasattr(item, 'claim_text'):
                print(f"  {idx+1}. Claim: \"{item.claim_text}\"")
            elif hasattr(item, 'method_name'):
                print(f"  {idx+1}. Method: \"{item.method_name}\" - {item.description[:100]}...")
            elif hasattr(item, 'contribution_text'):
                print(f"  {idx+1}. Contribution: \"{item.contribution_text}\"")
            elif hasattr(item, 'direction_text'):
                print(f"  {idx+1}. Direction: \"{item.direction_text}\"")
    
    return results, extracted_content

def run_state_tests(config):
    """Test state management functionality"""
    results = []
    
    # Create test state
    state = LitSynthMultiState(config=config)
    
    # Test document addition
    results.append(run_sample_test(
        "state_add_document",
        state.add_document,
        DocumentSource(
            source_id="test_doc",
            title="Test Document",
            authors=["Test Author"],
            source_type="text",
            content="Test content"
        )
    ))
    
    # Test state serialization
    temp_file = os.path.join(tempfile.mkdtemp(), "test_state.json")
    results.append(run_sample_test(
        "save_session",
        save_session,
        state, temp_file
    ))
    
    # Test state loading
    if os.path.exists(temp_file):
        results.append(run_sample_test(
            "load_session",
            load_session,
            temp_file
        ))
    
    return results

# ===== MAIN TEST RUNNER =====

def run_quick_tests(pdf_files=None):
    """Run simplified tests focusing on core functionality"""
    print("Running simplified LitSynth-Multidoc tests...")
    print("Function\tStatus\tLatency\tDetails")
    
    all_results = []
    test_docs = []
    all_extracted_content = {}
    
    # Create config with sensible defaults for testing
    config = LitSynthMultiConfig().model_dump()
    
    # Test data models
    try:
        DocumentSource(source_id="test", title="Test", source_type="text", content="test")
        print("DocumentSource Model\t✅ success\t0.00s\t-")
    except Exception as e:
        print(f"DocumentSource Model\t❌ error\t0.00s\t{str(e)}")
    
    try:
        LitSynthMultiConfig()
        print("LitSynthMultiConfig Model\t✅ success\t0.00s\t-")
    except Exception as e:
        print(f"LitSynthMultiConfig Model\t❌ error\t0.00s\t{str(e)}")
    
    try:
        LitSynthMultiState(config=config)
        print("State Management Model\t✅ success\t0.00s\t-")
    except Exception as e:
        print(f"State Management Model\t❌ error\t0.00s\t{str(e)}")
    
    # Prepare test documents
    if pdf_files:
        # Use uploaded PDFs
        for i, pdf_file in enumerate(pdf_files[:2]):
            doc = DocumentSource(
                source_id=f"doc_{i}",
                title=f"Test Document {i+1}",
                authors=["Test Author"],
                source_type="pdf",
                content=pdf_file,
                processed=False
            )
            test_docs.append(doc)
    else:
        # Create a synthetic test document
        test_docs.append(DocumentSource(
            source_id="test_doc",
            title="Test Document",
            authors=["Test Author"],
            source_type="text",
            content="""
            ABSTRACT
            This study introduces a novel approach for climate prediction using transformer models.
            
            METHODOLOGY
            We combined transformer architecture with physics-informed constraints.
            
            RESULTS
            The model achieved 89% accuracy in predicting extreme weather events.
            
            FUTURE DIRECTIONS
            Future research should explore applications to extreme weather event prediction.
            """,
            processed=False
        ))
    
    # Test document loading
    all_results.append(run_sample_test(
        "load_document_collection",
        load_document_collection,
        test_docs
    ))
    
    # Sample chunks for testing
    for doc in test_docs:
        chunks = sample_chunks_from_document(doc, config, num_chunks=2)  # Reduced to 2 chunks for faster testing
        if chunks:
            print(f"Sampled {len(chunks)} chunks from {doc.title}")
            # Run extraction tests on sample chunks
            extraction_results, extracted_content = run_extraction_tests(doc, chunks, config)
            all_results.extend(extraction_results)
            all_extracted_content[doc.source_id] = extracted_content
    
    # Test state management
    state_results = run_state_tests(config)
    all_results.extend(state_results)
    
    # Calculate summary metrics
    total_tests = len(all_results)
    success_tests = sum(1 for r in all_results if r.get("success", True))
    llm_calls = sum(1 for r in all_results if r.get("is_llm", False))
    avg_time = sum(r.get("duration", 0) for r in all_results) / total_tests if total_tests > 0 else 0
    
    # Get extraction counts
    total_extracted = sum(
        sum(len(items) for items in content.values())
        for content in all_extracted_content.values()
    )
    
    # Display summary
    summary = f"""
    # Test Summary
    Total Tests: {total_tests} | Passed: {success_tests}/{total_tests} | LLM Calls: {llm_calls} | Avg Time: {avg_time:.2f}s
    
    Total Extracted Elements: {total_extracted}
    
    All core functions tested successfully. Empty results for some extractions are expected
    when the sample chunks don't contain relevant information.
    """
    
    show(summary, "info")
    return all_results, all_extracted_content

# ===== TEST INTERFACE =====

def create_simple_test_interface():
    """Create simplified test interface with default Gradio styling"""
    
    def handle_pdf_upload(files):
        """Process uploaded PDF files"""
        if not files:
            return "Please upload PDF files to test."
        
        pdf_paths = [f.name for f in files[:2]]
        return f"Ready to test with {len(pdf_paths)} documents."
    
    def run_tests(files):
        """Run tests with uploaded PDFs"""
        if not files:
            return "Please upload at least one PDF file."
        
        pdf_paths = [f.name for f in files[:2]]
        
        # Create test documents
        test_docs = []
        for i, path in enumerate(pdf_paths[:2]):
            doc = DocumentSource(
                source_id=f"doc_{i}",
                title=f"Test Document {i+1}",
                authors=["Test Author"],
                source_type="pdf",
                content=path,
                processed=False
            )
            test_docs.append(doc)
        
        # Run the tests and capture output
        import io
        from contextlib import redirect_stdout
        
        output = io.StringIO()
        with redirect_stdout(output):
            results, extracted_content = run_quick_tests(pdf_paths)
        
        # Format results as text
        success_count = sum(1 for r in results if r.get("success", False))
        llm_calls = sum(1 for r in results if r.get("is_llm", False))
        avg_time = sum(r.get("duration", 0) for r in results) / len(results) if results else 0
        
        # Count total extracted items
        total_extracted = sum(
            sum(len(items) for items in content.values())
            for content in extracted_content.values()
        )
        
        output_text = f"""
## Quick Test Results

**Summary:** Passed {success_count}/{len(results)} tests | LLM Calls: {llm_calls} | Avg Time: {avg_time:.2f}s

**Total Extracted Elements:** {total_extracted}

### Test Output
{output.getvalue()}

### Extraction Samples

"""
        # Add extracted content samples
        for doc_id, content in extracted_content.items():
            output_text += f"\n#### Document: {doc_id}\n\n"
            
            for category, items in content.items():
                output_text += f"**{category.title()}:** {len(items)} items\n\n"
                
                for idx, item in enumerate(items[:3]):  # Show up to 3 items per category
                    if hasattr(item, 'claim_text'):
                        output_text += f"{idx+1}. **Claim:** {item.claim_text}\n\n"
                    elif hasattr(item, 'method_name'):
                        output_text += f"{idx+1}. **Method:** {item.method_name} - {item.description[:100]}...\n\n"
                    elif hasattr(item, 'contribution_text'):
                        output_text += f"{idx+1}. **Contribution:** {item.contribution_text}\n\n"
                    elif hasattr(item, 'direction_text'):
                        output_text += f"{idx+1}. **Direction:** {item.direction_text}\n\n"
        
        return output_text
    
    # Create interface with default Gradio styling
    with gr.Blocks() as demo:
        gr.Markdown("## LitSynth-Multidoc Testing")
        gr.Markdown("Tests core functionality and displays extraction results")
        
        file_input = gr.Files(
            label="Upload PDF Documents (1-2 files)",
            file_types=[".pdf"],
            file_count="multiple"
        )
        
        test_btn = gr.Button("Run Tests", variant="primary")
        status = gr.Textbox(label="Status", value="Upload PDFs to begin")
        results_output = gr.Markdown("Results will appear here")
        
        # Connect components
        file_input.change(
            fn=handle_pdf_upload,
            inputs=[file_input],
            outputs=[status]
        )
        
        test_btn.click(
            fn=run_tests,
            inputs=[file_input],
            outputs=[results_output]
        )
    
    return demo

# Launch the interface
test_interface = create_simple_test_interface()
test_interface.launch(inline=True, share=False)

show("Simplified testing framework initialized successfully", "success")