# Biomapper API Client Tutorial

## Understanding the Thin Client Architecture

This notebook demonstrates how to use the BiomapperClient to execute mapping strategies and explains the architecture behind the scenes.

### Architecture Overview

```
Client Request → BiomapperClient → FastAPI Server → MinimalStrategyService
                                                     ↓
                                    Load YAML from configs/strategies/
                                                     ↓
                                    ACTION_REGISTRY (Global Dict)
                                                     ↓
                            Individual Action Classes (self-registered)
                                                     ↓
                                  Execution Context (Dict[str, Any])
                                                     ↓
                                    Results returned to client
```

## 1. Setup and Installation

### Starting the API Server

Before using the BiomapperClient, you need to start the API server. In a separate terminal:

```bash
# First, make sure no other process is using port 8000
sudo kill -9 $(ps aux | grep "port 8000" | grep -v grep | awk '{print $2}')

# Then start the server
cd /home/ubuntu/biomapper/biomapper-api
poetry run uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

# Or use a different port if 8000 is stuck:
poetry run uvicorn app.main:app --reload --host 0.0.0.0 --port 8001
```

The API server will:
- Load all YAML strategies from `configs/` and `configs/strategies/`
- Initialize the action registry
- Set up the v2 endpoints for modern strategy execution
- Be ready to accept client requests

Once you see `INFO: Application startup complete`, the server is ready!

If using a different port, update the client initialization:
```python
client = BiomapperClient(base_url="http://localhost:8001")
```

In [None]:
# Install the client if needed (usually done via poetry install)
# !poetry install --with dev,api

# Import required libraries
import asyncio
import json
from pathlib import Path

# Use the enhanced v2 client which has more features
from biomapper_client.client_v2 import BiomapperClient

print("BiomapperClient v2 imported successfully!")

## 2. Basic Client Usage - Health Check

Let's start with a simple health check to ensure the API is running:

In [None]:
# Direct inspection of strategies without API
import yaml
from pathlib import Path

def explore_strategies_offline():
    """Explore available strategies by reading YAML files directly."""
    strategies_dir = Path("/home/ubuntu/biomapper/configs/strategies/")
    
    if not strategies_dir.exists():
        print(f"Strategies directory not found: {strategies_dir}")
        return []
    
    strategies = []
    for yaml_file in sorted(strategies_dir.glob("*.yaml")):
        with open(yaml_file) as f:
            strategy = yaml.safe_load(f)
            strategies.append({
                'name': strategy.get('name', yaml_file.stem),
                'description': strategy.get('description', 'No description'),
                'steps': len(strategy.get('steps', [])),
                'file': yaml_file.name
            })
    
    print(f"📚 Found {len(strategies)} strategies (offline inspection):\n")
    for s in strategies:
        print(f"📋 {s['name']}")
        print(f"   Description: {s['description']}")
        print(f"   Steps: {s['steps']}")
        print(f"   File: {s['file']}")
        print()
    
    return strategies

# Explore strategies without needing the API
offline_strategies = explore_strategies_offline()

## Alternative: Direct Strategy Inspection (No API Required)

If the API server is not running, you can still explore the available strategies by reading the YAML files directly:

In [None]:
async def check_api_health():
    """Check if the API server is running and healthy."""
    try:
        async with BiomapperClient() as client:
            health = await client.health_check()
            print("✅ API Status: Connected and healthy!")
            print(f"Response: {health}")
            return health
    except Exception as e:
        print("❌ API server is not running!")
        print("\nTo start the API server, run this in a terminal:")
        print("cd /home/ubuntu/biomapper/biomapper-api")
        print("poetry run uvicorn app.main:app --reload --host 0.0.0.0 --port 8000")
        print(f"\nError details: {e}")
        return None

# Run the health check
await check_api_health()

## 3. Listing Available Strategies

The API automatically loads all YAML strategies from the `configs/strategies/` directory. Let's see what's available:

In [None]:
async def list_strategies():
    """List all available mapping strategies from the configs directory."""
    # Note: The API doesn't have a list_strategies endpoint yet,
    # so we'll read directly from the configs directory
    strategies_dir = Path("/home/ubuntu/biomapper/configs/strategies/")
    
    strategies = []
    if strategies_dir.exists():
        for yaml_file in strategies_dir.glob("*.yaml"):
            # Read YAML to get name and description
            import yaml
            with open(yaml_file) as f:
                strategy = yaml.safe_load(f)
                strategies.append({
                    'name': strategy.get('name', yaml_file.stem),
                    'description': strategy.get('description', 'No description'),
                    'file': yaml_file.name
                })
    
    print(f"Found {len(strategies)} strategies:\n")
    for strategy in strategies:
        print(f"📋 {strategy['name']}")
        print(f"   {strategy['description']}")
        print(f"   File: {strategy['file']}")
        print()
    
    return strategies

strategies = await list_strategies()

## 4. Behind the Scenes: How Strategies Work

Let's examine a strategy YAML file to understand the structure:

In [None]:
# Let's look at a simple protein mapping strategy
strategy_path = Path("/home/ubuntu/biomapper/configs/strategies/arivale_to_kg2c_proteins.yaml")

if strategy_path.exists():
    with open(strategy_path, 'r') as f:
        # Show first 50 lines to see the structure
        lines = f.readlines()[:50]
        print("📄 Strategy YAML Structure (first 50 lines):")
        print("=" * 60)
        print(''.join(lines))
else:
    print("Strategy file not found. Let's check what strategies exist:")
    strategies_dir = Path("/home/ubuntu/biomapper/configs/strategies/")
    if strategies_dir.exists():
        yaml_files = list(strategies_dir.glob("*.yaml"))
        print(f"Found {len(yaml_files)} strategy files:")
        for f in yaml_files[:5]:  # Show first 5
            print(f"  - {f.name}")

## 5. Understanding Action Types

Each step in a strategy uses an **action type**. These are Python classes stored in `biomapper/core/strategy_actions/`. Let's see how they work:

In [None]:
# Display information about action types and where they're stored
action_info = {
    "LOAD_DATASET_IDENTIFIERS": {
        "location": "biomapper/core/strategy_actions/load_dataset.py",
        "purpose": "Loads biological identifiers from TSV/CSV files",
        "key_params": ["file_path", "identifier_column", "output_key"]
    },
    "MERGE_WITH_UNIPROT_RESOLUTION": {
        "location": "biomapper/core/strategy_actions/uniprot_resolution.py",
        "purpose": "Maps identifiers to UniProt accessions with historical resolution",
        "key_params": ["input_key", "output_key", "uniprot_api_url"]
    },
    "CALCULATE_SET_OVERLAP": {
        "location": "biomapper/core/strategy_actions/calculate_overlap.py",
        "purpose": "Calculates Jaccard similarity between datasets",
        "key_params": ["dataset_keys", "output_key"]
    },
    "EXPORT_DATASET": {
        "location": "biomapper/core/strategy_actions/export_dataset.py",
        "purpose": "Exports results to various formats (TSV, JSON, HTML)",
        "key_params": ["input_key", "output_dir", "formats"]
    }
}

print("🔧 Action Types and Their Locations:\n")
print("=" * 70)
for action_type, info in action_info.items():
    print(f"\n{action_type}")
    print(f"  📁 Location: {info['location']}")
    print(f"  📝 Purpose: {info['purpose']}")
    print(f"  ⚙️  Key params: {', '.join(info['key_params'])}")

print("\n" + "=" * 70)
print("\n💡 How it works:")
print("  1. Each action is decorated with @register_action('ACTION_NAME')")
print("  2. Actions self-register into ACTION_REGISTRY when imported")
print("  3. MinimalStrategyService loads actions dynamically from the registry")
print("  4. Actions pass data through a shared execution context dictionary")

## 6. Executing a Strategy

Now let's execute a real strategy and see how the results are returned:

In [None]:
async def execute_simple_strategy():
    """Execute a strategy using the updated v2 endpoint."""
    
    print("✅ API V2 Endpoint Status:")
    print("- Endpoint is working at: /api/strategies/v2/execute")
    print("- Job submission successful, returns job_id")
    print("- Background execution implemented\n")
    
    print("⚠️ CURRENT LIMITATIONS:")
    print("1. Protein strategies (ARIVALE_TO_KG2C_PROTEINS, UKBB_TO_KG2C_PROTEINS):")
    print("   - Use unimplemented actions: CUSTOM_TRANSFORM, FILTER_DATASET, EXPORT_DATASET, GENERATE_REPORT")
    print("   - These are template strategies that need action implementation")
    print("2. Metabolomics strategies work but have Pydantic validation issues")
    print("3. Only basic actions are currently implemented\n")
    
    print("📝 Next Steps:")
    print("1. Implement missing action types for protein strategies")
    print("2. Fix Pydantic validation issues in StrategyExecutionContext")
    print("3. Create simpler demo strategies that use only implemented actions\n")
    
    # Use a test strategy that doesn't require unimplemented actions
    strategy_name = "TEST_STRATEGY"
    
    print(f"Testing with: {strategy_name}\n")
    
    async with BiomapperClient() as client:
        try:
            # Submit the job
            response = await client._async_client.post(
                "/api/strategies/v2/execute",
                json={
                    "strategy": strategy_name,
                    "parameters": {},
                    "options": {}
                }
            )
            
            if response.status_code == 200:
                result = response.json()
                print(f"✅ Job submitted successfully!")
                print(f"📊 Job ID: {result['job_id']}")
                print(f"📊 Status: {result['status']}")
                print(f"📊 Message: {result['message']}")
                
                # Check job status
                job_id = result['job_id']
                import asyncio
                await asyncio.sleep(1)
                
                status_response = await client._async_client.get(
                    f"/api/strategies/v2/jobs/{job_id}/status"
                )
                if status_response.status_code == 200:
                    status = status_response.json()
                    print(f"\n📈 Job Status Update:")
                    print(f"  - Status: {status.get('status')}")
                    print(f"  - Strategy: {status.get('strategy_name')}")
                    if 'error' in status:
                        print(f"  - Error: {status['error']}")
            else:
                print(f"❌ Error: {response.status_code}")
                print(response.json())
            
            return result
            
        except Exception as e:
            print(f"❌ Error: {e}")
            return None

# Try to execute
result = await execute_simple_strategy()

## 7. Understanding Result Structure

The API returns the entire execution context from the strategy. Let's explore what this means:

In [None]:
print("🔍 Understanding the Result Structure:\n")
print("=" * 70)
print("""
The result object contains:

1. **status**: The job execution status (completed, failed, etc.)
2. **job_id**: Unique identifier for tracking this execution
3. **result**: The final execution context dictionary containing:

   📂 Common keys in the execution context:
   
   • current_identifiers: Active dataset being processed
   • datasets: Dictionary of all loaded/processed datasets
     - Each action can store results with a unique key
     - E.g., 'arivale_proteins', 'ukbb_proteins', 'overlap_results'
   
   • statistics: Accumulated statistics from all actions
     - Match rates, counts, quality metrics
     - Each action can add its own statistics
   
   • output_files: List of files generated during execution
     - Export actions write files and record paths here
     - Includes TSV, JSON, HTML reports

🎯 Key Insight: The FINAL ACTION doesn't determine what's returned.
   Instead, the ENTIRE execution context is returned, containing
   all accumulated data from every action in the pipeline.
""")
print("=" * 70)

## 8. Accessing Specific Results

Let's demonstrate how to access specific parts of the results:

In [None]:
async def analyze_strategy_results():
    """Show how to access and use specific results from a strategy."""
    
    async with BiomapperClient() as client:
        try:
            # Execute a strategy using the simpler run interface
            # Using a simpler strategy for faster demonstration
            result = await client._async_run(
                strategy="ARIVALE_TO_KG2C_PROTEINS",  # Updated to use new protein strategy
                wait=True
            )
            
            if hasattr(result, 'success') and result.success and result.result_data:
                context = result.result_data
                
                # Access datasets
                print("📊 Loaded Datasets:")
                if 'datasets' in context:
                    for dataset_name, dataset_data in context['datasets'].items():
                        if isinstance(dataset_data, list):
                            print(f"  • {dataset_name}: {len(dataset_data)} items")
                        elif isinstance(dataset_data, dict) and 'identifiers' in dataset_data:
                            print(f"  • {dataset_name}: {len(dataset_data['identifiers'])} identifiers")
                
                # Access statistics
                print("\n📈 Statistics:")
                if 'statistics' in context:
                    stats = context['statistics']
                    for key, value in stats.items():
                        if isinstance(value, (int, float)):
                            if 'rate' in key.lower() or 'coefficient' in key.lower():
                                print(f"  • {key}: {value:.2%}")
                            else:
                                print(f"  • {key}: {value:,}")
                        else:
                            print(f"  • {key}: {value}")
                
                # Access output files
                print("\n📁 Generated Files:")
                if 'output_files' in context:
                    for file_path in context['output_files']:
                        print(f"  • {file_path}")
                
                return context
            else:
                print(f"Strategy execution failed or returned no data")
                return None
                
        except Exception as e:
            print(f"❌ Error: {e}")
            print("\nTrying with a different strategy...")
            # Fallback to just showing the structure
            print("\nTypical result structure includes:")
            print("  • datasets: All loaded and processed datasets")
            print("  • statistics: Accumulated metrics from all actions")
            print("  • output_files: List of generated files")
            print("  • current_identifiers: Active dataset being processed")
            return None

# Analyze the results
context = await analyze_strategy_results()

## 9. Advanced: Executing a Metabolomics Strategy

Let's execute a more complex metabolomics strategy to see the progressive enhancement pattern:

In [None]:
async def execute_metabolomics_strategy():
    """Execute a metabolomics harmonization strategy."""
    
    strategy_name = "METABOLOMICS_PROGRESSIVE_ENHANCEMENT"
    
    print(f"🧪 Executing metabolomics strategy: {strategy_name}\n")
    print("This strategy uses progressive enhancement:")
    print("  1️⃣ Baseline: Fuzzy string matching")
    print("  2️⃣ Enhancement: API enrichment (CTS, PubChem)")
    print("  3️⃣ Advanced: Vector similarity search\n")
    
    async with BiomapperClient() as client:
        # First check if strategy exists by looking at YAML files
        strategies_dir = Path("/home/ubuntu/biomapper/configs/strategies/")
        strategy_files = list(strategies_dir.glob("*.yaml"))
        
        # Look for metabolomics strategies
        metabolomics_strategies = []
        for yaml_file in strategy_files:
            if 'metabol' in yaml_file.name.lower():
                import yaml
                with open(yaml_file) as f:
                    strategy = yaml.safe_load(f)
                    metabolomics_strategies.append(strategy.get('name', yaml_file.stem))
        
        if strategy_name not in metabolomics_strategies:
            print(f"⚠️ Strategy '{strategy_name}' not found.")
            print("Available metabolomics strategies:")
            for name in metabolomics_strategies:
                print(f"  • {name}")
            return None
        
        # Execute the strategy with progress watching
        print("⏳ Executing (this may take a few minutes)...\n")
        result = await client._async_run(
            strategy=strategy_name,
            wait=True,
            watch=True  # This will print progress updates
        )
        
        if hasattr(result, 'success') and result.success:
            print(f"\n✅ Strategy completed successfully!")
            
            # Show the progression of match rates
            if result.result_data and 'statistics' in result.result_data:
                stats = result.result_data['statistics']
                print("\n📊 Progressive Enhancement Results:")
                
                # Look for stage-specific statistics
                for key, value in stats.items():
                    if 'match' in key.lower() or 'rate' in key.lower():
                        print(f"  • {key}: {value}")
        else:
            print(f"❌ Strategy failed")
            if hasattr(result, 'error'):
                print(f"Error: {result.error}")
        
        return result

# Execute metabolomics strategy (if available)
metabolomics_result = await execute_metabolomics_strategy()

## 10. Creating Your Own Strategy

Here's how you would create a new strategy:

In [None]:
# Example: Creating a simple custom strategy YAML
custom_strategy = """
name: MY_CUSTOM_PROTEIN_MAPPING
description: Custom mapping of protein datasets
parameters:
  source_file: /path/to/my/proteins.tsv
  identifier_column: uniprot_id
  output_dir: /tmp/results

steps:
  - name: load_proteins
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "${parameters.source_file}"
        identifier_column: "${parameters.identifier_column}"
        output_key: my_proteins

  - name: resolve_uniprot
    action:
      type: MERGE_WITH_UNIPROT_RESOLUTION
      params:
        input_key: my_proteins
        output_key: resolved_proteins
        batch_size: 100

  - name: export_results
    action:
      type: EXPORT_DATASET
      params:
        input_key: resolved_proteins
        output_dir: "${parameters.output_dir}"
        formats: ["tsv", "json"]
"""

print("📝 Example Custom Strategy YAML:")
print("=" * 60)
print(custom_strategy)
print("=" * 60)
print("\nTo use this strategy:")
print("1. Save it to: /home/ubuntu/biomapper/configs/strategies/my_custom_strategy.yaml")
print("2. The API will auto-load it (no restart needed)")
print("3. Execute with: client.execute_strategy('MY_CUSTOM_PROTEIN_MAPPING')")

## 11. Summary: The Complete Flow

Let's visualize the complete execution flow:

In [None]:
print("""
🔄 COMPLETE EXECUTION FLOW:
═══════════════════════════════════════════════════════════════════════

1️⃣ CLIENT LAYER (Your Code)
   📍 Location: Your script/notebook
   └─> BiomapperClient.execute_strategy("STRATEGY_NAME")

2️⃣ API LAYER (FastAPI)
   📍 Location: biomapper-api/app/main.py
   └─> POST /api/v1/strategies/execute
       └─> MapperServiceForStrategies.execute_strategy()

3️⃣ STRATEGY LOADING
   📍 Location: biomapper/core/minimal_strategy_service.py
   └─> MinimalStrategyService loads YAML from:
       configs/strategies/STRATEGY_NAME.yaml

4️⃣ ACTION REGISTRY
   📍 Location: biomapper/core/strategy_actions/registry.py
   └─> ACTION_REGISTRY maps action types to classes:
       • LOAD_DATASET_IDENTIFIERS → LoadDatasetAction
       • MERGE_WITH_UNIPROT_RESOLUTION → UniProtResolutionAction
       • etc.

5️⃣ ACTION EXECUTION
   📍 Location: biomapper/core/strategy_actions/*.py
   └─> Each action:
       • Receives params from YAML
       • Reads from execution context
       • Performs its operation
       • Updates execution context
       • Passes control to next action

6️⃣ CONTEXT ACCUMULATION
   📍 Throughout execution
   └─> Execution context grows with:
       • datasets['key1'] = data from action 1
       • datasets['key2'] = data from action 2
       • statistics['metric1'] = value
       • output_files.append('/path/to/file.tsv')

7️⃣ RESULT RETURN
   📍 Back through the layers
   └─> The ENTIRE execution context is returned
       └─> Contains ALL data from ALL actions
           └─> Not just the last action's output!

═══════════════════════════════════════════════════════════════════════

💡 KEY INSIGHTS:

• The final action DOES NOT determine what's returned
• The complete execution context is always returned
• Each action builds upon the shared context
• Results accumulate throughout the pipeline
• You can access any intermediate results in the final output
""")

## 12. Practical Examples

### Example 1: Check Job Status

In [None]:
async def check_job_status(job_id: str):
    """Check the status of a previously submitted job."""
    async with BiomapperClient() as client:
        try:
            status = await client.get_job_status(job_id)
            print(f"Job {job_id}:")
            print(f"  Status: {status.status}")
            if hasattr(status, 'message') and status.message:
                print(f"  Message: {status.message}")
            if hasattr(status, 'progress'):
                print(f"  Progress: {status.progress:.1f}%")
            return status
        except Exception as e:
            print(f"Error checking job status: {e}")
            return None

# Example usage (replace with actual job_id from a previous execution)
# await check_job_status("abc123-def456-ghi789")

### Example 2: List and Filter Strategies

In [None]:
async def find_protein_strategies():
    """Find all strategies related to protein mapping."""
    # Since list_strategies isn't implemented in the API yet,
    # we'll read directly from the configs directory
    strategies_dir = Path("/home/ubuntu/biomapper/configs/strategies/")
    
    all_strategies = []
    if strategies_dir.exists():
        for yaml_file in strategies_dir.glob("*.yaml"):
            import yaml
            with open(yaml_file) as f:
                strategy = yaml.safe_load(f)
                all_strategies.append({
                    'name': strategy.get('name', yaml_file.stem),
                    'description': strategy.get('description', ''),
                    'file': yaml_file.name
                })
    
    protein_strategies = [
        s for s in all_strategies 
        if 'protein' in s['name'].lower() or 
           'protein' in s.get('description', '').lower()
    ]
    
    print(f"Found {len(protein_strategies)} protein-related strategies:\n")
    for strategy in protein_strategies:
        print(f"🧬 {strategy['name']}")
        if strategy['description']:
            print(f"   {strategy['description'][:100]}...")
        print(f"   File: {strategy['file']}")
    
    return protein_strategies

protein_strategies = await find_protein_strategies()

## Conclusion

This notebook has demonstrated:

✅ **The Thin Client Pattern**: BiomapperClient provides a clean interface without importing core libraries

✅ **Behind the Scenes**: YAML strategies → Action Registry → Execution Context → Results

✅ **Key Files**:
- Strategies: `configs/strategies/*.yaml`
- Actions: `biomapper/core/strategy_actions/*.py`
- API: `biomapper-api/app/services/mapper_service.py`

✅ **Result Structure**: The entire execution context is returned, not just the final action's output

✅ **Extensibility**: Add new strategies by creating YAML files - no code changes needed!

### Next Steps

1. Try executing different strategies
2. Create your own custom strategy YAML
3. Explore the action types in `biomapper/core/strategy_actions/`
4. Build applications using the BiomapperClient

Happy mapping! 🚀