# Patchseq Metadata Processing Example

This notebook demonstrates how to use the refactored patchseq metadata modules to create procedures.json files for subjects. The modular design allows for flexible usage in notebooks while maintaining the same CLI functionality.

## Key Features:
- **Modular Architecture**: Import only the functions you need
- **Excel Integration**: Automatically loads specimen procedures from Excel files  
- **Schema Conversion**: Converts v1 to v2.0 schema with proper validation
- **CSV Tracking**: Tracks injection and specimen procedure data in CSV files
- **Smart Caching**: v1 files are cached (never overwritten), v2 files regenerated by default

## Command Line Behavior:
- **Default**: Always regenerate v2 files (fast conversion)
- **--skip-existing**: Skip subjects that already have v2.0 procedures.json files
- **v1 caching**: v1 files from metadata service are never overwritten

## 1. Import Required Libraries and Setup

In [None]:
# Standard library imports
import pandas as pd
import json
import argparse
from pathlib import Path
from datetime import date, datetime

# AIND data schema imports
from aind_data_schema.core.procedures import Procedures, Surgery, SpecimenProcedure

# Import functions from our modular architecture
from excel_loader import load_specimen_procedures_excel, get_specimen_procedures_for_subject
from metadata_service import fetch_procedures_metadata  
from schema_conversion import convert_procedures_to_v2
from file_io import save_procedures_json, check_existing_procedures
from csv_tracking import (
    extract_injection_details, 
    update_injection_tracking_csv, 
    extract_specimen_procedure_details, 
    update_specimen_procedure_tracking_csv
)
from subject_utils import get_subjects_list

print("✅ All modules imported successfully!")

## 2. Load Configuration and Command Line Arguments

The CLI supports the same arguments as the original script:
- `--skip-existing`: Skip subjects that already have v2.0 procedures.json files  
- `--limit`: Limit number of subjects to process
- `--subjects`: Process specific subjects (space-separated list)
- `--excel`: Excel file with specimen procedures data

**Key Behavior**: v2 files are always regenerated unless `--skip-existing` is set (fast conversion)

In [None]:
# Configuration for notebook usage
class NotebookConfig:
    def __init__(self):
        self.skip_existing = False  # Set to True to skip existing v2 files
        self.limit = 3  # Limit processing for demo (set to None for all subjects)
        self.subjects = None  # Set to ["692912", "692913"] for specific subjects
        self.excel = "DT_HM_TissueClearingTracking_.xlsx"  # Excel file path
        
    def __repr__(self):
        return f"Config(skip_existing={self.skip_existing}, limit={self.limit}, subjects={self.subjects})"

# Create configuration instance
config = NotebookConfig()
print(f"📋 Configuration: {config}")

# Initialize tracking variables (same as CLI version)
success_count = 0
failed_count = 0 
skipped_count = 0
all_injection_data = {}
all_specimen_procedure_data = {}
subjects_with_missing_injection_coords = []
subjects_with_missing_specimen_procedures = []

## 3. Load Excel Data for Specimen Procedures

Load the Excel file containing specimen procedures data. This handles complex multi-level headers and batch tracking information.

In [None]:
print("📊 Loading Excel data for specimen procedures...")

try:
    excel_sheet_data = load_specimen_procedures_excel(config.excel)
    if excel_sheet_data:
        print(f"✅ Successfully loaded {len(excel_sheet_data)} sheets from Excel file")
        print(f"   Sheets: {list(excel_sheet_data.keys())}")
        
        # Show batch info summary
        if "Batch Info" in excel_sheet_data:
            batch_info = excel_sheet_data["Batch Info"]
            print(f"   Batch Info: {len(batch_info)} batches tracked")
    else:
        print("❌ Failed to load Excel data")
        excel_sheet_data = {}
        
except Exception as e:
    print(f"❌ Error loading Excel data: {e}")
    excel_sheet_data = {}

## 4. Get Subject List and Apply Filters

Retrieve the list of subjects to process based on configuration settings.

In [None]:
# Get list of subjects to process
if config.subjects:
    subjects = config.subjects
    print(f"📋 Processing specific subjects: {subjects}")
else:
    subjects = get_subjects_list()
    print(f"📋 Found {len(subjects)} total subjects in CSV")
    
    if config.limit:
        subjects = subjects[:config.limit]
        print(f"📋 Limited to first {config.limit} subjects for demo")

print(f"📋 Will process {len(subjects)} subjects: {subjects}")

## 5. Process Individual Subjects

Main processing loop that handles each subject with comprehensive error handling and progress tracking.

In [None]:
def process_single_subject_notebook(subject_id, excel_sheet_data, config):
    """
    Process a single subject in notebook environment.
    Returns: (success: bool, skipped: bool, error_msg: str or None)
    """
    print(f"\n--- Processing Subject: {subject_id} ---")
    
    # Check if files already exist  
    v1_exists, v2_exists, v1_data, v2_data = check_existing_procedures(subject_id)
    
    if v2_exists and config.skip_existing:
        print(f"  ✓ v2.0 procedures.json already exists, skipping (--skip-existing set)")
        return True, True, None
    
    # Fetch v1 data from API if not cached
    if v1_exists and v1_data:
        print(f"  ✓ Using cached v1 data")
        success, data, message = True, v1_data, "Loaded from cache"
    else:
        success, data, message = fetch_procedures_metadata(subject_id)
        
        if success and data:
            print(f"  ✓ Fetched procedures from API: {message}")
            # Save v1 data for debugging
            save_procedures_json(subject_id, data, is_v1=True)
        else:
            print(f"  ✗ Failed to fetch procedures: {message}")
            return False, False, message
    
    # Convert to v2 schema
    print(f"  📄 Converting to v2 schema...")
    v2_data, missing_injection_coords_info, batch_tracking_info = convert_procedures_to_v2(data, excel_sheet_data)
    
    if v2_data is None:
        print(f"  ✗ Schema conversion failed")
        return False, False, "Schema conversion failed"
    
    # Track injection coordinate issues
    if missing_injection_coords_info['has_missing_injection_coords']:
        subjects_with_missing_injection_coords.append(subject_id)
        print(f"    ⚠️  Missing injection coordinates detected:")
        for detail in missing_injection_coords_info['missing_injection_details']:
            print(f"      - {detail}")
    
    # Extract tracking data
    injection_details = extract_injection_details(data)
    if injection_details:
        all_injection_data[subject_id] = injection_details
        print(f"    ✓ Extracted {len(injection_details)} injection details for CSV tracking")
    
    # Extract specimen procedure details
    if v2_data.get('specimen_procedures'):
        specimen_procedure_details = extract_specimen_procedure_details(v2_data['specimen_procedures'], batch_tracking_info)
        all_specimen_procedure_data[subject_id] = specimen_procedure_details
        print(f"    ✓ Extracted specimen procedure details for CSV tracking")
    else:
        subjects_with_missing_specimen_procedures.append(subject_id)
        print(f"    ⚠️  No specimen procedures found for this subject")
        # Still add empty record for consistency
        empty_specimen_procedure_details = extract_specimen_procedure_details([], batch_tracking_info)
        all_specimen_procedure_data[subject_id] = empty_specimen_procedure_details
    
    # Save v2 data
    save_procedures_json(subject_id, v2_data, is_v1=False)
    print(f"  ✓ Successfully saved v2.0 procedures.json")
    
    return True, False, None

print("✅ Processing function defined!")

In [None]:
# Execute the main processing loop
print("🚀 Starting procedures.json creation workflow...")

for subject_id in subjects:
    try:
        success, skipped, error_msg = process_single_subject_notebook(
            subject_id, 
            excel_sheet_data, 
            config
        )
        
        if success:
            if skipped:
                skipped_count += 1
            else:
                success_count += 1
        else:
            failed_count += 1
            print(f"  ❌ Error: {error_msg}")
            
    except KeyboardInterrupt:
        print(f"\n⚠️  Processing interrupted by user")
        break
    except Exception as e:
        print(f"  ❌ Unexpected error: {e}")
        failed_count += 1

print(f"\n📊 Processing completed!")
print(f"✅ Successful: {success_count}")
print(f"❌ Failed: {failed_count}")
print(f"⏭️  Skipped: {skipped_count}")

## 7. File Handling Operations

The `file_io` module provides utilities for managing different versions of procedures files and handling file system operations with proper error handling and atomic writes.

In [None]:
# Example file handling operations
from file_io import get_procedures_file_path, read_json_file, write_json_file_safe

# Get file paths for both versions
subject_id = "123456"
v1_path = get_procedures_file_path(subject_id, version=1)
v2_path = get_procedures_file_path(subject_id, version=2)

print(f"📁 Version 1 path: {v1_path}")
print(f"📁 Version 2 path: {v2_path}")

# Check if files exist
import os
v1_exists = os.path.exists(v1_path) if v1_path else False
v2_exists = os.path.exists(v2_path) if v2_path else False

print(f"🔍 V1 exists: {v1_exists}")
print(f"🔍 V2 exists: {v2_exists}")

# Example of safe file writing with atomic operations
example_data = {
    "subject_id": subject_id,
    "procedures": ["perfusion", "imaging"],
    "created_at": "2024-01-01"
}

try:
    write_json_file_safe(example_data, "/tmp/example_procedures.json")
    print("✅ File written safely with atomic operation")
except Exception as e:
    print(f"❌ File write failed: {e}")

# Example of reading with error handling
try:
    data = read_json_file("/tmp/example_procedures.json")
    print(f"📖 Successfully read: {len(data)} keys")
except Exception as e:
    print(f"❌ File read failed: {e}")

## 8. API Interactions

The `metadata_service` module handles all interactions with the AIND metadata service, including fetching existing records and managing caching behavior.

In [None]:
# Example API interactions
from metadata_service import fetch_procedures_from_metadata_service

# Example 1: Fetch procedures for a specific subject
subject_id = "699461"  # Example subject from your data
print(f"🌐 Fetching procedures for subject {subject_id}...")

try:
    procedures_data = fetch_procedures_from_metadata_service(subject_id)
    
    if procedures_data:
        print(f"✅ Successfully fetched procedures data")
        print(f"📋 Subject: {procedures_data.get('subject_id', 'Unknown')}")
        
        # Show procedure names if available
        procedures = procedures_data.get('procedures', [])
        if procedures:
            procedure_names = [p.get('procedure_name', 'Unknown') for p in procedures]
            print(f"🔬 Procedures: {', '.join(procedure_names)}")
        else:
            print("📝 No procedures found in data")
            
    else:
        print("⚠️  No procedures data found for this subject")
        
except Exception as e:
    print(f"❌ API call failed: {e}")

# Example 2: Understanding the caching behavior
print(f"\n💾 Caching behavior:")
print(f"   • V1 files (from metadata service): Never overwritten")
print(f"   • V2 files (converted): Always regenerated unless --skip-existing")
print(f"   • This ensures data integrity while allowing updates")

# Example 3: Show what happens with missing subjects
try:
    missing_data = fetch_procedures_from_metadata_service("999999")
    if missing_data:
        print(f"✅ Found data for test subject")
    else:
        print(f"ℹ️  Subject 999999 not found (expected for test)")
except Exception as e:
    print(f"ℹ️  Subject lookup failed: {e} (expected for missing subjects)")

## 9. Schema Conversion

The `schema_conversion` module handles converting between different versions of the procedures schema, ensuring compatibility and data integrity.

In [None]:
# Example schema conversion operations
from schema_conversion import convert_v1_to_v2_procedures

# Example: Converting procedures from v1 to v2 format
# This shows the core transformation that happens for each subject

# Simulate v1 procedures data (what comes from metadata service)
example_v1_data = {
    "subject_id": "699461",
    "procedures": [
        {
            "procedure_name": "Perfusion",
            "start_date": "2023-12-17",
            "experimenter_full_name": "John Doe",
            "protocol_id": "dx.doi.org/10.17504/protocols.io.kxygx3pe6g8j/v4",
            "notes": "Standard perfusion protocol"
        },
        {
            "procedure_name": "Patch-seq",
            "start_date": "2023-12-17", 
            "experimenter_full_name": "Jane Smith",
            "protocol_id": "dx.doi.org/10.17504/protocols.io.bp2l6no99lqe/v3",
            "notes": "Multi-cell recording"
        }
    ]
}

print("🔄 Converting v1 to v2 schema...")
print(f"📋 Input: {len(example_v1_data['procedures'])} procedures")

try:
    v2_data = convert_v1_to_v2_procedures(example_v1_data)
    
    print(f"✅ Conversion successful!")
    print(f"📋 Output: {len(v2_data.get('procedures', []))} procedures")
    print(f"🆔 Subject ID: {v2_data.get('subject_id')}")
    
    # Show the structure differences
    if v2_data.get('procedures'):
        first_proc = v2_data['procedures'][0]
        print(f"🔍 V2 Structure Example:")
        for key in first_proc.keys():
            value = first_proc[key]
            if isinstance(value, str) and len(value) > 50:
                value = value[:47] + "..."
            print(f"   • {key}: {value}")
            
except Exception as e:
    print(f"❌ Conversion failed: {e}")

# Show key differences between versions
print(f"\n📊 Key Schema Differences:")
print(f"   • V1: Direct field mapping from metadata service")
print(f"   • V2: Enhanced validation and structure compliance")
print(f"   • V2: Better error handling and data consistency")
print(f"   • V2: Standardized field formats and validation")

## 10. CSV Progress Tracking

The `csv_tracking` module provides functionality to track processing progress and generate summary reports for analysis and monitoring.

In [None]:
# Example CSV tracking operations
from csv_tracking import update_processing_status, read_processing_status, generate_summary_report

# Initialize tracking for our processing session
csv_path = "processing_status_example.csv"

# Example: Track processing results for multiple subjects
example_results = [
    {"subject_id": "699461", "status": "success", "message": "Procedures created successfully"},
    {"subject_id": "699462", "status": "skipped", "message": "V2 file exists, skipping"},
    {"subject_id": "123456", "status": "failed", "message": "Subject not found in metadata service"},
    {"subject_id": "789012", "status": "success", "message": "Procedures updated from Excel data"}
]

print("📊 Updating processing status...")

for result in example_results:
    try:
        update_processing_status(
            csv_path,
            result["subject_id"],
            result["status"],
            result["message"]
        )
        print(f"  ✅ {result['subject_id']}: {result['status']}")
    except Exception as e:
        print(f"  ❌ Failed to update {result['subject_id']}: {e}")

# Read back the status to verify
print(f"\n📖 Reading processing status...")
try:
    status_data = read_processing_status(csv_path)
    print(f"📋 Found {len(status_data)} status records")
    
    # Show summary by status
    status_counts = {}
    for record in status_data:
        status = record.get('status', 'unknown')
        status_counts[status] = status_counts.get(status, 0) + 1
    
    print(f"📈 Status Summary:")
    for status, count in status_counts.items():
        print(f"   • {status}: {count}")
        
except Exception as e:
    print(f"❌ Failed to read status: {e}")

# Generate a summary report
print(f"\n📋 Generating summary report...")
try:
    summary = generate_summary_report(csv_path)
    print(f"✅ Summary generated:")
    for key, value in summary.items():
        print(f"   • {key}: {value}")
except Exception as e:
    print(f"❌ Failed to generate summary: {e}")

# Clean up example file
import os
if os.path.exists(csv_path):
    os.remove(csv_path)
    print(f"🧹 Cleaned up example file")

## 11. Summary and Next Steps

This comprehensive example demonstrates how to use the modular patchseq procedures system. The key benefits of this architecture include:

### ✅ Benefits
- **Modularity**: Each component has a single responsibility
- **Reusability**: Functions can be imported and used independently  
- **Testability**: Individual modules can be tested in isolation
- **Maintainability**: Changes are localized to specific modules
- **CLI Compatibility**: Original command-line interface preserved exactly

### 🚀 Usage Patterns
1. **CLI Usage**: `python create_procedures.py` (identical to original)
2. **Notebook Usage**: Import specific modules as needed
3. **Script Integration**: Use individual functions in your own scripts
4. **Batch Processing**: Leverage the modular structure for custom workflows

### 📁 Module Reference
- `excel_loader.py`: Excel file reading and validation
- `metadata_service.py`: API interactions and caching
- `schema_conversion.py`: Data format transformations  
- `file_io.py`: Safe file operations and path management
- `csv_tracking.py`: Progress tracking and reporting
- `subject_utils.py`: Subject ID validation and utilities

### 🎯 Next Steps
- Explore individual modules for specific use cases
- Customize the configuration for your environment
- Add your own processing logic using these building blocks
- Share modules with colleagues for collaborative development