# Genetic Variant Analysis with KEGG Pathway Data

This notebook demonstrates the process of analyzing genetic variants using KEGG pathway data and the Anthropic Claude API. The analysis creates structured reasoning paths explaining the biological mechanisms and disease relationships for genetic variants.

## Overview

The notebook includes functions to:
1. Load genetic variant data from TSV files
2. Process variants in batches using the Anthropic API
3. Generate detailed biological reasoning for each variant
4. Combine results into a comprehensive dataset

## Requirements

- Python 3.7+
- anthropic library
- tqdm for progress tracking
- Access to Anthropic Claude API

## Data Format

The input TSV file should contain columns for:
- Var_ID: Variant identifier
- ENTRY: Gene entry
- Chr: Chromosome
- Start: Position
- RefAllele: Reference allele
- AltAllele: Alternative allele
- Network Definition: Pathway information
- Gene: Gene information (JSON format)
- Disease: Associated disease (JSON format)

## Setup and Installation

Install required packages and set up the environment.

In [None]:
!pip install anthropic

import os
import json
import time
import glob
import datetime
import re
from tqdm.notebook import tqdm
import anthropic
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request

# Create directories
output_dir = "processed_variants"
os.makedirs(output_dir, exist_ok=True)

# API key setup - replace with your preferred method
# Option 1: Set as environment variable (recommended for production)
api_key = os.getenv('ANTHROPIC_API_KEY')

# Option 2: For Google Colab, uncomment the following lines:
# from google.colab import userdata
# api_key = userdata.get('ANTHROPIC_API_KEY')

# Option 3: Direct input (not recommended for production)
if not api_key:
    api_key = input("Enter your Anthropic API key: ")

# Create Anthropic client
client = anthropic.Anthropic(api_key=api_key)



In [None]:
import os
import json
import time
import glob
import datetime
import re
from tqdm.notebook import tqdm
import anthropic
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request

# Create directories
output_dir = "processed_variants"
os.makedirs(output_dir, exist_ok=True)

# Get API key from Google Colab secrets
from google.colab import userdata
api_key = userdata.get('api_key')
if not api_key:
    api_key = input("Enter your Anthropic API key: ")

# Create Anthropic client
client = anthropic.Anthropic(api_key=api_key)

## Data Loading Functions

Functions to load and process genetic variant data from TSV files.

In [None]:
# Load the variant data
def load_variant_data(file_path):
    """Load variant data from a TSV file."""
    variants = []

    with open(file_path, 'r', encoding='utf-8') as f:
        # Get header line
        header = f.readline().strip().split('\t')

        # Read each line and create a dictionary
        for line in f:
            values = line.strip().split('\t')
            variant = {header[i]: values[i] for i in range(len(header))}
            variants.append(variant)

    return variants



In [None]:
def create_variant_prompt(variant):
    """Create a prompt for analyzing a genetic variant."""
    prompt = f"""# Genetic Variant Analysis Prompt

    You are a genetics expert analyzing disease-causing mutations. For the following variant data, create a detailed reasoning path explaining the biological mechanism and disease relationship.

    ## Variant Data:
    - Variant ID: {variant.get('Var_ID', 'Unknown')}
    - Gene: {variant.get('ENTRY', 'Unknown')} ({", ".join([f"{k.split(';')[0]}" for k in json.loads(variant.get('Gene', '{}')).values()])})
    - Chromosome: {variant.get('Chr', 'Unknown')}
    - Position: {variant.get('Start', 'Unknown')}
    - Reference Allele: {variant.get('RefAllele', 'Unknown')}
    - Alternative Allele: {variant.get('AltAllele', 'Unknown')}
    - Network: {variant.get('Network Definition', 'Unknown')}
    - Associated Disease: {list(json.loads(variant.get('Disease', '{}')).keys())[0] if variant.get('Disease') else 'Unknown'}

    ## Instructions
    1. Based on this variant data, provide a structured analysis in valid JSON format with the following components:
      - Keep the complete raw_data object containing all original fields
      - Generate one detailed question about the biological effect of this variant and what disease it might contribute to
      - Provide a concise answer (2-3 sentences) summarizing the mechanism and disease relationship
      - Develop a comprehensive reasoning path containing:
        - The variant identifier
        - The HGVS notation
        - 8-12 sequential reasoning steps that trace the causal pathway from the genetic mutation to its cellular effects and disease manifestation
        - Relevant labels for pathways, diseases, and genes

    ## Output Format
    ```json
    {{
      "raw_data": {{
        // Complete original data object with all fields
      }},
      "question": "What is the biological effect of the [gene] mutation [id] ([ref]>[alt] at [position]) and what disease might it contribute to?",
      "answer": "Concise 2-3 sentence answer summarizing mechanism and disease",
      "reasoning": {{
        "variant_id": "ID",
        "hgvs": "Formal HGVS notation",
        "reasoning_steps": [
          "Step 1: Description of mutation at molecular level",
          "Step 2: Effect on protein structure/function",
          "Step 3: Effect on cellular pathway/process",
          // Additional steps showing causal chain
          "Final step: How this contributes to disease pathology"
        ],
        "labels": {{
          "pathway": ["Pathway identifiers"],
          "disease": ["Disease names"],
          "gene": ["Gene names"]
        }}
      }}
    }}
    Important notes:

    Ensure your response is VALID JSON without ANY explanatory text outside the JSON structure
    Do not include markdown code blocks (```) in your response - just provide the raw JSON
    Provide detailed, scientifically accurate reasoning steps that show the complete causal pathway
    For HGVS notation, include both genomic (g.) and protein (p.) level changes

    Analyze this variant data and provide your complete analysis in valid JSON format:
    """
    return prompt


## Prompt Creation

Function to create structured prompts for genetic variant analysis.

In [None]:
def process_variants_in_batches(variants, batch_size=5, model="claude-3-7-sonnet-20250219", max_tokens=6000):
    """Process variants in batches using the Anthropic SDK."""
    print(f"Processing {len(variants)} variants in batches of {batch_size}")
    # Process in batches
    for i in range(0, len(variants), batch_size):
        batch_variants = variants[i:i+batch_size]
        print(f"Processing batch {i//batch_size + 1} with {len(batch_variants)} variants")

        # Create batch requests
        batch_requests = []
        for variant in batch_variants:
            # Create a custom_id (max 64 chars)
            var_id = variant.get('Var_ID', 'variant')
            gene = variant.get('ENTRY', '')
            custom_id = f"{var_id}_{gene}"[:64]

            # Create the prompt
            prompt = create_variant_prompt(variant)

            # Add to batch requests
            batch_requests.append(
                Request(
                    custom_id=custom_id,
                    params=MessageCreateParamsNonStreaming(
                        model=model,
                        max_tokens=max_tokens,
                        temperature=0.2,  # Slightly higher temperature for reasoning variation
                        system="You are a genetics expert analyzing disease-causing mutations. Provide your analysis in VALID JSON format only, with no markdown formatting or explanatory text. Your JSON should contain raw_data, question, answer, and reasoning components.",
                        messages=[
                            {"role": "user", "content": prompt}
                        ]
                    )
                )
            )

        # Submit batch
        print(f"Submitting batch with {len(batch_requests)} requests...")
        batch = client.messages.batches.create(requests=batch_requests)
        print(f"Batch created with ID: {batch.id}")
        print(f"Initial status: {batch.processing_status}")

        # Poll for batch completion
        polling_interval = 10  # seconds
        while True:
            # Get batch status
            batch_status = client.messages.batches.retrieve(batch.id)

            # Print status
            print(f"Batch status: {batch_status.processing_status}")
            print(f"Processing: {batch_status.request_counts.processing}, "
                  f"Succeeded: {batch_status.request_counts.succeeded}, "
                  f"Errored: {batch_status.request_counts.errored}")

            # Exit loop if processing is complete
            if batch_status.processing_status == "ended":
                break

            # Wait before checking again
            print(f"Waiting {polling_interval} seconds...")
            time.sleep(polling_interval)

        # Process batch results
        print("Processing batch results...")
        try:
            for result in client.messages.batches.results(batch.id):
                custom_id = result.custom_id

                # Extract variant ID from custom_id
                variant_id = custom_id.split('_')[0]
                output_file = os.path.join(output_dir, f"{variant_id}_processed.json")

                # Handle different result types
                if result.result.type == "succeeded":
                    # Get the message content
                    message = result.result.message
                    content = message.content[0].text if message.content else ""

                    # Extract and parse the JSON
                    try:
                        # Try direct parsing first
                        try:
                            parsed_json = json.loads(content)
                        except json.JSONDecodeError:
                            # Remove markdown code blocks if present
                            if "```json" in content or "```" in content:
                                content = re.sub(r'```json\s*', '', content)
                                content = re.sub(r'```\s*', '', content)

                            # Extract just the JSON part
                            json_start = content.find('{')
                            json_end = content.rfind('}') + 1

                            if json_start >= 0 and json_end > json_start:
                                json_text = content[json_start:json_end]
                                parsed_json = json.loads(json_text)

                        # Save the parsed result
                        with open(output_file, 'w', encoding='utf-8') as f:
                            json.dump(parsed_json, f, indent=2)
                        print(f"✓ Saved result for {variant_id}")

                    except Exception as e:
                        print(f"✗ Error parsing result for {variant_id}: {e}")
                        # Save the raw content
                        with open(output_file, 'w', encoding='utf-8') as f:
                            json.dump({"error": str(e), "raw_content": content}, f, indent=2)

                        # Also save as text file for manual fixing
                        with open(f"{output_file}_raw.txt", 'w', encoding='utf-8') as f:
                            f.write(content)

                elif result.result.type == "errored":
                    error_message = "Unknown error"
                    if hasattr(result.result, 'error') and hasattr(result.result.error, 'message'):
                        error_message = result.result.error.message

                    print(f"✗ Error processing {variant_id}: {error_message}")
                    # Save the error
                    with open(output_file, 'w', encoding='utf-8') as f:
                        json.dump({"error": error_message}, f, indent=2)

        except Exception as e:
            print(f"Error processing batch results: {str(e)}")

        # Wait between batches
        if i + batch_size < len(variants):
            print("Waiting 5 seconds before next batch...")
            time.sleep(5)

    print("All batches processed!")



## Batch Processing Functions

Functions to process variants in batches using the Anthropic API.

## This is the version I used for the curation


## Run for a Batch ##

In [None]:
import os
import json
import time
import glob
import datetime
import re
from tqdm.notebook import tqdm
import anthropic
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request

# Create directories
output_dir = "processed_variants"
os.makedirs(output_dir, exist_ok=True)

# API key setup - multiple options for different environments
api_key = os.getenv('ANTHROPIC_API_KEY')

# For Google Colab users, uncomment these lines:
# from google.colab import userdata
# api_key = userdata.get('ANTHROPIC_API_KEY')

if not api_key:
    api_key = input("Enter your Anthropic API key: ")

# Create Anthropic client
client = anthropic.Anthropic(api_key=api_key)

# Function to load variant data
def load_variant_data(file_path):
    """Load variant data from a TSV file."""
    variants = []

    with open(file_path, 'r', encoding='utf-8') as f:
        # Get header line
        header = f.readline().strip().split('\t')

        # Read each line and create a dictionary
        for line in f:
            values = line.strip().split('\t')
            if len(values) == len(header):
                variant = {header[i]: values[i] for i in range(len(header))}
                variants.append(variant)
            else:
                print(f"Skipping malformed line: {line[:50]}...")

    return variants

# Function to create the prompt
def create_variant_prompt(variant):
    """Create a prompt for analyzing a genetic variant."""
    # Parse Gene field
    gene_info = {}
    gene_names = "Unknown"
    try:
        # First try to parse as JSON
        if variant.get('Gene') and variant.get('Gene').startswith('{'):
            gene_info = json.loads(variant.get('Gene', '{}'))
            gene_names = ", ".join([g.split(';')[0] for g in gene_info.values()]) if gene_info else "Unknown"
        else:
            # If not JSON, use as is
            gene_names = variant.get('Gene', 'Unknown')
    except:
        gene_names = variant.get('Gene', 'Unknown')

    # Parse Disease field
    disease_info = {}
    disease_name = "Unknown"
    try:
        # First try to parse as JSON
        if variant.get('Disease') and variant.get('Disease').startswith('{'):
            disease_info = json.loads(variant.get('Disease', '{}'))
            disease_name = list(disease_info.keys())[0] if disease_info else "Unknown"
        else:
            # If not JSON, use as is
            disease_name = variant.get('Disease', 'Unknown')
    except:
        disease_name = variant.get('Disease', 'Unknown')

    prompt = f"""# Genetic Variant Analysis Prompt

    You are a genetics expert analyzing disease-causing mutations. For the following variant data, create a detailed reasoning path explaining the biological mechanism and disease relationship.

    ## Variant Data:
    - Variant ID: {variant.get('Var_ID', 'Unknown')}
    - Gene: {variant.get('ENTRY', 'Unknown')} ({gene_names})
    - Chromosome: {variant.get('Chr', 'Unknown')}
    - Position: {variant.get('Start', 'Unknown')}
    - Reference Allele: {variant.get('RefAllele', 'Unknown')}
    - Alternative Allele: {variant.get('AltAllele', 'Unknown')}
    - Network: {variant.get('Network Definition', 'Unknown')}
    - Associated Disease: {disease_name}

    ## Instructions
    1. Based on this variant data, provide a structured analysis in valid JSON format with the following components:
      - Keep the complete raw_data object containing all original fields
      - Generate one detailed question about the biological effect of this variant and what disease it might contribute to
      - Provide a concise answer (2-3 sentences) summarizing the mechanism and disease relationship
      - Develop a comprehensive reasoning path containing:
        - The variant identifier
        - The HGVS notation
        - 8-12 sequential reasoning steps that trace the causal pathway from the genetic mutation to its cellular effects and disease manifestation
        - Relevant labels for pathways, diseases, and genes

    ## Output Format
    ```json
    {{
      "raw_data": {{
        // Complete original data object with all fields
      }},
      "question": "What is the biological effect of the [gene] mutation [id] ([ref]>[alt] at [position]) and what disease might it contribute to?",
      "answer": "Concise 2-3 sentence answer summarizing mechanism and disease",
      "reasoning": {{
        "variant_id": "ID",
        "hgvs": "Formal HGVS notation",
        "reasoning_steps": [
          "Step 1: Description of mutation at molecular level",
          "Step 2: Effect on protein structure/function",
          "Step 3: Effect on cellular pathway/process",
          // Additional steps showing causal chain
          "Final step: How this contributes to disease pathology"
        ],
        "labels": {{
          "pathway": ["Pathway identifiers"],
          "disease": ["Disease names"],
          "gene": ["Gene names"]
        }}
      }}
    }}
    Important notes:

    Ensure your response is VALID JSON without ANY explanatory text outside the JSON structure
    Do not include markdown code blocks (```) in your response - just provide the raw JSON
    Provide detailed, scientifically accurate reasoning steps that show the complete causal pathway
    For HGVS notation, include both genomic (g.) and protein (p.) level changes

    Analyze this variant data and provide your complete analysis in valid JSON format:
    """
    return prompt

## Function to process variants in batches
def process_variants_in_batches(variants, batch_size=5, model="claude-3-7-sonnet-20250219", max_tokens=6000):
    """Process variants in batches using the Anthropic SDK."""
    print(f"Processing {len(variants)} variants in batches of {batch_size}")

    # Process in batches
    for i in range(0, len(variants), batch_size):
        batch_variants = variants[i:i+batch_size]
        print(f"Processing batch {i//batch_size + 1} with {len(batch_variants)} variants")

        # Create batch requests
        batch_requests = []
        for variant in batch_variants:
            # Extract the Var_ID as the unique identifier
            var_id = variant.get('Var_ID', f'variant_{i}_{len(batch_requests)}')

            # Create the prompt
            prompt = create_variant_prompt(variant)

            # Add to batch requests
            batch_requests.append(
                Request(
                    custom_id=var_id,  # Use Var_ID directly as the custom_id
                    params=MessageCreateParamsNonStreaming(
                        model=model,
                        max_tokens=max_tokens,
                        temperature=0.2,
                        system="You are a genetics expert analyzing disease-causing mutations. Provide your analysis in VALID JSON format only, with no markdown formatting or explanatory text. Your JSON should contain raw_data, question, answer, and reasoning components.",
                        messages=[
                            {"role": "user", "content": prompt}
                        ]
                    )
                )
            )

        # Submit batch
        print(f"Submitting batch with {len(batch_requests)} requests...")
        batch = client.messages.batches.create(requests=batch_requests)
        print(f"Batch created with ID: {batch.id}")
        print(f"Initial status: {batch.processing_status}")

        # Poll for batch completion
        polling_interval = 10  # seconds
        while True:
            # Get batch status
            batch_status = client.messages.batches.retrieve(batch.id)

            # Print status
            print(f"Batch status: {batch_status.processing_status}")
            print(f"Processing: {batch_status.request_counts.processing}, "
                  f"Succeeded: {batch_status.request_counts.succeeded}, "
                  f"Errored: {batch_status.request_counts.errored}")

            # Exit loop if processing is complete
            if batch_status.processing_status == "ended":
                break

            # Wait before checking again
            print(f"Waiting {polling_interval} seconds...")
            time.sleep(polling_interval)

        # Process batch results
        print("Processing batch results...")
        try:
            for result in client.messages.batches.results(batch.id):
                # Get the variant ID from custom_id (which should be the Var_ID)
                variant_id = result.custom_id
                output_file = os.path.join(output_dir, f"{variant_id}_processed.json")

                # Handle different result types
                if result.result.type == "succeeded":
                    # Get the message content
                    message = result.result.message
                    content = message.content[0].text if message.content else ""

                    # Extract and parse the JSON
                    try:
                        # Try multiple approaches to extract and parse the JSON
                        json_text = None
                        parsed_json = None

                        # Try direct parsing first
                        try:
                            parsed_json = json.loads(content)
                            print(f"✓ Direct JSON parsing successful for {variant_id}")
                        except json.JSONDecodeError:
                            # Try removing markdown code blocks if present
                            if "```json" in content or "```" in content:
                                cleaned_content = re.sub(r'```json\s*', '', content)
                                cleaned_content = re.sub(r'```\s*', '', cleaned_content)
                                try:
                                    parsed_json = json.loads(cleaned_content)
                                    print(f"✓ JSON parsing after markdown removal successful for {variant_id}")
                                except json.JSONDecodeError:
                                    pass  # Will try next method

                            # Try extracting just the JSON part
                            if not parsed_json:
                                json_start = content.find('{')
                                json_end = content.rfind('}') + 1

                                if json_start >= 0 and json_end > json_start:
                                    json_text = content[json_start:json_end]
                                    try:
                                        parsed_json = json.loads(json_text)
                                        print(f"✓ JSON extraction and parsing successful for {variant_id}")
                                    except json.JSONDecodeError:
                                        # Try fixing common JSON syntax issues
                                        fixed_json = re.sub(r'"\s*"', '", "', json_text)
                                        fixed_json = re.sub(r'}\s*{', '}, {', fixed_json)
                                        fixed_json = re.sub(r']\s*{', '], {', fixed_json)
                                        fixed_json = re.sub(r'}\s*\[', '}, [', fixed_json)
                                        fixed_json = re.sub(r']\s*\[', '], [', fixed_json)

                                        try:
                                            parsed_json = json.loads(fixed_json)
                                            print(f"✓ JSON parsing after fixing syntax successful for {variant_id}")
                                        except json.JSONDecodeError as e:
                                            print(f"✗ All JSON parsing methods failed for {variant_id}: {e}")

                        # Save the parsed result or error
                        if parsed_json:
                            with open(output_file, 'w', encoding='utf-8') as f:
                                json.dump(parsed_json, f, indent=2)
                            print(f"✓ Saved result for {variant_id}")
                        else:
                            # Save the full raw response for manual fixing
                            with open(output_file, 'w', encoding='utf-8') as f:
                                json.dump({
                                    "error": "Invalid JSON in response",
                                    "raw_response": content
                                }, f, indent=2)
                            print(f"✗ JSON parsing error for {variant_id}, saved full raw response")

                            # Also save raw content to a text file for easier manual fixing
                            with open(f"{output_file}_raw.txt", 'w', encoding='utf-8') as f:
                                f.write(content)

                    except Exception as e:
                        print(f"✗ Error processing result for {variant_id}: {e}")
                        # Save the raw content
                        with open(output_file, 'w', encoding='utf-8') as f:
                            json.dump({"error": str(e), "raw_content": content}, f, indent=2)

                        # Also save as text file for manual fixing
                        with open(f"{output_file}_raw.txt", 'w', encoding='utf-8') as f:
                            f.write(content)

                elif result.result.type == "errored":
                    error_message = "Unknown error"
                    if hasattr(result.result, 'error') and hasattr(result.result.error, 'message'):
                        error_message = result.result.error.message

                    print(f"✗ Error processing {variant_id}: {error_message}")
                    # Save the error
                    with open(output_file, 'w', encoding='utf-8') as f:
                        json.dump({"error": error_message}, f, indent=2)

        except Exception as e:
            print(f"Error processing batch results: {str(e)}")

        # Wait between batches
        if i + batch_size < len(variants):
            print("Waiting 5 seconds before next batch...")
            time.sleep(5)

    print("All batches processed!")

## Function to combine all results
def combine_all_results():
    """Combine all processed results into a single JSON file."""
    all_results = []
    error_count = 0

    # List all JSON files in the output directory (excluding raw text files)
    json_files = [f for f in glob.glob(os.path.join(output_dir, "*.json"))
                 if not f.endswith("_raw.txt")]

    print(f"Found {len(json_files)} JSON files to combine")

    for file_path in json_files:
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                data = json.load(f)

            # Skip files with errors
            if "error" in data:
                error_count += 1
                print(f"Skipping file with error: {os.path.basename(file_path)}")
                continue

            all_results.append(data)
            print(f"Added {os.path.basename(file_path)} to combined results")

        except Exception as e:
            print(f"Error loading {os.path.basename(file_path)}: {e}")
            error_count += 1

    print(f"Successfully combined {len(all_results)} results. {error_count} files had errors.")

    # Save the combined collection
    with open("all_variant_analyses.json", 'w', encoding='utf-8') as f:
        json.dump(all_results, f, indent=2)

    print("Saved all results to 'all_variant_analyses.json'")




## Complete Processing Pipeline

This section contains the complete pipeline with all functions integrated for easier execution.

In [None]:
## Main function that you will call
def process_genetic_variants(file_path, num_variants=20, batch_size=5, model="claude-3-7-sonnet-20250219"):
    """
    Process genetic variants from a TSV file.
    Parameters:
    file_path (str): Path to the TSV file containing variant data
    num_variants (int): Number of variants to process (default: 20)
    batch_size (int): Number of variants to process in each batch (default: 5)
    model (str): Claude model to use (default: claude-3-7-sonnet-20250219)
    """
    print(f"Genetic Variant Analysis Script")
    print(f"===============================")
    print(f"Processing {num_variants} variants in batches of {batch_size} using {model}")

    # Load data
    print(f"Loading variant data from {file_path}...")
    all_variants = load_variant_data(file_path)
    print(f"Loaded {len(all_variants)} variants in total")

    # Limit to specified number of variants
    variants = all_variants[:num_variants]
    print(f"Limited to the first {len(variants)} variants for processing")

    # Process variants
    process_variants_in_batches(
        variants,
        batch_size=batch_size,
        model=model
    )

    # Combine results
    print("Combining results...")
    combine_all_results()

    print("Processing complete!")
    return f"Results saved to {output_dir} and combined in all_variant_analyses.json"


## Main Processing Function

Convenience function to run the complete analysis pipeline.

In [None]:
# 2. Then run the process_genetic_variants function:
# Run the function with your parameters
result = process_genetic_variants(
    file_path="final_network_with_variant.tsv",
    num_variants=20,
    batch_size=5,
    model="claude-3-7-sonnet-20250219"
)
print(result)

def process_genetic_variants(file_path, num_variants=20, batch_size=5, model="claude-3-7-sonnet-20250219"):
    """
    Process genetic variants from a TSV file using the Anthropic Claude API.
    
    Parameters:
    -----------
    file_path : str
        Path to the TSV file containing variant data (relative to notebook location)
    num_variants : int, optional
        Number of variants to process (default: 20)
        Set to None to process all variants in the file
    batch_size : int, optional
        Number of variants to process in each API batch (default: 5)
        Smaller batches provide better error handling but may be slower
    model : str, optional
        Claude model to use (default: "claude-3-7-sonnet-20250219")
        
    Returns:
    --------
    str
        Status message indicating completion and output locations
        
    Output Files:
    -------------
    - Individual analyses: saved in processed_variants/ directory
    - Combined results: saved as all_variant_analyses.json
    """
    print(f"Genetic Variant Analysis Script")
    print(f"===============================")
    print(f"Model: {model}")
    print(f"Batch size: {batch_size}")
    
    # Load data
    print(f"Loading variant data from {file_path}...")
    try:
        all_variants = load_variant_data(file_path)
        print(f"Loaded {len(all_variants)} variants in total")
    except FileNotFoundError:
        return f"Error: File '{file_path}' not found. Please check the file path."
    except Exception as e:
        return f"Error loading data: {str(e)}"
    
    # Limit to specified number of variants
    if num_variants is None:
        variants = all_variants
        print(f"Processing all {len(variants)} variants")
    else:
        variants = all_variants[:num_variants]
        print(f"Processing the first {len(variants)} variants")
    
    if not variants:
        return "Error: No variants to process"
    
    # Process variants
    try:
        process_variants_in_batches(
            variants,
            batch_size=batch_size,
            model=model
        )
    except Exception as e:
        return f"Error during processing: {str(e)}"
    
    # Combine results
    print("Combining results...")
    try:
        combine_all_results()
    except Exception as e:
        print(f"Warning: Error combining results: {str(e)}")
    
    print("Processing complete!")
    return f"Analysis complete. Results saved to '{output_dir}/' directory and combined in 'all_variant_analyses.json'"

Genetic Variant Analysis Script
Processing 440 variants in batches of 20 using claude-3-7-sonnet-20250219
Loading variant data from final_network_with_variant.tsv...
Loaded 289 variants in total
Limited to the first 289 variants for processing
Processing 289 variants in batches of 20
Processing batch 1 with 20 variants
Submitting batch with 20 requests...
Batch created with ID: msgbatch_013VgvncRWMwgGiuSD3ZU1Ug
Initial status: in_progress
Batch status: in_progress
Processing: 20, Succeeded: 0, Errored: 0
Waiting 10 seconds...
Batch status: in_progress
Processing: 20, Succeeded: 0, Errored: 0
Waiting 10 seconds...
Batch status: in_progress
Processing: 20, Succeeded: 0, Errored: 0
Waiting 10 seconds...
Batch status: in_progress
Processing: 20, Succeeded: 0, Errored: 0
Waiting 10 seconds...
Batch status: in_progress
Processing: 20, Succeeded: 0, Errored: 0
Waiting 10 seconds...
Batch status: in_progress
Processing: 20, Succeeded: 0, Errored: 0
Waiting 10 seconds...
Batch status: in_progre

## Usage Examples

Examples of how to run the genetic variant analysis with different parameters.

## Notes and Considerations

### API Usage
- This notebook uses the Anthropic Claude API which requires an API key
- Processing large numbers of variants will consume significant API credits
- Consider rate limits and batch sizes based on your API plan

### Data Requirements
- Input data should be in TSV format with required columns
- Gene and Disease fields should contain valid JSON when structured data is available
- Ensure your input file path is correct relative to the notebook location

### Output
- Individual variant analyses are saved in the `processed_variants/` directory
- Combined results are saved as `all_variant_analyses.json`
- Failed analyses are saved with error information for debugging

### Customization
- Adjust `num_variants` and `batch_size` parameters based on your needs
- Modify the prompt template in `create_variant_prompt()` for different analysis focuses
- Change the output directory by modifying the `output_dir` variable

### Example 1: Basic usage with default parameters
Process first 20 variants from the KEGG dataset
```python
file_path = "kegg_data/final_network_with_variant.tsv"

result = process_genetic_variants(
    file_path=file_path,
    num_variants=20,     # Process first 20 variants
    batch_size=5,        # Process 5 variants per batch
    model="claude-3-7-sonnet-20250219"
)
print(result)
```

### Example 2: Process more variants with larger batches
Uncomment the following lines to run:
```python
result = process_genetic_variants(
    file_path=file_path,
    num_variants=100,    # Process first 100 variants
    batch_size=10,       # Larger batches for efficiency
    model="claude-3-7-sonnet-20250219"
)
print(result)
```

### Example 3: Process all variants in the file
Uncomment the following lines to run (be aware of API costs):
```python
result = process_genetic_variants(
    file_path=file_path,
    num_variants=None,   # Process all variants
    batch_size=5,        # Conservative batch size
    model="claude-3-7-sonnet-20250219"
)
print(result)
```