# AI-Powered Variant Interpretation

This notebook demonstrates how to use Amazon Bedrock with foundation models to interpret genomic variants from VCF files.

**Prerequisites**: Complete the workshop documentation sections 510-530 before running this notebook.

## Step 1: Install Required Libraries

In [1]:
# Install required Python libraries
!pip install strands-agents pandas boto3 --quiet

## Step 2: Import Libraries

In [2]:
# Import required libraries
import pandas as pd
import boto3
from strands import Agent
import sys

print("‚úÖ All libraries imported successfully!")

‚úÖ All libraries imported successfully!


## Step 3: Verify Environment Setup

In [3]:
# Verify Python version
print(f"Python version: {sys.version}")
print()

# Verify library versions
print("Library versions:")
print(f"  pandas: {pd.__version__}")
print(f"  boto3: {boto3.__version__}")
print()

# Verify AWS credentials
try:
    sts = boto3.client('sts')
    identity = sts.get_caller_identity()
    print("‚úÖ AWS credentials configured")
    print(f"   Account: {identity['Account']}")
    print(f"   User/Role: {identity['Arn'].split('/')[-1]}")
except Exception as e:
    print(f"‚ùå AWS credentials error: {e}")
    print("   Make sure you're running in the VSCode Server environment")

print()
print("‚úÖ Environment setup complete!")
print("   You're ready to start analyzing variants.")

Python version: 3.12.10 (main, Apr  8 2025, 11:35:47) [Clang 17.0.0 (clang-1700.0.13.3)]

Library versions:
  pandas: 2.3.3
  boto3: 1.42.2

‚úÖ AWS credentials configured
   Account: 034362031504
   User/Role: Jacob

‚úÖ Environment setup complete!
   You're ready to start analyzing variants.


## Step 4: Load VCF File from S3

**Action Required**: Replace `YOUR_BUCKET_NAME` with your actual S3 bucket name.

In [None]:
# Configure your S3 bucket and VCF file location
BUCKET_NAME = 'YOUR_BUCKET_NAME'  # Replace with your bucket name
VCF_KEY = 'SRR014820/SRR014820.chr19.vcf.gz'
LOCAL_VCF_FILE = 'variants.vcf.gz'

print(f"Downloading VCF file from S3...")
print(f"  Bucket: {BUCKET_NAME}")
print(f"  Key: {VCF_KEY}")
print()

try:
    # Download VCF file from S3
    s3 = boto3.client('s3')
    s3.download_file(BUCKET_NAME, VCF_KEY, LOCAL_VCF_FILE)
    print(f"‚úÖ VCF file downloaded successfully to {LOCAL_VCF_FILE}")
except Exception as e:
    print(f"‚ùå Error downloading VCF file: {e}")
    print()
    print("Troubleshooting tips:")
    print("  1. Verify the bucket name is correct")
    print("  2. Check that the VCF file exists at the specified path")
    print("  3. Ensure your IAM role has S3 read permissions")
    print("  4. Confirm you completed the Batch Jobs module")
    raise

In [16]:
LOCAL_VCF_FILE = 'SRR014820.chr19.vcf.gz'

## Step 5: Parse VCF into DataFrame

In [17]:
try:
    # Load VCF file into pandas DataFrame
    # Skip comment lines (starting with #) and parse tab-separated values
    vcf_df = pd.read_csv(
        LOCAL_VCF_FILE,
        sep='\t',
        comment='#',
        names=['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'],
        compression='gzip'
    )

    # Reset the index to make CHROM and POS regular columns
    vcf_df = vcf_df.reset_index()

    # Rename columns to their correct VCF names
    vcf_df.columns = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'SAMPLE']
    
    print(f"‚úÖ VCF file loaded successfully!")
    print()
    print(f"Total variants found: {len(vcf_df):,}")
    print()
    print("Column summary:")
    print(f"  CHROM: Chromosome identifier")
    print(f"  POS: Position on chromosome")
    print(f"  REF: Reference allele")
    print(f"  ALT: Alternate allele (variant)")
    print(f"  QUAL: Quality score (higher = better)")
    print(f"  FILTER: Quality filter status")
    print(f"  INFO: Additional variant metadata")
    
except FileNotFoundError:
    print(f"‚ùå VCF file not found: {LOCAL_VCF_FILE}")
    print("   Did you download it from S3 in the previous step?")
    raise
except Exception as e:
    print(f"‚ùå Error loading VCF file: {e}")
    print()
    print("Troubleshooting tips:")
    print("  1. Verify the file is a valid VCF format")
    print("  2. Check that the file is not corrupted")
    print("  3. Ensure sufficient memory is available")
    raise

  vcf_df = pd.read_csv(


‚úÖ VCF file loaded successfully!

Total variants found: 4,206,989

Column summary:
  CHROM: Chromosome identifier
  POS: Position on chromosome
  REF: Reference allele
  ALT: Alternate allele (variant)
  QUAL: Quality score (higher = better)
  FILTER: Quality filter status
  INFO: Additional variant metadata


## Step 6: Preview Variant Data

In [18]:
# Display the first 10 variants
print("Preview of variant data:")
print()
vcf_df.head(10)

Preview of variant data:



Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,SAMPLE
0,chr19,62222,.,A,.,38.5872,.,"DP=3;MQ0F=1;AN=2;DP4=0,3,0,0;MQ=0",GT,0/0
1,chr19,62223,.,A,.,38.5872,.,"DP=3;MQ0F=1;AN=2;DP4=0,3,0,0;MQ=0",GT,0/0
2,chr19,62224,.,C,.,38.5872,.,"DP=3;MQ0F=1;AN=2;DP4=0,3,0,0;MQ=0",GT,0/0
3,chr19,62225,.,T,.,38.5872,.,"DP=3;MQ0F=1;AN=2;DP4=0,3,0,0;MQ=0",GT,0/0
4,chr19,62226,.,T,.,38.5872,.,"DP=3;MQ0F=1;AN=2;DP4=0,3,0,0;MQ=0",GT,0/0
5,chr19,62227,.,G,.,38.5872,.,"DP=3;MQ0F=1;AN=2;DP4=0,3,0,0;MQ=0",GT,0/0
6,chr19,62228,.,C,.,38.5872,.,"DP=3;MQ0F=1;AN=2;DP4=0,3,0,0;MQ=0",GT,0/0
7,chr19,62229,.,T,.,38.5872,.,"DP=3;MQ0F=1;AN=2;DP4=0,3,0,0;MQ=0",GT,0/0
8,chr19,62230,.,T,.,38.5872,.,"DP=3;MQ0F=1;AN=2;DP4=0,3,0,0;MQ=0",GT,0/0
9,chr19,62231,.,T,.,38.5872,.,"DP=3;MQ0F=1;AN=2;DP4=0,3,0,0;MQ=0",GT,0/0


## Step 7: Basic Variant Statistics

In [19]:
# Display basic statistics 
print("Variant Statistics:")
print("=" * 50)
print()
print(f"Total variants: {len(vcf_df):,}")
print(f"Chromosomes: {vcf_df['CHROM'].unique().tolist()}")
print()

# Convert QUAL to numeric once (handles mixed types efficiently)
qual_numeric = pd.to_numeric(vcf_df['QUAL'], errors='coerce')

print("Quality score statistics:")
print(f"  Mean: {qual_numeric.mean():.2f}")
print(f"  Median: {qual_numeric.median():.2f}")
print(f"  Min: {qual_numeric.min():.2f}")
print(f"  Max: {qual_numeric.max():.2f}")
print()
print("Filter status distribution:")
print(vcf_df['FILTER'].value_counts())
print()
print("‚úÖ VCF data loaded and ready for analysis!")

Variant Statistics:

Total variants: 4,206,989
Chromosomes: ['chr19']

Quality score statistics:
  Mean: 64.10
  Median: 37.59
  Min: 1.99
  Max: 284.59

Filter status distribution:
FILTER
.    4206989
Name: count, dtype: int64

‚úÖ VCF data loaded and ready for analysis!


## Step 8: Filter Variants by Quality

High-quality variant calls are essential for accurate interpretation. We'll filter variants based on:
- **QUAL > 30**: Quality score threshold (Phred-scaled probability)
- **DP > 10**: Read depth threshold (minimum coverage)

These thresholds help ensure we're analyzing reliable variant calls.

In [21]:
# Extract depth (DP) from INFO column
vcf_df['DP'] = (
    vcf_df['INFO']
    .str.extract(r'DP=(\d+)', expand=False)
    .fillna(0)
    .astype(int)
)

print("‚úÖ Depth (DP) values extracted from INFO column")
print()
print("Depth statistics:")
print(f"  Mean: {vcf_df['DP'].mean():.1f}")
print(f"  Median: {vcf_df['DP'].median():.1f}")
print(f"  Min: {vcf_df['DP'].min()}")
print(f"  Max: {vcf_df['DP'].max()}")

‚úÖ Depth (DP) values extracted from INFO column

Depth statistics:
  Mean: 3.7
  Median: 1.0
  Min: 0
  Max: 251


In [42]:
# Store original count for comparison
original_count = len(vcf_df)

# Apply quality filters
QUAL_THRESHOLD = 30
DP_THRESHOLD = 10

# Convert QUAL to numeric (handles '.' and other non-numeric values)
vcf_df['QUAL'] = pd.to_numeric(vcf_df['QUAL'], errors='coerce')

filtered_df = vcf_df[
    (vcf_df['ALT'] != '.') &           # Must be an actual variant, not reference
    (vcf_df['QUAL'] > QUAL_THRESHOLD) & 
    (vcf_df['DP'] > DP_THRESHOLD)
].copy()

filtered_count = len(filtered_df)

print("Quality Filtering Results")
print("=" * 50)
print()
print(f"Filter criteria:")
print(f"  ALT != '.' (actual variants only)")
print(f"  QUAL > {QUAL_THRESHOLD} (quality score)")
print(f"  DP > {DP_THRESHOLD} (read depth)")
print()
print(f"Variant counts:")
print(f"  Before filtering: {original_count:,}")
print(f"  After filtering:  {filtered_count:,}")
print(f"  Removed:          {original_count - filtered_count:,} ({(original_count - filtered_count) / original_count * 100:.1f}%)")
print()
print("‚úÖ Quality filtering complete!")

Quality Filtering Results

Filter criteria:
  ALT != '.' (actual variants only)
  QUAL > 30 (quality score)
  DP > 10 (read depth)

Variant counts:
  Before filtering: 4,206,989
  After filtering:  125
  Removed:          4,206,864 (100.0%)

‚úÖ Quality filtering complete!


In [43]:
# Sort by quality score (descending) and select top 10 variants
top_variants = filtered_df.sort_values('QUAL', ascending=False).head(10)

# Display top variants as a formatted table
display_columns = ['CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'DP', 'FILTER']

# Create a clean display DataFrame
display_df = top_variants[display_columns].reset_index(drop=True)
display_df.index = display_df.index + 1  # Start index at 1 for readability
display_df.index.name = 'Rank'

# Format QUAL column to 2 decimal places
display_df['QUAL'] = display_df['QUAL'].round(2)

print("Top 10 Variants")
display_df

Top 10 Variants


Unnamed: 0_level_0,CHROM,POS,REF,ALT,QUAL,DP,FILTER
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,chr19,37565861,C,G,228.42,99,.
2,chr19,55810048,C,A,228.42,98,.
3,chr19,2853568,A,G,228.41,94,.
4,chr19,20949405,A,T,228.41,31,.
5,chr19,53479794,A,G,228.22,44,.
6,chr19,21423627,G,T,225.93,60,.
7,chr19,56424158,A,G,225.42,92,.
8,chr19,20949588,C,T,225.42,25,.
9,chr19,35831368,C,T,225.42,14,.
10,chr19,2934284,T,C,225.42,12,.


### Understanding the Results

The table above shows the top 10 variants sorted by quality score:

| Column | Description |
|--------|-------------|
| CHROM | Chromosome where the variant is located |
| POS | Position on the chromosome |
| REF | Reference allele (original DNA base) |
| ALT | Alternate allele (variant DNA base) |
| QUAL | Quality score (higher = more confident) |
| DP | Read depth (number of reads covering this position) |
| FILTER | Quality filter status (PASS = high quality) |

These high-quality variants are ideal candidates for AI-powered interpretation.

## Step 9: Create Bioinformatics AI Agent

The system prompt defines the agent's expertise and behavior. This is where **prompt engineering** happens.

In [26]:
# Define the system prompt - keep it concise to avoid token limits
SYSTEM_PROMPT = """You are a clinical geneticist. Analyze genomic variants briefly.

For each variant, provide:
1. Variant type (SNV, indel, etc.)
2. Likely clinical significance (pathogenic/benign/uncertain)
3. One recommended follow-up action

Keep responses under 100 words. Be direct and concise."""

In [None]:
# Create the bioinformatics agent using Strands SDK
try:
    agent = Agent(
        model="us.amazon.nova-2-lite-v1:0",  # Amazon Nova 2 Lite model
        system_prompt=SYSTEM_PROMPT      # Our clinical geneticist prompt
    )
    
    print("‚úÖ Bioinformatics AI agent created successfully!")
    print(f"   Model: Amazon Nova 2 Lite")
    print(f"   Region: us-west-2")
    print(f"   Role: Expert Clinical Geneticist")
    
except Exception as e:
    print(f"‚ùå Error creating agent: {e}")
    print("   Check IAM permissions for Bedrock (bedrock:InvokeModel)")
    print("   See workshop documentation section 530 for setup instructions")
    raise

‚úÖ Bioinformatics AI agent created successfully!
   Model: Amazon Nova 2 Lite
   Region: us-west-2
   Role: Expert Clinical Geneticist


### Prompt Engineering Tips

You can customize the `SYSTEM_PROMPT` for different analysis needs:
- Focus on specific disease areas (e.g., "specializing in cancer genomics")
- Change response style (e.g., "explain in simple terms")
- Add specific guidelines (e.g., "always cite ClinVar")

### Model Switching

To try Claude Sonnet 4 instead of Nova:

```python
agent_claude = Agent(
    model="us.anthropic.claude-sonnet-4-20250514-v1:0",
    system_prompt=SYSTEM_PROMPT
)
```

## Step 10: Interactive Variant Interpretation

Now let's use our AI agent to interpret variants! We'll demonstrate several query types:

1. **Basic Query**: Get an overview of the top variants
2. **Specific Variant Interpretation**: Analyze a single variant in detail
3. **Clinical Significance Assessment**: Identify potentially pathogenic variants
4. **Gene Identification**: Determine which genes are affected

### Helper Function for Displaying Agent Responses

This function formats the agent's responses for better readability.

In [44]:
def display_agent_response(response, query_type="Query"):
    """Display agent response with formatted output.
    
    Args:
        response: The agent response object
        query_type: A label describing the type of query
    """
    print(f"{'=' * 60}")
    print(f"üß¨ {query_type}")
    print(f"{'=' * 60}")
    print()
    
    try:
        # Extract text content from response
        text = response.message["content"][0]["text"]
        print(text)
    except (KeyError, IndexError, TypeError) as e:
        print(f"‚ùå Error extracting response: {e}")
        print(f"Raw response: {response}")
    
    print()
    print(f"{'=' * 60}")

print("‚úÖ Helper function defined")

‚úÖ Helper function defined


### Query 1: Basic Overview of Top Variants

Let's start with a simple query asking the agent to summarize our top variants.

In [45]:
# Prepare variant data as a formatted string for the agent
def format_variants_for_agent(df, max_variants=10):
    """Format variant DataFrame as a string for the agent.
    
    Args:
        df: DataFrame containing variant data
        max_variants: Maximum number of variants to include
    
    Returns:
        Formatted string representation of variants
    """
    variants_str = "Variant Data:\n"
    variants_str += "-" * 80 + "\n"
    
    for idx, row in df.head(max_variants).iterrows():
        variants_str += f"Variant: {row['CHROM']}:{row['POS']} {row['REF']}>{row['ALT']}\n"
        variants_str += f"  Quality: {row['QUAL']:.2f}, Depth: {row['DP']}, Filter: {row['FILTER']}\n"
    
    return variants_str

# Format our top variants
variants_context = format_variants_for_agent(top_variants)
print("Variant data prepared for agent:")
print(variants_context)

Variant data prepared for agent:
Variant Data:
--------------------------------------------------------------------------------
Variant: chr19:37565861 C>G
  Quality: 228.42, Depth: 99, Filter: .
Variant: chr19:55810048 C>A
  Quality: 228.42, Depth: 98, Filter: .
Variant: chr19:2853568 A>G
  Quality: 228.41, Depth: 94, Filter: .
Variant: chr19:20949405 A>T
  Quality: 228.41, Depth: 31, Filter: .
Variant: chr19:53479794 A>G
  Quality: 228.22, Depth: 44, Filter: .
Variant: chr19:21423627 G>T
  Quality: 225.93, Depth: 60, Filter: .
Variant: chr19:56424158 A>G
  Quality: 225.42, Depth: 92, Filter: .
Variant: chr19:20949588 C>T
  Quality: 225.42, Depth: 25, Filter: .
Variant: chr19:35831368 C>T
  Quality: 225.42, Depth: 14, Filter: .
Variant: chr19:2934284 T>C
  Quality: 225.42, Depth: 12, Filter: .



In [46]:
# Basic Query: Ask for an overview of the variants
basic_query = f"""I have analyzed a VCF file and identified the following high-quality variants:

{variants_context}

Please provide a brief overview of these variants. What types of variants are present (SNVs vs indels)? 
Are there any patterns you notice in terms of position or quality?"""

print("Sending query to AI agent...")
print()

try:
    response = agent(basic_query)
    display_agent_response(response, "Basic Overview")
except Exception as e:
    print(f"‚ùå Error querying agent: {e}")
    print("   Check your Bedrock permissions and network connectivity")

Sending query to AI agent...

### Overview of Identified Variants

#### Types of Variants
All the variants listed are **Single Nucleotide Variants (SNVs)** rather than insertions or deletions (indels). This is indicated by the format `C>G`, `C>A`, `A>G`, etc., which represent single nucleotide substitutions.

#### Patterns in Position and Quality

1. **Position**:
   - The variants are spread across chromosome 19 but do not show a clear clustering pattern. They are located at various positions ranging from 2853568 to 56424158.

2. **Quality**:
   - The quality scores for the variants are uniformly high, ranging from 225.42 to 228.42. This indicates that these are high-confidence variant calls.

### Detailed Analysis

1. **Variant: chr19:37565861 C>G**
   - **Quality**: 228.42
   - **Depth**: 99

2. **Variant: chr19:55810048 C>A**
   - **Quality**: 228.42
   - **Depth**: 98

3. **Variant: chr19:2853568 A>G**
   - **Quality**: 228.41
   - **Depth**: 94

4. **Variant: chr19:20949405 A>T**

### Query 2: Specific Variant Interpretation

Now let's ask the agent to interpret a specific variant in detail.

In [47]:
# Select the highest quality variant for detailed interpretation
top_variant = top_variants.iloc[0]

specific_query = f"""Please provide a detailed interpretation of this genomic variant:

Chromosome: {top_variant['CHROM']}
Position: {top_variant['POS']}
Reference Allele: {top_variant['REF']}
Alternate Allele: {top_variant['ALT']}
Quality Score: {top_variant['QUAL']:.2f}
Read Depth: {top_variant['DP']}

Please include:
1. What type of variant is this (SNV, insertion, deletion)?
2. What is the potential molecular consequence?
3. What genes might be affected at this chromosomal location?
4. What databases would you recommend checking for more information?"""

print(f"Analyzing variant: {top_variant['CHROM']}:{top_variant['POS']} {top_variant['REF']}>{top_variant['ALT']}")
print()

try:
    response = agent(specific_query)
    display_agent_response(response, "Specific Variant Interpretation")
except Exception as e:
    print(f"‚ùå Error querying agent: {e}")

Analyzing variant: chr19:37565861 C>G

### Detailed Interpretation of the Genomic Variant

#### 1. Type of Variant
This variant is a **Single Nucleotide Variant (SNV)**. Specifically, it is a substitution where the reference allele C is changed to the alternate allele G at position 37565861 on chromosome 19.

#### 2. Potential Molecular Consequence
The potential molecular consequences of this SNV depend on its location within the gene:
- **Missense Mutation**: If the variant occurs in a coding region and changes an amino acid, it could potentially alter the protein's function.
- **Synonymous Mutation**: If the variant does not change the amino acid sequence of the protein, it is less likely to have a significant effect.
- **Regulatory Region**: If the variant occurs in a regulatory region, it could affect gene expression levels.

Given the high-quality score (228.42) and read depth (99), this is a reliable variant call.

#### 3. Genes Affected at This Chromosomal Location
The position 

### Query 3: Clinical Significance Assessment

Ask the agent to assess which variants might be clinically significant.

In [48]:
# Clinical significance query
clinical_query = f"""Based on the following variants from chromosome 19, please assess their potential clinical significance:

{variants_context}

For each variant, please:
1. Classify as: Pathogenic, Likely Pathogenic, Uncertain Significance, Likely Benign, or Benign
2. Explain your reasoning briefly
3. Identify which variants should be prioritized for follow-up

Note: This is for educational purposes. Clinical decisions should involve certified genetic counselors."""

print("Requesting clinical significance assessment...")
print()

try:
    response = agent(clinical_query)
    display_agent_response(response, "Clinical Significance Assessment")
except Exception as e:
    print(f"‚ùå Error querying agent: {e}")

Requesting clinical significance assessment...

### Assessment of Clinical Significance for Variants on Chromosome 19

#### 1. Variant: chr19:37565861 C>G
- **Classification**: Uncertain Significance
- **Reasoning**: This SNV has a high-quality score and read depth, indicating it is a reliable variant. However, without additional context, its clinical significance is uncertain. It falls within the BRCA2 gene, which is associated with hereditary breast and ovarian cancer syndrome.
- **Follow-up Priority**: Medium

#### 2. Variant: chr19:55810048 C>A
- **Classification**: Uncertain Significance
- **Reasoning**: Similar to the above, this SNV is reliable but its clinical significance is uncertain. It does not fall within a well-known cancer-associated gene.
- **Follow-up Priority**: Low

#### 3. Variant: chr19:2853568 A>G
- **Classification**: Uncertain Significance
- **Reasoning**: This SNV is reliable but its clinical significance is uncertain. It does not fall within a well-known cance

### Query 4: Gene Identification

Ask the agent to identify genes that might be affected by these variants.

In [49]:
# Gene identification query
gene_query = f"""I need to identify which genes are potentially affected by these chromosome 19 variants:

{variants_context}

Please:
1. List any known genes at or near these positions on chromosome 19
2. Describe the biological function of each identified gene
3. Explain what diseases or conditions are associated with variants in these genes
4. Suggest which variants are most likely to have functional consequences"""

print("Identifying affected genes...")
print()

try:
    response = agent(gene_query)
    display_agent_response(response, "Gene Identification")
except Exception as e:
    print(f"‚ùå Error querying agent: {e}")

Identifying affected genes...

### Identification of Genes Affected by Chromosome 19 Variants

#### 1. Known Genes at or Near These Positions on Chromosome 19

- **Variant chr19:37565861 C>G**
  - **Gene**: BRCA2
  - **Function**: BRCA2 is a tumor suppressor gene that plays a crucial role in the repair of DNA double-strand breaks through homologous recombination. It is involved in maintaining genomic stability.

- **Variant chr19:55810048 C>A**
  - **Gene**: No significant genes at this position.

- **Variant chr19:2853568 A>G**
  - **Gene**: No significant genes at this position.

- **Variant chr19:20949405 A>T**
  - **Gene**: No significant genes at this position.

- **Variant chr19:53479794 A>G**
  - **Gene**: No significant genes at this position.

- **Variant chr19:21423627 G>T**
  - **Gene**: No significant genes at this position.

- **Variant chr19:56424158 A>G**
  - **Gene**: No significant genes at this position.

- **Variant chr19:20949588 C>T**
  - **Gene**: No significant g

### Query 5: Custom Query (Try Your Own!)

Now it's your turn! Modify the query below to ask the agent anything about your variants.

In [None]:
# Custom query - modify this to ask your own questions!
custom_query = f"""Here are my variant data:

{variants_context}

YOUR QUESTION HERE: What follow-up tests would you recommend for validating these variants?
"""

print("Sending custom query...")
print()

try:
    response = agent(custom_query)
    display_agent_response(response, "Custom Query Response")
except Exception as e:
    print(f"‚ùå Error querying agent: {e}")

In [50]:
# Create DataFrame with interpretations
interpretations_df = pd.DataFrame(interpretations)

print("Variant Interpretations Summary")
print("=" * 60)
print()

for idx, row in interpretations_df.iterrows():
    print(f"Variant {idx + 1}: {row['CHROM']}:{row['POS']} {row['REF']}>{row['ALT']}")
    print(f"Quality: {row['QUAL']:.2f}, Depth: {row['DP']}")
    print(f"Interpretation:")
    print(f"  {row['interpretation'][:500]}..." if len(row['interpretation']) > 500 else f"  {row['interpretation']}")
    print()
    print("-" * 60)
    print()

Variant Interpretations Summary

Variant 1: chr19:42754284 A>.
Quality: 284.59, Depth: 107
Interpretation:
  This variant on chromosome 19 at position 42754284 is a deletion of the nucleotide A, indicated by the empty alternate allele field. Given its high-quality score and read depth, it is a reliable variant with potential clinical significance, particularly if it occurs within a gene known to be associated with hereditary cancer syndromes, such as BRCA2.

------------------------------------------------------------

Variant 2: chr19:36143123 A>.
Quality: 284.59, Depth: 52
Interpretation:
  This variant on chromosome 19 at position 36143123 is a deletion of the nucleotide A, indicated by the empty alternate allele field. With a high-quality score and read depth, it is a reliable variant that could potentially disrupt gene function, particularly if it occurs within a gene associated with hereditary cancer syndromes, such as BRCA2.

-----------------------------------------------------

### Understanding Agent Responses

The AI agent provides interpretations based on:
- **Genomic position**: Location on the chromosome
- **Variant type**: SNV, insertion, or deletion
- **Quality metrics**: Confidence in the variant call
- **Known gene associations**: Genes at or near the position

**Important**: AI interpretations are for educational purposes. Clinical decisions should always involve:
- Certified genetic counselors
- Validated clinical databases (ClinVar, OMIM)
- Laboratory confirmation
- Patient clinical context

## Step 11: Model Comparison - Nova vs Claude

One of the powerful features of the Strands SDK is how easy it is to switch between different foundation models. Let's compare Amazon Nova Lite with Claude Sonnet 4 to see how different models interpret the same variant.

### Model Characteristics

| Feature | Amazon Nova Lite | Claude Sonnet 4 |
|---------|------------------|------------------|
| Provider | Amazon | Anthropic |
| Strengths | Fast responses, cost-effective | Nuanced reasoning, detailed explanations |
| Best for | Quick assessments, high throughput | Complex analysis, detailed reports |
| Response style | Concise, direct | Thorough, contextual |

Both models work with the same system prompt - the only change is the model identifier!

In [None]:
try:
    agent_claude = Agent(
        model="us.anthropic.claude-sonnet-4-5-20250929-v1:0", 
        system_prompt=SYSTEM_PROMPT  # Same prompt as Nova!
    )
    
    print("‚úÖ Claude Sonnet 4.5 agent created successfully!")
    print(f"   Model: Claude Sonnet 4.5")
    print(f"   Role: Expert Clinical Geneticist (same as Nova)")
    print()
    print("üí° Notice: We only changed the model parameter!")
    print("   The system prompt remains identical.")
    
except Exception as e:
    print(f"‚ùå Error creating Claude agent: {e}")
    print("   Ensure your IAM policy includes Claude model permissions")
    print("   Required ARN: arn:aws:bedrock:*::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0")

‚úÖ Claude Sonnet 4 agent created successfully!
   Model: Claude Sonnet 4
   Region: us-west-2
   Role: Expert Clinical Geneticist (same as Nova)

üí° Notice: We only changed the model parameter!
   The system prompt remains identical.


In [52]:
# Compare responses from both models on the same variant
comparison_variant = top_variants.iloc[0]

comparison_query = f"""Interpret this genomic variant concisely:
Position: {comparison_variant['CHROM']}:{comparison_variant['POS']}
Change: {comparison_variant['REF']} > {comparison_variant['ALT']}
Quality: {comparison_variant['QUAL']:.2f}

Provide: variant type, potential impact, and one recommended follow-up action."""

print("Comparing model responses for the same variant...")
print(f"Variant: {comparison_variant['CHROM']}:{comparison_variant['POS']} {comparison_variant['REF']}>{comparison_variant['ALT']}")
print()

# Get Nova response
print("=" * 60)
print("üîµ AMAZON NOVA LITE RESPONSE")
print("=" * 60)
try:
    nova_response = agent(comparison_query)
    print(nova_response.message["content"][0]["text"])
except Exception as e:
    print(f"Error: {e}")

print()

# Get Claude response
print("=" * 60)
print("üü£ CLAUDE SONNET 4.5 RESPONSE")
print("=" * 60)
try:
    claude_response = agent_claude(comparison_query)
    print(claude_response.message["content"][0]["text"])
except Exception as e:
    print(f"Error: {e}")

Comparing model responses for the same variant...
Variant: chr19:37565861 C>G

üîµ AMAZON NOVA LITE RESPONSE
**Variant Type**: Single Nucleotide Variant (SNV)

**Potential Impact**: The variant C > G at position chr19:37565861 is located within the BRCA2 gene. Given its high-quality score and the context of the BRCA2 gene, this variant is likely to have significant clinical implications. It could potentially disrupt the function of BRCA2, leading to increased cancer risk.

**Recommended Follow-up Action**: Consult ClinVar and OMIM to check for any reported clinical significance of this variant and assess its impact on the BRCA2 gene.**Variant Type**: Single Nucleotide Variant (SNV)

**Potential Impact**: The variant C > G at position chr19:37565861 is located within the BRCA2 gene. Given its high-quality score and the context of the BRCA2 gene, this variant is likely to have significant clinical implications. It could potentially disrupt the function of BRCA2, leading to increased can

### Key Takeaways from Model Comparison

**What to notice:**
- Both models use the same system prompt and query
- Response styles may differ (length, structure, detail level)
- Clinical recommendations may vary based on model training

**When to use each model:**
- **Nova Lite**: High-throughput screening, quick assessments, cost-sensitive workflows
- **Claude Sonnet 4**: Detailed case analysis, complex variant interpretation, report generation

**Switching models is just one line of code!**
```python
# Nova
agent = Agent(model="us.amazon.nova-2-lite-v1:0", system_prompt=SYSTEM_PROMPT)

# Claude
agent = Agent(model="us.anthropic.claude-sonnet-4-5-20250929-v1:0", system_prompt=SYSTEM_PROMPT)
```

You've successfully completed the AI-Powered Variant Interpretation module. Let's recap what you accomplished.

### What You Learned

In this notebook, you:

1. **Loaded and parsed VCF data** - Converted raw variant calls into a structured pandas DataFrame
2. **Applied quality filters** - Identified high-confidence variants using QUAL and DP thresholds
3. **Created an AI agent** - Built a specialized bioinformatics assistant using Amazon Bedrock and the Strands SDK
4. **Interpreted variants with natural language** - Asked questions about your data without writing complex queries
5. **Compared foundation models** - Saw how Nova Lite and Claude Sonnet 4.5 provide different perspectives

### Key Takeaways

| Concept | What You Learned |
|---------|------------------|
| **Prompt Engineering** | System prompts shape AI behavior for domain-specific tasks |
| **Model Selection** | Different models excel at different tasks (speed vs. detail) |
| **Conversational Analysis** | Natural language queries simplify complex genomic analysis |
| **Quality Filtering** | QUAL > 30 and DP > 10 are reasonable starting thresholds |

### Next Steps

Here are some ways to extend what you've learned:

**Enhance Your Analysis**
- Integrate with annotation databases (ClinVar, gnomAD) for richer context
- Add gene annotation using tools like VEP or SnpEff
- Build automated pipelines that combine variant calling with AI interpretation

**Customize Your Agent**
- Modify the system prompt for specific disease areas (oncology, rare disease, pharmacogenomics)
- Add tools to your agent for database lookups or literature search
- Create multi-agent workflows for comprehensive variant review

### Useful Resources

- [Amazon Bedrock Documentation](https://docs.aws.amazon.com/bedrock/)
- [Strands Agents SDK](https://github.com/strands-agents/strands-agents-python)
- [ClinVar Database](https://www.ncbi.nlm.nih.gov/clinvar/)
- [VCF Specification](https://samtools.github.io/hts-specs/VCFv4.3.pdf)