# Analyzing Variant Effects on Gene Expression: SNPs, Indels, and Multi-Variant Interactions

## Overview

This notebook demonstrates how to analyze the functional impact of genetic variants using VariantFormer. You can:

- **Test Individual Variants**: Examine how single SNPs or indels affect gene expression/embeddings
- **Multi-Variant Interactions**: Add new variant to an existing VCF file to study multivariant effect
- **Compare Variant Types**: Understand differential impacts of SNPs vs indels (insertions/deletions)
- **Extract Gene Embeddings**: Access learned representations for downstream analysis
- **Tissue-Specific Responses**: Compare variant effects across different tissues

In [None]:
# Essential imports
import sys
import os
import subprocess
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import ipynbname

# Add project root to path
REPO_PATH = ipynbname.path().parent.parent
sys.path.insert(0, str(REPO_PATH))

from processors.vcfprocessor import VCFProcessor

print("‚úÖ Imports successful!")


In [None]:
# Initialize VCFProcessor
print("üöÄ Initializing VCFProcessor...")
vcf_processor = VCFProcessor(model_class='v4_ag')
print("‚úÖ VCFProcessor initialized!")
print(f"üìÇ Fasta path: {vcf_processor.vcf_loader_config.fasta_path}")


## Step 1: Define Variants for Analysis

We'll test 5 variants near the APOE gene: 3 SNPs and 2 indels (1 insertion, 1 deletion).


In [None]:
# Create variant dataframe with pre-validated reference alleles
# These variants are located near/within the APOE gene on chr19
variant_df = pd.DataFrame([
    # SNPs (Single Nucleotide Polymorphisms)
    {'chrom': 'chr19', 'pos': 44900754, 'ref': 'A', 'alt': 'G', 'GT': '0/1', 'type': 'SNP'},
    {'chrom': 'chr19', 'pos': 44906754, 'ref': 'G', 'alt': 'T', 'GT': '1/1', 'type': 'SNP'},
    {'chrom': 'chr19', 'pos': 44907754, 'ref': 'A', 'alt': 'C', 'GT': '0/1', 'type': 'SNP'},
    
    # Indels (Insertions and Deletions)
    {'chrom': 'chr19', 'pos': 44908754, 'ref': 'T', 'alt': 'TTG', 'GT': '0/1', 'type': 'insertion'},
    {'chrom': 'chr19', 'pos': 44909754, 'ref': 'CCG', 'alt': 'C', 'GT': '1/1', 'type': 'deletion'},
])

print("‚úÖ Variants defined:")
print(f"   3 SNPs: heterozygous (0/1) and homozygous alt (1/1)")
print(f"   2 Indels: 1 insertion (T‚ÜíTTG), 1 deletion (CCG‚ÜíC)")
print(f"\nüìä Variant DataFrame:")
print(variant_df[['chrom', 'pos', 'ref', 'alt', 'GT', 'type']])


## Step 2: Test VCF Creation

### Test Case 1: Create a New VCF File


In [None]:
# Create output directory for test files
output_dir = Path("/work/notebooks/test_output")
output_dir.mkdir(exist_ok=True)

# Test Case 1: Create new VCF file (no merging)
output_vcf_1 = output_dir / "test_variants_set1.vcf.gz"

print("üöÄ Test Case 1: Creating new VCF file...")
print(f"   Output: {output_vcf_1}")

result_file = vcf_processor.create_vcf_from_variant(
    variant_df=variant_df,
    output_path=str(output_vcf_1),
    vcf_path=None  # No merging
)

print(f"‚úÖ VCF file created: {result_file}")
print(f"   File exists: {Path(result_file).exists()}")
print(f"   Index exists: {Path(result_file + '.tbi').exists()}")


### Test Case 3: Merge Additional Variants


In [None]:
# Create a second set of variants to test VCF merging
variant_df_2 = pd.DataFrame([
    {'chrom': 'chr19', 'pos': 44910754, 'ref': 'C', 'alt': 'A', 'GT': '0/1'},
    {'chrom': 'chr19', 'pos': 44911754, 'ref': 'A', 'alt': 'T', 'GT': '1/1'},
])

print("üìä Second variant set (for merge test):")
print(variant_df_2)


In [None]:
# Test Case 2: Merge with existing VCF
output_vcf_2 = output_dir / "test_variants_merged.vcf.gz"

print("\nüöÄ Test Case 2: Merging with existing VCF...")
print(f"   Existing VCF: {result_file}")
print(f"   Output: {output_vcf_2}")

result_file_merged = vcf_processor.create_vcf_from_variant(
    variant_df=variant_df_2,
    output_path=str(output_vcf_2),
    vcf_path=str(result_file)  # Merge with first VCF
)

print(f"‚úÖ Merged VCF file created: {result_file_merged}")
print(f"   File exists: {Path(result_file_merged).exists()}")
print(f"   Index exists: {Path(result_file_merged + '.tbi').exists()}")


## Step 4: Validate VCF Files

Quick validation to confirm VCF creation was successful.


In [None]:
# Count variants in created VCF files
result = subprocess.run(
    ["bcftools", "view", "-H", str(result_file)],
    capture_output=True, text=True
)
vcf1_count = len(result.stdout.strip().split('\n'))

result_merged = subprocess.run(
    ["bcftools", "view", "-H", str(result_file_merged)],
    capture_output=True, text=True
)
vcf_merged_count = len(result_merged.stdout.strip().split('\n'))

print("‚úÖ VCF Validation:")
print(f"   First VCF: {vcf1_count} variants (3 SNPs + 2 indels)")
print(f"   Merged VCF: {vcf_merged_count} variants (all variants combined)")
print(f"   Both files indexed and compressed (.vcf.gz + .tbi)")


## Step 5: Predict Gene Expression from Variants

Now let's use the created VCF file to predict gene expression and compare it with reference genome predictions.


In [None]:
# Prepare query for APOE gene across multiple tissues
tissues_of_interest = ["whole blood", "brain - cortex", "liver", "adipose - subcutaneous"]
tissues_str = ",".join(tissues_of_interest)

query_df = pd.DataFrame({
    "gene_id": ['ENSG00000130203.9'],
    "tissues": [tissues_str]
})

print("üîç Query DataFrame for Expression Prediction:")
print(f"   Gene: APOE (ENSG00000130203.9)")
print(f"   Tissues: {tissues_of_interest}")
query_df


In [None]:
# Load the model (this may take a moment)
print("üîÑ Loading pre-trained model...")
import time
start_time = time.time()

model, checkpoint_path, trainer = vcf_processor.load_model()

load_time = time.time() - start_time
print(f"‚úÖ Model loaded in {load_time:.2f} seconds")
print(f"üìÇ Checkpoint: {checkpoint_path}")

# Print model info
total_params = sum(p.numel() for p in model.parameters())
print(f"üìä Model parameters: {total_params:,}")


### Prediction 1: With Variants (from our created VCF)


In [None]:
# Create dataset with variants
print(f"üìä Creating dataset with variants from VCF: {result_file}")
vcf_dataset_variant, dataloader_variant = vcf_processor.create_data(
    vcf_path=str(output_vcf_2),  # Our created VCF
    query_df=query_df
)

print("‚úÖ Dataset created")
print(f"   Dataset size: {len(vcf_dataset_variant)}")


In [None]:
# Run predictions with variants
print("üîÆ Running predictions with variants...")
start_time = time.time()

predictions_variant = vcf_processor.predict(
    model=model,
    checkpoint_path=checkpoint_path,
    trainer=trainer,
    dataloader=dataloader_variant,
    vcf_dataset=vcf_dataset_variant
)

pred_time = time.time() - start_time
predictions_variant 

### Prediction 2: Reference Genome (without variants)


In [None]:
# Create dataset without variants (reference genome)
print("üìä Creating dataset with reference genome (no variants)...")
vcf_dataset_ref, dataloader_ref = vcf_processor.create_data(
    vcf_path=None,  # No VCF = reference genome
    query_df=query_df
)

print("‚úÖ Reference dataset created")
print(f"   Dataset size: {len(vcf_dataset_ref)}")


In [None]:
# Run predictions with reference genome
print("üîÆ Running predictions with reference genome...")
start_time = time.time()

predictions_ref = vcf_processor.predict(
    model=model,
    checkpoint_path=checkpoint_path,
    trainer=trainer,
    dataloader=dataloader_ref,
    vcf_dataset=vcf_dataset_ref
)

pred_time = time.time() - start_time
predictions_ref


## Summary: Analyzing Variant Effects on Gene Expression

### What We Demonstrated

**1. Custom VCF Creation ‚úÖ**
- Created VCF files with 3 SNPs and 2 indels 
- Merged multiple VCF files to study combined variant effects
- All files compressed and indexed for efficient access

**2. Expression Prediction with Variants ‚úÖ**
- Predicted gene expression with multi-variant input
- Compared against reference genome baseline
- Analyzed tissue-specific responses across 4 tissues
