# VCF to Gene Expression Prediction

This notebook demonstrates how to use the VariantFormer VCFProcessor to predict gene expression from VCF (Variant Call Format) files. The VCFProcessor leverages deep learning models to predict gene expression levels based on genetic variants and tissue types.

## Overview

The VariantFormer pipeline can:
- Process VCF files containing genetic variants
- Predict gene expression levels for specific genes and tissues
- Generate embeddings that capture regulatory information
- Handle multiple tissues and genes in a single analysis

## Prerequisites

- GPU-enabled environment (CUDA required)
- Access to reference genome and model checkpoints
- VCF files with genetic variants

## Key Outputs:

- **Gene Expression Predictions**: Quantitative predictions of expression levels
- **Embeddings**: Embedding for each gene-tissue pair
- **Structured Results**: Well-formatted DataFrame ready for downstream analysis


## Import Required Libraries

Let's start by importing the necessary libraries and modules.


In [4]:
import sys
import os
from pathlib import Path

# Add parent directory to path to import custom modules
sys.path.append(str(Path.cwd().parent))

import pandas as pd
from processors.vcfprocessor import VCFProcessor
import warnings

warnings.filterwarnings("ignore")

print("‚úÖ All libraries imported successfully!")
print(f"üìÅ Working directory: {Path.cwd()}")
print(f"üñ•Ô∏è  Python version: {sys.version}")

# Check GPU availability
import torch

if torch.cuda.is_available():
    print(f"üöÄ GPU available: {torch.cuda.get_device_name(0)}")
    print(
        f"üíæ GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB"
    )
else:
    print("‚ö†Ô∏è  No GPU available - this notebook requires CUDA")

‚úÖ All libraries imported successfully!
üìÅ Working directory: /work/notebooks
üñ•Ô∏è  Python version: 3.12.10 (main, Apr  9 2025, 04:03:51) [Clang 20.1.0 ]
üöÄ GPU available: NVIDIA H100 80GB HBM3
üíæ GPU memory: 84.9 GB


## 1. Initialize VCFProcessor

The VCFProcessor is the main class that handles:
- Loading model configurations
- Managing tissue and gene vocabularies  
- Creating data loaders for VCF files
- Loading pre-trained models
- Running predictions

Let's initialize it with the default VariantFormer protein-coding gene (PCG) model.


In [5]:
# Initialize the VCFProcessor with the default D2C_PCG model
model_class = "D2C_AG"
vcf_processor = VCFProcessor(model_class=model_class)

print("üß¨ VCFProcessor initialized successfully!")
print(f"üìä Model class: {model_class}")
print(f"‚öôÔ∏è  Configuration loaded from: {vcf_processor.config_location}")
print(f"üéØ Accelerator: {vcf_processor.accelerator}")

üß¨ VCFProcessor initialized successfully!
üìä Model class: D2C_AG
‚öôÔ∏è  Configuration loaded from: /work/configs
üéØ Accelerator: gpu


## 2. Explore Available Tissues and Genes

Before making predictions, let's explore what tissues and genes are available in the system. This will help us understand the scope of the model and choose appropriate targets for our analysis.


In [6]:
# Get available tissues
tissues = vcf_processor.get_tissues()
print(f"üß™ Available tissues ({len(tissues)} total):")
print("=" * 50)
print("First 10 tissues in the dataset:")
for i, tissue in enumerate(tissues, 1):
    print(f"{i:2d}. {tissue}")
    if i == 10:
        break

print("\n" + "=" * 70)

# Get available genes
genes_df = vcf_processor.get_genes()
print(f"üß¨ Available genes ({len(genes_df)} total):")
print("=" * 50)
print("First 10 genes in the dataset:")
print(genes_df[["gene_id", "gene_name"]].head(10).to_string(index=False))

print("\nüìä Gene statistics:")
print(f"   ‚Ä¢ Total genes: {len(genes_df):,}")

üß™ Available tissues (62 total):
First 10 tissues in the dataset:
 1. A549
 2. GM23248
 3. HepG2
 4. K562
 5. NCI-H460
 6. Panc1
 7. adipose - subcutaneous
 8. adipose - visceral (omentum)
 9. adrenal gland
10. artery - aorta

üß¨ Available genes (51061 total):
First 10 genes in the dataset:
           gene_id gene_name
ENSG00000000419.12      DPM1
ENSG00000000457.13     SCYL3
ENSG00000000460.16  C1orf112
ENSG00000000938.12       FGR
ENSG00000000971.15       CFH
ENSG00000001036.13     FUCA2
ENSG00000001084.10      GCLC
ENSG00000001167.14      NFYA
ENSG00000001460.17     STPG1
ENSG00000001461.16    NIPAL3

üìä Gene statistics:
   ‚Ä¢ Total genes: 51,061


## 3. Prepare Query Data

Now we'll prepare our query data, which specifies:
- **gene_id**: Which genes we want to predict expression for
- **tissues**: Which tissues/cell types we're interested in

For this demo, we'll use the same example as in the test function, but let's also prepare a more diverse example.


In [7]:
# Example 1: Simple query (same as test function)
simple_query = {
    "gene_id": ["ENSG00000001461.16"] * 2,
    "tissues": ["whole blood,thyroid,artery - aorta", "brain - amygdala"],
}
simple_query_df = pd.DataFrame(simple_query)

print("üìã Simple Query Example:")
print("=" * 40)
print(simple_query_df.to_string(index=False))

# We'll use the simple query for this demo (following the test function)
query_df = simple_query_df
print(f"\n‚úÖ Using simple query for demonstration ({len(query_df)} rows)")

üìã Simple Query Example:
           gene_id                            tissues
ENSG00000001461.16 whole blood,thyroid,artery - aorta
ENSG00000001461.16                   brain - amygdala

‚úÖ Using simple query for demonstration (2 rows)


## 4. Specify VCF File and Create Dataset

Now we need to specify the path to our VCF file containing genetic variants. The VCFProcessor will:
- Load the VCF file and extract relevant variants
- Map variants to regulatory regions (CREs) and genes
- Create sequence data for model input
- Prepare batches for efficient processing

**Note**: Update the `vcf_path` below to point to your actual VCF file.


In [13]:
# Specify VCF file path (update this path to your actual VCF file)
vcf_path = os.path.join(str(Path.cwd().parent), "_artifacts/HG00096.vcf.gz")

print(f"üìÅ VCF file path: {vcf_path}")

# Check if file exists (optional - comment out if using a different path)
if os.path.exists(vcf_path):
    print("‚úÖ VCF file found!")
    file_size = os.path.getsize(vcf_path) / (1024 * 1024)  # Size in MB
    print(f"üìä File size: {file_size:.1f} MB")
else:
    print("‚ö†Ô∏è  VCF file not found at specified path. Please update the path above.")

print("\nüîÑ Creating dataset and dataloader...")

# Create dataset and dataloader
try:
    vcf_dataset, dataloader = vcf_processor.create_data(vcf_path, query_df)

    print("‚úÖ Dataset and dataloader created successfully!")
    print(f"üìä Dataset size: {len(vcf_dataset)} samples")
    print(f"üî¢ Number of batches: {len(dataloader)}")
    print(f"‚öôÔ∏è  Batch size: {dataloader.batch_size}")

except Exception as e:
    print(f"‚ùå Error creating dataset: {str(e)}")
    print("Please check your VCF file path and ensure all dependencies are available.")

üìÅ VCF file path: /work/_artifacts/HG00096.vcf.gz
‚úÖ VCF file found!
üìä File size: 61.1 MB

üîÑ Creating dataset and dataloader...
Loaded BPE vocabulary from /work/vocabs/bpe_vocabulary_500.json
Filtered query df to 2 genes reducing from 2
‚úÖ Dataset and dataloader created successfully!
üìä Dataset size: 2 samples
üî¢ Number of batches: 1
‚öôÔ∏è  Batch size: 8


## 5. Load Pre-trained Model

Now we'll load the pre-trained VariantFormer model. This includes:
- Loading the model architecture and weights
- Setting up the PyTorch Lightning trainer
- Configuring GPU acceleration and precision settings

The model loading process will download checkpoints if they're not already available locally.


In [14]:
import time

print("üîÑ Loading pre-trained model...")
start_time = time.time()

try:
    model, checkpoint_path, trainer = vcf_processor.load_model()

    load_time = time.time() - start_time
    print(f"‚úÖ Model loaded successfully in {load_time:.1f} seconds!")
    print(f"üìÇ Checkpoint path: {checkpoint_path}")
    print(f"üéØ Trainer device: {trainer.accelerator}")
    print(f"üî¢ Number of devices: {trainer.num_devices}")
    print(f"‚ö° Precision: {trainer.precision}")

    # Print model information
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

    print("\nüìä Model Statistics:")
    print(f"   ‚Ä¢ Total parameters: {total_params:,}")
    print(f"   ‚Ä¢ Model type: {type(model).__name__}")

except Exception as e:
    print(f"‚ùå Error loading model: {str(e)}")
    print("Please ensure model checkpoints are available and accessible.")

üîÑ Loading pre-trained model...
Loading Seq2Reg model...
Loading Seq2Reg gene model...
Creating Seq2Gene model...
Model class: <class 'seq2gene.model_combined_modulator.Seq2GenePredictorCombinedModulator'>
Model architecture:
Model: Seq2GenePredictorCombinedModulator
  start_tkn: 96,768 params
  cre_tokenizer: 31,826,153 params
  gene_tokenizer: 31,826,153 params
  gene_map: 787,968 params
  cre_map: 787,968 params
  combined_modulator: 1,157,298,176 params
  tissue_heads: 4,726,273 params
  gene_loss: 0 params
Total number of parameters: 1,227,349,459
Loading checkpoint from /work/_artifacts/v4_ag_epoch9_checkpoint.pth


Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Model loaded successfully on cuda
‚úÖ Model loaded successfully in 10.5 seconds!
üìÇ Checkpoint path: /work/_artifacts/v4_ag_epoch9_checkpoint.pth
üéØ Trainer device: <lightning.pytorch.accelerators.cuda.CUDAAccelerator object at 0x7ddd9a60aab0>
üî¢ Number of devices: 1
‚ö° Precision: bf16-mixed

üìä Model Statistics:
   ‚Ä¢ Total parameters: 1,227,349,459
   ‚Ä¢ Model type: Seq2GenePredictorCombinedModulator


## 6. Run Predictions

Now we're ready to run the actual predictions! The model will:
- Process the genetic variants 
- Generate embeddings that capture gene representation
- Predict gene expression levels for each gene-tissue combination

This step may take some time depending on the size of your VCF file and the complexity of your queries.


In [15]:
print("üîÑ Running predictions...")
print(f"üìä Processing {len(query_df)} gene-tissue combinations")
print(f"üß¨ VCF samples: {len(vcf_dataset)}")

start_time = time.time()

try:
    # Run predictions
    predictions_df = vcf_processor.predict(
        model, checkpoint_path, trainer, dataloader, vcf_dataset
    )

    prediction_time = time.time() - start_time
    print(f"‚úÖ Predictions completed in {prediction_time:.1f} seconds!")
    print(f"üìä Results shape: {predictions_df.shape}")

    # Basic validation
    assert len(predictions_df) == len(
        query_df
    ), "Predictions length should match query dataframe length"
    print("‚úÖ Validation passed: predictions match input queries")

except Exception as e:
    print(f"‚ùå Error during prediction: {str(e)}")
    print("Please check that all previous steps completed successfully.")

Restoring states from the checkpoint path at /work/_artifacts/v4_ag_epoch9_checkpoint.pth


üîÑ Running predictions...
üìä Processing 2 gene-tissue combinations
üß¨ VCF samples: 2


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from the checkpoint at /work/_artifacts/v4_ag_epoch9_checkpoint.pth


Predicting: |          | 0/? [00:00<?, ?it/s]

2025-10-30 20:14:55 - utils.assets - INFO - Loading parquet file: /tmp/tmpubvgwkrn/model/common/cres_all_genes_manifest.parquet
2025-10-30 20:14:55 - utils.assets - INFO - Validated schema - found columns: {'file_path', 'gene_id'}


‚úÖ Predictions completed in 14.8 seconds!
üìä Results shape: (2, 5)
‚úÖ Validation passed: predictions match input queries


## 7. Analyze Results

Let's examine the prediction results in detail. The output contains:
- **Original query information** (gene_id, tissues)
- **predicted_expression**: Model's prediction of gene expression levels
- **embeddings**: High-dimensional representations capturing regulatory context

We'll explore the structure of the results and provide some basic analysis.


In [16]:
print("üìä PREDICTION RESULTS ANALYSIS")
print("=" * 50)

# Display basic information about results
print("üîç Results Overview:")
print(f"   ‚Ä¢ Number of predictions: {len(predictions_df)}")
print(f"   ‚Ä¢ Columns: {list(predictions_df.columns)}")
print("   ‚Ä¢ Data types:")
for col in predictions_df.columns:
    print(f"     - {col}: {predictions_df[col].dtype}")

print("\nüìã Sample Results:")
print("=" * 30)
print("Predictions:")
predictions_df

üìä PREDICTION RESULTS ANALYSIS
üîç Results Overview:
   ‚Ä¢ Number of predictions: 2
   ‚Ä¢ Columns: ['gene_id', 'tissues', 'tissue_names', 'predicted_expression', 'embeddings']
   ‚Ä¢ Data types:
     - gene_id: object
     - tissues: object
     - tissue_names: object
     - predicted_expression: object
     - embeddings: object

üìã Sample Results:
Predictions:


Unnamed: 0,gene_id,tissues,tissue_names,predicted_expression,embeddings
0,ENSG00000001461.16,"[62, 59, 10]","[whole blood, thyroid, artery - aorta]","[[1.2511718], [2.8266606], [1.6822926]]","[[-5.875, -4.53125, 1.9375, 2.953125, -0.96484..."
1,ENSG00000001461.16,[15],[brain - amygdala],[[3.0634768]],"[[10.875, 5.46875, -2.53125, 4.28125, 20.125, ..."


### Run the analysis on ref genome GRch38

In [17]:
# Np VCF file provided, so using empty string
vcf_dataset, dataloader = vcf_processor.create_data("", query_df)
predictions_df = vcf_processor.predict(
    model, checkpoint_path, trainer, dataloader, vcf_dataset
)

Restoring states from the checkpoint path at /work/_artifacts/v4_ag_epoch9_checkpoint.pth


Loaded BPE vocabulary from /work/vocabs/bpe_vocabulary_500.json
Filtered query df to 2 genes reducing from 2


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from the checkpoint at /work/_artifacts/v4_ag_epoch9_checkpoint.pth


Predicting: |          | 0/? [00:00<?, ?it/s]

2025-10-30 20:15:20 - utils.assets - INFO - Loading parquet file: /tmp/tmpubvgwkrn/model/common/cres_all_genes_manifest.parquet
2025-10-30 20:15:20 - utils.assets - INFO - Validated schema - found columns: {'file_path', 'gene_id'}


In [18]:
predictions_df

Unnamed: 0,gene_id,tissues,tissue_names,predicted_expression,embeddings
0,ENSG00000001461.16,"[62, 59, 10]","[whole blood, thyroid, artery - aorta]","[[1.1905905], [2.7972884], [1.6317137]]","[[-5.71875, -4.65625, 1.0859375, 1.9765625, -1..."
1,ENSG00000001461.16,[15],[brain - amygdala],[[2.915098]],"[[7.53125, 5.0, -4.1875, 2.765625, 19.5, -6.53..."


In [20]:
predictions_df['embeddings'].iloc[0][0]

array([-5.71875  , -4.65625  ,  1.0859375, ...,  2.78125  ,  5.28125  ,
        3.171875 ], shape=(1536,), dtype=float32)