# VCF to Gene Expression Prediction

This notebook demonstrates how to use the VariantFormer VCFProcessor to predict gene expression from VCF (Variant Call Format) files. The VCFProcessor leverages a transformer model to predict gene expression levels based on genetic variants and tissue types.

## Overview

The VariantFormer pipeline can:
- Process VCF files containing genetic variants
- Predict gene expression levels for specific genes and tissues
- Generate embeddings for specific genes
- Handle multiple tissues and genes in a single analysis

## Prerequisites

- GPU-enabled environment (CUDA required)
- Access to reference genome and model checkpoints (run `python download_artifacts.py` before running the notebook)
- VCF files with genetic variants

## Key Outputs:

- **Gene Expression Predictions**: Quantitative predictions of expression levels
- **Embeddings**: Embedding for each gene-tissue pair


In [None]:
import sys
import os
from pathlib import Path
import ipynbname
import pandas as pd
from processors.vcfprocessor import VCFProcessor
import warnings
 
# Check GPU availability
import torch

if torch.cuda.is_available():
    print(f"üöÄ GPU available: {torch.cuda.get_device_name(0)}")
    print(
        f"üíæ GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB"
    )
else:
    print("‚ö†Ô∏è  No GPU available - this notebook requires CUDA")

# Set repository path
REPO_PATH = ipynbname.path().parent.parent

## 1. Initialize VCFProcessor

The VCFProcessor is the main class that handles:
- Loading model configurations
- Managing tissue and gene vocabularies  
- Creating data loaders for VCF files
- Loading pre-trained models (v4_pcg, v4_ag)
- Running predictions

Let's initialize it with the default VariantFormer protein-coding gene (PCG) model.


In [None]:
# Initialize the VCFProcessor with the default v4_PCG model
model_class = "v4_ag" # You can also use "v4_ag" for protein-coding genes
#model_class = "v4_pcg"  # Uncomment to use the protein-coding gene model
vcf_processor = VCFProcessor(model_class=model_class)

print(f"üìä Model class: {model_class}")
print(f"‚öôÔ∏è  Configuration loaded from: {vcf_processor.config_location}")

## 2. Explore Available Tissues and Genes

Before making predictions, let's explore what tissues and genes are available in the system. This will help us understand the scope of the model and choose appropriate targets for our analysis.


In [None]:
# Get available tissues
tissues = vcf_processor.get_tissues()
print(f"üß™ Available tissues ({len(tissues)} total):")
print("=" * 50)
print("First 10 tissues in the dataset:")
for i, tissue in enumerate(tissues, 1):
    print(f"{i:2d}. {tissue}")
    if i == 10:
        break

print("\n" + "=" * 70)

# Get available genes
genes_df = vcf_processor.get_genes()
print(f"üß¨ Available genes ({len(genes_df)} total):")
print("=" * 50)
print("First 10 genes in the dataset:")
print(genes_df[["gene_id", "gene_name"]].head(10).to_string(index=False))

print("\nüìä Gene statistics:")
print(f"   ‚Ä¢ Total genes: {len(genes_df):,}")

## 3. Prepare Query Data

Now we'll prepare our query data, which specifies:
- **gene_id**: Which genes we want to predict expression for
- **tissues**: Which tissues/cell types we're interested in

For this demo, we'll use the same example as in the test function, but let's also prepare a more diverse example.


In [None]:
# Example 1: Simple query
'''
General format for queries:
simple_query = {
    "gene_id": [gene ID],
    "tissues": [Comma separated tissue names],
}
'''
simple_query = {
    "gene_id": ["ENSG00000001461.16", "ENSG00000000419.12"],
    "tissues": ["whole blood,thyroid,artery - aorta", "brain - amygdala"],
} 
query_df = pd.DataFrame(simple_query)
print("üîç Simple Query DataFrame:")
print(query_df.to_string(index=False))

## 4. Specify VCF File and Create Dataset

Now we need to specify the path to our VCF file containing genetic variants. The VCFProcessor will:
- Load the VCF file and extract relevant variants
- Map variants to regulatory regions (CREs) and genes
- Create sequence data for model input
- Prepare batches for efficient processing

**Note**: Update the `vcf_path` below to point to your actual VCF file.


In [None]:
# Specify VCF file path (update this path to your actual VCF file)
vcf_path = os.path.join(REPO_PATH, "_artifacts/HG00096.vcf.gz")
# Create data loader from VCF and query
vcf_dataset, dataloader = vcf_processor.create_data(vcf_path, query_df)

In [None]:
import time

print("üîÑ Loading pre-trained model...")
start_time = time.time()


model, checkpoint_path, trainer = vcf_processor.load_model()

load_time = time.time() - start_time
print(f"üìÇ Checkpoint path: {checkpoint_path}")
print(f"‚ö° Precision: {trainer.precision}")

# Print model information
total_params = sum(p.numel() for p in model.parameters())

print("\nüìä Model Statistics:")
print(f"   ‚Ä¢ Total parameters: {total_params:,}")


## 6. Run Predictions

Now we're ready to run the actual predictions! The model will:
- Process the genetic variants 
- Generate embeddings that capture gene representation
- Predict gene expression levels for each gene-tissue combination

This step may take some time depending on the size of your VCF file and the complexity of your queries. Prune the VCF files to remove all ref variant which will improve speed.


In [None]:
print("üîÑ Running predictions...")

start_time = time.time()

# Run predictions
predictions_df = vcf_processor.predict(
    model, checkpoint_path, trainer, dataloader, vcf_dataset
)

prediction_time = time.time() - start_time
print(f"‚úÖ Predictions completed in {prediction_time:.1f} seconds!")
print(f"üìä Results shape: {predictions_df.shape}")



## 7. Analyze Results

In [None]:
print("üìä PREDICTION RESULTS ANALYSIS")
print("=" * 50)

# Display basic information about results
print("üîç Results Overview:")
print(f"   ‚Ä¢ Number of predictions: {len(predictions_df)}")
print(f"   ‚Ä¢ Columns: {list(predictions_df.columns)}")
print("\nüìã Sample Results:")
print("=" * 30)
print("Predictions:")
predictions_df

**The output schema**
- Original query information: (gene_id, tissues)
- `predicted_expression`: Model's prediction of gene expression levels
- `embeddings`: High-dimensional representations of the gene conditioned on tissue and neighboring regulatory regions

## Run the analysis on ref genome hg38 without any mutations
This analysis allows to compare gene expression change due to the presence of the mutations.

In [None]:
# Np VCF file provided, so using empty string
vcf_dataset, dataloader = vcf_processor.create_data("", query_df) # Replace actual VCF path with "" to indicate no VCF
predictions_df_ref = vcf_processor.predict(
    model, checkpoint_path, trainer, dataloader, vcf_dataset
)

In [None]:
print("‚úÖ Predictions with made from reference genome (no VCF provided):")  
predictions_df_ref.head()

### Compare the log2fc from REF genome

In [None]:
import numpy as np

# Extract the predicted expression values from nested lists
ref_expr = predictions_df_ref['predicted_expression'].values[0]
vcf_expr = predictions_df['predicted_expression'].values[0]

# Calculate log2 fold change
print(f'Tissues: {predictions_df["tissue_names"].values[0]}')
print(f'Log2 fold change: {np.log2(vcf_expr / ref_expr)}')

In [None]:
predictions_df