# VCF to Alzheimer's Disease Risk Prediction

This notebook demonstrates how to predict Alzheimer's Disease (AD) risk from genetic variants in VCF files using VariantFormer. The model predicts tissue-specific gene expression changes and estimates AD risk contributions.

## Overview

- Input: VCF file with genetic variants
- Output: AD risk scores per gene-tissue pair


## 1. Setup and Imports


In [8]:
import sys
from pathlib import Path
sys.path.insert(0, str(Path().resolve().parent))
import os
from processors import ad_risk
!uv pip install plotly
import plotly.express as px

[2mUsing Python 3.12.10 environment at: /work/.venv[0m
[2mAudited [1m1 package[0m [2min 2ms[0m[0m


## 2. Load AD Risk Model

Initialize the model with `v4_pcg` (DNA-to-Cell, Protein-Coding Genes) or `v4_ag`. Model artifacts are automatically downloaded from S3 on first run.


In [None]:
# loading AD risk prediction model
model_class = 'v4_pcg'
# model_class = 'v4_ag' # If you want to use the model trained on all genes.
adrisk = ad_risk.ADriskFromVCF(model_class=model_class)

2025-11-02 21:32:20 - processors.model_manager - INFO - Loading Seq2Reg model...
2025-11-02 21:32:21 - processors.model_manager - INFO - Loading Seq2Reg gene model...
2025-11-02 21:32:21 - processors.model_manager - INFO - Creating Seq2Gene model...
2025-11-02 21:32:26 - processors.model_manager - INFO - Model class: <class 'seq2gene.model_combined_modulator.Seq2GenePredictorCombinedModulator'>
2025-11-02 21:32:26 - processors.model_manager - INFO - Model architecture:
2025-11-02 21:32:26 - processors.model_manager - INFO - Model: Seq2GenePredictorCombinedModulator
2025-11-02 21:32:26 - processors.model_manager - INFO -   start_tkn: 96,768 params
2025-11-02 21:32:26 - processors.model_manager - INFO -   cre_tokenizer: 31,826,153 params
2025-11-02 21:32:26 - processors.model_manager - INFO -   gene_tokenizer: 31,826,153 params
2025-11-02 21:32:26 - processors.model_manager - INFO -   gene_map: 787,968 params
2025-11-02 21:32:26 - processors.model_manager - INFO -   cre_map: 787,968 para

## 3. Explore Available Genes


In [10]:
all_genes = adrisk.vcf_processor.get_genes()
print(f'First 5 of {len(all_genes)} genes:')
print(all_genes[:5])

First 5 of 18439 genes:
  chromosome      start        end strand             gene_id gene_name
0      chr20   50934867   50958555      -  ENSG00000000419.12      DPM1
1       chr1  169849631  169894267      -  ENSG00000000457.13     SCYL3
2       chr1  169662007  169854080      +  ENSG00000000460.16  C1orf112
3       chr1   27612064   27635277      -  ENSG00000000938.12       FGR
4       chr1  196651878  196747504      +  ENSG00000000971.15       CFH


## 4. Explore Available Tissues

The model supports 62 tissue/cell types including brain regions, cell lines, and GTEx tissues. Each has a unique ID for predictions.


In [11]:
# Tissues vocabulary
tissue_vocab = adrisk.vcf_processor.tissue_vocab
print('Tissues: Ids')
for tissue, idx in tissue_vocab.items():
    print(f'  {tissue}: {idx}')

Tissues: Ids
  A549: 0
  GM23248: 2
  HepG2: 3
  K562: 4
  NCI-H460: 5
  Panc1: 6
  adipose - subcutaneous: 7
  adipose - visceral (omentum): 8
  adrenal gland: 9
  artery - aorta: 10
  artery - coronary: 11
  artery - tibial: 12
  bladder: 13
  blood: 14
  brain - amygdala: 15
  brain - anterior cingulate cortex (ba24): 16
  brain - caudate (basal ganglia): 17
  brain - cerebellar hemisphere: 18
  brain - cerebellum: 19
  brain - cortex: 20
  brain - frontal cortex (ba9): 21
  brain - hippocampus: 22
  brain - hypothalamus: 23
  brain - nucleus accumbens (basal ganglia): 24
  brain - putamen (basal ganglia): 25
  brain - spinal cord (cervical c-1): 26
  brain - substantia nigra: 27
  breast - mammary tissue: 28
  cells - cultured fibroblasts: 29
  cells - ebv-transformed lymphocytes: 30
  cervix - ectocervix: 31
  cervix - endocervix: 32
  colon - sigmoid: 33
  colon - transverse: 34
  esophagus - gastroesophageal junction: 35
  esophagus - mucosa: 36
  esophagus - muscularis: 37
  fa

## 5. Configure Parameters

Specify the VCF file, target genes, and tissues for analysis.


In [12]:
# picking a vcf file
vcf_path = os.path.join(str(Path.cwd().parent), "_artifacts/HG00096.vcf.gz")
tissue_vocab = adrisk.vcf_processor.tissue_vocab

# picking a set of genes and their corresponding tissue ids to query

tissue_ids = [tissue_vocab['brain - cortex'], tissue_vocab['brain - hippocampus']]
gene_ids = ["ENSG00000000419.12", "ENSG00000000457.13"]

## 6. Make Predictions

Run the prediction pipeline: variant processing → sequence extraction → expression prediction → AD risk calculation.


In [13]:
# Make predictions
preds = adrisk(vcf_path, gene_ids, tissue_ids)

Restoring states from the checkpoint path at /work/_artifacts/v4_pcg_epoch11_checkpoint.pth


Loaded BPE vocabulary from /work/vocabs/bpe_vocabulary_500.json
Filtered query df to 2 genes reducing from 2


/work/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py:283: Be aware that when using `ckpt_path`, callbacks used to create the checkpoint need to be provided during `Trainer` instantiation. Please add the following callbacks: ["ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None}"].
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from the checkpoint at /work/_artifacts/v4_pcg_epoch11_checkpoint.pth


Predicting: |          | 0/? [00:00<?, ?it/s]

2025-11-02 21:32:45 - utils.assets - INFO - Downloading from S3: s3://czi-variantformer/model/common/cres_all_genes_manifest.parquet
2025-11-02 21:32:47 - utils.assets - INFO - Loading parquet file: /tmp/tmpkysebm36/model/common/cres_all_genes_manifest.parquet
2025-11-02 21:32:47 - utils.assets - INFO - Validated schema - found columns: {'file_path', 'gene_id'}
2025-11-02 21:32:47 - utils.assets - INFO - Downloading from S3: s3://czi-variantformer/model/common/cres_all_genes/ENSG00000000419.12/gene_vocab.csv
2025-11-02 21:32:55 - utils.assets - INFO - Downloading from S3: s3://czi-variantformer/model/common/cres_all_genes/ENSG00000000457.13/gene_vocab.csv
2025-11-02 21:33:03 - utils.assets - INFO - Downloading from S3: s3://czi-variantformer/alzheimer_disease/v4_pcg/manifest.parquet
2025-11-02 21:33:05 - utils.assets - INFO - Loading parquet file: /tmp/tmp9xdpienp/alzheimer_disease/v4_pcg/manifest.parquet
2025-11-02 21:33:05 - utils.assets - INFO - Validated schema - found columns: {'f

## 7. View Results

Output DataFrame contains gene IDs/names, tissue IDs/names, and AD risk scores.


In [14]:
# Print predictions
preds

Unnamed: 0,gene_id,tissue_id,tissue_name,predicted_expression,embedding,ad_risk,gene_name
0,ENSG00000000419.12,20,brain - cortex,3.242956,"[[8.6875, -1.0078125, 0.7421875, 1.125, 16.75,...",0.574,DPM1
1,ENSG00000000457.13,22,brain - hippocampus,1.025923,"[[-9.75, -0.71875, -2.96875, -5.28125, -2.1562...",0.82,SCYL3


## 8. Visualize Results

Interactive bar chart showing AD risk predictions across tissues.


In [15]:
# Visualize predictions
fig = px.bar(
    preds, x="tissue_name", y="ad_risk", title="AD Risk Predictions across Tissues"
)
fig.show()

2025-11-02 21:33:07 - utils.assets - INFO - Loading parquet file: /tmp/tmp_xvluhh5/model/common/cres_all_genes_manifest.parquet
2025-11-02 21:33:07 - utils.assets - INFO - Validated schema - found columns: {'file_path', 'gene_id'}
