# VCF to Alzheimer's Disease Risk Prediction

This notebook demonstrates how to predict Alzheimer's Disease (AD) risk from genetic variants in VCF files using VariantFormer. The model predicts tissue-specific gene expression changes and estimates AD risk contributions.

## Overview

- Input: VCF file with genetic variants
- Output: AD risk scores per gene-tissue pair


## 1. Setup and Imports


In [None]:
import sys
from pathlib import Path
import ipynbname
import os
from processors import ad_risk
!uv pip install plotly
import plotly.express as px
import ipynbname
REPO_PATH = ipynbname.path().parent.parent

## 2. Load AD Risk Model

Initialize the model with `v4_pcg` (DNA-to-Cell, Protein-Coding Genes) or `v4_ag`. Model artifacts are automatically downloaded from S3 on first run.


In [None]:
# loading AD risk prediction model
model_class = 'v4_pcg'
# model_class = 'v4_ag' # If you want to use the model trained on all genes.
adrisk = ad_risk.ADriskFromVCF(model_class=model_class)

## 3. Explore Available Genes


In [None]:
all_genes = adrisk.vcf_processor.get_genes()
print(f'First 5 of {len(all_genes)} genes:')
print(all_genes[:5])

## 4. Explore Available Tissues

The model supports 62 tissue/cell types including brain regions, cell lines, and GTEx tissues. Each has a unique ID for predictions.


In [None]:
# Tissues vocabulary
tissue_vocab = adrisk.vcf_processor.tissue_vocab
print('Tissues: Ids')
for tissue, idx in tissue_vocab.items():
    print(f'  {tissue}: {idx}')

## 5. Configure Parameters

Specify the VCF file, target genes, and tissues for analysis.


In [None]:
# picking a vcf file
vcf_path = os.path.join(REPO_PATH, "_artifacts/HG00096.vcf.gz")
tissue_vocab = adrisk.vcf_processor.tissue_vocab

# picking a set of genes and their corresponding tissue ids to query

tissue_ids = [tissue_vocab['brain - cortex'], tissue_vocab['brain - hippocampus']]
gene_ids = ["ENSG00000000419.12", "ENSG00000000457.13"]

## 6. Make Predictions

Run the prediction pipeline: variant processing → sequence extraction → expression prediction → AD risk calculation.


In [None]:
# Make predictions
preds = adrisk(vcf_path, gene_ids, tissue_ids)

## 7. View Results

Output DataFrame contains gene IDs/names, tissue IDs/names, and AD risk scores.


In [None]:
# Print predictions
preds

## 8. Visualize Results

Interactive bar chart showing AD risk predictions across tissues.


In [None]:
# Visualize predictions
fig = px.bar(
    preds, x="tissue_name", y="ad_risk", title="AD Risk Predictions across Tissues"
)
fig.show()