# Explore HuggingFace Module Annotations

This notebook explores the annotation results from the HuggingFace modules pipeline.

**Source VCF**: `antonkulaga/personal-health` (HuggingFace)

**Modules**: longevitymap, lipidmetabolism, vo2max, superhuman, coronary

In [7]:
import polars as pl
from pathlib import Path
import json

# Detect repo root - works whether running from notebooks/ or repo root
base = (Path.cwd() if (Path.cwd() / "notebooks").exists() else Path.cwd().parent).absolute().resolve()
print(f"Repo root: {base}")

# Output directory
OUTPUT_DIR = base / "data" / "output" / "users" / "antonkulaga" / "genome" / "modules"

print(f"Output directory: {OUTPUT_DIR}")
print(f"Files in output directory:")
for f in sorted(OUTPUT_DIR.iterdir()):
    size_kb = f.stat().st_size / 1024
    print(f"  - {f.name}: {size_kb:.1f} KB")

Repo root: /home/antonkulaga/sources/just-dna-lite
Output directory: /home/antonkulaga/sources/just-dna-lite/data/output/users/antonkulaga/genome/modules
Files in output directory:
  - coronary_weights.parquet: 13.3 KB
  - lipidmetabolism_weights.parquet: 13.2 KB
  - longevitymap_weights.parquet: 25.9 KB
  - manifest.json: 1.3 KB
  - superhuman_weights.parquet: 8.9 KB
  - vo2max_weights.parquet: 10.2 KB


## Load Manifest

The manifest describes all output files and their source modules.

In [8]:
# Load manifest
manifest_path = OUTPUT_DIR / "manifest.json"
with open(manifest_path) as f:
    manifest = json.load(f)

print(f"User: {manifest['user_name']}")
print(f"Sample: {manifest['sample_name']}")
print(f"Source VCF: {manifest['source_vcf']}")
print(f"Total variants annotated: {manifest['total_variants_annotated']}")
print(f"\nModules processed: {len(manifest['modules'])}")
for m in manifest['modules']:
    print(f"  - {m['module']}: {m['weights_path']}")

User: antonkulaga
Sample: genome
Source VCF: /home/antonkulaga/.cache/huggingface/hub/datasets--antonkulaga--personal-health/snapshots/3280bc64374f3ac8d3aaf362c0bffd216fb6cdb9/genetics/antonkulaga.vcf
Total variants annotated: 324

Modules processed: 5
  - longevitymap: data/output/users/antonkulaga/genome/modules/longevitymap_weights.parquet
  - lipidmetabolism: data/output/users/antonkulaga/genome/modules/lipidmetabolism_weights.parquet
  - vo2max: data/output/users/antonkulaga/genome/modules/vo2max_weights.parquet
  - superhuman: data/output/users/antonkulaga/genome/modules/superhuman_weights.parquet
  - coronary: data/output/users/antonkulaga/genome/modules/coronary_weights.parquet


## Load All Module Results as LazyFrames

Using lazy evaluation for memory efficiency.

In [9]:
# Load all module results as lazy frames
module_data = {}

for m in manifest['modules']:
    module_name = m['module']
    # Resolve path relative to repo root (manifest stores paths relative to repo root)
    weights_path = base / m['weights_path']
    
    if weights_path.exists():
        lf = pl.scan_parquet(weights_path)
        module_data[module_name] = lf
        
        # Get row count and schema
        row_count = lf.select(pl.len()).collect().item()
        schema = lf.collect_schema()
        print(f"\n{module_name.upper()}:")
        print(f"  Rows: {row_count}")
        print(f"  Columns: {len(schema.names())}")
    else:
        print(f"  {module_name}: File not found at {weights_path}")


LONGEVITYMAP:
  Rows: 295
  Columns: 31

LIPIDMETABOLISM:
  Rows: 9
  Columns: 31

VO2MAX:
  Rows: 6
  Columns: 31

SUPERHUMAN:
  Rows: 2
  Columns: 31

CORONARY:
  Rows: 12
  Columns: 31


---
## Longevitymap Results

Longevity-associated variants from the LongevityMap database.

In [10]:
if "longevitymap" in module_data:
    lf = module_data["longevitymap"]
    
    # Select key columns and show head
    result = lf.select([
        "chrom", "start", "rsid", "ref", "alt", "genotype",
        "weight", "state", "conclusion"
    ]).head(20).collect()
    
    print("Longevitymap - First 20 variants:")
    display(result)

Longevitymap - First 20 variants:


chrom,start,rsid,ref,alt,genotype,weight,state,conclusion
str,u32,str,str,str,list[str],f64,str,str
"""19""",18350648,"""""","""C""","""T""","[""T"", ""T""]",0.14,"""protective""","""281 SNPs were found to discrim…"
"""10""",1342438,"""""","""G""","""A""","[""A"", ""A""]",-0.35,"""risk""","""SNPs in the RNA editing genes …"
"""7""",109268325,"""""","""A""","""G""","[""A"", ""G""]",0.07,"""protective""","""281 SNPs were found to discrim…"
"""1""",238350009,"""""","""A""","""C""","[""C"", ""C""]",0.14,"""protective""","""281 SNPs were found to discrim…"
"""11""",74002546,"""""","""A""","""G""","[""A"", ""G""]",0.245,"""protective""","""The study showed that after 10…"
…,…,…,…,…,…,…,…,…
"""12""",4130326,"""""","""G""","""A""","[""A"", ""G""]",0.07,"""protective""","""281 SNPs were found to discrim…"
"""13""",96878715,"""""","""C""","""A""","[""A"", ""C""]",0.07,"""protective""","""281 SNPs were found to discrim…"
"""20""",44651586,"""""","""C""","""T""","[""C"", ""T""]",0.115,"""protective""","""When comparing all 3 age group…"
"""16""",56977273,"""""","""G""","""A""","[""A"", ""G""]",0.05,"""protective""","""The minor allele frequency of …"


In [11]:
if "longevitymap" in module_data:
    lf = module_data["longevitymap"]
    
    # Count by state
    state_counts = lf.group_by("state").agg(pl.len().alias("count")).collect()
    print("Longevitymap - Variants by state:")
    display(state_counts)

Longevitymap - Variants by state:


state,count
str,u32
"""protective""",202
"""risk""",14
,79


---
## Lipid Metabolism Results

Lipid metabolism and cardiovascular risk variants.

In [12]:
if "lipidmetabolism" in module_data:
    lf = module_data["lipidmetabolism"]
    
    result = lf.select([
        "chrom", "start", "rsid", "ref", "alt", "genotype",
        "weight", "state", "conclusion"
    ]).head(20).collect()
    
    print("Lipid Metabolism - All variants:")
    display(result)

Lipid Metabolism - All variants:


chrom,start,rsid,ref,alt,genotype,weight,state,conclusion
str,u32,str,str,str,list[str],f64,str,str
"""2""",43846742,"""""","""T""","""C""","[""C"", ""C""]",0.0,"""alt""","""CC genotype is NOT associated …"
"""8""",19962213,"""""","""C""","""G""","[""C"", ""G""]",-0.4,"""alt""","""You are heterozygous carrier o…"
"""19""",44908684,"""""","""T""","""C""","[""C"", ""T""]",-1.5,"""alt""","""You are a carrier of unfavoura…"
"""1""",109274968,"""""","""G""","""T""","[""T"", ""T""]",-0.8,"""alt""","""TT genotype is associated with…"
"""19""",44908822,"""""","""C""","""T""","[""C"", ""T""]",1.5,"""alt""","""You are a carrier of one rs741…"
"""2""",21008652,"""""","""G""","""A""","[""A"", ""G""]",-0.5,"""alt""","""G allele is unfavourable. Carr…"
"""9""",15289580,"""""","""C""","""T""","[""T"", ""T""]",0.3,"""alt""","""TT genotype is favourable rega…"
"""11""",116832924,"""""","""G""","""C""","[""C"", ""C""]",0.0,"""alt""","""CC variant is favourable."""
"""12""",120951159,"""""","""A""","""C""","[""C"", ""C""]",0.0,"""alt""","""CC genotype is favourable."""


---
## VO2max Results

Athletic performance and VO2max-associated variants.

In [13]:
if "vo2max" in module_data:
    lf = module_data["vo2max"]
    
    result = lf.select([
        "chrom", "start", "rsid", "ref", "alt", "genotype",
        "weight", "state", "conclusion"
    ]).head(20).collect()
    
    print("VO2max - All variants:")
    display(result)

VO2max - All variants:


chrom,start,rsid,ref,alt,genotype,weight,state,conclusion
str,u32,str,str,str,list[str],f64,str,str
"""19""",44908684,"""""","""T""","""C""","[""C"", ""T""]",0.0,"""alt""","""Normal VO2max training respons…"
"""2""",41904383,"""""","""G""","""A""","[""A"", ""G""]",0.0,"""alt""","""Normal VO2max training respons…"
"""15""",23762924,"""""","""T""","""C""","[""C"", ""T""]",0.0,"""alt""","""Normal VO2max training respons…"
"""19""",44908822,"""""","""C""","""T""","[""C"", ""T""]",0.0,"""alt""","""Normal VO2max training respons…"
"""11""",67585218,"""""","""A""","""G""","[""G"", ""G""]",1.0,"""alt""","""High VO2max training response"""
"""1""",6955045,"""""","""T""","""C""","[""C"", ""T""]",,,


---
## Superhuman Results

Elite performance and rare beneficial variants.

In [14]:
if "superhuman" in module_data:
    lf = module_data["superhuman"]
    
    result = lf.select([
        "chrom", "start", "rsid", "ref", "alt", "genotype",
        "weight", "state", "conclusion"
    ]).head(20).collect()
    
    print("Superhuman - All variants:")
    display(result)

Superhuman - All variants:


chrom,start,rsid,ref,alt,genotype,weight,state,conclusion
str,u32,str,str,str,list[str],f64,str,str
"""1""",161223893,"""""","""G""","""A""","[""A"", ""G""]",,,
"""19""",44908822,"""""","""C""","""T""","[""C"", ""T""]",,,


---
## Coronary Results

Coronary artery disease associations.

In [15]:
if "coronary" in module_data:
    lf = module_data["coronary"]
    
    result = lf.select([
        "chrom", "start", "rsid", "ref", "alt", "genotype",
        "weight", "state", "conclusion"
    ]).head(20).collect()
    
    print("Coronary - All variants:")
    display(result)

Coronary - All variants:


chrom,start,rsid,ref,alt,genotype,weight,state,conclusion
str,u32,str,str,str,list[str],f64,str,str
"""1""",222650187,"""""","""A""","""C""","[""C"", ""C""]",-1.2,"""alt""","""CC genotype increases increase…"
"""15""",67166301,"""""","""T""","""C""","[""C"", ""T""]",-1.1,"""alt""","""SMAD3 gene encodes an intracel…"
"""2""",226203364,"""""","""A""","""C""","[""A"", ""C""]",0.0,"""alt""","""rs2943634 C/A single nucleotid…"
"""9""",22125504,"""""","""G""","""C""","[""C"", ""G""]",0.9,"""alt""","""1.5x increased risk for CAD is…"
"""6""",150931849,"""""","""G""","""A""","[""A"", ""G""]",0.7,"""alt""","""Polymorphisms in MTHFD1L, incl…"
…,…,…,…,…,…,…,…,…
"""11""",116778201,"""""","""G""","""C""","[""C"", ""C""]",0.0,"""alt""","""rs964184 variant in the ZPR1 g…"
"""19""",29573489,"""""","""A""","""G""","[""G"", ""G""]",-0.8,"""ref""","""rs7250581 has been reported in…"
"""9""",22098575,"""""","""A""","""G""","[""A"", ""G""]",-1.3,"""ref""","""Several GWASs have revealed th…"
"""12""",111446804,"""""","""T""","""C""","[""C"", ""T""]",-0.1,"""alt""","""SH2B adaptor protein 3 (SH2B3)…"


---
## Summary Statistics

Overview of annotations across all modules.

In [16]:
# Summary across all modules
summary_data = []

for module_name, lf in module_data.items():
    row_count = lf.select(pl.len()).collect().item()
    
    # Count annotated (have weight)
    annotated = lf.filter(pl.col("weight").is_not_null()).select(pl.len()).collect().item()
    
    # Get state distribution
    states = lf.filter(pl.col("state").is_not_null()).group_by("state").agg(
        pl.len().alias("count")
    ).collect()
    
    protective = states.filter(pl.col("state") == "protective")["count"].to_list()
    risk = states.filter(pl.col("state") == "risk")["count"].to_list()
    
    summary_data.append({
        "module": module_name,
        "total_variants": row_count,
        "with_weight": annotated,
        "protective": protective[0] if protective else 0,
        "risk": risk[0] if risk else 0,
    })

summary_df = pl.DataFrame(summary_data)
print("\nSummary across all modules:")
display(summary_df)


Summary across all modules:


module,total_variants,with_weight,protective,risk
str,i64,i64,i64,i64
"""longevitymap""",295,216,202,14
"""lipidmetabolism""",9,9,0,0
"""vo2max""",6,5,0,0
"""superhuman""",2,0,0,0
"""coronary""",12,12,0,0


In [17]:
# Total weight per module
weight_summary = []

for module_name, lf in module_data.items():
    weights = lf.filter(pl.col("weight").is_not_null()).select("weight").collect()
    
    if weights.height > 0:
        total = weights["weight"].sum()
        mean = weights["weight"].mean()
        min_w = weights["weight"].min()
        max_w = weights["weight"].max()
        
        weight_summary.append({
            "module": module_name,
            "total_weight": round(total, 3),
            "mean_weight": round(mean, 3),
            "min_weight": round(min_w, 3),
            "max_weight": round(max_w, 3),
        })

weight_df = pl.DataFrame(weight_summary)
print("\nWeight statistics per module:")
display(weight_df)


Weight statistics per module:


module,total_weight,mean_weight,min_weight,max_weight
str,f64,f64,f64,f64
"""longevitymap""",26.728,0.124,-0.5,0.5
"""lipidmetabolism""",-1.4,-0.156,-1.5,1.5
"""vo2max""",1.0,0.2,0.0,1.0
"""coronary""",-2.8,-0.233,-1.3,0.9


---
## Full Schema

Complete column list for each module output.

In [18]:
for module_name, lf in module_data.items():
    schema = lf.collect_schema()
    print(f"\n{module_name.upper()} Schema ({len(schema.names())} columns):")
    for name, dtype in schema.items():
        print(f"  {name}: {dtype}")


LONGEVITYMAP Schema (31 columns):
  chrom: String
  start: UInt32
  end: UInt32
  rsid: String
  ref: String
  alt: String
  qual: Float64
  filter: String
  END: Int32
  GT: String
  GQ: Int64
  DP: Int64
  AD: List(Int64)
  VAF: List(Float64)
  PL: List(Int64)
  genotype: List(String)
  rsid_longevitymap: String
  module: String
  weight: Float64
  state: String
  priority: String
  conclusion: String
  curator: String
  method: String
  end_longevitymap: UInt32
  alts: List(String)
  clinvar: Boolean
  pathogenic: Boolean
  benign: Boolean
  likely_pathogenic: Boolean
  likely_benign: Boolean

LIPIDMETABOLISM Schema (31 columns):
  chrom: String
  start: UInt32
  end: UInt32
  rsid: String
  ref: String
  alt: String
  qual: Float64
  filter: String
  END: Int32
  GT: String
  GQ: Int64
  DP: Int64
  AD: List(Int64)
  VAF: List(Float64)
  PL: List(Int64)
  genotype: List(String)
  rsid_lipidmetabolism: String
  module: String
  weight: Float64
  state: String
  priority: String
  c