# ClinVar SNV and Non-SNV Processing Pipeline

This notebook processes ClinVar genetic variants to create machine learning datasets for variant effect prediction. See `Clinvar_SNV_Non_SNV_README.md` for detailed documentation.

## Quick Start

1. Update file paths in the configuration section
2. Ensure all dependencies are installed
3. Run cells in order
4. Monitor progress and memory usage

**⚠️ Important**: This pipeline requires significant computational resources and storage space.

## Configuration

Update these paths for your environment:

In [None]:
# Configuration - Update these paths for your environment
import os
from pathlib import Path

# File paths (update these for your system)
CONFIG = {
    # Input data
    'clinvar_vcf': 'data/clinvar_grch38.vcf.gz',
    'reference_genome': 'data/reference/GRCh38.fa',
    'hgnc_mapping': 'data/hgnc_complete_set.txt',
    
    # VEP configuration
    'vep_root': '/path/to/vep',
    'vep_cache': '/path/to/vep/cache',
    
    # Output paths
    'output_dir': 'output',
    'temp_dir': 'temp',
    
    # Processing parameters
    'window_size': 4096,
    'max_variant_size': 64,
    'num_threads': 8,
    'batch_size': 100000
}

SCRATCH_DIR = '/your/scratch/directory'  # Update this to your scratch directory

# Create output directories
for dir_path in [CONFIG['output_dir'], CONFIG['temp_dir']]:
    os.makedirs(dir_path, exist_ok=True)
    
print("Configuration loaded. Please verify all paths are correct:")
for key, value in CONFIG.items():
    if 'path' in key or 'dir' in key:
        exists = os.path.exists(value) if not key.endswith('dir') else True
        status = "✅" if exists else "❌"
        print(f"  {status} {key}: {value}")
        
print("\n📝 Update CONFIG dictionary above with your actual file paths")

# ClinVar SNV and Non-SNV Variant Processing Pipeline

This notebook processes ClinVar genetic variants (both SNVs and non-SNVs) to create a comprehensive machine learning dataset for variant effect prediction. The pipeline includes:

## Overview

1. **Data Processing**: Download and process ClinVar VCF data using VEP (Variant Effect Predictor)
2. **Sequence Window Extraction**: Generate 4096bp genomic windows centered on variants
3. **Feature Engineering**: Extract pathogenicity, disease associations, and gene information
4. **Dataset Creation**: Build training/test datasets with disjoint disease splits
5. **Quality Control**: Comprehensive statistics and validation

## Key Features

- **Genomic Windows**: 4096bp sequences with centered mutations
- **Variant Types**: Both SNVs and structural variants (insertions, deletions, etc.)
- **Clinical Annotations**: Pathogenicity classification and disease associations
- **Gene Mapping**: Integration with HGNC gene nomenclature
- **Disjoint Splits**: Train/test splits ensuring no disease overlap

## Requirements

- **Computational Resources**: High-memory system (recommended for large datasets)
- **Software Dependencies**: VEP, Python libraries (pandas, pysam, pyarrow, hgvs)
- **Reference Data**: GRCh38 genome assembly, HGNC gene mapping
- **Storage**: Sufficient space for intermediate files (~100GB+)

## Output

Final datasets suitable for:
- Variant effect prediction models
- Pathogenicity classification
- Disease association studies
- Genomic language model training

## Initial Setup (For HPC/Cluster Environments)

**Note**: This section contains setup instructions for high-performance computing environments. Adapt paths and module loading commands for your specific system.

### Prerequisites Installation
If running on a cluster, you may need to download Python wheels and reference data:

In [None]:
# Download required Python packages and reference data
# Adjust paths and module loading for your specific environment

# Example for cluster environments:
# module load python gcc arrow postgresql

# Create directory for Python wheels (adjust path as needed)
# mkdir -p /path/to/your/pywheels
# pip download hgvs -d /path/to/your/pywheels

# Download HGNC gene mapping data
# wget -O hgnc_complete_set.txt "https://storage.googleapis.com/public-download-files/hgnc/tsv/tsv/hgnc_complete_set.txt"

print("Setup instructions provided above. Adjust paths for your environment.")
print("Required data:")
print("- HGNC complete gene set")
print("- Python packages: hgvs, pandas, pyarrow, pysam, tqdm")
print("- VEP installation with cache")
print("- GRCh38 reference genome")

## Environment Setup

**For cluster/HPC environments**: Configure virtual environment and load required modules.
**For local environments**: Ensure all dependencies are installed.

In [None]:
# Environment setup for cluster/HPC systems
# Adjust module loading and paths for your specific environment

# Example cluster setup:
"""
# Create virtual environment
python -m venv /tmp/clinvar_env

# Load required modules (adjust for your system)
module load python gcc arrow postgresql
module load perl samtools tabix bcftools mariadb

# Activate virtual environment
source /tmp/clinvar_env/bin/activate

# Install packages
pip install notebook pandas pyarrow pysam hgvs tqdm networkx

# Start Jupyter (for remote access)
jupyter notebook --no-browser --ip=$(hostname -f) --port=8888
"""

# For local environments, ensure these packages are installed:
required_packages = [
    'pandas>=1.3.0',
    'pyarrow>=5.0.0', 
    'pysam>=0.19.0',
    'hgvs>=1.5.0',
    'tqdm>=4.60.0',
    'networkx>=2.6.0'
]

print("Required packages:")
for pkg in required_packages:
    print(f"  - {pkg}")
    
print("\nFor VEP processing, also required:")
print("  - VEP (Ensembl Variant Effect Predictor)")
print("  - BCFtools, SAMtools, Tabix")
print("  - Reference genome and VEP cache files")

Now code

In [None]:
!which python
# Verify Python environment and core dependencies
import sys
import subprocess

print(f"Python executable: {sys.executable}")
print(f"Python version: {sys.version}")

# Check for required packages
try:
    import pandas as pd
    import pyarrow as pa
    import pysam
    import hgvs
    import tqdm
    import networkx as nx
    
    print("\n✅ Core dependencies available:")
    print(f"  - pandas: {pd.__version__}")
    print(f"  - pyarrow: {pa.__version__}")
    print(f"  - pysam: {pysam.__version__}")
    print(f"  - hgvs: {hgvs.__version__}")
    print(f"  - networkx: {nx.__version__}")
    
except ImportError as e:
    print(f"❌ Missing dependency: {e}")
    print("Please install required packages first")

/localscratch/naimerja.43836119.0/clinvar_env/bin/python


In [None]:
# Install required packages
# Adjust installation method based on your environment

# For environments with pre-downloaded wheels:
# !pip install --no-index --find-links /path/to/pywheels hgvs
# !pip install --no-index tqdm pandas pyarrow

# For standard environments:
# !pip install hgvs tqdm pandas pyarrow pysam networkx

print("Package installation commands provided above.")
print("Choose the appropriate method for your environment:")
print("  - Standard: pip install <package>")
print("  - Offline: pip install --no-index --find-links <wheel-dir> <package>")
print("  - Conda: conda install <package>")

View possible fields from clinvar

## ClinVar VCF Data Exploration

Examine the structure and metadata of the ClinVar VCF file to understand available annotations.

In [None]:
# Explore ClinVar VCF file structure
# Update the file path to point to your ClinVar VCF file

import subprocess
import os

# Example VCF file path (update for your data)
vcf_file = "data/clinvar_grch38.vcf.gz"  # Update this path

# Check if file exists
if os.path.exists(vcf_file):
    try:
        # View VCF header to understand available fields
        result = subprocess.run(
            ["bcftools", "view", "-h", vcf_file],
            capture_output=True, text=True, check=True
        )
        
        print("ClinVar VCF Header (first 50 lines):")
        print("=" * 50)
        header_lines = result.stdout.split('\n')[:50]
        for line in header_lines:
            print(line)
            
    except (subprocess.CalledProcessError, FileNotFoundError) as e:
        print(f"Error reading VCF file: {e}")
        print("Please ensure bcftools is installed and VCF file path is correct")
else:
    print(f"VCF file not found: {vcf_file}")
    print("Please update the file path to point to your ClinVar VCF file")
    print("Download from: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/")

print("\nKey ClinVar INFO fields to look for:")
print("- CLNSIG: Clinical significance")
print("- CLNDN: Disease name")
print("- GENEINFO: Gene information")
print("- CLNREVSTAT: Review status")

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=2025-04-29
##source=ClinVar
##reference=GRCh38
##ID=<Description="ClinVar Variation ID">
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
##INFO=<ID=AF_EXAC,Number=1,Type=Float,Description="allele frequencies from ExAC">
##INFO=<ID=AF_TGP,Number=1,Type=Float,Description="allele frequencies from TGP">
##INFO=<ID=ALLELEID,Number=1,Type=Integer,Description="the ClinVar Allele ID">
##INFO=<ID=CLNDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
##INFO=<ID=CLNDNINCL,Number=.,Type=String,Description="For included Variant : ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier submitted for germline classifications, e.g. OMIM:NNNNNN">
##INFO=<ID

VEP to clean raw clinvar vcf to cleaned coding only vcf

In [None]:
# 2) Point to your VEP install and cache, and wire up Perl libs:
import os

os.environ['VEP_ROOT']   = 'SCRATCH_DIR/DNASNVData113/clinvar_data/vep-code-113'
os.environ['VEP_CACHE']  = 'SCRATCH_DIR/DNASNVData113/clinvar_data/vep-cache-113'
os.environ['PERL5LIB']   = 'SCRATCH_DIR/perl5/lib/perl5:' + os.environ.get('PERL5LIB','')
# prepend VEP_ROOT onto the existing PATH
os.environ['PATH']       = os.environ['VEP_ROOT'] + ':' + os.environ.get('PATH','')

# now this will actually show your full, correct PATH:
!echo $PATH
!which bash
!which vep


In [None]:
%%bash

/usr/bin/time -v $VEP_ROOT/vep \
  --input_file  SCRATCH_DIR/DNASNVData113/clinvar_data/clinvar_grch38.vcf.gz \
  --output_file SCRATCH_DIR/DNASNVData113/clinvar_data/clinvar_coding_only.vcf \
  --cache \
  --dir_cache $VEP_CACHE \
  --offline \
  --fasta $VEP_CACHE/homo_sapiens/113_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa \
  --species homo_sapiens \
  --assembly GRCh38 \
  --vcf \
  --hgvs \
  --pick \
  --fork 48 \
  --force_overwrite \
  --verbose \
  --coding_only


## Step 1: VEP Processing

Process ClinVar VCF through VEP to add annotations and filter for coding variants.

In [None]:
#!/usr/bin/env python3
import hgvs.edit as HEdit
from hgvs.parser import Parser
from hgvs.exceptions import HGVSError
from hgvs.enums import Datum
import hgvs.location as loc

from collections import Counter
from concurrent.futures import ProcessPoolExecutor
from tqdm import tqdm

def is_coding_pos(pos):
    """
    Return True if the given position is within the translated CDS.
    Excludes:
      - intronic offsets (BaseOffsetPosition.is_intronic)
      - 5′ UTR (datum=CDS_START and base < 1)
      - 3′ UTR    (datum=CDS_END)
    """
    p = pos.start if hasattr(pos, "start") else pos
    if isinstance(p, loc.BaseOffsetPosition):
        dbg = f"(base={p.base}, datum={p.datum}, offset={p.offset})"
        if p.is_intronic:
            return False
        if p.datum == Datum.CDS_START and p.base < 1:
            return False
        if p.datum == Datum.CDS_END:
            return False
        if p.datum == Datum.CDS_START and p.base >= 1:
            return True
        # any other datum we don’t recognize
        raise ValueError(f"Unrecognized BaseOffsetPosition {dbg}, full pos object: {pos!r}")

def _init_worker(idx):
    # runs once in each worker
    global parser, hgvsc_idx
    parser    = Parser()
    hgvsc_idx = idx


def _classify_line(line):
    # split on tabs to get INFO (column 7)
    cols = line.rstrip("\n").split("\t")
    if len(cols) < 8:
        return ("unmatched", None, "")

    info = cols[7]
    # pull CSQ=
    csq_entries = [kv.split("=",1)[1]
                   for kv in info.split(";")
                   if kv.startswith("CSQ=")]
    if not csq_entries:
        return ("unmatched", None, "")

    # first allele in CSQ, then HGVSc field
    hfull = csq_entries[0].split(",")[0].split("|")[hgvsc_idx]
    if not hfull:
        return ("unmatched", None, "")

    # parse HGVS
    try:
        var = parser.parse_hgvs_variant(hfull)
    except HGVSError:
        return ("unmatched", None, hfull)

    edit = var.posedit.edit
    pos  = var.posedit.pos

    # get 1-based start/end
    if hasattr(pos, "start") and hasattr(pos, "end"):
        start = pos.start.base
        end   = pos.end.base
    else:
        start = end = pos.base

    # generic type key
    etype = edit.type  # attribute, not method
    if etype in ("del", "dup", "inv"):
        key = f"{etype}_{'single' if start == end else 'range'}"
    else:
        key = etype    # covers sub, ins, delins, etc.

    # coding vs noncoding
    coding = is_coding_pos(pos)

    return (key, coding, None)


def scan_hgvsc_types(vcf_path, max_workers=24):
    # 1) find CSQ header → HGVSc index
    csq_fields = None
    with open(vcf_path) as f:
        for line in f:
            if line.startswith("##INFO=<ID=CSQ"):
                desc = line.split("Format:")[1].split('">')[0].strip()
                csq_fields = desc.split("|")
                break
    if not csq_fields:
        raise RuntimeError("Couldn't find CSQ header in VCF")
    idx = csq_fields.index("HGVSc")

    # 2) count lines for progress bar
    total = sum(1 for _ in open(vcf_path) if not _.startswith("#"))

    coding_counts    = Counter()
    noncoding_counts = Counter()
    unmatched_counts = Counter()

    # 3) parallel processing
    with ProcessPoolExecutor(
        max_workers=max_workers,
        initializer=_init_worker,
        initargs=(idx,)
    ) as exe:
        # only non-header lines
        lines = (l for l in open(vcf_path) if not l.startswith("#"))
        for key, coding, extra in tqdm(
            exe.map(_classify_line, lines, chunksize=1000),
            total=total,
            desc="Scanning variants"
        ):
            if key == "unmatched":
                unmatched_counts[extra] += 1
            else:
                if coding:
                    coding_counts[key] += 1
                else:
                    noncoding_counts[key] += 1

    # 4) report
    print("\n=== Coding-region variants ===")
    for name, cnt in coding_counts.most_common():
        print(f"  {name}: {cnt}")

    print("\n=== Non-coding variants (UTR & intronic) ===")
    for name, cnt in noncoding_counts.most_common():
        print(f"  {name}: {cnt}")

    print("\n=== Unmatched HGVSc patterns ===")
    for h, cnt in unmatched_counts.most_common():
        print(f"  {h}: {cnt}")


if __name__ == "__main__":
    scan_hgvsc_types(
        "SCRATCH_DIR/DNASNVData113/clinvar_data/clinvar_coding_only.vcf",
        max_workers=24
    )


Creating data table

In [None]:
#!/usr/bin/env python3
import os
import pandas as pd
# Use 24 threads for PyArrow encoding
os.environ["ARROW_NUM_THREADS"] = "24"

import pysam
import pyarrow as pa
import pyarrow.parquet as pq
from tqdm import tqdm

def get_window(genome, chrom, pos0, window_size=4096, pad_char="N"):
    """
    Fetch exactly `window_size` bases centered at 0-based pos0
    from the pysam.FastaFile `genome`, padding with `pad_char`.
    """
    half  = window_size // 2
    start = pos0 - half
    end   = start + window_size

    parts = []
    chrom_len = genome.get_reference_length(chrom)

    # left padding
    if start < 0:
        parts.append(pad_char * -start)
        fetch_start = 0
    else:
        fetch_start = start

    # fetch middle
    fetch_end = min(end, chrom_len)
    parts.append(genome.fetch(chrom, fetch_start, fetch_end))

    # right padding
    if fetch_end < end:
        parts.append(pad_char * (end - fetch_end))

    return "".join(parts)


def main(vcf_path, genome_fasta_path, out_parquet_path):
    use_cols = ["symbol", "name", "entrez_id"]
    hgnc_df = pd.read_csv(
        "SCRATCH_DIR/DNASNVData113/clinvar_data/hgnc_complete_set.txt",
        sep="\t", usecols=use_cols,
        dtype={"entrez_id": "Int64"}
    )
    # build a dict mapping Entrez ID → approved name
    gene_desc_map = dict(zip(
        hgnc_df["entrez_id"].astype(str),  # ensure keys are strings if your gene_id is str
        hgnc_df["name"]
    ))

    missing_genes = 0
    # definitions
    PATHOGENIC_ALLOWED = {
        "pathogenic",
        "pathogenic/likely_pathogenic",
        "likely_pathogenic",
        "benign",
        "likely_benign",
        "benign/likely_benign",
    }

    REVIEW_STATUS_ALLOWED = {
        "criteria_provided,_multiple_submitters,_no_conflicts",
        "reviewed_by_expert_panel",
        "practice_guideline",
    }

    # 0) explicitly remove any old output
    try:
        os.remove(out_parquet_path)
    except FileNotFoundError:
        pass

    # count variants for progress bar
    total = sum(1 for line in open(vcf_path) if not line.startswith("#"))

    # open the genomic FASTA
    genome = pysam.FastaFile(genome_fasta_path)
    fasta_contigs = set(genome.references)  # <<< build this once

    # prepare for Parquet writing
    writer = None
    batch = {col: [] for col in (
        "clinvar_id",
        "original_window",
        "mutated_window",
        "cleaned_pathogenicity",
        "disease_name",
        "gene_name",
        "gene_desc",
        "chromosome",
        "chromosome_position",
        "variant_type",
        "clinvar_link",
        "gene_id",
        "mutation_instruction",
        "pathogenicity",
        "review_status"
    )}
    batch_size = 100_000

    def flush_batch():
        nonlocal writer, batch
        table = pa.Table.from_pydict(batch)
        if writer is None:
            writer = pq.ParquetWriter(
                out_parquet_path,
                table.schema,
                compression="snappy",
                use_dictionary=True
            )
        writer.write_table(table)
        for col in batch:
            batch[col].clear()

    # process VCF
    with open(vcf_path) as vf:
        for line in tqdm(vf, total=total, desc="Writing Parquet"):
            if line.startswith("#"):
                continue
            cols = line.rstrip("\n").split("\t")
            chrom, pos1, clinvar_id, ref, alt = cols[:5]

            # --- SKIP if this contig is not in your FASTA --- or mitochondrial chromosome (keeps only nuclear chromosomes as in Evo2)
            if chrom not in fasta_contigs or chrom == "MT":
                continue

            # Skip variants too large to fit sensibly in a 4 096 bp window
            MAX_EDIT = 64 # 64 bp
            if len(ref) > MAX_EDIT or len(alt) > MAX_EDIT:
                continue


            info = {
                kv.split("=", 1)[0]: kv.split("=", 1)[1]
                for kv in cols[7].split(";") if "=" in kv
            }

            # mutation instruction
            instr = f"{ref}>{alt}"

            # extract 4096-bp window
            pos0 = int(pos1) - 1
            orig_win = get_window(genome, chrom, pos0, window_size=4096)

            # apply REF→ALT at center
            half = 4096 // 2
            i0   = half
            i1   = half + len(ref)
            mut_win = orig_win[:i0] + alt + orig_win[i1:]
            # enforce fixed length
            if len(mut_win) < 4096:
                mut_win = mut_win.ljust(4096, "N")
            elif len(mut_win) > 4096:
                mut_win = mut_win[:4096]

            # pathogenicity, disease, variant type
            path = info.get("CLNSIG", "").lower()
            dis  = info.get("CLNDN", "")
            gene_info = info.get("GENEINFO", "")

            #filter out variants with no gene info
            if gene_info =="":
                missing_genes +=1
                continue
            else:
                gene_name = gene_info.split(":")[0]
                gene_id = gene_info.split(":")[1]


            vart = "SNV" if len(ref) == 1 == len(alt) else "non_SNV"
            rev_stat = info.get("CLNREVSTAT", "").lower()

            # filter for pathogenic/(|)likely pathogenic or benign/(|)likely benign only
            # only keep if ANY of the pipe-delimited terms is in our allowed set
            terms = path.split("|")
            if not any(term in PATHOGENIC_ALLOWED for term in terms):
                continue

            # filter for review status
            if rev_stat not in REVIEW_STATUS_ALLOWED:
                continue

            if "pathogenic" in path:
                clean_pathogenicity = "pathogenic"
            elif "benign" in path:
                clean_pathogenicity = "benign"
            else:
                raise ValueError(f"Unknown pathogenicity: {path}")


            # collect row
            batch["clinvar_id"].append(clinvar_id)
            batch["mutation_instruction"].append(instr)
            batch["original_window"].append(orig_win)
            batch["mutated_window"].append(mut_win)
            batch["pathogenicity"].append(path)
            batch["cleaned_pathogenicity"].append(clean_pathogenicity)
            batch["disease_name"].append(dis)
            batch["variant_type"].append(vart)
            batch["review_status"].append(rev_stat)
            batch["gene_name"].append(gene_name)
            batch["gene_id"].append(gene_id)
            batch["chromosome"].append(chrom)
            batch["chromosome_position"].append(pos1) # 1-based position on chromosome
            batch["gene_desc"].append(gene_desc_map.get(gene_id))
            batch["clinvar_link"].append(f"https://www.ncbi.nlm.nih.gov/clinvar/variation/{clinvar_id}/")

            # flush when batch is full
            if len(batch["mutation_instruction"]) >= batch_size:
                flush_batch()

    # final flush & close
    if batch["mutation_instruction"]:
        flush_batch()
    if writer is not None:
        writer.close()

    print("Finished writing →", out_parquet_path)
    print(f"# Removed due to missing gene info: {missing_genes}")


if __name__ == "__main__":
    main(
        "SCRATCH_DIR/DNASNVData113/clinvar_data/clinvar_coding_only.vcf",
        "SCRATCH_DIR/DNASNVData113/clinvar_data/"
        "vep-cache-113/homo_sapiens/113_GRCh38/"
        "Homo_sapiens.GRCh38.dna.toplevel.fa",
        "SCRATCH_DIR/DNASNVData113/clinvar_data/"
        "clinvar_windowed_4096.parquet"
    )

# note to visually inspect the dna sequences and modified sequences go to https://www.ncbi.nlm.nih.gov/gdv/browser/genome/?id=GCF_000001405.40 and then click tools and then sequence text view

In [17]:
df[df['clinvar_id']=='10152']['disease_name'][342667]
# https://www.ncbi.nlm.nih.gov/clinvar/variation/10152/
# shows that only diseases with stars are included in the associated diseases (since hemophelia not included)

'Hereditary_factor_VIII_deficiency_disease|not_provided'

In [None]:
[print(x) for x in (df[(df['pathogenicity']=='pathogenic') & df['disease_name'].str.contains(r'\|')]['clinvar_link'])]

On login node upload table to huggingface

In [None]:
!pip install --no-index huggingface-hub
from huggingface_hub import HfApi
import os
import glob

# 0) config
repo_id     = "wanglab/bioR_tasks"         # your dataset repo
repo_type   = "dataset"
subfolder   = "variant_effect_non_snv_and_snv"
local_dir   = "SCRATCH_DIR/DNASNVData113/clinvar_data/clinvar_windowed_4096.parquet"

api = HfApi()

# 1) list all files in that subfolder
all_files = api.list_repo_files(repo_id, repo_type=repo_type)
old_files = [f for f in all_files if f.startswith(subfolder + "/")]

print(f"Will delete {len(old_files)} old files:")
for f in old_files:
    print("  ", f)

# 2) delete them (one commit per file, or you can batch by reusing the same commit_message)
for f in old_files:
    api.delete_file(
        path_in_repo = f,
        repo_id      = repo_id,
        repo_type    = repo_type,
        commit_message = f"remove old dataset file"
    )

# 3) upload your single Parquet file
new_file = "SCRATCH_DIR/DNASNVData113/clinvar_data/clinvar_windowed_4096.parquet"
basename = os.path.basename(new_file)
dest_path = f"{subfolder}/{basename}"

print(f"Uploading {new_file!r} to {repo_id}/{dest_path} …")
api.upload_file(
    path_or_fileobj = new_file,
    path_in_repo    = dest_path,
    repo_id         = repo_id,
    repo_type       = repo_type,
    commit_message  = f"add updated parquet {basename}"
)

print("Done! Your dataset has been updated on the Hub.")


In [None]:
!pip install --no-index huggingface-hub
from huggingface_hub import HfApi
import os
import glob

# 0) config
repo_id     = "wanglab/bioR_tasks"         # your dataset repo
repo_type   = "dataset"
subfolder   = "variant_effect_non_snv_and_snv"
local_dir   = "SCRATCH_DIR/DNASNVData113/clinvar_data/clinvar_windowed_4096.parquet"

api = HfApi()

# 1) list all files in that subfolder
all_files = api.list_repo_files(repo_id, repo_type=repo_type)
old_files = [f for f in all_files if f.startswith(subfolder + "/")]


import io

# Upload cleaned DataFrame
buffer = io.BytesIO()
final_df.to_parquet(buffer, index=False)
buffer.seek(0)

# Construct cleaned filename by appending '_cleaned'
basename = os.path.splitext(os.path.basename(local_dir))[0] + "_cleaned.parquet"
dest_path = f"{subfolder}/{basename}"

print(f"Uploading cleaned DataFrame to {repo_id}/{dest_path} …")
api.upload_file(
    path_or_fileobj=buffer,
    path_in_repo=dest_path,
    repo_id=repo_id,
    repo_type=repo_type,
    commit_message=f"add cleaned parquet {basename}"
)

print("Done! Cleaned DataFrame uploaded.")


read table

In [None]:
#!/usr/bin/env python3
import time, os
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

from tqdm import tqdm

def load_parquet_to_pandas(parquet_dir, num_threads=24):
    # configure PyArrow global thread pool
    pa.set_cpu_count(num_threads)
    pa.set_io_thread_count(num_threads)

    start = time.time()
    print(f"→ Discovering data under {parquet_dir!r}")

    # Option A: use the ParquetDataset API
    # dataset = pq.ParquetDataset(parquet_dir)      # older PyArrow
    # table   = dataset.read(use_threads=True)      # uses all threads by default

    # Option B (recommended): use the Dataset API
    dataset = ds.dataset(parquet_dir, format="parquet")
    print("→ Scanning & reading all fragments in parallel …")
    # to_table will read all row-groups/files in parallel (use_threads defaults to True) :contentReference[oaicite:0]{index=0}
    table = dataset.to_table()

    print("→ Converting to pandas DataFrame…")
    df = table.to_pandas()

    end = time.time()
    print(f"✅ Loaded {len(df):,} rows in {end - start:.1f}s")
    print(f"DataFrame shape: {df.shape}")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1e9:.2f} GB")

    return df

if __name__ == "__main__":
    PARQUET_DIR = "SCRATCH_DIR/DNASNVData113/clinvar_data/clinvar_windowed_4096.parquet"
    df = load_parquet_to_pandas(PARQUET_DIR, num_threads=24)


→ Discovering data under '/scratch/naimerja/DNASNVData113/clinvar_data/clinvar_windowed_4096.parquet'
→ Scanning & reading all fragments in parallel …
→ Converting to pandas DataFrame…
✅ Loaded 342,689 rows in 3.3s
DataFrame shape: (342689, 15)
Memory usage: 3.18 GB


Create final training dataframe

In [46]:
import numpy as np
import re

#list of 50 questions

question_synonyms = {
    "A genetic variant on chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, affects the gene <GENE_SYMBOL> (<GENE_FULL_NAME>). Is this variant benign or pathogenic? If pathogenic, what disease(s) does it cause?",
    "A mutation at chromosome position <CHROMOSOME_POSITION> on chromosome <CHROMOSOME_NUMBER> in gene <GENE_SYMBOL> (<GENE_FULL_NAME>): benign or pathogenic? If pathogenic, which disease(s) is it linked to?",
    "Considering the variant on chromosome <CHROMOSOME_NUMBER>, location <CHROMOSOME_POSITION>, involving gene <GENE_SYMBOL> (<GENE_FULL_NAME>), would you classify it as benign or pathogenic? What disease(s), if any, does a pathogenic variant indicate?",
    "Is the genetic mutation found on chromosome <CHROMOSOME_NUMBER> at position <CHROMOSOME_POSITION>, within the gene <GENE_SYMBOL> (<GENE_FULL_NAME>), considered benign or pathogenic? If pathogenic, specify the associated disease(s).",
    "Assess the clinical significance (benign or pathogenic) of the variant at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>). What disease(s) is it linked to if pathogenic?",
    "Does the genetic variant at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, impacting gene <GENE_SYMBOL> (<GENE_FULL_NAME>), appear benign or pathogenic? If pathogenic, name the associated disease(s).",
    "Variant in gene <GENE_SYMBOL> (<GENE_FULL_NAME>), located at chromosome <CHROMOSOME_NUMBER> position <CHROMOSOME_POSITION>: benign or pathogenic? What disease(s) does it cause if pathogenic?",
    "Gene <GENE_SYMBOL> (<GENE_FULL_NAME>) variant at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>—is it benign or pathogenic? If pathogenic, what are the associated condition(s)?",
    "A genetic alteration at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, in gene <GENE_SYMBOL> (<GENE_FULL_NAME>)—benign or pathogenic? If pathogenic, which disease(s) is involved?",
    "Chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>): Is this mutation clinically benign or pathogenic? If pathogenic, identify the related disease(s).",
    "Does the variant on chromosome <CHROMOSOME_NUMBER> at location <CHROMOSOME_POSITION> affecting gene <GENE_SYMBOL> (<GENE_FULL_NAME>) have a clinical significance of benign or pathogenic? If pathogenic, what disease(s) is associated?",
    "Mutation at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, within <GENE_SYMBOL> (<GENE_FULL_NAME>): benign or pathogenic? If pathogenic, indicate the disease(s).",
    "Evaluate this variant at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>): benign or pathogenic? If pathogenic, what are the disease connection(s)?",
    "Gene mutation in <GENE_SYMBOL> (<GENE_FULL_NAME>) at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>—is it benign or pathogenic? If pathogenic, specify the disease(s).",
    "Located at chromosome <CHROMOSOME_NUMBER> position <CHROMOSOME_POSITION>, the variant affecting gene <GENE_SYMBOL> (<GENE_FULL_NAME>)—benign or pathogenic? If pathogenic, which disease(s) does it relate to?",
    "Is the chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION> variant in <GENE_SYMBOL> (<GENE_FULL_NAME>) clinically benign or pathogenic? If pathogenic, what condition(s) is associated?",
    "Clinical significance of chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>): benign or pathogenic? Name the disease(s) if pathogenic.",
    "Is the genetic variant on chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>), benign or pathogenic? If pathogenic, what disease(s) is indicated?",
    "Regarding the variant at chromosome <CHROMOSOME_NUMBER> and position <CHROMOSOME_POSITION>, affecting gene <GENE_SYMBOL> (<GENE_FULL_NAME>): benign or pathogenic? If pathogenic, what are the associated illness(es)?",
    "The mutation in gene <GENE_SYMBOL> (<GENE_FULL_NAME>) at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>—clinically benign or pathogenic? If pathogenic, identify the related disease(s).",
    "Assess the variant on chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, impacting <GENE_SYMBOL> (<GENE_FULL_NAME>): is it benign or pathogenic? If pathogenic, specify the associated condition(s).",
    "Variant in <GENE_SYMBOL> (<GENE_FULL_NAME>), chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>—is this benign or pathogenic? If pathogenic, what disease(s) is linked?",
    "Clinical impact (benign or pathogenic) of the variant at chromosome <CHROMOSOME_NUMBER>, location <CHROMOSOME_POSITION>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>): what disease(s) if pathogenic?",
    "The chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION> genetic variant in gene <GENE_SYMBOL> (<GENE_FULL_NAME>): benign or pathogenic? If pathogenic, indicate disease(s).",
    "Determine if the mutation at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION> in gene <GENE_SYMBOL> (<GENE_FULL_NAME>) is benign or pathogenic. If pathogenic, what disease(s) is associated?",
    "Is chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>) variant benign or pathogenic? If pathogenic, what condition(s) is it related to?",
    "The mutation impacting <GENE_SYMBOL> (<GENE_FULL_NAME>) on chromosome <CHROMOSOME_NUMBER> at position <CHROMOSOME_POSITION>: benign or pathogenic? Name the associated disease(s) if pathogenic.",
    "Variant at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>): clinically benign or pathogenic? If pathogenic, specify the disease(s) involved.",
    "Chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>): benign or pathogenic variant? If pathogenic, what are the linked illness(es)?",
    "A genetic variant at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, affecting gene <GENE_SYMBOL> (<GENE_FULL_NAME>)—is it benign or pathogenic? If pathogenic, identify the associated disorder(s).",
    "Mutation found at chromosome <CHROMOSOME_NUMBER> position <CHROMOSOME_POSITION>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>): benign or pathogenic? If pathogenic, indicate the relevant disease(s).",
    "Benign or pathogenic: chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>) variant? Disease(s) if pathogenic?",
    "Evaluate if the mutation on chromosome <CHROMOSOME_NUMBER> at position <CHROMOSOME_POSITION> in <GENE_SYMBOL> (<GENE_FULL_NAME>) is benign or pathogenic. Disease name(s) if pathogenic?",
    "Clinical classification of chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>): benign or pathogenic? Disease(s) if pathogenic?",
    "Variant chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>): benign or pathogenic? Disease(s)?",
    "Variant on chromosome <CHROMOSOME_NUMBER>, at position <CHROMOSOME_POSITION>, affecting <GENE_SYMBOL> (<GENE_FULL_NAME>): is it benign or pathogenic? If pathogenic, specify the associated disease(s).",
    "Does the chromosome <CHROMOSOME_NUMBER> mutation at position <CHROMOSOME_POSITION> within gene <GENE_SYMBOL> (<GENE_FULL_NAME>) classify as benign or pathogenic? If pathogenic, indicate the related illness(es).",
    "Determine whether the variant at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, in gene <GENE_SYMBOL> (<GENE_FULL_NAME>) is benign or pathogenic. If pathogenic, identify the relevant disease(s).",
    "Gene <GENE_SYMBOL> (<GENE_FULL_NAME>) variant at chromosome position <CHROMOSOME_POSITION> on chromosome <CHROMOSOME_NUMBER>: benign or pathogenic? If pathogenic, what disease(s) is it associated with?",
    "Considering the genetic mutation at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, impacting <GENE_SYMBOL> (<GENE_FULL_NAME>): is it clinically benign or pathogenic? Name the associated disease(s) if pathogenic.",
    "Evaluate the clinical significance of the mutation at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION> in gene <GENE_SYMBOL> (<GENE_FULL_NAME>): benign or pathogenic? What disease(s) does a pathogenic variant suggest?",
    "Is the variant located on chromosome <CHROMOSOME_NUMBER> at position <CHROMOSOME_POSITION>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>), benign or pathogenic? If pathogenic, specify the disease(s) linked.",
    "Classify the chromosome <CHROMOSOME_NUMBER> variant at position <CHROMOSOME_POSITION> affecting gene <GENE_SYMBOL> (<GENE_FULL_NAME>) as benign or pathogenic. If pathogenic, which disease(s) is associated?",
    "For chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>): benign or pathogenic mutation? If pathogenic, what are the associated disease(s)?",
    "Is the genetic change at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, within gene <GENE_SYMBOL> (<GENE_FULL_NAME>) benign or pathogenic? Name the disease(s) if pathogenic.",
    "Does the variant impacting <GENE_SYMBOL> (<GENE_FULL_NAME>) on chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, classify as benign or pathogenic? If pathogenic, what disease(s) is it associated with?",
    "Variant at chromosome position <CHROMOSOME_POSITION>, chromosome <CHROMOSOME_NUMBER>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>): benign or pathogenic? If pathogenic, what condition(s) does it relate to?",
    "Regarding the variant found on chromosome <CHROMOSOME_NUMBER> at position <CHROMOSOME_POSITION> in gene <GENE_SYMBOL> (<GENE_FULL_NAME>): is it benign or pathogenic? If pathogenic, identify the disease(s).",
    "The genetic variant at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, affecting gene <GENE_SYMBOL> (<GENE_FULL_NAME>): benign or pathogenic? Disease name(s) if pathogenic?",
    "Clinically, how would you classify the variant at chromosome <CHROMOSOME_NUMBER>, position <CHROMOSOME_POSITION>, gene <GENE_SYMBOL> (<GENE_FULL_NAME>): benign or pathogenic? If pathogenic, specify the associated illness(es)."
}

question_df = pd.DataFrame({'question': list(question_synonyms)})
question_df.index.name = 'question_number'

# copy the df to training_df
training_df = df.copy()
training_df = training_df.rename(columns={'original_window': 'reference_sequence', 'mutated_window': 'mutated_sequence'})
training_df['question_number'] = np.random.randint(0, 50, size=len(training_df)) # generate random question number between 0 and 49 inclusive

# merge the training_df with the question_df
training_df = pd.merge(training_df, question_df, on='question_number', how='left')

# drop the question_number column
training_df = training_df.drop(columns=['question_number'])

def fill_placeholders(row):
    q = row['question']
    # always replace these
    q = q.replace('<CHROMOSOME_NUMBER>', str(row['chromosome']))
    q = q.replace('<CHROMOSOME_POSITION>', str(row['chromosome_position']))
    q = q.replace('<GENE_SYMBOL>', row['gene_name'])
    
    # gene_full_name may be None
    if pd.notnull(row['gene_desc']):
        q = q.replace('<GENE_FULL_NAME>', row['gene_desc'])
    else:
        # remove the entire "(<GENE_FULL_NAME>)" including surrounding space
        q = re.sub(r'\s*\(\s*<GENE_FULL_NAME>\s*\)', '', q)
    
    return q

training_df['question'] = training_df.apply(fill_placeholders, axis=1)



def format_answer(row):
    path = row['cleaned_pathogenicity']
    disease = row['disease_name']
    
    # If disease_name is exactly 'not_provided' or 'not_specified'
    if disease in ('not_provided', 'not_specified', 'not_specified|not_provided', 'not_provided|not_specified'):
        return path
    
    # Split on '|' into a list and drop 'not_provided'
    diseases = [d for d in disease.split('|') if d != 'not_provided']
    
    # Handle 'not_specified': note it, then drop it
    unspecified = 'not_specified' in diseases
    diseases = [d for d in diseases if d != 'not_specified']
    
    # Sort the disease names alphabetically
    diseases = sorted(diseases)
    
    # If unspecified, append the note as an element at the end
    if unspecified:
        diseases.append('likely other unspecified diseases')
    
    # Represent diseases as a Python-style list literal
    disease_text = str(diseases)  # e.g. "['DiseaseA', 'DiseaseB']"
    
    # Build the answer, adding semicolon only for pathogenic
    if path == 'pathogenic' and diseases:
        return f"{path}; {disease_text}"
    else:
        return path

# Apply to your DataFrame
training_df['answer'] = training_df.apply(format_answer, axis=1)




In [5]:
training_df['disease_name'].value_counts()

disease_name
not_provided                                                                                                      73241
not_specified|not_provided                                                                                         6405
not_provided|not_specified                                                                                         5466
Inborn_genetic_diseases|not_provided                                                                               2289
not_provided|Inborn_genetic_diseases                                                                               1929
                                                                                                                  ...  
not_provided|VAMP7-related_disorder                                                                                   1
46,XY_sex_reversal_1|not_provided                                                                                     1
Hereditary_factor_VIII_defi

In [6]:
!pip install --no-index networkx
import networkx as nx

Looking in links: /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo2023/x86-64-v3, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo2023/generic, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic
Processing /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic/networkx-3.4.2+computecanada-py3-none-any.whl
Installing collected packages: networkx
Successfully installed networkx-3.4.2+computecanada


disjoint diseases

In [47]:
special_diseases = {"not_provided", "not_specified", "Inborn_genetic_diseases", "See_cases"}
import pandas as pd
import networkx as nx
import itertools
import numpy as np
import re
import time
from tqdm import tqdm
from math import comb
import multiprocessing as mp
from functools import partial
from collections import defaultdict

def _evaluate_subset(subset, G, disease_to_rows, train_frac):
    """
    Worker to evaluate one subset removal:
      - removes `subset` from G,
      - checks for ≥2 components,
      - if so, computes the train/test split score using disease_to_rows.
    Returns (score, subset, components) or None.
    """
    H = G.copy()
    H.remove_nodes_from(subset)
    ccs = list(nx.connected_components(H))
    if len(ccs) < 2:
        return None

    # compute unique row counts for each component
    sizes = []
    for comp in ccs:
        rows = set()
        for d in comp:
            rows |= disease_to_rows.get(d, set())
        sizes.append(len(rows))

    # pick two largest comps
    idx = np.argsort(sizes)[::-1][:2]
    train_count, test_count = sizes[idx[0]], sizes[idx[1]]
    frac = train_count / (train_count + test_count)
    score = abs(frac - train_frac)
    return (score, subset, ccs)

def assign_disjoint_splits(
    df: pd.DataFrame,
    special_diseases: set,
    train_frac: float = 0.9,
    max_remove: int = 3,
    random_state: int = 42,
    n_procs: int = 24
) -> (pd.DataFrame, dict):
    """
    Add a 'split' column to df (0=train, 1=test) so that:
      - No disease outside special_diseases appears in both splits.
      - The overall train/test row ratio is as close to train_frac as possible.
      - SNV/non-SNV and pathogenic/benign proportions stay balanced automatically
        by sampling at the end for any rows containing only special diseases.
    Uses up to `n_procs` parallel processes for the removal search, but only if needed.
    Prints progress at every major step.
    """
    rng = np.random.RandomState(random_state)
    start_time = time.time()
    print("Starting split assignment...")

    # 1) Build graph and disease→rows mapping
    print("Step 1/5: Building graph & disease→row index mapping (excluding specials)...")
    G = nx.Graph()
    disease_to_rows = defaultdict(set)
    for idx, name_str in enumerate(df['disease_name']):
        names = name_str.split('|')
        non_special = [d for d in names if d not in special_diseases]
        for d in non_special:
            disease_to_rows[d].add(idx)
            G.add_node(d)
        for u, v in itertools.combinations(non_special, 2):
            G.add_edge(u, v)
    elapsed = time.time() - start_time
    print(f"  → Built graph with {G.number_of_nodes()} nodes, {G.number_of_edges()} edges in {elapsed:.1f}s")

    # 2) Check connectivity
    print("Step 2/5: Checking for existing disconnected components...")
    comps = list(nx.connected_components(G))
    if len(comps) >= 2:
        print(f"  → Found {len(comps)} components; skipping node removal.")
        # compute rows-per-component sets
        comp_rows = []
        for comp in comps:
            rows_set = set()
            for d in comp:
                rows_set |= disease_to_rows[d]
            comp_rows.append((comp, rows_set))

        # total non-special rows
        total_ns_rows = len(set().union(*(rows for _, rows in comp_rows)))
        target_train_ns = train_frac * total_ns_rows

        # sort components by descending size
        comp_rows.sort(key=lambda x: len(x[1]), reverse=True)

        # greedy pack to hit target_train_ns
        train_comp = set()
        train_rows = set()
        for comp, rows_set in comp_rows:
            if len(train_rows | rows_set) <= target_train_ns or not train_rows:
                train_comp |= comp
                train_rows |= rows_set

        all_nodes = set(G.nodes())
        test_comp = all_nodes - train_comp
        dropped = []
    else:
        # 3) Removal search
        print("Step 3/5: Graph is connected; searching for node removals…")
        best = {'score': float('inf')}
        all_nodes = list(G.nodes())
        worker = partial(_evaluate_subset,
                         G=G,
                         disease_to_rows=disease_to_rows,
                         train_frac=train_frac)
        for k in range(1, max_remove + 1):
            total_combs = comb(len(all_nodes), k)
            print(f"  → Trying removals of size {k} ({total_combs} combos)…")
            with mp.Pool(processes=n_procs) as pool:
                for result in tqdm(pool.imap_unordered(worker, itertools.combinations(all_nodes, k)),
                                   total=total_combs,
                                   desc=f"    size={k}"):
                    if not result:
                        continue
                    score, subset, ccs = result
                    if score < best['score']:
                        best.update(score=score, subset=subset, components=ccs)
            elapsed_k = time.time() - start_time
            print(f"    → Done size-{k} in {elapsed_k:.1f}s; best score = {best['score']:.4f}")
            if best['score'] < float('inf'):
                break

        dropped = list(best['subset'])
        comps = best['components']

        # 4) select two largest comps
        print("Step 4/5: Selecting two largest components for train/test…")
        comp_counts = []
        for comp in comps:
            rows_set = set()
            for d in comp:
                rows_set |= disease_to_rows[d]
            comp_counts.append((comp, rows_set))
        comp_counts.sort(key=lambda x: len(x[1]), reverse=True)
        train_comp, test_comp = comp_counts[0][0], comp_counts[1][0]

    # 5) Assign rows
    print("Step 5/5: Assigning rows to splits…")
    def which_split(dlist):
        non_special = [d for d in dlist if d not in special_diseases]
        if any(d in train_comp for d in non_special):
            return 0
        if any(d in test_comp for d in non_special):
            return 1
        return None

    df_out = df.copy()
    df_out['split'] = df_out['disease_name'].str.split('|').apply(which_split)

    # fill None rows to achieve exact train_frac
    mask_none = df_out['split'].isna()
    n_none = mask_none.sum()
    n_train_desired = int(train_frac * len(df_out))
    n_current_train = (df_out['split'] == 0).sum()
    n_to_train = max(0, n_train_desired - n_current_train)
    assign = np.array([0]*n_to_train + [1]*(n_none - n_to_train))
    rng.shuffle(assign)
    df_out.loc[mask_none, 'split'] = assign
    df_out['split'] = df_out['split'].astype(int)

    total_elapsed = time.time() - start_time
    print(f"Done! Total time: {total_elapsed:.1f}s; achieved train fraction = {df_out['split'].mean():.4f}")

    info = {
        'dropped_nodes': dropped,
        'dropped_row_count': int(sum(len(disease_to_rows[d]) for d in dropped)),
        'achieved_frac': float(df_out['split'].mean())
    }
    return df_out, info

# ── Usage ──
new_df, report = assign_disjoint_splits(
    training_df,
    special_diseases,
    train_frac=0.9,
    max_remove=3,
    random_state=42,
    n_procs=24
)
print("Dropped diseases:", report['dropped_nodes'])
print("Rows dropped:", report['dropped_row_count'])
print(f"Final train fraction: {report['achieved_frac']:.3f}")


Starting split assignment...
Step 1/5: Building graph & disease→row index mapping (excluding specials)...
  → Built graph with 13326 nodes, 48265 edges in 1.0s
Step 2/5: Checking for existing disconnected components...
  → Found 3099 components; skipping node removal.
Step 5/5: Assigning rows to splits…
Done! Total time: 9.8s; achieved train fraction = 0.1000
Dropped diseases: []
Rows dropped: 0
Final train fraction: 0.100


In [48]:
new_df['split'].value_counts()

split
0    308420
1     34269
Name: count, dtype: int64

In [49]:
new_df[new_df['split']==1]['disease_name'].unique()

array(['SAMD11-related_disorder|not_provided',
       'not_provided|SAMD11-related_disorder', 'not_provided', ...,
       'not_provided|VAMP7-related_disorder',
       '46,XY_sex_reversal_1|not_provided',
       'TBL1Y-related_disorder|Deafness,_Y-linked_2|not_provided'],
      shape=(11445,), dtype=object)

In [50]:
new_df['split'].value_counts()

split
0    308420
1     34269
Name: count, dtype: int64

In [51]:
import pandas as pd

# assuming new_df is your DataFrame with a 'split' column (0=train, 1=test)

def print_ratio_stats(df, split_label):
    sub = df[df['split'] == split_label]
    total = len(sub)
    print(f"\n=== Split {split_label} (n={total}) ===")
    
    # Pathogenic vs. Benign
    p_counts = sub['cleaned_pathogenicity'].value_counts()
    p_ratios = p_counts / total
    print("\nPathogenicity counts:")
    print(p_counts.to_string())
    print("\nPathogenicity ratios:")
    print(p_ratios.to_string())
    
    # SNV vs. non-SNV
    v_counts = sub['variant_type'].value_counts()
    v_ratios = v_counts / total
    print("\nVariant-type counts:")
    print(v_counts.to_string())
    print("\nVariant-type ratios:")
    print(v_ratios.to_string())

# Overall
print_ratio_stats(new_df, 0)  # train
print_ratio_stats(new_df, 1)  # test

# If you also want a quick cross-tab view:
print("\nCross-tab: split × pathogenicity")
print(pd.crosstab(new_df['split'], new_df['cleaned_pathogenicity'], normalize='index'))

print("\nCross-tab: split × variant_type")
print(pd.crosstab(new_df['split'], new_df['variant_type'], normalize='index'))



=== Split 0 (n=308420) ===

Pathogenicity counts:
cleaned_pathogenicity
benign        230709
pathogenic     77711

Pathogenicity ratios:
cleaned_pathogenicity
benign        0.748035
pathogenic    0.251965

Variant-type counts:
variant_type
SNV        274147
non_SNV     34273

Variant-type ratios:
variant_type
SNV        0.888876
non_SNV    0.111124

=== Split 1 (n=34269) ===

Pathogenicity counts:
cleaned_pathogenicity
benign        30279
pathogenic     3990

Pathogenicity ratios:
cleaned_pathogenicity
benign        0.883568
pathogenic    0.116432

Variant-type counts:
variant_type
SNV        32454
non_SNV     1815

Variant-type ratios:
variant_type
SNV        0.947037
non_SNV    0.052963

Cross-tab: split × pathogenicity
cleaned_pathogenicity    benign  pathogenic
split                                      
0                      0.748035    0.251965
1                      0.883568    0.116432

Cross-tab: split × variant_type
variant_type       SNV   non_SNV
split                    

In [62]:


final_df = new_df.copy()[['question', 'answer', 'reference_sequence', 'mutated_sequence', 'split', 'variant_type', 'cleaned_pathogenicity']]

# if len(final_df['variant_type'].value_counts().keys().tolist()) > 2:
#     raise ValueError("variant_type has more than 2 values, should just be SNV and non_SNV")

train_split_df = final_df[final_df['split']==0]
test_split_df = final_df[final_df['split']==1]

train_split_df = train_split_df.drop('split', axis=1)
test_split_df = test_split_df.drop('split', axis=1)

snv_train_split_df = train_split_df[train_split_df['variant_type']=='SNV']
non_snv_train_split_df = train_split_df[train_split_df['variant_type']=='non_SNV']

snv_test_split_df = test_split_df[test_split_df['variant_type']=='SNV']
non_snv_test_split_df = test_split_df[test_split_df['variant_type']=='non_SNV']

snv_test_split_df = snv_test_split_df.drop('variant_type', axis=1)
non_snv_test_split_df = non_snv_test_split_df.drop('variant_type', axis=1)

snv_train_split_df = snv_train_split_df.drop('variant_type', axis=1)
non_snv_train_split_df = non_snv_train_split_df.drop('variant_type', axis=1)



In [None]:
# save all the final dataframes to parquet files
snv_train_split_df.to_parquet('SCRATCH_DIR/DNASNVData113/finaldata/snv_train_split_df.parquet')
non_snv_train_split_df.to_parquet('SCRATCH_DIR/DNASNVData113/finaldata/non_snv_train_split_df.parquet')
snv_test_split_df.to_parquet('SCRATCH_DIR/DNASNVData113/finaldata/snv_test_split_df.parquet')
non_snv_test_split_df.to_parquet('SCRATCH_DIR/DNASNVData113/finaldata/non_snv_test_split_df.parquet')

#now upload to huggingface
!pip install --no-index huggingface-hub
from huggingface_hub import HfApi
import os
import glob

# 0) config
repo_id     = "wanglab/bioR_tasks"         # your dataset repo
repo_type   = "dataset"
subfolder   = "task4-variant_effect_non_snv_and_snv_with_split"
local_dir   = "SCRATCH_DIR/DNASNVData113/clinvar_data/clinvar_windowed_4096.parquet"

api = HfApi()

# 1) list all files in that subfolder
all_files = api.list_repo_files(repo_id, repo_type=repo_type)
old_files = [f for f in all_files if f.startswith(subfolder + "/")]

print(f"Will delete {len(old_files)} old files:")
for f in old_files:
    print("  ", f)

# 2) delete them (one commit per file, or you can batch by reusing the same commit_message)
for f in old_files:
    api.delete_file(
        path_in_repo = f,
        repo_id      = repo_id,
        repo_type    = repo_type,
        commit_message = f"remove old dataset file"
    )

# 3) upload your single Parquet file
new_file = "SCRATCH_DIR/DNASNVData113/clinvar_data/clinvar_windowed_4096.parquet"
basename = os.path.basename(new_file)
dest_path = f"{subfolder}/{basename}"

print(f"Uploading {new_file!r} to {repo_id}/{dest_path} …")
api.upload_file(
    path_or_fileobj = new_file,
    path_in_repo    = dest_path,
    repo_id         = repo_id,
    repo_type       = repo_type,
    commit_message  = f"add updated parquet {basename}"
)

print("Done! Your dataset has been updated on the Hub.")


Unnamed: 0,question,answer,reference_sequence,mutated_sequence,cleaned_pathogenicity
0,"Assess the variant on chromosome 1, position 9...",benign,GGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCTG...,GGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCTG...,benign
1,Gene SAMD11 (sterile alpha motif domain contai...,benign,TGACTAACACGGTGAAACCCGTCTCTACTAAAAATACAAAAAATTA...,TGACTAACACGGTGAAACCCGTCTCTACTAAAAATACAAAAAATTA...,benign
2,The mutation in gene SAMD11 (sterile alpha mot...,benign,CCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCATG...,CCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCATG...,benign
3,"Determine whether the variant at chromosome 1,...",benign,GAGGGAGTGAGTTAGACGCTCTCAAGGGCTCTGCCACCTCCCGGAG...,GAGGGAGTGAGTTAGACGCTCTCAAGGGCTCTGCCACCTCCCGGAG...,benign
4,"Variant on chromosome 1, at position 935779, a...",benign,CCTATGTGCCTGGGGGGGGCTTCCTTTCCCACTGGGAGCCGGTGGG...,CCTATGTGCCTGGGGGGGGCTTCCTTTCCCACTGGGAGCCGGTGGG...,benign
...,...,...,...,...,...
342678,"Variant at chromosome X, position 155524483, g...",benign,GTGTGCATAGCTCTATGCAGTGTAATTACATGTGTAACTTTGTGTA...,GTGTGCATAGCTCTATGCAGTGTAATTACATGTGTAACTTTGTGTA...,benign
342680,"Mutation at chromosome X, position 155900534, ...",benign,AGCATTAAAGATCATCTAGTTGAACTACCCATCTGATGCTTAAATG...,AGCATTAAAGATCATCTAGTTGAACTACCCATCTGATGCTTAAATG...,benign
342681,Does the variant on chromosome X at location 1...,benign,CAATTAGTCCCTTGATTATTGATCCTTCTCTTTTGGCTGTATTCTC...,CAATTAGTCCCTTGATTATTGATCCTTCTCTTTTGGCTGTATTCTC...,benign
342685,Assess the clinical significance (benign or pa...,benign,TTTAGTCTTTCCAAAATGTATACATGCATGATGTCATAATTTTTAA...,TTTAGTCTTTCCAAAATGTATACATGCATGATGTCATAATTTTTAA...,benign


In [13]:
final_training_df.iloc[0][['question', 'answer', 'reference_sequence', 'mutated_sequence']].to_dict()

{'question': 'Assess the variant on chromosome 1, position 930204, impacting SAMD11 (sterile alpha motif domain containing 11): is it benign or pathogenic? If pathogenic, specify the associated condition(s).',
 'answer': 'benign',
 'reference_sequence': 'GGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCTGACTAACACGGTGAAACCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGTGGTGGCGGGTGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCATGAACCCGGGAGGCGGAGCTTGCAGTGAGCCCAGATTGTGCCACCGCACTCCAGCCTGGGCAACAGAGTGAGACTCCGTCTCAAAAAACTAAAAAAGAAGAGAGGTGGGAGAGGAGAGGCTGTCAGAGCCTCTAAGCCCTGGTGCTTGGGCTGCAGAAGGGCAGAGCTAAGCGGGACTTCCCAGCACAGCACACTCCGGACAGGCTGTGGCTGTTGAAGGGACCCCCGAGCTCCAGCTGACACGCGGAGGCCCGGGCACAGACAGGCATCATACCTTCGGCCTTGGCCGCACTCTGTGGTCATTGGTGTTGGGGGCAGCCCAGGGTCAGGGCAGGGTCTCAGCCTCGGACCCCAGGCCCCACCCCTTGCCCAGCAGTGCTGCGTTTTCCCAGTGAGCTGTCGTGGAGAGAGCAGAGGGGACCCAGCGCAGGCCCAGTGGCCGGTGAGGGGAGACGTGGCTCTGGGACGGGGGCCTCCACCTGGGTGGGGGGATGCTCCAGCTTCCAGACCCTTGGGGAGGGGGCACTGCCCAAACTAAGCTGGCACTGGGGCTGTGCATTTGAAGGTGATGGTGGTTCTAGGTCTGAGGAG

In [15]:
df.iloc[0].to_dict()

{'clinvar_id': '1170208',
 'original_window': 'GGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCTGACTAACACGGTGAAACCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGTGGTGGCGGGTGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCATGAACCCGGGAGGCGGAGCTTGCAGTGAGCCCAGATTGTGCCACCGCACTCCAGCCTGGGCAACAGAGTGAGACTCCGTCTCAAAAAACTAAAAAAGAAGAGAGGTGGGAGAGGAGAGGCTGTCAGAGCCTCTAAGCCCTGGTGCTTGGGCTGCAGAAGGGCAGAGCTAAGCGGGACTTCCCAGCACAGCACACTCCGGACAGGCTGTGGCTGTTGAAGGGACCCCCGAGCTCCAGCTGACACGCGGAGGCCCGGGCACAGACAGGCATCATACCTTCGGCCTTGGCCGCACTCTGTGGTCATTGGTGTTGGGGGCAGCCCAGGGTCAGGGCAGGGTCTCAGCCTCGGACCCCAGGCCCCACCCCTTGCCCAGCAGTGCTGCGTTTTCCCAGTGAGCTGTCGTGGAGAGAGCAGAGGGGACCCAGCGCAGGCCCAGTGGCCGGTGAGGGGAGACGTGGCTCTGGGACGGGGGCCTCCACCTGGGTGGGGGGATGCTCCAGCTTCCAGACCCTTGGGGAGGGGGCACTGCCCAAACTAAGCTGGCACTGGGGCTGTGCATTTGAAGGTGATGGTGGTTCTAGGTCTGAGGAGGACACCCTCCTAACAGCCTCATCCCCAAGCTCCGGGCTGTGTTGTGGCAATGGGAGGGAGGAAGTCTGAGGAGACCCTGGTGACTGAACGGAGGAGGGAGTGAGTTAGACGCTCTCAAGGGCTCTGCCACCTCCCGGAGCCAGCGGCCTGTTACTACATTTAAAAAAGCCTCCCGCCCACTGGAAAATAATCAATAACTTTCCTTTAT

In [5]:
training_df['answer'].sample(100).value_counts()

answer
benign                                                                                                                                                                                                         80
pathogenic; ['Intellectual_disability,_X-linked_102']                                                                                                                                                           1
pathogenic; ['Familial_adenomatous_polyposis_2', 'Hereditary_cancer-predisposing_syndrome']                                                                                                                     1
pathogenic; ['Familial_thoracic_aortic_aneurysm_and_aortic_dissection', 'Hereditary_cancer-predisposing_syndrome', 'Juvenile_polyposis_syndrome']                                                               1
pathogenic; ['Familial_cancer_of_breast', 'Hereditary_cancer-predisposing_syndrome']                                                                     

visualization of table

In [15]:
df[df['pathogenicity']=='benign']

Unnamed: 0,clinvar_id,original_window,mutated_window,cleaned_pathogenicity,disease_name,variant_type,clinvar_link,mutation_instruction,pathogenicity,review_status
0,1170208,GGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCTG...,GGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCTG...,benign,SAMD11-related_disorder|not_provided,SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,G>A,benign,"criteria_provided,_multiple_submitters,_no_con..."
2,1170010,CCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCATG...,CCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCATG...,benign,SAMD11-related_disorder|not_provided,SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,C>T,benign,"criteria_provided,_multiple_submitters,_no_con..."
3,1170044,GAGGGAGTGAGTTAGACGCTCTCAAGGGCTCTGCCACCTCCCGGAG...,GAGGGAGTGAGTTAGACGCTCTCAAGGGCTCTGCCACCTCCCGGAG...,benign,not_provided|SAMD11-related_disorder,SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,C>T,benign,"criteria_provided,_multiple_submitters,_no_con..."
5,1170011,AGCCGTCATCTAGGTCTCCTGGAAGGTTTAGAGCCCAGCCTGGGAG...,AGCCGTCATCTAGGTCTCCTGGAAGGTTTAGAGCCCAGCCTGGGAG...,benign,SAMD11-related_disorder|not_provided,SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,C>G,benign,"criteria_provided,_multiple_submitters,_no_con..."
7,1169668,GGTTTAGAGCCCAGCCTGGGAGTCTTTGGTGCTGAAACGGATCTGC...,GGTTTAGAGCCCAGCCTGGGAGTCTTTGGTGCTGAAACGGATCTGC...,benign,not_provided,SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,C>T,benign,"criteria_provided,_multiple_submitters,_no_con..."
...,...,...,...,...,...,...,...,...,...,...
342875,522717,TGTCATCCCTCTTATTAATCATCATCCTAGCCCTAAGTCTGGCCTA...,TGTCATCCCTCTTATTAATCATCATCCTAGCCCTAAGTCTGGCCTA...,benign,Mitochondrial_disease|not_specified,SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,G>A,benign,"criteria_provided,_multiple_submitters,_no_con..."
342878,65510,CTAAAACTAATCGTCCCAACAATTATATTACTACCACTGACATGAC...,CTAAAACTAATCGTCCCAACAATTATATTACTACCACTGACATGAC...,benign,Leber_optic_atrophy|Leigh_syndrome|Mitochondri...,SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,T>C,benign,reviewed_by_expert_panel
342905,140592,AGTTACAATCGGCATCAACCAACCACACCTAGCATTCCTGCACATC...,AGTTACAATCGGCATCAACCAACCACACCTAGCATTCCTGCACATC...,benign,Familial_cancer_of_breast|Mitochondrial_diseas...,SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,A>G,benign,reviewed_by_expert_panel
342907,235623,TAAACGCCTGGCAGCCGGAAGCCTATTCGCAGGATTTCTCATTACT...,TAAACGCCTGGCAGCCGGAAGCCTATTCGCAGGATTTCTCATTACT...,benign,Leigh_syndrome|not_provided,SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,A>G,benign,"criteria_provided,_multiple_submitters,_no_con..."


In [23]:
df[df['variant_type']=='non_SNV']

Unnamed: 0,clinvar_id,original_window,mutated_window,cleaned_pathogenicity,disease_name,variant_type,clinvar_link,mutation_instruction,pathogenicity,review_status
42,1185392,TTATTGATGTGAAATTCATATAACATAAAACTAACCATTTTAAAGA...,TTATTGATGTGAAATTCATATAACATAAAACTAACCATTTTAAAGA...,benign,Mendelian_susceptibility_to_mycobacterial_dise...,non_SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,T>TA,benign,"criteria_provided,_multiple_submitters,_no_con..."
67,666960,TGGTGCAGGGAGGTGACTGGGTCCTTGGCCATGGGGTTGGGACCTG...,TGGTGCAGGGAGGTGACTGGGTCCTTGGCCATGGGGTTGGGACCTG...,pathogenic,Congenital_myasthenic_syndrome|Congenital_myas...,non_SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,G>GGGGCC,pathogenic/likely_pathogenic,"criteria_provided,_multiple_submitters,_no_con..."
69,970311,ATCAGCAGGTGCCCGTTGGATTTGGACTGGGAGTCCCAGGGCCTTG...,ATCAGCAGGTGCCCGTTGGATTTGGACTGGGAGTCCCAGGGCCTTG...,pathogenic,Congenital_myasthenic_syndrome_8,non_SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,G>GC,pathogenic/likely_pathogenic,"criteria_provided,_multiple_submitters,_no_con..."
80,930633,GTGCCTGAGGCAGCTTTGTTGGCCACGTTGAGGTCTGGTGATGGGA...,GTGCCTGAGGCAGCTTTGTTGGCCACGTTGAGGTCTGGTGATGGGA...,pathogenic,Presynaptic_congenital_myasthenic_syndrome|Con...,non_SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,CGCTCCGGCCAGTGCCAGGGTCGAGGTGAGCGGCTCCCCCGGGGGA...,likely_pathogenic,"criteria_provided,_multiple_submitters,_no_con..."
90,263160,TCGCGGGACCCCTGCTCCAACGTGACCTGCAGCTTCGGCAGCACCT...,TCGCGGGACCCCTGCTCCAACGTGACCTGCAGCTTCGGCAGCACCT...,benign,not_provided|not_specified|Congenital_myasthen...,non_SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,CCT>C,benign,"criteria_provided,_multiple_submitters,_no_con..."
...,...,...,...,...,...,...,...,...,...,...
342844,9654,TACATAAAATCTAGACAAAAAAGGAAGGAATCGAACCCCCCAAAGC...,TACATAAAATCTAGACAAAAAAGGAAGGAATCGAACCCCCCAAAGC...,pathogenic,Mitochondrial_disease|Mitochondrial_complex_IV...,non_SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,TTTTTTCTTCGCAGGA>T,likely_pathogenic,reviewed_by_expert_panel
342845,9656,CAAGCCAACCCCATGGCCTCCATGACTTTTTCAAAAAGGTATTAGA...,CAAGCCAACCCCATGGCCTCCATGACTTTTTCAAAAAGGTATTAGA...,pathogenic,Mitochondrial_disease|Mitochondrial_complex_IV...,non_SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,A>AC,likely_pathogenic,reviewed_by_expert_panel
342876,693440,ATGAGTGACTACAAAAAGGATTAGACTGAACCGAATTGGTATATAG...,ATGAGTGACTACAAAAAGGATTAGACTGAACCGAATTGGTATATAG...,pathogenic,Mitochondrial_myopathy_with_reversible_cytochr...,non_SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,CA>C,likely_pathogenic,reviewed_by_expert_panel
342895,800503,ACCTTTATTATCAGTCTCTTCCCCACAACAATATTCATGTGCCTAG...,ACCTTTATTATCAGTCTCTTCCCCACAACAATATTCATGTGCCTAG...,pathogenic,Mitochondrial_disease,non_SNV,https://www.ncbi.nlm.nih.gov/clinvar/variation...,CTA>C,likely_pathogenic,reviewed_by_expert_panel


In [19]:
df['variant_type'].value_counts()

variant_type
SNV        306816
non_SNV     36097
Name: count, dtype: int64

In [17]:
df['pathogenicity'].value_counts().keys()

Index(['likely_benign', 'benign', 'benign/likely_benign', 'pathogenic',
       'pathogenic/likely_pathogenic', 'likely_pathogenic',
       'pathogenic|drug_response', 'likely_pathogenic|drug_response',
       'benign/likely_benign|other', 'likely_benign|other', 'benign|other',
       'pathogenic/likely_pathogenic|other', 'pathogenic|other',
       'benign|association', 'likely_benign|drug_response|other',
       'pathogenic/likely_pathogenic|risk_factor', 'benign|drug_response',
       'benign/likely_benign|drug_response|other',
       'likely_pathogenic|risk_factor', 'pathogenic|risk_factor',
       'benign/likely_benign|drug_response', 'benign|risk_factor',
       'likely_benign|association', 'benign/likely_benign|other|risk_factor',
       'benign/likely_benign|association', 'likely_pathogenic|affects',
       'likely_pathogenic|other', 'benign/likely_benign|risk_factor',
       'likely_pathogenic|association',
       'pathogenic/likely_pathogenic|association',
       'benign|confer

In [11]:
# ─── Basic cohort statistics ─────────────────────────────────

print(f"Total variants: {len(df):,}\n")

# Variant type
print("Variant type counts:")
display(df['variant_type'].value_counts())

# Pathogenicity
print("\nPathogenicity counts:")
display(df['pathogenicity'].value_counts())

# Top diseases
print("\nTop 10 disease names:")
display(df['disease_name']
        .replace('', 'Unknown')             # collapse blanks
        .value_counts()
        .head(10))

# ─── Indel vs. SNP breakdown ────────────────────────────────

# parse ref/alt lengths
ref_alt = df['mutation_instruction'].str.split('>', expand=True)
df['ref_len'] = ref_alt[0].str.len().astype(int)
df['alt_len'] = ref_alt[1].str.len().astype(int)
df['len_diff'] = df['alt_len'] - df['ref_len']

print("\nLength‐difference (alt − ref) distribution:")
display(df['len_diff']
        .value_counts()
        .sort_index())

# ─── Transition / transversion in SNVs ─────────────────────

# only look at true SNVs (ref_len==alt_len==1)
snv = df[(df['variant_type']=='SNV') & (df['len_diff']==0)].copy()
def is_transition(instr):
    pur = {'A','G'}
    pyr = {'C','T'}
    r,a = instr.split('>')
    return (r in pur and a in pur) or (r in pyr and a in pyr)

snv['is_transition'] = snv['mutation_instruction'].map(is_transition)
t1 = snv['is_transition'].sum()
t2 = (~snv['is_transition']).sum()
print(f"\nSNVs: {len(snv):,}  →  Transitions: {t1:,}   Transversions: {t2:,}\n")

# ─── GC‐content in windows (sampled) ─────────────────────────

# sampling to speed up
sample = df.sample(min(len(df), 10000), random_state=0)
def gc_frac(s): return (s.count('G')+s.count('C'))/len(s)

sample['orig_gc'] = sample['original_window'].map(gc_frac)
sample['mut_gc' ] = sample['mutated_window'].map(gc_frac)

print("Original‐window GC content (sample):")
display(sample['orig_gc'].describe())

print("\nMutated‐window GC content (sample):")
display(sample['mut_gc'].describe())


# ─── Better Non-SNV event breakdown ────────────────────────────────

non_snv = df[df['variant_type'] != 'SNV']

# counts
n_ins   = (non_snv['len_diff'] >  0).sum()
n_del   = (non_snv['len_diff'] <  0).sum()
n_bal   = ((non_snv['len_diff']==0) & (non_snv['ref_len']>1)).sum()

print("Non-SNV events by net length change:")
print(f"  Insertions      (len_diff>0) : {n_ins:,}")
print(f"  Deletions       (len_diff<0) : {n_del:,}")
print(f"  Balanced Delins (len_diff=0) : {n_bal:,}")

# catch any explicit VCF-style inversions (<INV>) if they exist
n_inv = df['mutation_instruction'].str.contains('<INV>').sum()
if n_inv:
    print(f"  Inversions                 : {n_inv:,}")



Total variants: 3,493,400

Variant type counts:


variant_type
SNV        3226063
non_SNV     267337
Name: count, dtype: int64


Pathogenicity counts:


pathogenicity
not_pathogenic    3043681
pathogenic         449719
Name: count, dtype: int64


Top 10 disease names:


disease_name
not_provided                               861927
not_specified                              719547
Inborn_genetic_diseases                    133139
Hereditary_cancer-predisposing_syndrome     47592
Cardiovascular_phenotype                    25149
Primary_ciliary_dyskinesia                  17996
Inborn_genetic_diseases|not_provided        16863
not_specified|not_provided                  16518
not_provided|Inborn_genetic_diseases        15874
not_provided|not_specified                  14489
Name: count, dtype: int64


Length‐difference (alt − ref) distribution:


len_diff
-2046    1
-2037    1
-2032    1
-2031    1
-2030    1
        ..
 1951    1
 1989    1
 1992    1
 2004    1
 2019    1
Name: count, Length: 1266, dtype: int64


SNVs: 3,226,063  →  Transitions: 2,104,260   Transversions: 1,121,803

Original‐window GC content (sample):


count    10000.000000
mean         0.471380
std          0.094873
min          0.244629
25%          0.389404
50%          0.461914
75%          0.548340
max          0.744385
Name: orig_gc, dtype: float64


Mutated‐window GC content (sample):


count    10000.000000
mean         0.471290
std          0.094818
min          0.244385
25%          0.389404
50%          0.461792
75%          0.548157
max          0.744385
Name: mut_gc, dtype: float64

Non-SNV events by net length change:
  Insertions      (len_diff>0) : 86,857
  Deletions       (len_diff<0) : 169,730
  Balanced Delins (len_diff=0) : 10,750


In [23]:
df.sample(10)

Unnamed: 0,mutation_instruction,original_window,mutated_window,pathogenicity,disease_name,variant_type,ref_len,alt_len,len_diff,abs_len_diff
370378,A>C,GAACTGAGGAGATAGTTTTTGTTTTTAATGATTGTGCTCTTTTAAC...,GAACTGAGGAGATAGTTTTTGTTTTTAATGATTGTGCTCTTTTAAC...,not_pathogenic,Hereditary_cancer-predisposing_syndrome,SNV,1,1,0,0
47441,C>A,TCTTGCTGGTTTCAGGGGAGGAGCCCGCTGTGCCAGGCCCTCATCT...,TCTTGCTGGTTTCAGGGGAGGAGCCCGCTGTGCCAGGCCCTCATCT...,not_pathogenic,not_specified,SNV,1,1,0,0
2370658,C>G,ACAGAAATAATGGAGTTAGAAAATCATTTAGTAGCCATCATAGTAA...,ACAGAAATAATGGAGTTAGAAAATCATTTAGTAGCCATCATAGTAA...,not_pathogenic,DICER1-related_tumor_predisposition,SNV,1,1,0,0
2479341,C>A,TGAATGCTTTTAGTTGTATGTGTTTTACGTTCATAAAAGTAAAATC...,TGAATGCTTTTAGTTGTATGTGTTTTACGTTCATAAAAGTAAAATC...,not_pathogenic,not_specified,SNV,1,1,0,0
2340733,G>A,TAAGTGGGGAAGGGCCTGCTTCCTGAGTCGGAGGCTGAGAGGATGG...,TAAGTGGGGAAGGGCCTGCTTCCTGAGTCGGAGGCTGAGAGGATGG...,not_pathogenic,not_specified,SNV,1,1,0,0
312980,C>T,GTCGGCCAGGGCCGCCGCGGGGCTACCGGGCGGGCTCGGGGCGGCG...,GTCGGCCAGGGCCGCCGCGGGGCTACCGGGCGGGCTCGGGGCGGCG...,not_pathogenic,Intellectual_developmental_disorder_with_micro...,SNV,1,1,0,0
1829920,T>G,GAAGGGAATACAAGGAAGGAGGAAAGGGAGTGTTAGTTTGGGCTAT...,GAAGGGAATACAAGGAAGGAGGAAAGGGAGTGTTAGTTTGGGCTAT...,not_pathogenic,Dilated_cardiomyopathy_1DD|Cardiovascular_phen...,SNV,1,1,0,0
315617,C>T,TCCTGGTCCCAACCCCCTGCGCAGTATCTCTGGACGGGGCTAGACC...,TCCTGGTCCCAACCCCCTGCGCAGTATCTCTGGACGGGGCTAGACC...,not_pathogenic,not_provided,SNV,1,1,0,0
2279534,C>T,TTACTTAGAAAAGCTCAACAAGTCTTTGGATATTTAGAGACTTTTT...,TTACTTAGAAAAGCTCAACAAGTCTTTGGATATTTAGAGACTTTTT...,not_pathogenic,not_provided,SNV,1,1,0,0
2536550,C>T,GGGTGACACACCGGGAGAGGCTAGCAGTAAACAAAGGGAAAGGCGG...,GGGTGACACACCGGGAGAGGCTAGCAGTAAACAAAGGGAAAGGCGG...,not_pathogenic,not_provided|Hereditary_cancer-predisposing_sy...,SNV,1,1,0,0


check to see which variant types from the vep vcf are not included in the fasta

In [None]:
import pysam
from collections import Counter

vcf_path   = "SCRATCH_DIR/DNASNVData113/clinvar_data/clinvar_coding_only.vcf"
fasta_path = "SCRATCH_DIR/DNASNVData113/clinvar_data/vep-cache-113/homo_sapiens/113_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa"

# 1) open VCF
vcf = pysam.VariantFile(vcf_path)

# 2) get contigs from header if present, else from records
vcf_contigs = set(vcf.header.contigs)
if not vcf_contigs:
    vcf_contigs = { rec.contig for rec in vcf }
    vcf = pysam.VariantFile(vcf_path)  # reopen to iterate again

# 3) open FASTA and get its contigs
fa = pysam.FastaFile(fasta_path)
fasta_contigs = set(fa.references)

# 4) compute sets
missing = sorted(vcf_contigs - fasta_contigs)
common  = sorted(vcf_contigs & fasta_contigs)

print("In VCF but not in FASTA:", missing)
print("In both VCF and FASTA:", common)

# 5) count variants by category
counts_missing = Counter()
counts_common  = 0

for rec in vcf:
    chrom = rec.contig
    if chrom in missing:
        counts_missing[chrom] += 1
    elif chrom in fasta_contigs:
        counts_common += 1

# 6) report
print("\nCounts of variants on missing contigs:")
for contig in missing:
    print(f"  {contig}: {counts_missing[contig]}")

print(f"\nTotal variants on contigs present in both VCF and FASTA: {counts_common}")


[W::vcf_parse] Contig '1' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '2' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '3' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '4' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '5' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '6' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '7' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '8' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '9' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '10' is not defined in the header. (Quick workaroun

In VCF but not in FASTA: ['NT_113889.1', 'NT_187633.1', 'NT_187661.1', 'NT_187693.1', 'NW_009646201.1']
In both VCF and FASTA: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '3', '4', '5', '6', '7', '8', '9', 'MT', 'X', 'Y']


[W::vcf_parse] Contig '2' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '3' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '4' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '5' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '6' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '7' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '8' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '9' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '10' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '11' is not defined in the header. (Quick workarou


Counts of variants on missing contigs:
  NT_113889.1: 1
  NT_187633.1: 10
  NT_187661.1: 8
  NT_187693.1: 10
  NW_009646201.1: 1

Total variants on contigs present in both VCF and FASTA: 3494465


[W::vcf_parse] Contig 'Y' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'MT' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'NT_113889.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'NT_187633.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'NT_187661.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'NT_187693.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'NW_009646201.1' is not defined in the header. (Quick workaround: index the file with tabix.)
