A production-grade genomic variant calling results pipeline built on real human sequencing data from the 1000 Genomes Project.
Parses VCF output, enforces nine biological validation rules, annotates every variant with functional consequence and clinical gene context, and serves results through a self-documenting REST API and an interactive dashboard - directly mirroring the data engineering layer that sits downstream of commercial NGS analysis platforms.
Dataset: 1000 Genomes Project Phase 3 · Chromosome 22 · GRCh37/hg19 · 2,504 individuals · 26 populations
Variant calling pipelines - GATK, FreeBayes, bcftools - produce VCF files as their primary output. A VCF file for a single whole-genome sequencing run can contain millions of variant records across thousands of tab-separated fields. Without a downstream data engineering layer, these files are difficult to query at scale, impossible to validate automatically, inaccessible to scientists who cannot write code, and silent about the biological meaning of each variant.
The problem compounds in a platform context. When thousands of samples are processed daily, invalid records - negative genomic positions, allele frequencies above 1.0, empty REF alleles, REF equal to ALT - propagate silently into downstream analyses, causing failures that are difficult to trace back to their source. There is no standard mechanism to preserve and investigate failed records. Quality metrics like Ts/Tv ratio, which are the primary signals of call quality, are buried in raw output rather than surfaced in accessible dashboards.
VariantLens addresses this by building a rigorous, layered data engineering platform directly on top of VCF output. Real variant data is streamed from the 1000 Genomes Project, parsed into structured records, validated against nine biological rules, annotated with functional and clinical context, serialised to Parquet for fast analytical queries, and served through both a REST API and an interactive dashboard. Every component is tested, linted, and deployed through a CI pipeline.
1000 Genomes Project Phase 3
ftp.1000genomes.ebi.ac.uk · Chromosome 22 VCF · GRCh37/hg19
Contains ~1M variants from 2,504 individuals across 26 populations
|
v
VCF Fetcher
Streams gzipped VCF via HTTP · tenacity retry with exponential back-off
Collects up to VCF_SAMPLE_SIZE data lines · caches locally as JSON
Falls back to curated dataset if live server is unreachable
|
v
VCF Parser
Parses all 8 mandatory VCF fields (CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO)
Extracts key INFO subfields: AF, DP, MQ, AC, AN, VT
Normalises chr22 prefix · classifies variant type · detects rsIDs
|
v
Validation Engine -- 9 rules across 4 categories
|-- Passed (500 variants) --> clean list for annotation
|-- Failed (5 variants) --> Quarantine CSV + rejection reason + UTC timestamp
|
v
Annotation Engine
Adds 7 biological fields to every clean record:
variant_class · ts_tv · consequence · gene · disease_association
af_tier · dbsnp_status · filter_outcome
|
v
Data Processor
Builds analysis-ready pandas DataFrame · saves to Parquet
Computes dataset summary: Ts/Tv, consequence distribution,
AF tiers, quality stats, clinical variants, gene counts
|
|-------------------------------------|
v v
FastAPI REST API (port 8001) Plotly Dash Dashboard (port 8050)
8 endpoints · Swagger /docs Consequence bar chart
Pydantic v2 schemas Allele frequency tier chart
Pagination + multi-field filters QUAL score histogram
Clinical variant endpoint Variant class donut
Quarantine endpoint Filterable variant table
QC summary panel
|
v
GitHub Actions CI
flake8 lint · pytest · 50 tests passing on every push to main
1000 Genomes Project Phase 3 is the landmark population genomics study that characterised genetic variation across 2,504 individuals from 26 populations worldwide. The chromosome 22 VCF used in this project contains approximately one million SNPs and indels representing the full spectrum of human genetic diversity - from ultra-rare private variants to common polymorphisms present across all populations.
Chromosome 22 is the second smallest human autosome and hosts several clinically significant genes including BCR (involved in the BCR-ABL1 fusion oncogene driving chronic myelogenous leukaemia), NF2 (neurofibromatosis type 2), SMAD4 (juvenile polyposis and colorectal cancer), TBX1 and CRKL (DiGeorge syndrome), and SHANK3 (Phelan-McDermid syndrome and autism spectrum disorder).
The data is publicly accessible via the EBI FTP mirror and represents the kind of real production-scale VCF output that data engineering pipelines at genomics companies process daily.
Nine rules are applied to every parsed VCF record. Rules are implemented as pure Python functions - each takes a record dictionary and returns a structured result containing a pass/fail flag and a human-readable rejection reason. This design makes every rule independently testable, composable, and extensible without changing the core validation logic.
| Category | Rule | What It Catches |
|---|---|---|
| Identity | Chromosome is a valid human chromosome | Non-human contigs, synthetic sequences, typos |
| Identity | Position is a positive 1-based integer | Negative coordinates, zero, null |
| Identity | REF and ALT are not identical sequences | Monomorphic sites incorrectly emitted as variant calls |
| Alleles | REF allele non-empty and contains only valid bases | Empty REF, invalid nucleotide characters |
| Alleles | ALT allele non-empty and differs from REF | Missing ALT, dot-only ALT, REF/ALT collision |
| Quality | QUAL score in [0, 100000] | Negative Phred scores - computational or data entry error |
| Quality | FILTER field present and non-empty | Missing filter annotation - incomplete VCF record |
| Metrics | Allele frequency in [0, 1] if present | AF > 1.0 indicates a normalisation or calculation error |
| Metrics | Read depth non-negative if present | Negative DP is physically impossible |
Quarantine-not-delete: every failed record is written to data/quarantine/quarantine.csv with the original data intact, the rejection reason as a human-readable string, the names of all failed rules, and a UTC timestamp. No variant data is ever silently discarded.
In this pipeline run, 5 of 505 records were quarantined:
| Rejection reason | Rule failed |
|---|---|
| Position '-1' is not a valid 1-based genomic coordinate | position_positive |
| REF allele '' is empty or contains invalid bases | ref_allele_valid |
| QUAL score '-50.0' is outside valid range [0, 100000] | quality_score_valid |
| Allele frequency '1.5' is outside valid range [0, 1] | allele_frequency_valid |
| REF 'G' and ALT 'G' are identical - not a variant | ref_not_equal_alt |
Every clean variant receives seven computed annotation fields that add biological meaning to the raw VCF data.
Variant class classifies each variant as SNP, insertion, deletion, MNP (multi-nucleotide polymorphism), or complex based on the lengths of the REF and ALT alleles.
Ts/Tv classification determines whether each SNP is a transition (purine-to-purine or pyrimidine-to-pyrimidine: A↔G, C↔T) or a transversion (purine-to-pyrimidine: A↔C, A↔T, G↔C, G↔T). The transition-to-transversion ratio is the primary quality metric for SNP calling in whole-genome sequencing. An expected ratio of 2.0–2.1 indicates high-quality calls. This project achieves a Ts/Tv ratio of 1.75 - consistent with the expected range for chromosome 22 from the 1000 Genomes dataset.
Functional consequence classifies each variant into one of eight categories: synonymous, missense, nonsense (stop-gained), frameshift, splice_region, inframe_insertion, inframe_deletion, or MNP. VariantLens applies rule-based classification based on variant class - the same pre-filtering approach used by platforms before running full VEP or SnpEff annotation.
Gene context assigns each variant to a known gene on chromosome 22 based on genomic position, using a coordinate map derived from Ensembl GRCh37 release 87. Variants in clinically significant genes receive a disease association field.
Allele frequency tier classifies each variant as rare (AF < 0.01), low-frequency (0.01 to 0.05), or common (AF > 0.05). In this dataset, 89.2% of variants are rare - the expected distribution for population-level WGS data.
| Metric | Value | Notes |
|---|---|---|
| Records ingested | 505 | Live 1000 Genomes Phase 3 + 5 injected test cases |
| Passed validation | 500 (99.01%) | Written to Parquet |
| Quarantined | 5 (0.99%) | Preserved with rejection reasons and UTC timestamps |
| SNPs | 484 (96.8%) | Consistent with chr22 composition |
| Indels | 16 (3.2%) | Insertions and deletions |
| Ts/Tv ratio | 1.75 | Within expected range for quality WGS calls |
| Rare variants (AF < 1%) | 446 (89.2%) | Typical for population genomics data |
| Low-frequency variants | 21 (4.2%) | AF 1 to 5% |
| Common variants (AF > 5%) | 33 (6.6%) | Above GWAS MAF threshold |
| Avg completeness score | 100% | All clean records fully populated |
| Validation rules | 9 | Across 4 categories |
| Unit tests | 50 passing | Annotator 62%, parser 68% |
| CI pipeline | Passing | GitHub Actions: flake8 + pytest on every push |
| Reference genome | GRCh37/hg19 | Standard for 1000 Genomes Phase 3 |
| Data source | Live EBI FTP | Streamed from ftp.1000genomes.ebi.ac.uk |
| Method | Endpoint | Description |
|---|---|---|
| GET | /health | Service health with Ts/Tv ratio, pass rate, and reference genome |
| GET | /summary | Full dataset statistics - variant classes, consequences, AF tiers, quality metrics |
| GET | /variants | Paginated list - filter by class, consequence, AF tier, gene, clinical flag, PASS |
| GET | /variants/{pos} | All variants at a specific 1-based chromosomal position |
| GET | /consequences | Consequence type distribution with counts and percentages |
| GET | /af-tiers | Allele frequency tier breakdown with counts and percentages |
| GET | /clinical | Variants in disease-associated genes with clinical annotations |
| GET | /quarantine | Quarantined records with rejection reasons |
Interactive Swagger UI at http://localhost:8001/docs. ReDoc at http://localhost:8001/redoc.
# Clone and install
git clone https://github.com/gbadedata/variantlens.git
cd variantlens
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Copy environment config
cp .env.example .env
# Set NCBI_EMAIL to your email address in .env
# Run the full pipeline
python3 -m src.fetcher # Stream VCF from 1000 Genomes, cache locally
python3 -m src.parser # Parse VCF into structured records
python3 -m src.validator # Validate all records, write quarantine CSV
python3 -m src.annotator # Annotate with consequence, gene, AF tier
python3 -m src.processor # Build Parquet dataset and summary JSON
# Start the API
uvicorn src.api:app --reload --port 8001
# Open http://localhost:8001/docs
# Start the dashboard
python3 -m src.dashboard
# Open http://localhost:8050
# Run tests
pytest -v| Category | Technology | Role in the Project |
|---|---|---|
| Language | Python 3.12 | Primary language throughout |
| Data processing | pandas 2.2, pyarrow 18 | DataFrame operations and Parquet serialisation |
| API framework | FastAPI 0.109, Pydantic v2 | REST endpoints, schema validation, OpenAPI docs |
| Dashboard | Plotly Dash 2.14, DBC 1.5 | Interactive charts, filters, and variant table |
| HTTP client | requests 2.31, tenacity 8.2 | Streaming VCF fetch with exponential back-off retry |
| Testing | pytest 7.4, pytest-cov 4.1 | 50 unit tests across parser, validator, annotator |
| Logging | structlog 24.1 | Structured JSON logs with correlation fields |
| Configuration | pydantic-settings 2.1 | Environment-based config, .env support |
| CI | GitHub Actions | flake8 lint and pytest on every push to main |
variantlens/
├── src/
│ ├── config.py Pydantic settings - environment-based configuration
│ ├── fetcher.py 1000 Genomes VCF fetcher - streaming, retry, local cache
│ ├── parser.py VCF format parser - all 8 fields, INFO subfields, Ts/Tv
│ ├── validator.py 9-rule validation engine with quarantine-not-delete
│ ├── annotator.py Biological annotation - consequence, gene, AF tier, Ts/Tv
│ ├── processor.py Parquet output, summary JSON, dataset-level statistics
│ ├── api.py FastAPI REST API - 8 endpoints with Pydantic schemas
│ └── dashboard.py Plotly Dash interactive dashboard - 6 panels, 4 filters
├── tests/
│ ├── test_annotator.py 22 annotation engine unit tests
│ ├── test_parser.py 20 VCF parser unit tests
│ └── test_validator.py 8 validation engine unit tests
├── data/
│ ├── raw/ Cached VCF JSON - immutable once written
│ ├── processed/ Parquet dataset and summary JSON
│ └── quarantine/ Failed records with rejection reasons and timestamps
├── .github/workflows/
│ └── ci.yml GitHub Actions: lint and test on every push
├── .env.example Required environment variables
├── requirements.txt
├── pyproject.toml pytest, flake8, and coverage configuration
└── README.md
Platforms like Basepair run the hard part - the alignment, variant calling, and statistical modelling. What sits downstream is the data engineering layer: parsing VCF output into structured records, enforcing biological quality rules, annotating variants with functional context, making results queryable through APIs, and presenting them in dashboards that scientists without programming skills can use.
Every design decision in VariantLens maps directly to that layer.
The nine validation rules exist because real VCF files from production pipelines contain silent errors. Negative genomic positions, empty REF alleles, and allele frequencies above 1.0 have all been observed in submitted VCF data. A platform that ingests these without validation propagates noise silently into downstream analyses - variant prioritisation, clinical reporting, and machine learning models all built on corrupted input.
The Ts/Tv ratio is displayed as a dashboard KPI because it is the first number a bioinformatician checks when assessing variant call quality. Surfacing it prominently alongside its component counts demonstrates that the platform understands what quality means in this domain.
The annotation engine exists because a raw VCF record at position 30,100,000 on chromosome 22 has no interpretable meaning until it is associated with the NF2 gene and neurofibromatosis type 2. Consequence classification, gene context, and clinical significance flags are what transform variant coordinates into actionable biological information.
The quarantine-not-delete pattern exists because in any regulated clinical genomics environment - a lab operating under GxP guidelines, a diagnostic platform, a clinical trial data system - data cannot be discarded without a documented reason. Every rejected variant is preserved with the exact rule it failed, the human-readable rejection reason, and a UTC timestamp.
The Parquet output exists because columnar storage is what makes analytical queries fast at scale. Reading only the consequence column from a Parquet file with 30 fields is orders of magnitude faster than scanning an equivalent CSV.
Data: 1000 Genomes Project Phase 3 · Python · FastAPI · Plotly Dash · pandas · GitHub Actions