VarFish DB Downloader

The purpose of this repository is to collect various data form public sources that is eventually used in VarFish for annotation and display to the user. This repository contains a Snakemake workflow with supporting code for downloading the data and preparing it for being used with VarFish.

Quick Facts

License: MIT
Programming Language: Python / Snakemake

Running

Use the utility rule help to get a list of all available rules:

# snakemake --cores=1 help

Run them all with all:

# snakemake --cores=1 all

Note that this will take a long time, use a lot of disk space, and download a lot of data.

To run on a Slurm cluster, you can use the Snakemake --slurm option. See run-slurm.sh for an example.

Development Setup

Prerequisites: Install `mamba` for Conda Package Management

Install conda, ideally via miniforge. A quickstart:

# wget -O /tmp/Mambaforge-Linux-x86_64.sh \
    https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
# bash /tmp/Mambaforge-Linux-x86_64.sh -b -p ~/mambaforge3 -s
# source ~/mambaforge3/bin/activate

Clone Project

# git clone git@github.com:bihealth/varfish-db-downloader.git
# cd varfish-db-downloader

Setup Environment and Install Tools

This will setup the conda environment:

# mamba env create --file environment.yml
# conda activate varfish-db-downloader

This will install the varfish-db-downloader tools:

# pip install -e .

Developer Rules

Download Commands

We use wget and aria2c only and not curl. The rationale is that for the test mode, we are overriding the two executables with helper commands.

Development Subsets

Besides the full output, we also build a subset of the data suitable for development. At the moment of writing, the subset is to the BRCA1 gene only. The rationale is that this gene and its variants are heavily annotated as breast cancer predisposition screening is a common task and users/data is plenty.

Running in Test Mode

The download of files can be disabled to enable a test mode. Instead, the files in excerpt-data are used when CI=true is set in the environment.

This is done by overriding the download executables wget and aria2 in the Snakemake file when CI=true has been set. This again is done by overriding the PATH environment variable.

The files can be updated by calling

# varfish-db-downloader wget urls-download

The known URLs are managed in download_urls.yml.

Managing GitHub Project with Terraform

# export GITHUB_OWNER=bihealth
# export GITHUB_TOKEN=ghp_<thetoken>

# cd utils/terraform
# terraform init
# terraform import github_repository.varfish-db-downloader varfish-db-downloader

# terraform validate
# terraform fmt
# terraform plan
# terraform apply

Uploading Data to S3

For example, as follows

# s5cmd --dry-run --profile ext-varfish-public \
    --endpoint-url https://ceph-s3-ext.cubi.bihealth.org \
    sync \
    'output/full/mehari/genes-txs-grch3*' \
    s3://varfish-public/public/

Semantic Commits

Generally, follow Semantic Commits v1, also see examples.

Here is a list of the commit message prefixes that we use:

prefix	description
feat	Features
fix	Bug Fixes
perf	Performance Improvements
deps	Dependencies
revert	Reverts
docs	Documentation
style	Styles
chore	Miscellaneous Chores
refactor	Code Refactoring
test	Tests
build	Build System
ci	Continuous Integration

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
.github		.github
bundled-data		bundled-data
data		data
excerpt-data		excerpt-data
features/grch37		features/grch37
rules		rules
scripts		scripts
test-mode-bin		test-mode-bin
utils/terraform		utils/terraform
vardbs		vardbs
varfish_db_downloader		varfish_db_downloader
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
Snakefile		Snakefile
download_urls.yml		download_urls.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml
run-slurm.sh		run-slurm.sh
sample.vcf		sample.vcf
setup.cfg		setup.cfg
setup.py		setup.py
version.txt		version.txt

License

varfish-org/varfish-db-downloader

Folders and files

Latest commit

History

Repository files navigation

VarFish DB Downloader

Running

Development Setup

Prerequisites: Install mamba for Conda Package Management

Clone Project

Setup Environment and Install Tools

Developer Rules

Download Commands

Development Subsets

Running in Test Mode

Managing GitHub Project with Terraform

Uploading Data to S3

Semantic Commits

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Prerequisites: Install `mamba` for Conda Package Management