Skip to content

aofarrel/SRANWRP

Repository files navigation

SRAnwrp

SRAnwrp ("Saran Wrap") envelops several SRA-related tools in the warm, polyethylene embrace of a single Ubuntu-based Docker image and some optional assorted workflows. For the sake of simplicity, releases on main follow the same versioning scheme as the Docker image.

What tasks can it perform?

The combination of e-direct and sra-tools allows it do basically anything you can do from SRA's website. These exist in the form of WDL workflows -- more on WDL here.

Pulling FASTQs

Getting Organism + TaxID from a list of BioProject/BioSample accessions

There's a lot of BioProjects on SRA, and some of them are multi-species. Use this workflow to get a list of all run accessions, and said run accessions' species and TaxIDs, from a list of BioProject accessions. If you instead have a list of BioSamples, use this workflow to get species and taxid (as well as a list of all run accessions).

Getting sample accessions from run accessions (SRR/ERR/DRR)

If you have a list of run accessions, this workflow will get a list of sample accessions that they cover. Some samples have more than one run -- those samples will only appear in the output once.

Other stuff?

Here's some other tasks that can help you convert between data types.

What's included in the Docker image?

Non-exhaustive list:

  • The TB reference genome and a BED of its commonly masked regions
  • bash-5.1.16(1)-release
  • bedtools-latest
  • bc-latest
  • bcftools-1.16
  • cpan-latest
  • curl-latest
  • entrez-direct-latest (aka edirect)
  • gcc-latest
  • git-latest
  • htslib-1.16
  • make-latest
  • Matplotlib-latest
  • numpy-latest
  • pandas-latest
  • pigz-latest
  • python-3.12
    • note: must be called with python3 instead of python (and pip3 instead of pip) when running non-interactively
  • samtools-1.16
    • mpileup, minimap2, fixmate, etc
  • seqtk-latest
  • sra-tools-3.0.1 (aka SRAtools, SRA tools, SRA toolkit, etc)
    • align-info, fastq-dump, fasterq-dump, prefetch, sam-dump, sra-pileup, etc
    • fyi: ncbi/ncbi-vdb was merged with sra-tools in sra-tools-3.0.0 and vdb-get was retired in 3.0.1
  • sudo-latest
  • taxoniumtools-latest
  • tree-latest
  • vim-latest
  • wget-latest

Who builds?

Right now, the image is built and pushed manually. You'll need to include your own copy of the TB reference tarball -- it can be created with clockwork refprep, or downloaded from this Google bucket. MD5s are provided in this repo as a double-check.

Why?

  • Docker Hub's latest version of staphb/sratoolkit, as of my writing this in October 2022, runs version 2.9.2 (see command 15), which doesn't work at all anymore
  • Existing Docker images tend to contain either the SRA toolkit or Entrez Direct, not both
  • Building SRA Toolkit on your own, without conda, is not intuitive
  • Building SRA Toolkit on your own, with conda, is also not intutive (you usually end up with v2.10 which only sometimes works)
  • No need to run vdb-config --interactive or any other interactive process before using anything in this image; SRA Toolkit's config file is generated while building the image