# SRA Toolkit Tutorial
Access to data remains a major hurdle for reproducibility in science despite the increased availability of large-scale repositories for online data storage, and even journal policies that require data archiving. In this post I’ll try to make an argument for archiving data in the SRA through demonstration of the sratools wrapper from the ipyrad.analysis toolkit, which makes it easy and elegant to download data from the SRA when working in Python/Jupyter.

There are many reasons not to upload data to the SRA and to instead dump it on a generic archive like DRYAD, chief among them being that uploading data to SRA is super time-consuming and difficult. It requires entering pages upon pages of meta-data by hand into arcane web forms or spread-sheets to define numerous objects that are continually referenced by redundant names or prefix tags (e.g., SUB, SAMN, SRX, SRP, PRJNA, BioSamples and BioProjects), and which have a relational structure that defies understanding (e.g., 1 SRA can have 4 SRXs which produce data for 96 SRRs from 96 SRSs for 4 SRPs; See table below, which I reference frequently when trying to understand this stuff.) And so it should be no surprise that people often forgo data archiving.

**Prefix	Accession Name	Definition**
- SRX	Experiment	Metadata about library, platform, selection.
- SRR	Run	The actual sequence data for an experiment.
- SRP	Study	Metadata about project (BioProject).
- SRS	Sample	Metadata about the physical Sample (BioSample)
- SRZ	Analysis	Mapped/aligned reads file (BAM) & metadata.
- SRA	Submission	Metadata of other 5 linked objects.

I do almost all of my work these days in Jupyter notebooks, and so I aimed to write a simple wrapper around sra-tools + entez-tools that would function in a Pythonic way, and overcome some of the problems listed above. This tool is available through conda and distributed with the ipyrad analysis toolkit.


**Sources**
- [Helpful Getting Started Guide on how to search the SRA Database](https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/)
- To learn how to use Advanced Search Builder please refer to [Search in SRA](https://www.ncbi.nlm.nih.gov/sra/docs/srasearch) 
- [Original source for this notebook (although it's out of date)](https://eaton-lab.org/articles/sra-downloads/)
- [ipyrad library documentation](https://ipyrad.readthedocs.io/en/latest/API-analysis/cookbook-sratools.html?highlight=analysis)

First import the ipyrad analysis tools (renamed as ipa) and then initiate an sratools object with a Study accession ID (SRP). You can also provide an argument for “workdir” which is the location where your fastq files (and temporary .sra files) will be downloaded to, and which will be created if it doesn’t yet exist.

In [None]:
# Import dependencies (this may take several minutes so go grab a coffee while you wait)
! conda install -c bioconda sra-tools -y
! conda install -c bioconda entrez-direct -y
! conda install -c bioconda ipyrad -y
! conda install -c conda-forge toytree -y

In [None]:
import ipyrad.analysis as ipa

## Fetch info for a published data set by its accession ID

You can find the study ID or individual sample IDs from published papers or by searching the NCBI or related databases. ipyrad can take as input one or more accessions IDs for individual Runs or Studies (SRR or SRP, and similarly ERR or ERP, etc.).

In [None]:
# init sratools object with an accessions argument
sra = ipa.sratools(accessions="SRP065788")

By providing just the SRP ID you can now query information about the study and have it returned as a nice DataFrame. Below I request fields 1,4,6,29, and 30. The result is returned as a Pandas DataFrame object which is easy to read and manipulate. From these fields you can see the Run IDs, the number of reads (spots), the fact that the data are single-end (i.e., no “spots with mates”), the ScientificNames and the SampleName provided by the study authors.

In [None]:
# fetch info for all samples from this study, save as a dataframe
stable = sra.fetch_runinfo()
# stable = sra.fetch_runinfo((1,4,6,29,30))

In [None]:
# the dataframe has all information about this study
stable.head()

## File names
You can select columns by their index number to use for file names. See below.

In [None]:
stable.iloc[:5, [0, 28, 29]]

## Download the data
From an sratools object you can fetch just the info, or you can download the files as well. Here we call .run() to download the data into a designated workdir. There are arguments for how to name the files according to name fields in the fetch_runinfo table. The accessions argument here is a list of the first five SRR sample IDs in the table above.

In [None]:
# select first 5 samples
list_of_srrs = stable.Run[:5]
list_of_srrs

In [None]:
# new sra object
sra2 = ipa.sratools(accessions=list_of_srrs, workdir="downloaded")

# call download (run) function
sra2.run(auto=True, name_fields=(1,30))
# sra2.run(auto=True, name_fields=(1,30), name_separator="_") # does this work?

## Check the data files
You can see that the files were named according to the SRR and species name in the table. The intermediate .sra files were removed and only the fastq files were saved.

In [None]:
! ls -l downloaded