# Welcome to the SRA Toolkit getting started notebook
In this notebook we will walk you though how to download SRA data using the NCBI docker container. Let's go step by step. 

In [1]:
# Quesitons:
#.    Do we need the conda install if we pull the docker container?
# ! conda install -c bioconda sra-tools==2.10.9 -y
! docker pull ncbi/sra-tools

Using default tag: latest
latest: Pulling from ncbi/sra-tools
Digest: sha256:bc9cf88569fdf5dd905ac3305ea9ce222eefca4ebdb99626bb8a736e668d7952
Status: Image is up to date for ncbi/sra-tools:latest
docker.io/ncbi/sra-tools:latest


## Some set up 
The toolkit expects to find the preconfigured settings file in `${HOME}/.ncbi/user-settings.mkfg`. During `docker build`, the preconfigured settings file is put in `/root/.ncbi/user-settings.mkfg`. If you change (or your workflow engine changes) the HOME environment variable, you should move or copy `/root/.ncbi/user-settings.mkfg` to `${HOME}/.ncbi/user-settings.mkfg`. Otherwise, you will get a message saying that the toolkit requires configuration.

In [13]:
# Ensure home is at /home/jupyter
! echo $HOME
# If not run this command
# %env HOME=/home/jupyter

/home/jupyter
env: HOME=/home/jupyter


In [5]:
# Create new .ncbi directory
! mkdir $HOME/.ncbi

In [7]:
# Copy user-settings file to new folder
! docker run -t --rm -v ${HOME}:${HOME}:rw -w ${HOME} --env HOME ncbi/sra-tools cp /root/.ncbi/user-settings.mkfg .ncbi

In [12]:
# And you can verify the path for the configuration file with vdb-config -o n NCBI_SETTINGS, e.g.
! docker run -t --rm -v ${HOME}:${HOME}:rw -w ${HOME} --env HOME ncbi/sra-tools vdb-config -o n NCBI_SETTINGS

NCBI_SETTINGS = "/home/jupyter/.ncbi/user-settings.mkfg"



## Explore

Here we will explore the docker container. For more info on the NCBI docker container, pleaes visit: https://github.com/ncbi/sra-tools/wiki/SRA-tools-docker

In [9]:
# This command shows you all the hidden folders in the current directory
! ls -a

.	 .docker	     .local			       src
..	 .ipynb_checkpoints  .ncbi			       tutorials
.cache	 .ipython	     SRA_Toolkit_DockerDownload.ipynb
.config  .jupyter	     sample


In [25]:
# Edit permissions of the user-settings file so we can view and edit it
! sudo chmod 777 .ncbi/user-settings.mkfg

In [20]:
# Are these settings all we need????
! cat .ncbi/user-settings.mkfg

## auto-generated configuration file - DO NOT EDIT ##

/LIBS/GUID = "ed62e7c0-5f27-4de3-b3d2-986b91413efe"
/LIBS/IMAGE_GUID = "ed62e7c0-5f27-4de3-b3d2-776250d04266"
/libs/cloud/report_instance_identity = "true"


In [23]:
# Testing - can we edit the config?
# This command does not work to edit the config, produces the following error and locks the config again
# 2022-07-13T14:56:46 vdb-config.3.0.0 err: condition violated while updating node - Warning: normally this application should not be run as root/superuser
# ! docker run -t --rm -v ${HOME}:${HOME}:rw -w ${HOME} --env HOME ncbi/sra-tools vdb-config --set /repository/user/main/public/root="/repo"



In [21]:
# Isntead let's try to move the settings file to a folder you can access and take a look at it

# Make a new folder 
! mkdir ncbi_settings
# Copy the settings file into this new folder
! cp .ncbi/user-settings.mkfg ncbi_settings

mkdir: cannot create directory ‘ncbi_settings’: File exists


This is what my mkfg file looks like when I use the interactive mode per NCBI's instructions

```
/LIBS/GUID = "a6b0149c-7b28-467c-81f0-82960cab4c7c"
/config/default = "false"
/libs/cloud/accept_gcp_charges = "true"
/libs/cloud/report_instance_identity = "true"
/libs/temp_cache = "/home/jupyter/downloaded"
/repository/remote/disabled = "false"
/repository/remote/main/CGI/resolver-cgi = "https://trace.ncbi.nlm.nih.gov/Traces/names/names.fcgi"
/repository/remote/protected/CGI/resolver-cgi = "https://trace.ncbi.nlm.nih.gov/Traces/names/names.fcgi"
/repository/user/ad/public/apps/file/volumes/flatAd = "."
/repository/user/ad/public/apps/refseq/volumes/refseqAd = "."
/repository/user/ad/public/apps/sra/volumes/sraAd = "."
/repository/user/ad/public/apps/sraPileup/volumes/ad = "."
/repository/user/ad/public/apps/sraRealign/volumes/ad = "."
/repository/user/ad/public/apps/wgs/volumes/wgsAd = "."
/repository/user/ad/public/root = "."
/repository/user/default-path = "/home/jupyter/ncbi"
/repository/user/main/public/root = "/home/jupyter/local_sra_file_caching"
```
Do we need to update to look like this with docker? Probably not the /LIBS/GUID, /LIBS/IMAGE_GUID and /libs/cloud/report_instance_identity. But we need to ask NCBI.

In [29]:
# Now edit it adding x,y,z .... need to figure out if we even need to edit it
# Then push the changes to back the .ncbi folder
# Caution! Only do this is you are sure about the settings adjustment
! cp ncbi_settings/user-settings.mkfg .ncbi/user-settings.mkfg

In [30]:
# Verify that changes went through
! cat .ncbi/user-settings.mkfg

## auto-generated configuration file - DO NOT EDIT ##

/LIBS/GUID = "ed62e7c0-5f27-4de3-b3d2-986b91413efe"
/LIBS/IMAGE_GUID = "ed62e7c0-5f27-4de3-b3d2-776250d04266"
/libs/cloud/report_instance_identity = "true"


## Get started access NCBI's data


### Find accession ID
First we to find the "Run accession" ID that correspond to the data we want to download from SRA. Here's how we do that. 

**Example**: find RNA-Seq records for lymph node tissue in BALB/c mice in [SRA Entrez](https://www.ncbi.nlm.nih.gov/sra/)

- In the Entrez search bar enter the query: ((("mus musculus"[Organism]) AND BALB/c*) AND "lymph*") AND "rna seq"[Strategy].
- To limit your search to only aligned data add to the above query AND aligned data"[Properties].
- Click the checkboxes next to records (experiments) to select data of interest. Leave all checkboxes unchecked to select all records (experiments) from your search.

**Obtain run accessions**
If you are only interested in one particular dataset, simply copy and paste it in the `SRA_ACCESSION` variable below. To download a list of Run accessions selected from your [Entrez search](https://www.ncbi.nlm.nih.gov/sra/). You might several different types of Accession IDs. Here's a helpful table which describes what all the different prefixes mean which you will see:

**Prefix: Accession Name: Definition**
- SRX: Experiment: Metadata about library, platform, selection.
- SRR: Run: The actual sequence data for an experiment.
- SRP: Study: Metadata about project (BioProject).
- SRS: Sample: Metadata about the physical Sample (BioSample)
- SRZ: Analysis: Mapped/aligned reads file (BAM) & metadata.
- SRA: Submission: Metadata of other 5 linked objects.

- Click **Send to** on the top right hand corner of the search page, check the radiobutton **File**, select **Accession List**. This will download a .txt file to your local computer downloads folder. 
- Upload this file to this Jupyter lab instance (we folders to the left) in the main base directory (unless otherwise specified)

*To learn how to use Advanced Search Builder please refer to Search in* [SRAHELP](https://www.ncbi.nlm.nih.gov/sra/docs/srasearch)

## Download data

Now we are ready to download some SRA data using the `fasterq-dump` from the docker container. Run accessions are used to download SRA data which is an alpha numeric code such as SRR10985476.
- creating a host volume to write to: `-v $PWD:/output:rw`
- setting the container working directory to the host volume: `-w /output`

## Downloading with prefetch (optional but prefered best practice)
**Prefetch** is a part of the SRA toolkit. This program downloads Runs (sequence files in the compressed SRA format) and all additional data necessary to convert the Run from the SRA format to a more commonly used format. **Prefetch** can be used to correct and finish an incomplete Run download, and is also much faster than simply using the `fasterq-demp` directly. 

`prefetch` + `fasterq-dump`

In [31]:
# Indicate which SRA code you want to download (this is for one at a time)
# o set an env variable in a jupyter notebook, just use a % magic commands, 
# either %env or %set_env, e.g., %env MY_VAR=MY_VALUE or %env MY_VAR MY_VALUE. 
# (Use %env by itself to print out current environmental variables.)

# other examples: SRR10985476
%env SRA_ACCESSION=SRR2017944
! echo $SRA_ACCESSION

env: SRA_ACCESSION=SRR2017944
SRR2017944


In [46]:
# # Download one accession to a new folder called "output" under the current directory
# # ! docker run -t --rm -v $PWD:/output:rw -w /output ncbi/sra-tools fasterq-dump -e 2 -p SRR10985476
# %env output_folder=output
# # Create the output folder
# ! mkdir $PWD/$output_folder

env: output_folder=output


In [None]:
# Prefetch (optional but best practice)
# Question for NCBI: what does the "output" part here do? Seems like it should create a folder named output but it doesn't
! docker run -t --rm -v $PWD:/output:rw -w /output ncbi/sra-tools prefetch SRR10985476 $SRA_ACCESSION

# Actual Download
# You'll get an error if you run this more than once for the same accession: (rcExe,rcFile,rcPacking,rcName,rcExists)
! docker run -t --rm -v $PWD:/output:rw -w /output ncbi/sra-tools fasterq-dump -p $SRA_ACCESSION

join   :|  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.99-  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.4 2.4 2.4 2.4 2.4 2.4 2.

## Spliting files
--split-files is splitting things according to how the actual reads should be split. If the original dataset happened to be 2x44, then yes it'll just split things in half. The problem with SRA is that a fair number of uploaded datasets can be misformatted. For all ERR* datasets, do not use SRA. Download the original fastq files from ENA. If those have different numbers of reads then that's what was uploaded.

In [33]:
! docker run -t --rm -v $PWD:/output:rw -w /output ncbi/sra-tools fasterq-dump --split-files SRR11180057

spots read      : 903,571
reads read      : 1,807,142
reads written   : 1,807,142


In [35]:
# A list of Runs:
! docker run -t --rm -v $PWD:/output:rw -w /output ncbi/sra-tools prefetch --option-file SraAccList.txt


2022-07-12T15:47:09 prefetch.3.0.0: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2022-07-12T15:47:10 prefetch.3.0.0: 1) Downloading 'DRR058709'...
2022-07-12T15:47:10 prefetch.3.0.0: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
2022-07-12T15:47:10 prefetch.3.0.0:  Downloading via HTTPS...
2022-07-12T15:50:00 prefetch.3.0.0:  HTTPS download succeed
2022-07-12T15:50:42 prefetch.3.0.0:  'DRR058709' is valid
2022-07-12T15:50:42 prefetch.3.0.0: 1) 'DRR058709' was downloaded successfully
2022-07-12T15:50:42 prefetch.3.0.0: 'DRR058709' has 0 unresolved dependencies

2022-07-12T15:50:42 prefetch.3.0.0: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2022-07-12T15:50:43 prefetch.3.0.0 warn: Maximum file size download limit is 20GB 
2022-07-12T15:50:43 prefetch.3.0.0: 2) 'DRR058710' (28GB) is larger than max