Skip to content

Commit

Permalink
Adding paragraph for NCBI SRA download and file conversion
Browse files Browse the repository at this point in the history
  • Loading branch information
tjakobi committed Sep 21, 2018
1 parent 6ed4b24 commit 31a7694
Showing 1 changed file with 40 additions and 4 deletions.
44 changes: 40 additions & 4 deletions docs/Detect.rst
Expand Up @@ -56,15 +56,51 @@ In this tutorial, we use the data set from `Jakobi et al. 2016 <https://www.sci

Throughout this tutorial, we will employ Bash wrapper scripts that automate the analysis for multiple samples. While these scripts have been designed to be used with the `SLURM <https://slurm.schedmd.com/man_index.html>`_ workload manager, it is also possible to use them in conjunction with `GNU parallel <https://www.gnu.org/software/parallel/>`_ without SLURM.


Download of raw data files from the NCBI SRA
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The raw data of the `Jakobi et al. 2016 <https://www.sciencedirect.com/science/article/pii/S167202291630033X>`_ study was uploaded to the NCBI short read archive (SRA) and converted to the NCBI SRA format. Before we can start processing the data, the `sra-toolkit <https://github.com/ncbi/sra-tools/wiki/Downloads>`_ needs to be installed. In order to simplify the download process, the `wonderdump <http://data.biostarhandbook.com/scripts/README.html>`_ script is used.

.. code-block:: bash
# create main folder and raw reads folder
mkdir -p workflow/reads
cd workflow/reads
# change the default download directory of wonderdump to current directory
sed -i 's/SRA_DIR=~\/ncbi\/public\/sra/SRA_DIR=.\//g' wonderdump.sh
# get list of accession numbers to download
# also get a mapping file from SRA accession to original file name
wget https://data.dieterichlab.org/s/jakobi2016_sra_list/download -O jakobi2016_sra_list.txt
wget https://data.dieterichlab.org/s/sra_mapping/download -O mapping.txt
# downloading and rewriting the files as gzipped .fastq files will take some time
# in the end, the process will generate a set of 16 files (8 samples x 2 pairs)
# start wonderdump with the accession list and download data (~29GB)
cat jakobi2016_sra_list.txt | xargs -n 1 echo ./wonderdump.sh --split-files --gzip | bash
# rename files from SRA accessions to file names used throughout this tutorial
# for mate 1:
parallel --link ln -s {2}_1.fastq {1}1.fastq :::: mapping.txt :::: jakobi2016_sra_list.txt
# for mate 2:
parallel --link ln -s {2}_2.fastq {1}2.fastq :::: mapping.txt :::: jakobi2016_sra_list.txt
Data structure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^

The data can be downloaded directly from the NCBI SRA via the link given above. Once the data is downloaded, the following data structure is assumed:
Once the data is downloaded, the following data structure is assumed:

.. code-block:: bash
cd workflow
cd reads
cd workflow/reads
ls -la
Expand Down

0 comments on commit 31a7694

Please sign in to comment.