Running end to end pipeline

The overall pipeline has two different computational requirements. Running Guppy is GPU intensive while training the model and running inference is CPU intensive. So, downloading fast5s and basecalling is done on a GPU instance and training / inference is done on a large CPU instance.

Start an ubuntu18.08 image on AWS with 1000 GB of storage (tested with g4dn.xlarge for basecalling)
Download fast5 data from European Nucleotide Archive (ENA) under accession number PRJEB48183 into /home/ubuntu/fast5/. (~300 GB)
1. Use IBM's Aspera connect to download files. Here is a gist I made for easy installation.
  - cat aspera_fast5_paths.txt | xargs -I{} $HOME/.aspera/connect/bin/ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh {} /home/ubuntu/fast5/.
2. Download files from fast5_paths.txt.
  - cat fast5_paths.txt | xargs -I{} wget {} -P /home/ubuntu/fast5/
3. Download files from ENA directly. ENA creates a zip file called ena_files.zip. Unzip the archive and then move all tar.gz files to /home/ubuntu/fast5/
  - ls | xargs -I{} ls {} | grep tar.gz | xargs -I{} mv {} /home/ubuntu/fast5/
Untar fast5 files ls /home/ubuntu/fast5 | xargs -I{} tar -xzf /home/ubuntu/fast5/{} -C /home/ubuntu/fast5/
Get fastq files
1. OPTION 1
  1. Download fastq data from ENA under accession number PRJEB48183 into /home/ubuntu/fastq/
    1. Use IBM's Aspera connect to download files. Here is a gist I made for easy installation.
      - cat aspera_fastq_paths.txt | xargs -I{} $HOME/.aspera/connect/bin/ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh {} /home/ubuntu/fastq/.
    2. Download files from fast5_paths.txt.
      - cat fastq_paths.txt | xargs -I{} wget {} -P /home/ubuntu/fastq/
    3. Download files from ENA directly. ENA creates a zip file called ena_files.zip. Unzip the archive and then move all fastq.gz files to /home/ubuntu/fastq/
      - ls | xargs -I{} ls {} | grep fastq.gz | xargs -I{} mv {} /home/ubuntu/fastq/
  2. Gunzip fastq files ls /home/ubuntu/fastq | xargs -I{} gunzip /home/ubuntu/fastq/{}
  3. Run python3 prep_fastqs.py to place fastq files into the expected subdirectories
2. OPTION 2: Basecall fast5 files using guppy. It is recommended to switch to a GPU node to basecall
  1. ```
  sudo apt-get install git
  git clone https://github.com/adbailey4/yeast_rrna_modification_detection
  source /home/ubuntu/yeast_rrna_modification_detection/load_environment.sh
  bash /home/ubuntu/yeast_rrna_modification_detection/ubuntu18_setup.sh
  bash /home/ubuntu/yeast_rrna_modification_detection/basecalling/install_guppy.sh
  bash /home/ubuntu/yeast_rrna_modification_detection/basecalling/run_guppy.sh /home/ubuntu/fast5 /home/ubuntu/fastq
```
Run entire pipeline on cpu heavy compute node (tested on c5.metal). bash /home/ubuntu/yeast_rrna_modification_detection/end_to_end/run_entire_pipeline.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Running end to end pipeline

Files

README.md

Latest commit

History

README.md

File metadata and controls

Running end to end pipeline