- The overall pipeline has two different computational requirements. Running Guppy is GPU intensive while training the model and running inference is CPU intensive. So, downloading fast5s and basecalling is done on a GPU instance and training / inference is done on a large CPU instance.
- Start an ubuntu18.08 image on AWS with 1000 GB of storage (tested with g4dn.xlarge for basecalling)
- Download fast5 data from European Nucleotide Archive (ENA) under accession number PRJEB48183 into
/home/ubuntu/fast5/
. (~300 GB)- Use IBM's Aspera connect to download files. Here is a gist I made for easy installation.
cat aspera_fast5_paths.txt | xargs -I{} $HOME/.aspera/connect/bin/ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh {} /home/ubuntu/fast5/.
- Download files from
fast5_paths.txt
.cat fast5_paths.txt | xargs -I{} wget {} -P /home/ubuntu/fast5/
- Download files from ENA directly. ENA creates a zip file called
ena_files.zip
. Unzip the archive and then move all tar.gz files to/home/ubuntu/fast5/
ls | xargs -I{} ls {} | grep tar.gz | xargs -I{} mv {} /home/ubuntu/fast5/
- Use IBM's Aspera connect to download files. Here is a gist I made for easy installation.
- Untar fast5 files
ls /home/ubuntu/fast5 | xargs -I{} tar -xzf /home/ubuntu/fast5/{} -C /home/ubuntu/fast5/
- Get fastq files
-
OPTION 1
- Download fastq data from ENA under accession number PRJEB48183 into
/home/ubuntu/fastq/
- Use IBM's Aspera connect to download files. Here is a gist I made for easy installation.
cat aspera_fastq_paths.txt | xargs -I{} $HOME/.aspera/connect/bin/ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh {} /home/ubuntu/fastq/.
- Download files from
fast5_paths.txt
.cat fastq_paths.txt | xargs -I{} wget {} -P /home/ubuntu/fastq/
- Download files from ENA directly. ENA creates a zip file called
ena_files.zip
. Unzip the archive and then move all fastq.gz files to/home/ubuntu/fastq/
ls | xargs -I{} ls {} | grep fastq.gz | xargs -I{} mv {} /home/ubuntu/fastq/
- Use IBM's Aspera connect to download files. Here is a gist I made for easy installation.
- Gunzip fastq files
ls /home/ubuntu/fastq | xargs -I{} gunzip /home/ubuntu/fastq/{}
- Run
python3 prep_fastqs.py
to place fastq files into the expected subdirectories
- Download fastq data from ENA under accession number PRJEB48183 into
-
OPTION 2: Basecall fast5 files using guppy. It is recommended to switch to a GPU node to basecall
-
sudo apt-get install git git clone https://github.com/adbailey4/yeast_rrna_modification_detection source /home/ubuntu/yeast_rrna_modification_detection/load_environment.sh bash /home/ubuntu/yeast_rrna_modification_detection/ubuntu18_setup.sh bash /home/ubuntu/yeast_rrna_modification_detection/basecalling/install_guppy.sh bash /home/ubuntu/yeast_rrna_modification_detection/basecalling/run_guppy.sh /home/ubuntu/fast5 /home/ubuntu/fastq
-
-
- Run entire pipeline on cpu heavy compute node (tested on c5.metal).
bash /home/ubuntu/yeast_rrna_modification_detection/end_to_end/run_entire_pipeline.sh