The purpose of this notebook is to download the assembled contigs, blast them against nt and nr, then return those contigs to AWS.

1. Set up an m5a.12xlarge instance, with keys and ports for jupyter from the `czbiohub-miniconda` AMI. Storage is at `/mnt/data`.

`aegea launch --iam-role S3fromEC2 --ami-tags Name=czbiohub-jupyter -t m5a.12xlarge  batson-blast`

Setup jupyter on the remote

`aegea ssh batson@batson-blast`
`tmux`
`jupyter notebook`

Port forwarding from laptop

`aegea ssh batson@batson-blast -NL localhost:8899:localhost:8888`

2. Download contigs

`mkdir /mnt/data/contigs`
`aws s3 sync s3://czbiohub-mosquito/contigs/ /mnt/data/contigs --exclude "*" --include "*.fasta" --dryrun`

3. Install requirements

conda install -c bioconda -c conda-forge blast

mkdir /mnt/data/blast
cd /mnt/data/blast
update_blastdb.pl --decompress nt nr taxdb

4. Run BLAST

Loop over all contigs, and run for each sample

`BLASTDB=/mnt/data/blast blastn -db nt -num_threads 48 -query /mnt/data/contigs/{SAMPLE}/contigs.fasta -outfmt 7 -out /mnt/data/contigs/{SAMPLE}/blastn_nt.m9 -evalue 1e-1`

`BLASTDB=/mnt/data/blast blastx -db nr -num_threads 48 -query /mnt/data/contigs/{SAMPLE}/contigs.fasta -outfmt 7 -out /mnt/data/contigs/{SAMPLE}/blastx_nr.m9 -evalue 1e-1`

5. Upload samples

`aws s3 sync /mnt/data/contigs/ s3://czbiohub-mosquito/contigs/ --exclude "*" --include "*.m9" --dryrun`

In [2]:
!mkdir /mnt/data/contigs

mkdir: cannot create directory ‘/mnt/data/contigs’: File exists


In [3]:
!aws s3 sync s3://czbiohub-mosquito/contigs/ /mnt/data/contigs --exclude "*" --include "*.fasta" --dryrun | wc -l

160


In [4]:
!aws s3 sync s3://czbiohub-mosquito/contigs/ /mnt/data/contigs --exclude "*" --include "*.fasta"

download: s3://czbiohub-mosquito/contigs/CMS001_004_Ra_S2/contigs.fasta to ../../../../mnt/data/contigs/CMS001_004_Ra_S2/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS001_007_Ra_S12/contigs.fasta to ../../../../mnt/data/contigs/CMS001_007_Ra_S12/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS001_002_Ra_S1/contigs.fasta to ../../../../mnt/data/contigs/CMS001_002_Ra_S1/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS001_010_Ra_S1/contigs.fasta to ../../../../mnt/data/contigs/CMS001_010_Ra_S1/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS001_006_Ra_S5/contigs.fasta to ../../../../mnt/data/contigs/CMS001_006_Ra_S5/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS001_001_Ra_S1/contigs.fasta to ../../../../mnt/data/contigs/CMS001_001_Ra_S1/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS001_008_Ra_S3/contigs.fasta to ../../../../mnt/data/contigs/CMS001_008_Ra_S3/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS001_011_R

download: s3://czbiohub-mosquito/contigs/CMS002_013a_Rb_S120_L004/contigs.fasta to ../../../../mnt/data/contigs/CMS002_013a_Rb_S120_L004/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS002_016a_Rb_S121_L004/contigs.fasta to ../../../../mnt/data/contigs/CMS002_016a_Rb_S121_L004/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS001_060_Ra_S12/contigs.fasta to ../../../../mnt/data/contigs/CMS001_060_Ra_S12/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS001_051_Ra_S8/contigs.fasta to ../../../../mnt/data/contigs/CMS001_051_Ra_S8/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS002_017b_Rb_S123_L004/contigs.fasta to ../../../../mnt/data/contigs/CMS002_017b_Rb_S123_L004/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS002_017c_Rb_S124_L004/contigs.fasta to ../../../../mnt/data/contigs/CMS002_017c_Rb_S124_L004/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS002_018a_Rb_S128_L004/contigs.fasta to ../../../../mnt/data/contigs/CMS002_018a_

download: s3://czbiohub-mosquito/contigs/CMS002_045f_Rb_S189_L004/contigs.fasta to ../../../../mnt/data/contigs/CMS002_045f_Rb_S189_L004/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS002_044e_Rb_S182_L004/contigs.fasta to ../../../../mnt/data/contigs/CMS002_044e_Rb_S182_L004/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS002_045b_Rb_S184_L004/contigs.fasta to ../../../../mnt/data/contigs/CMS002_045b_Rb_S184_L004/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS002_045c_Rb_S185_L004/contigs.fasta to ../../../../mnt/data/contigs/CMS002_045c_Rb_S185_L004/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS002_045a_Rb_S183_L004/contigs.fasta to ../../../../mnt/data/contigs/CMS002_045a_Rb_S183_L004/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS002_038a_Rb_S172_L004/contigs.fasta to ../../../../mnt/data/contigs/CMS002_038a_Rb_S172_L004/contigs.fasta
download: s3://czbiohub-mosquito/contigs/CMS002_045d_Rb_S186_L004/contigs.fasta to ../../../..

To download the contigs, we will sync to

`s3://czbiohub-mosquito/contigs/SAMPLE/contigs.fasta`

To setup blast db, follow https://czbiohub.atlassian.net/wiki/spaces/DS/pages/903905690/nt+nr+BLAST+etc+on+EC2


In [1]:
samples = !ls /mnt/data/contigs

In [17]:
test_samples = ['CMS002_045c_Rb_S185_L004']

In [None]:
for sample in samples:
    print("NT: Beginning sample ", sample)
    
    !BLASTDB=/mnt/data/blast blastn -db nt -num_threads 48 \
         -query /mnt/data/contigs/{sample}/contigs.fasta -outfmt "7 std staxid ssciname scomname stitle" \
         -out /mnt/data/contigs/{sample}/blast_nt.m9 -evalue 1e-1

for sample in samples:
    print("NR: Beginning sample ", sample)

    !BLASTDB=/mnt/data/blast blastx -db nr -num_threads 48 \
         -query /mnt/data/contigs/{sample}/contigs.fasta -outfmt "7 std staxid ssciname scomname stitle" \
         -out /mnt/data/contigs/{sample}/blast_nr.m9 -evalue 1e-1

NT: Beginning sample  CMS001_001_Ra_S1


In [None]:
1=-outfmt "$BLAST_OUTFMT"

In [None]:
"7 std staxid ssciname scomname stitle"

In [4]:
!blastx -help

USAGE
  blastx [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file]
    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
    [-soft_masking soft_masking] [-matrix matrix_name]
    [-threshold float_value] [-culling_limit int_value]
    [-best_hit_overhang float_value] [-best_hit_score_edge flo

In [None]:
!aws s3 sync /mnt/data/contigs/ s3://czbiohub-mosquito/contigs/ --exclude "*" --include "*_nt.m9" --dryrun

In [14]:
for sample in samples:
    !cat /mnt/data/contigs/{sample}/contigs.fasta | sed 's/>N/>{samples[0]}~N/g' > /mnt/data/contigs/prefixed/{sample}_contigs.fasta

In [15]:
!ls /mnt/data/contigs/prefixed | wc -l

160


In [1]:
!cat /mnt/data/contigs/prefixed/*.fasta > /mnt/data/contigs/all.fasta 

In [2]:
!head /mnt/data/contigs/all.fasta

>CMS001_001_Ra_S1~NODE_1_length_10868_cov_37.868316
GATCTCTTGGTGACTGTTTTGTTGACAATTGCCAAGCCGGCGGTAGCTGTATTGTTACTC
TATCGCTTCAAATCGTTCTCGTGACGCCACGCGAACAACCATGCCGACGTTTCCGAACCA
ACGCAAAACCCTGGCATCTGCCAAGCCGTCGAGGACGCCCCGAGACGGTACTAAAGTCGC
GGAGAATCGCGGGTATTGTTACCTCGCGCTCTTTGAGGCTTTAAACGCGTCGTCTGAGAA
GAAGTTAGACGTTGCGTCAATCAAGGCTCGTTTGGGAGCCTTTCCCCTCGTGAGACGCGT
TGTTGGGGAGTTGTACGCTCACGTGACATTTGACTTGTTTGTTCCCTGCGTGCGTAGGGT
AAGCAATACGATGTTCCACGTGGACGAATGGCGCCCCCCAATGTTGTTCTCTGAGGTACT
CGCGATGACGATCTTTTCGAGTGCGAGAATTGGTGCGGATGACCGTGCCCACTTGCAGCA
GCAACAGCTGATCAGGGTGCAGGACTTATGTAAGACTGCAGGCCTGGACCGCAACACCGT


In [None]:
!/home/ubuntu/plastbinary_linux_20160121/plast -p plastx \
    -i /mnt/data/contigs/all.fasta \
    -d /mnt/data/blast/nr.pal \
    -o /mnt/data/plast_output.tab \
    -e 1e-2 -a 48 -max-hit-per-query 30 -outfmt 1 \
    -bargraph -verbose \
    -max-database-size 200000000

In [4]:
!/home/ubuntu/plastbinary_linux_20160121/plast -h

plast 2.3.1
  - Build date: 2016-01-21 13:53:54
  - OS: Linux-2.6.32-279.el6.x86_64
  - Compiler: /usr/bin/gcc (4.4.6)
  - Host CPU: 48 cores available

[*] denotes mandatory argument.
	-p [*]:	 Program Name [plastp, tplastn, plastx, tplastx or plastn]
	-d [*]:	 Subject database file
	-i [*]:	 Query database file
	-o :	 PLAST report Output File
	-e :	 Expectation value
	-n :	 Size of neighbourhood peforming ungapped extension
	-s :	 Ungapped threshold trigger a small gapped extension
	-g :	 threshold for small gapped extension
	-b :	 bandwith for small gapped extension
	-a :	 Number of processors to use
	-G :	 Cost to open a gap
	-E :	 Cost to extend a gap
	-xdrop-ungap :	 X dropoff value for Ungapped alignment (in bits) (zero invokes default behavior 20 bits)
	-X :	 X dropoff value for gapped alignment (in bits) (zero invokes default behavior)
	-Z :	 X dropoff value for final gapped alignment in bits (0.0 invokes default behavior)
	-index-threshold :	 Index thres