# BLASTing Transcriptome 2.0 Against Genomes from NCBI

Aidan Coyle, afcoyle@uw.edu

Roberts Lab, UW-SAFS

2021-03-02

## Script Description

This script takes a transcriptome of mixed _Chionoecetes bairdi_ and _Hematodinium sp._ and BLASTs it against genomes closely-related to the two species downloaded from NCBI. The purpose of this is to determine which sequences in our transcriptome originate from _C. bairdi_ and which originate from _Hematodinium sp._.

The most closely related genome to _C. bairdi_ is a _C. opilio_ genome, and the most closely-related to _Hematodinium sp._ is _Amoebophyra sp._

To speed up the processing time, we will run these jobs on Mox, meaning that commands will generally be copy-pasted rather than ran directly in this notebook

### Transcriptome: 

By Roberts Lab notation, cbai_transcriptome_v2.0.fasta, or 20200507.C_bairdi.Trinity.fasta

Created **with no taxonomic filter**, meaning that it contains both _C. bairdi_ and _Hematodinium_ sequences

[Directly available here](https://owl.fish.washington.edu/halfshell/genomic-databank/cbai_transcriptome_v2.0.fasta), and [lab notebook describing creation available here](https://robertslab.github.io/sams-notebook/2020/05/02/Transcriptome-Assembly-C.bairdi-All-RNAseq-Data-Without-Taxonomic-Filters-with-Trinity-on-Mox.html)

### _Chionoecetes opilio_ genome

FASTA file containing all protein sequences from the snow crab genome, _Chionoecetes opilio_. This is the most closely-related publically available genome to _C. bairdi_ at time of writing. Snow crab and Tanner crab are extremely closely related - in fact, they can hybridize - so our BLAST should be quite good at specifically pulling _Chionoecetes_ sequences.

Importantly, in our BLAST, we will build our database from **protein** sequences, not nucleotide sequences. Since our transcriptome contains nucleotide sequences, we will use BLASTx, not BLASTn

The genome is available from the [NCBI taxonomy browser](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi). Here is a [specific link to the genome](https://www.ncbi.nlm.nih.gov/genome/?term=txid41210[Organism:exp]). Genome was assembled by researchers at Seoul National University and submitted on 2021-01-08. To download the protein sequences we will be using, follow [this link](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/016/584/305/GCA_016584305.1_ASM1658430v1/GCA_016584305.1_ASM1658430v1_protein.faa.gz).

### _Amoebophyra sp._ genome

FASTA file containing all nucleotide sequences from the genome of the parasitic dinoflagellate _Amoebophyra sp._. _Amoebophyra sp_. and _Hematodinium sp_. are far more distantly-related than _C. opilio_ and _C. bairdi_, but the two parasitic dinoflagellates do share an order - Syndinea. Therefore, our BLAST may not be quite as accurate at IDing which sequences originate from _Hematodinium_, particularly as parasite genomes tend to be reduced somewhat. However, it is the most closely-related available genome, and should hopefully be able to accurately describe which sequences in our transcriptome originate from _Hematodinium_.

As background, _Amoebophyra sp._ is a parasite of other dinoflagellates, while _Hematodinium sp._ is, of course, a parasite of crustaceans.

Importantly, this genome does not have protein sequences available, and so our BLAST will involve a comparison of nucleotide sequences, meaning we will use BLASTn.

The genome is available from the [NCBI taxonomy browser](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi). Here is a [specific link to the genome](https://www.ncbi.nlm.nih.gov/genome/?term=txid1775427[Organism:exp]). Genome was assembled as part of [John et al. 2019 "An aerobic eukaryotic parasite with functional mitochondria that likely lacks a mitochondrial genome", _Scientific Advances_ 5(4): DOI: 10.1126/sciadv.aav1110](https://pubmed.ncbi.nlm.nih.gov/31032404/)

To download the nucleotide sequences we will be using, follow [this link](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/005/223/375/GCA_005223375.1_ASM522337v1/GCA_005223375.1_ASM522337v1_genomic.fna.gz)

### Mox

We will be running most commands (except the slurm script) from the login node of Mox (specifically /gscratch/srlab/afcoyle). Generally, commands prefaced with a ! indicate that it was ran on the local machine. Standard commands were ran on either Mox or Gannet, as indicated by notes

In [None]:
# Download C. opilio protein sequences from genome
[afcoyle@mox2 afcoyle]$ curl -o projects/21_ncbi_genome_blasts/data/opilio_genome.fasta.gz \
-k https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/016/584/305/GCA_016584305.1_ASM1658430v1/GCA_016584305.1_ASM1658430v1_protein.faa.gz

In [None]:
# Unzip
[afcoyle@mox2 afcoyle]$ gunzip projects/21_ncbi_genome_blasts/data/opilio_genome.fasta.gz

In [None]:
# Download Amoebophyra sp. genome
[afcoyle@mox2 afcoyle]$ curl -o projects/21_ncbi_genome_blasts/data/amoebo_genome.fasta.gz \
-k https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/005/223/375/GCA_005223375.1_ASM522337v1/GCA_005223375.1_ASM522337v1_genomic.fna.gz

In [None]:
# Unzip
[afcoyle@mox2 afcoyle]$ gunzip projects/21_ncbi_genome_blasts/data/amoebo_genome.fasta.gz

In [None]:
# Download Transcriptome v2.0
[afcoyle@mox2 afcoyle]$ curl -o projects/21_ncbi_genome_blasts/data/cbai_hemat_transcriptomev2.0.fasta \
> -k https://owl.fish.washington.edu/halfshell/genomic-databank/cbai_transcriptome_v2.0.fasta

#  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
#                                 Dload  Upload   Total   Spent    Left  Speed
# 100  903M  100  903M    0     0  26.9M      0  0:00:33  0:00:33 --:--:-- 28.0M

In [None]:
# Check against checksum for Transcriptome v2.0
[afcoyle@mox2 afcoyle]$ md5sum projects/21_ncbi_genome_blasts/data/cbai_hemat_transcriptomev2.0.fasta \
| grep 01adbd54298495c147767b19ee5c0de9

In [None]:
# Create blast database for opilio genome
[afcoyle@mox2 afcoyle]$ /gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/makeblastdb \
> -in projects/21_ncbi_genome_blasts/data/opilio_genome.fasta \
> -dbtype prot \
> -parse_seqids \
> -out projects/21_ncbi_genome_blasts/output/blastdbs/opilio/opilio_blastdb


# Building a new DB, current time: 03/02/2021 21:33:13
# New DB name:   /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastdbs/opilio/opilio_blastdb
# New DB title:  projects/21_ncbi_genome_blasts/data/opilio_genome.fasta
# Sequence type: Protein
# Keep MBits: T
# Maximum file size: 1000000000B
# Adding sequences from FASTA; added 22637 sequences in 0.909929 seconds.

In [None]:
# Create blast database for amoebo genome
[afcoyle@mox2 afcoyle]$ /gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/makeblastdb \
> -in projects/21_ncbi_genome_blasts/data/amoebo_genome.fasta \
> -dbtype nucl \
> -parse_seqids \
> -out projects/21_ncbi_genome_blasts/output/blastdbs/amoebo/amoebo_blastdb


# Building a new DB, current time: 03/02/2021 22:52:11
# New DB name:   /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastdbs/amoebo/amoebo_blastdb
# New DB title:  projects/21_ncbi_genome_blasts/data/amoebo_genome.fasta
# Sequence type: Nucleotide
# Keep MBits: T
# Maximum file size: 1000000000B
# Adding sequences from FASTA; added 13796 sequences in 1.52016 seconds.

### Mox slurm script for opilio BLASTx

Initial reference should be [here](https://github.com/afcoyle/hemat_bairdii_transcriptome/blob/main/scripts/slurm_scripts/20210302_opilio_blast.sh). However in case of repo reformatting breaking the link, the script is also copy-pasted below.

In [None]:
#!/bin/bash
## Job Name
#SBATCH --job-name=afcoyle_opilioblast
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-12:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=afcoyle@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/afcoyle


/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/blastx \
-task="blastx" \
-query /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/data/cbai_hemat_transcriptomev2.0.fasta \
-db /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastdbs/opilio/opilio_blastdb \
-out /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastres/opilio_blastres.tab \
-evalue 1E-05 \
-num_threads 40 \
-max_target_seqs 1 \
-outfmt 6                                                                                                           

In [3]:
# Submitted the Mox script above for our opilio blast
[afcoyle@mox1 afcoyle]$ sbatch jobs/20210302_opilio_blast.sh
Submitted batch job 1608412

### Mox slurm script for Amoebo BLASTn

Initial reference should be [here](https://github.com/afcoyle/hemat_bairdii_transcriptome/blob/main/scripts/slurm_scripts/20210302_amoebo_blast.sh). However in case of repo reformatting breaking the link, the script is also copy-pasted below.

In [None]:
#!/bin/bash
## Job Name
#SBATCH --job-name=afcoyle_amoeboblast
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-12:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=afcoyle@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/afcoyle



/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/blastn \
-task="blastn" \
-query /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/data/cbai_hemat_transcriptomev2.0.fasta \
-db /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastdbs/amoebo/amoebo_blastdb \
-out /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastres/amoebo_blastres.tab \
-evalue 1E-05 \
-num_threads 40 \
-max_target_seqs 1 \
-outfmt 6

In [None]:
# Submit slurm script for amoebo blastn
[afcoyle@mox1 afcoyle]$ sbatch jobs/20210302_amoebo_blast.sh
Submitted batch job 1608413

### Transfer Files to Gannet, then local machine

In [None]:
# Transfer opilio BLAST output from Mox to Gannet
# These commands performed from a folder in Gannet
# Gannet folder: /volume2/web/nerka/mox_transfers/scrubbed/
rsync -avz --progress \
afcoyle@mox.hyak.uw.edu:/gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastres/opilio_blastres.tab \
hemat_proj/

In [None]:
# Transfer amoebo BLAST output from Mox to Gannet, same folder
rsync -avz --progress \
afcoyle@mox.hyak.uw.edu:/gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastres/amoebo_blastres.tab \
hemat_proj/

In [None]:
# Transfer both BLAST outputs to local machine
# Ran command from local machine
# Using absolute path, as relative path fails
rsync -chavzP --stats \
afcoyle@gannet.fish.washington.edu:/volume2/web/nerka/mox_transfers/scrubbed/hemat_proj \
/mnt/c/Users/acoyl/Documents/GitHub/hemat_bairdii_transcriptome/output/BLASTs/

In [None]:
# Move each file into a separate folder and rename for clarity
!mv ../output/BLASTs/amoebo_blastres.tab ../output/BLASTs/amoebo_genome/cbai2.0_eval10_5_blastnres.tab

In [None]:
!mv ../output/BLASTs/opilio_blastres.tab ../output/BLASTs/opilio_genome/cbai2.0_eval10_5_blastxres.tab

## Exploratory Analysis of BLAST results

At this stage, since all relevant files are now on our local machine, we will revert to running commands directly in this Jupyter notebook rather than copy-pasting

In [2]:
# See how many sequences are in our original cbai_hemat transcriptome
!grep -c ">" ../data/transcriptomes/cbai_hemat_transcriptome_v2.0.fasta

1412254


In [12]:
# See how many sequences got matched in our BLASTx against the opilio genome
!wc -l ../output/BLASTs/opilio_genome/cbai2.0_eval10_5_blastxres.tab

439514 ../output/BLASTs/opilio_genome/cbai2.0_eval10_5_blastxres.tab


In [11]:
# See how many sequences got matched in our BLASTn against the amoebo genome
!wc -l ../output/BLASTs/amoebo_genome/cbai2.0_eval10_5_blastnres.tab

39080 ../output/BLASTs/amoebo_genome/cbai2.0_eval10_5_blastnres.tab


That sums to ~480,000 total matches to both genomes combined out of 1.4 million sequences in the original transcriptome. This might be worth re-doing but with a lower e-value bar - maybe 10^-5 was too high. Still, with the majority of our sequences presumably belonging to _C. bairdi_, I definitely would've expected to see more matches to the opilio genome at least.

Alright, why wonder - let's redo the BLASTs, but with e-value set to 10^-3 instead of 10^-5! Slurm scripts will have just two changes from the two above: Evalue changed from 10^-5 to 10^-3, and output names changed to speciesabbr_highereval_blastres.tab. Still, for the sake of thoroughness, the full slurm scripts are pasted below. 

Note: I initially did this without changing the output names and accidentally overwrote our amoebo_blastres.tab and opilio_eval10_5_blastxres.tab files - luckily, we had already transferred them to both Gannet and our local machine! Lesson here: back your stuff up! 

# BLAST Round 2

Again, we're performing two BLASTs - a BLASTn of cbai_transcriptomev2.0 (which contains both _C. bairdi_ and _Hematodinium_ sequences) against a genome for _Amoebophrya sp._ and a BLASTx against the protein sequences from the _Chionoecetes opilio_ genome. 

### Mox slurm script for BLASTx vs. _C. opilio_. 

Initial reference should be [here](https://github.com/afcoyle/hemat_bairdii_transcriptome/blob/main/scripts/slurm_scripts/20210303_opilio_blast.sh). However in case of repo reformatting breaking the link, the script is also copy-pasted below.

In [None]:
#!/bin/bash
## Job Name
#SBATCH --job-name=afcoyle_opilioblast
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-12:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=afcoyle@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/afcoyle


/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/blastx \
-task="blastx" \
-query /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/data/cbai_hemat_transcriptomev2.0.fasta \
-db /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastdbs/opilio/opilio_blastdb \
-out /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastres/opilio_highereval_blastres.tab \
-evalue 1E-03 \
-num_threads 40 \
-max_target_seqs 1 \
-outfmt 6

In [None]:
[afcoyle@mox2 afcoyle]$ sbatch jobs/20210303_opilio_blast.sh
Submitted batch job 1611098

### Mox slurm script for BLASTn vs. _Amoebophrya sp._

Initial reference should be [here](https://github.com/afcoyle/hemat_bairdii_transcriptome/blob/main/scripts/slurm_scripts/20210303_amoebo_blast.sh). However in case of repo reformatting breaking the link, the script is also copy-pasted below.

In [None]:
#!/bin/bash
## Job Name
#SBATCH --job-name=afcoyle_amoeboblast
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-12:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=afcoyle@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/afcoyle



/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/blastn \
-task="blastn" \
-query /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/data/cbai_hemat_transcriptomev2.0.fasta \
-db /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastdbs/amoebo/amoebo_blastdb \
-out /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastres/amoebo_highereval_blastres.tab \
-evalue 1E-03 \
-num_threads 40 \
-max_target_seqs 1 \
-outfmt 6

In [None]:
[afcoyle@mox2 afcoyle]$ sbatch jobs/20210303_amoebo_blast.sh
Submitted batch job 1611099

## Transfer files from Mox to Gannet to local machine

In [None]:
# Transfer opilio BLAST output from Mox to Gannet
# These commands performed from a folder in Gannet
# Gannet folder: /volume2/web/nerka/mox_transfers/scrubbed/
rsync -avz --progress \
afcoyle@mox.hyak.uw.edu:/gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastres/opilio_highereval_blastres.tab \
hemat_proj/

In [None]:
# Transfer amoebo BLAST output from Mox to Gannet
# These commands performed from a folder in Gannet
# Gannet folder: /volume2/web/nerka/mox_transfers/scrubbed/
rsync -avz --progress \
afcoyle@mox.hyak.uw.edu:/gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastres/amoebo_highereval_blastres.tab \
hemat_proj/

In [None]:
# Check our md5sums to ensure correct transfer. Here's what they should be:
# To obtain, run md5sum *highereval_blastres.tab on both Mox and Gannet

# 973414e99379fb9f8876b6fc757a03a1  amoebo_highereval_blastres.tab
# ad402d109a59f3d2c2869ab056269a60  opilio_highereval_blastres.tab

In [None]:
# Transfer both BLAST outputs to local machine
# Ran command from local machine
# Using absolute path, as relative path fails
rsync -chavzP --stats \
afcoyle@gannet.fish.washington.edu:/volume2/web/nerka/mox_transfers/scrubbed/hemat_proj/*highereval_blastres.tab \
/mnt/c/Users/acoyl/Documents/GitHub/hemat_bairdii_transcriptome/output/BLASTs

In [None]:
# Verify checksums still match by running *highereval_blastres.tab on local machine

# 973414e99379fb9f8876b6fc757a03a1  output/BLASTs/amoebo_highereval_blastres.tab
# ad402d109a59f3d2c2869ab056269a60  output/BLASTs/opilio_highereval_blastres.tab

In [None]:
# Move each file into a separate folder and rename for clarity
!mv ../output/BLASTs/amoebo_highereval_blastres.tab ../output/BLASTs/amoebo_genome/cbai2.0_eval10_3_blastnres.tab

In [None]:
# Move each file into a separate folder and rename for clarity
!mv ../output/BLASTs/opilio_highereval_blastres.tab ../output/BLASTs/opilio_genome/cbai2.0_eval10_3_blastxres.tab

## Exploratory Analysis of Higher Evalue BLAST results

At this stage, since all relevant files are now on our local machine, we will revert to running commands directly in this Jupyter notebook rather than copy-pasting

In [2]:
# See how many sequences are in our original cbai_hemat transcriptome
!grep -c ">" ../data/transcriptomes/cbai_hemat_transcriptome_v2.0.fasta

1412254


In [5]:
# See how many sequences got matched in our BLASTx against the opilio genome
!wc -l ../output/BLASTs/opilio_genome/cbai2.0_eval10_3_blastxres.tab

551494 ../output/BLASTs/opilio_genome/cbai2.0_eval10_3_blastxres.tab


In [8]:
# See how many sequences got matched in our BLASTn against the amoebo genome
!wc -l ../output/BLASTs/amoebo_genome/cbai2.0_eval10_3_blastnres.tab

64312 ../output/BLASTs/amoebo_genome/cbai2.0_eval10_3_blastres.tab


That sums to ~610,000 total matches to both genomes combined out of 1.4 million sequences in the original transcriptome. We'll redo this BLAST one more time, with an even lower e-value bar - changing from 10^-3 to 10^-2. This should let us graph the relationship between e-value and number of matches, thus indicating how Again, with the majority of our sequences presumably belonging to _C. bairdi_, I definitely would've expected to see more matches to the opilio genome at least.

# BLAST Round 3

Again, we're performing two BLASTs - a BLASTn of cbai_transcriptomev2.0 (which contains both _C. bairdi_ and _Hematodinium_ sequences) against a genome for _Amoebophrya sp._ and a BLASTx against the protein sequences from the _Chionoecetes opilio_ genome. 

Only two lines are changed in our slurm scripts - the e-value and the output file

### Mox slurm script for BLASTx vs. _C. opilio_. 

Initial reference should be [here](https://github.com/afcoyle/hemat_bairdii_transcriptome/blob/main/scripts/slurm_scripts/20210304_opilio_blast.sh). However in case of repo reformatting breaking the link, the script is also copy-pasted below.

In [None]:
#!/bin/bash
## Job Name
#SBATCH --job-name=afcoyle_opilioblast
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-12:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=afcoyle@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/afcoyle


/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/blastx \
-task="blastx" \
-query /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/data/cbai_hemat_transcriptomev2.0.fasta \
-db /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastdbs/opilio/opilio_blastdb \
-out /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastres/opilio_eval10_2_blastres.tab \
-evalue 1E-02 \
-num_threads 40 \
-max_target_seqs 1 \
-outfmt 6

In [None]:
[afcoyle@mox2 afcoyle]$ sbatch jobs/20210304_opilio_blast.sh
Submitted batch job 1612487

### Mox slurm script for BLASTn vs. _Amoebophrya sp._

Initial reference should be [here](https://github.com/afcoyle/hemat_bairdii_transcriptome/blob/main/scripts/slurm_scripts/20210304_amoebo_blast.sh). However in case of repo reformatting breaking the link, the script is also copy-pasted below.

In [None]:
#!/bin/bash
## Job Name
#SBATCH --job-name=afcoyle_amoeboblast
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-12:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=afcoyle@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/afcoyle



/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/blastn \
-task="blastn" \
-query /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/data/cbai_hemat_transcriptomev2.0.fasta \
-db /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastdbs/amoebo/amoebo_blastdb \
-out /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastres/amoebo_eval10_2_blastres.tab \
-evalue 1E-02 \
-num_threads 40 \
-max_target_seqs 1 \
-outfmt 6

In [None]:
[afcoyle@mox2 afcoyle]$ sbatch jobs/20210304_amoebo_blast.sh
Submitted batch job 1612488

In [None]:
# Transfer opilio BLAST output from Mox to Gannet
# These commands performed from a folder in Gannet
# Gannet folder: /volume2/web/nerka/mox_transfers/scrubbed/
rsync -avz --progress \
afcoyle@mox.hyak.uw.edu:/gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastres/opilio_eval10_2_blastres.tab \
hemat_proj/

In [None]:
# Transfer amoebo BLAST output from Mox to Gannet
# These commands performed from a folder in Gannet
# Gannet folder: /volume2/web/nerka/mox_transfers/scrubbed/
rsync -avz --progress \
afcoyle@mox.hyak.uw.edu:/gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastres/amoebo_eval10_2_blastres.tab \
hemat_proj/

In [None]:
# Check our md5sums to ensure correct transfer. Here's what they should be:
# To obtain, run md5sum *eval10_2_blastres.tab on both Mox and Gannet

# 0408b229a37d5f7c76629cc3f24e545e  amoebo_eval10_2_blastres.tab
# fd9e420b186de728ffc0a11a28fd1371  opilio_eval10_2_blastres.tab

In [None]:
# Transfer both BLAST outputs to local machine
# Ran command from local machine
# Using absolute path, as relative path fails
rsync -chavzP --stats \
afcoyle@gannet.fish.washington.edu:/volume2/web/nerka/mox_transfers/scrubbed/hemat_proj/*eval10_2_blastres.tab \
/mnt/c/Users/acoyl/Documents/GitHub/hemat_bairdi_transcriptome/output/BLASTs/

In [None]:
rsync -chavzP --stats \
afcoyle@gannet.fish.washington.edu:/volume2/web/nerka/mox_transfers/scrubbed/hemat_proj/*eval10_2_blastres.tab \
/mnt/c/Users/acoyl/Documents/GitHub/hemat_bairdi_transcriptome/output/BLASTs/

In [None]:
# Verify checksums still match by running md5sum ../output/BLASTs/*eval10_2_blastres.tab on local machine

# 0408b229a37d5f7c76629cc3f24e545e  amoebo_eval10_2_blastres.tab
# fd9e420b186de728ffc0a11a28fd1371  opilio_eval10_2_blastres.tab

In [2]:
# Move each file into a separate folder and rename for clarity
!mv ../output/BLASTs/amoebo_eval10_2_blastres.tab ../output/BLASTs/amoebo_genome/cbai2.0_eval10_2_blastnres.tab

In [3]:
# Move each file into a separate folder and rename for clarity
!mv ../output/BLASTs/opilio_eval10_2_blastres.tab ../output/BLASTs/opilio_genome/cbai2.0_eval10_2_blastxres.tab

## Exploratory Analysis of Eval = 10^-2 BLAST results

At this stage, since all relevant files are now on our local machine, we will revert to running commands directly in this Jupyter notebook rather than copy-pasting

In [2]:
# See how many sequences are in our original cbai_hemat transcriptome
!grep -c ">" ../data/transcriptomes/cbai_hemat_transcriptome_v2.0.fasta

1412254


In [1]:
# See how many sequences got matched in our BLASTx against the opilio genome
!wc -l ../output/BLASTs/opilio_genome/cbai2.0_eval10_2_blastxres.tab

631078 ../output/BLASTs/opilio_genome/cbai2.0_eval10_2_blastxres.tab


In [3]:
# See how many sequences got matched in our BLASTn against the amoebo genome
!wc -l ../output/BLASTs/amoebo_genome/cbai2.0_eval10_2_blastnres.tab

121305 ../output/BLASTs/amoebo_genome/cbai2.0_eval10_2_blastnres.tab


That sums to ~610,000 total matches to both genomes combined out of 1.4 million sequences in the original transcriptome. We'll redo this BLAST one more time, with an even lower e-value bar - changing from 10^-3 to 10^-2. This should let us graph the relationship between e-value and number of matches, thus indicating how Again, with the majority of our sequences presumably belonging to _C. bairdi_, I definitely would've expected to see more matches to the opilio genome at least.

# BLAST Round 4: Changing Amoebo BLASTn to tBLASTx

To maximize our number of matches, we should be BLASTing to protein sequences rather than nucleotide sequences, as protein sequences are more conserved. Since a BLASTn is quite fast to run for Amoebo, we can put up with the longer wait time of the tBLASTx, which translates both the sequence - transcriptome 2.0 - and the database - the Amoebophrya genome - to protein sequences. Since it takes much longer, we'll also increase the runtime to several days.

## tBLASTx for Amoebophrya genome, eval = 10^-5
Initial reference should be [here](https://github.com/afcoyle/hemat_bairdii_transcriptome/blob/main/scripts/slurm_scripts/20210304_amoebo_eval10_5_tblastx.sh). However in case of repo reformatting breaking the link, the script is also copy-pasted below.

In [None]:
#!/bin/bash
## Job Name
#SBATCH --job-name=afcoyle_amoeboblast
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=2-12:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=afcoyle@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/afcoyle



/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/tblastx \
-query /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/data/cbai_hemat_transcriptomev2.0.fasta \
-db /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastdbs/amoebo/amoebo_blastdb \
-out /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastres/amoebo_eval10_5_tblastxres.tab \
-evalue 1E-05 \
-num_threads 40 \
-max_target_seqs 1 \
-outfmt 6

In [None]:
[afcoyle@mox2 jobs]$ sbatch 20210304_amoebo_eval10_5_tblastx.sh
Submitted batch job 1613682

## tBLASTx for Amoebophrya genome, eval = 10^-3
Initial reference should be [here](https://github.com/afcoyle/hemat_bairdii_transcriptome/blob/main/scripts/slurm_scripts/20210304_amoebo_eval10_3_tblastx.sh). However in case of repo reformatting breaking the link, the script is also copy-pasted below.

In [None]:
#!/bin/bash
## Job Name
#SBATCH --job-name=afcoyle_amoeboblast
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=2-12:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=afcoyle@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/afcoyle



/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/tblastx \
-query /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/data/cbai_hemat_transcriptomev2.0.fasta \
-db /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastdbs/amoebo/amoebo_blastdb \
-out /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastres/amoebo_eval10_3_tblastxres.tab \
-evalue 1E-03 \
-num_threads 40 \
-max_target_seqs 1 \
-outfmt 6

In [None]:
[afcoyle@mox2 jobs]$ sbatch 20210304_amoebo_eval10_3_tblastx.sh
Submitted batch job 1614716

## tBLASTx for Amoebophrya genome, eval = 10^-2
Initial reference should be [here](https://github.com/afcoyle/hemat_bairdii_transcriptome/blob/main/scripts/slurm_scripts/20210304_amoebo_eval10_2_tblastx.sh). However in case of repo reformatting breaking the link, the script is also copy-pasted below.

In [None]:
#!/bin/bash
## Job Name
#SBATCH --job-name=afcoyle_amoeboblast
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=2-12:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=afcoyle@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/afcoyle



/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/tblastx \
-query /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/data/cbai_hemat_transcriptomev2.0.fasta \
-db /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastdbs/amoebo/amoebo_blastdb \
-out /gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastres/amoebo_eval10_2_tblastxres.tab \
-evalue 1E-02 \
-num_threads 40 \
-max_target_seqs 1 \
-outfmt 6

In [None]:
[afcoyle@mox2 jobs]$ sbatch 20210304_amoebo_eval10_2_tblastx.sh
Submitted batch job 1617553

In [None]:
# Transfer BLAST output for all amoebo tblastx from Mox to Gannet
# These commands performed from a folder in Gannet
# Gannet folder: /volume2/web/nerka/mox_transfers/scrubbed/
rsync -avz --progress \
afcoyle@mox.hyak.uw.edu:/gscratch/srlab/afcoyle/projects/21_ncbi_genome_blasts/output/blastres/amoebo_eval10*tblastxres.tab \
hemat_proj/

In [None]:
# Check our md5sums to ensure correct transfer. Here's what they should be:
# To obtain, run md5sum amoebo_eval10*tblastxres.tab on both Mox and Gannet

# 97195ee1a50f985ec1d869cba5a1a3f8  amoebo_eval10_2_tblastxres.tab
# 162fb370e477fd857f53599f65eb97a5  amoebo_eval10_3_tblastxres.tab
# 13276381cddb23c2e420beb64c4bae8a  amoebo_eval10_5_tblastxres.tab

In [None]:
# Transfer both BLAST outputs to local machine
# Ran command from local machine
# Using absolute path, as relative path fails
rsync -chavzP --stats \
afcoyle@gannet.fish.washington.edu:/volume2/web/nerka/mox_transfers/scrubbed/hemat_proj/*eval10_2_blastres.tab \
/mnt/c/Users/acoyl/Documents/GitHub/hemat_bairdi_transcriptome/output/BLASTs/amoebo_genome

In [None]:
# Verify checksums still match by running md5sum ../output/BLASTs/amoebo_genome/amoebo_eval10*tblastxres.tab on local machine

# 97195ee1a50f985ec1d869cba5a1a3f8  amoebo_eval10_2_tblastxres.tab
# 162fb370e477fd857f53599f65eb97a5  amoebo_eval10_3_tblastxres.tab
# 13276381cddb23c2e420beb64c4bae8a  amoebo_eval10_5_tblastxres.tab

In [1]:
# Rename each file for clarity
!mv ../output/BLASTs/amoebo_genome/amoebo_eval10_5_tblastxres.tab ../output/BLASTs/amoebo_genome/cbai2.0_eval10_5_tblastxres.tab

In [2]:
# Rename each file for clarity
!mv ../output/BLASTs/amoebo_genome/amoebo_eval10_3_tblastxres.tab ../output/BLASTs/amoebo_genome/cbai2.0_eval10_3_tblastxres.tab

In [3]:
# Rename each file for clarity
!mv ../output/BLASTs/amoebo_genome/amoebo_eval10_2_tblastxres.tab ../output/BLASTs/amoebo_genome/cbai2.0_eval10_2_tblastxres.tab

# Move all slurm scripts over to local machine

In [None]:
# Move over all slurm scripts. Ran from scripts directory
rsync -chavzP --stats \
afcoyle@mox.hyak.uw.edu:/gscratch/srlab/afcoyle/jobs/2021*blast*sh \
slurm_scripts/