Aidan Coyle, afcoyle@uw.edu

2021/01/20

Roberts lab at SAFS

## BLASTing DEGs against database of all Alveolata nucleotide sequences and all Arthropoda nucleotide sequences

This script takes a newline-separated file of accession IDs for all DEGs and BLASTs it (using BLASTn)against all Alveolata nucleotide sequences. Alveolata is the superphylum containing all dinoflagellates, including _Hematodinium_. This should exclude all but the most highly-conserved _C. bairdi_ genes from the DEGs, allowing us to examine _Hematodinium_ DEGs on an individual basis.

It then repeats the process, but using all Arthropoda nucleotide sequences. Arthropoda is, of course, the taxa containing the Tanner crab, _C. bairdi_. By performing this second BLAST, we will get an idea for how specific our BLAST results are, and be able to determine whether a particular DEG is more likely to be _C. bairdi_ or _Hematodinium_.

Nucleotide sequences downloaded from the NCBI Taxonomy Browser at https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi as FASTA files. Download of Alveolata sequences was made at 00:06 on 2021-02-17. Download of Arthropoda sequences was made at 12:24 on 2021-02-24.

## BLASTn of DEGs against all Alveolata nucleotide sequences

In [36]:
# Create blast database from all Alveolata nucleotide sequences
!makeblastdb \
-in ../data/uniprot_taxa_seqs/alveolata_sequences.fasta \
-dbtype nucl \
-parse_seqids \
-out ../data/blast_db/alveolata_nucleotides_2021_02_22/alveolata_uniprot_2021_02



Building a new DB, current time: 02/22/2021 23:57:21
New DB name:   /mnt/c/Users/acoyl/Documents/GitHub/hemat_bairdii_transcriptome/data/blast_db/alveolata_nucleotides_2021_02_22/alveolata_uniprot_2021_02
New DB title:  ../data/alveolata_sequences.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Ignoring sequence 'lcl|MT078137.1_cds_QJQ82428.1_1' as it has no sequence data
Adding sequences from FASTA; added 1566546 sequences in 81.6705 seconds.


In [59]:
# Show what the first column of the DEG list looks like, used next
!cut -f1 < ../graphs/DESeq2_output/amb2_vs_elev2_indiv/DEGlist.txt | head -n 5

TRINITY_DN978_c2_g1_i3
TRINITY_DN29869_c0_g1_i18
TRINITY_DN135171_c0_g1_i12
TRINITY_DN311_c0_g1_i18
TRINITY_DN183158_c0_g1_i1
cut: write error: Broken pipe


In [22]:
# Select first column - transcript ID - of the DEG list we made earlier, 
# turn into temporary file
!cut -f1 < ../graphs/DESeq2_output/amb2_vs_elev2_indiv/DEGlist.txt \
> ../output/BLASTn/tempids.txt

In [23]:
# Cross-reference transcript IDs and our transcriptome to get sequences for DEGs
# Pull the line containing a match (the transcript ID) 
# and the one following (the fasta sequence), write to file
!grep -w -A 1 -Ff ../output/BLASTn/tempids.txt \
../data/transcriptomes/cbai_hemat_transcriptome_v2.0.fasta --no-group-separator > \
../output/BLASTn/input_seqs/amb2_vs_elev2_indiv_DEGs.fasta

In [24]:
# Check length of original DEG list file
!wc -l ../graphs/DESeq2_output/amb2_vs_elev2_indiv/DEGlist.txt

2067 ../graphs/DESeq2_output/amb2_vs_elev2_indiv/DEGlist.txt


In [25]:
# See how many FASTA sequences we have
!grep -c ">" ../output/BLASTn/input_seqs/amb2_vs_elev2_indiv_DEGs.fasta

2067


In [26]:
# Looks good! Remove temporary file
!rm ../output/BLASTn/tempids.txt

In [102]:
# Blast all DEG sequences against our database of Alveolata sequences
!blastn \
-task="blastn" \
-query ../output/BLASTn/input_seqs/amb2_vs_elev2_indiv_DEGs.fasta \
-db ../data/blast_db/alveolata_nucleotides_2021_02_22/alveolata_uniprot_2021_02 \
-out ../output/BLASTn/alveolata_publicseqs/amb2_vs_elev2.tab \
-max_target_seqs 1 \
-outfmt 6 \
-num_threads 4



In [1]:
# See how many matches we got 
!wc -l ../output/BLASTn/alveolata_publicseqs/amb2_vs_elev2.tab

2391 ../output/BLASTn/alveolata_publicseqs/amb2_vs_elev2.tab


In [85]:
# Look at those matches
!head ../output/BLASTn/alveolata_publicseqs/amb2_vs_elev2.tab

TRINITY_DN17_c0_g1_i12	HBJU01001350.1_cds_1	75.340	515	117	8	809	1318	519	10	1.73e-60	239
TRINITY_DN61_c0_g1_i6	HBLK01034498.1_cds_CAE2923986.1_1	85.507	276	38	2	1321	1595	398	124	7.42e-75	287
TRINITY_DN8754_c1_g1_i1	HBLO01048875.1_cds_CAE3072170.1_1	73.904	456	108	9	268	719	656	208	1.29e-40	172
TRINITY_DN8756_c1_g1_i3	HBJK01013588.1_cds_CAE1088085.1_1	89.231	65	7	0	647	711	274	210	2.18e-13	82.4
TRINITY_DN43328_c0_g1_i10	HBLC01024351.1_cds_CAE2698820.1_1	100.000	32	0	0	631	662	277	308	1.20e-06	60.2
TRINITY_DN43328_c0_g1_i10	HBLC01024351.1_cds_CAE2698820.1_1	100.000	29	0	0	634	662	268	296	5.58e-05	54.7
TRINITY_DN431_c0_g2_i18	HBKY01019556.1_cds_1	72.336	1070	255	29	1730	2792	1103	68	6.00e-78	298
TRINITY_DN41764_c0_g1_i3	HBLO01010512.1_cds_CAE3048796.1_1	86.517	89	12	0	563	651	359	447	4.47e-18	99.0
TRINITY_DN883_c0_g1_i21	HBNJ01035065.1_cds_1	72.068	1407	356	34	147	1533	149	1538	1.28e-104	387
TRINITY_DN8273_c0_g1_i64	HBHF01001998.1_cds_1	100.000	28	0	0	20	47	297	270	4.49e-04	52.8


## Repeat the same process for Elevated Day 0 vs. Elevated Day 2

In [27]:
# Select first column - transcript ID - of the DEG list we made earlier, 
# turn into temporary file
!cut -f1 < ../graphs/DESeq2_output/elev0_vs_elev2_indiv/DEGlist.txt \
> ../output/BLASTn/tempids.txt

In [28]:
# Cross-reference transcript IDs and our transcriptome to get sequences for DEGs
# Pull the line containing a match (the transcript ID) 
# and the one following (the fasta sequence), write to file
!grep -w -A 1 -Ff ../output/BLASTn/tempids.txt \
../data/transcriptomes/cbai_hemat_transcriptome_v2.0.fasta --no-group-separator > \
../output/BLASTn/input_seqs/elev0_vs_elev2_indiv_DEGs.fasta

In [29]:
# Check length of original DEG list file
!wc -l ../graphs/DESeq2_output/elev0_vs_elev2_indiv/DEGlist.txt

338 ../graphs/DESeq2_output/elev0_vs_elev2_indiv/DEGlist.txt


In [30]:
# See how many FASTA sequences we have
!grep -c ">" ../output/BLASTn/input_seqs/elev0_vs_elev2_indiv_DEGs.fasta

338


In [31]:
# Looks good! Remove temporary file
!rm ../output/BLASTn/tempids.txt

In [101]:
# Blast all DEG sequences against our database of Alveolata sequences
!blastn \
-task="blastn" \
-query ../output/BLASTn/input_seqs/elev0_vs_elev2_indiv_DEGs.fasta \
-db ../data/blast_db/alveolata_nucleotides_2021_02_22/alveolata_uniprot_2021_02 \
-out ../output/BLASTn/alveolata_publicseqs/elev0_vs_elev2.tab \
-max_target_seqs 1 \
-outfmt 6 \
-num_threads 4



In [4]:
# See how many matches we got 
!wc -l ../output/BLASTn/alveolata_publicseqs/elev0_vs_elev2.tab

351 ../output/BLASTn/alveolata_publicseqs/elev0_vs_elev2.tab


In [5]:
# Look at the first few
!head ../output/BLASTn/alveolata_publicseqs/elev0_vs_elev2.tab

TRINITY_DN63011_c2_g1_i4	HBLT01014068.1_cds_CAE3167617.1_1	92.593	27	2	0	235	261	574	600	0.15	41.0
TRINITY_DN28882_c0_g1_i12	XM_721189.2_cds_XP_726282.2_1	78.333	60	11	1	348	407	1030	973	0.010	48.2
TRINITY_DN28849_c0_g1_i5	LR865372.1_cds_CAD2093610.1_322	88.889	45	2	2	1784	1827	1228	1186	0.001	53.6
TRINITY_DN28899_c0_g1_i14	HBOA01032109.1_cds_1	90.000	30	3	0	6	35	769	740	0.25	41.9
TRINITY_DN28899_c0_g1_i14	HBOA01032109.1_cds_1	86.667	30	4	0	42	71	769	740	3.0	37.4
TRINITY_DN71214_c0_g2_i1	LT990236.1_cds_SPJ09677.1_173	82.979	47	3	3	167	212	5104	5146	0.34	41.0
TRINITY_DN0_c116_g2_i2	HBGB01045696.1_cds_CAD9071815.1_1	82.353	34	6	0	191	224	640	607	5.5	35.6
TRINITY_DN0_c809_g2_i1	XM_013494334.1_cds_XP_013349788.1_1	74.118	85	20	2	20	104	1582	1500	0.001	48.2
TRINITY_DN4_c1_g1_i18	HBNJ01029381.1_cds_CAE4303349.1_1	90.000	30	3	0	279	308	568	539	0.54	41.9
TRINITY_DN2_c4_g2_i4	HBNF01086722.1_cds_1	92.857	28	0	1	217	244	29	4	0.50	40.1


# BLASTn of DEGs against all Arthropoda sequences


In [18]:
# Create blast database from all Arthropoda nucleotide sequences
!makeblastdb \
-in ../data/uniprot_taxa_seqs/arthropoda_sequences.fasta \
-dbtype nucl \
-parse_seqids \
-out ../data/blast_db/arthropoda_nucleotides_2021_02_25/arthropoda_uniprot_2021_02



Building a new DB, current time: 02/25/2021 11:09:14
New DB name:   /mnt/c/Users/acoyl/Documents/GitHub/hemat_bairdii_transcriptome/data/blast_db/arthropoda_nucleotides_2021_02_25/arthropoda_uniprot_2021_02
New DB title:  ../data/arthropoda_sequences.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
FASTA-Reader: Ignoring invalid residues at position(s): On line 138246570: 4, 7, 10, 16, 21-23, 32, 35-36, 41, 44-45, 47
Adding sequences from FASTA; added 5321483 sequences in 568.979 seconds.


### Already created .fasta files for our comparisons, so don't need to recreate

In [32]:
# Blast all DEG sequences for Ambient Day 2 vs. Elevated Day 2 against our database of Arthropoda sequences
!blastn \
-task="blastn" \
-query ../output/BLASTn/input_seqs/amb2_vs_elev2_indiv_DEGs.fasta \
-db ../data/blast_db/arthropoda_nucleotides_2021_02_25/arthropoda_uniprot_2021_02 \
-out ../output/BLASTn/arthropoda_publicseqs/amb2_vs_elev2.tab \
-max_target_seqs 1 \
-outfmt 6 \
-num_threads 4



In [35]:
# See how many matches we got
!wc -l ../output/BLASTn/arthropoda_publicseqs/amb2_vs_elev2.tab

2736 ../output/BLASTn/arthropoda_publicseqs/amb2_vs_elev2.tab


In [36]:
# Look at the first few
!head ../output/BLASTn/arthropoda_publicseqs/amb2_vs_elev2.tab

TRINITY_DN88410_c0_g1_i1	XM_011182754.2	84.783	46	3	3	107	152	373	414	0.39	42.8
TRINITY_DN21405_c0_g1_i2	IAUX01002007.1	71.345	171	41	3	590	756	303	469	2.38e-12	82.4
TRINITY_DN21405_c0_g1_i2	IAUX01002007.1	71.429	140	40	0	1372	1511	1040	1179	1.23e-09	73.4
TRINITY_DN21441_c0_g1_i11	XM_027991394.1	78.689	61	12	1	181	240	688	748	0.011	49.1
TRINITY_DN5655_c0_g1_i17	IADJ01078929.1	86.667	45	5	1	11	55	1900	1857	0.002	51.8
TRINITY_DN5680_c0_g1_i10	FO029351.1	87.755	49	6	0	1190	1238	623	671	1.47e-06	62.6
TRINITY_DN5607_c0_g1_i14	FQ853449.1	82.979	47	8	0	388	434	301	347	0.003	50.0
TRINITY_DN5644_c31_g1_i2	ICDQ01033203.1	90.909	33	3	0	404	436	2754	2786	0.043	47.3
TRINITY_DN5669_c0_g1_i9	XM_022111106.2	96.296	27	1	0	323	349	949	975	0.10	45.5
TRINITY_DN5672_c0_g1_i7	FQ999213.1	92.683	41	2	1	1	40	389	349	2.43e-05	58.1


In [33]:
# Blast all DEG sequences for Elevated Day 0 vs. Elevated Day 2 against our database of Arthropoda sequences
!blastn \
-task="blastn" \
-query ../output/BLASTn/input_seqs/elev0_vs_elev2_indiv_DEGs.fasta \
-db ../data/blast_db/arthropoda_nucleotides_2021_02_25/arthropoda_uniprot_2021_02 \
-out ../output/BLASTn/arthropoda_publicseqs/elev0_vs_elev2.tab \
-max_target_seqs 1 \
-outfmt 6 \
-num_threads 4



In [37]:
# See how many matches we got
!wc -l ../output/BLASTn/arthropoda_publicseqs/elev0_vs_elev2.tab

349 ../output/BLASTn/arthropoda_publicseqs/elev0_vs_elev2.tab


In [38]:
# Look at the first few
!head ../output/BLASTn/arthropoda_publicseqs/elev0_vs_elev2.tab

TRINITY_DN63011_c2_g1_i4	IADR01057942.1	83.721	43	5	2	198	240	397	357	3.2	40.1
TRINITY_DN28882_c0_g1_i12	XM_037934825.1	71.084	166	48	0	1607	1772	1048	883	8.36e-13	84.2
TRINITY_DN28849_c0_g1_i5	XM_037929934.1	68.863	2444	727	20	5768	8202	2637	219	0.0	918
TRINITY_DN28899_c0_g1_i14	XM_023854192.2	100.000	21	0	0	17	37	1434	1454	5.3	39.2
TRINITY_DN28899_c0_g1_i14	XM_023854192.2	100.000	21	0	0	53	73	1434	1454	5.3	39.2
TRINITY_DN71214_c0_g2_i1	XM_039455535.1	92.188	64	5	0	170	233	96	33	3.78e-16	94.2
TRINITY_DN0_c116_g2_i2	ICKD01001377.1	81.481	54	4	1	181	228	119	66	1.25e-04	54.5
TRINITY_DN0_c809_g2_i1	XM_037932011.1	74.359	117	27	3	14	129	4803	4917	2.15e-08	66.2
TRINITY_DN4_c1_g1_i18	BB996776.1	80.597	67	6	4	286	350	239	178	0.002	51.8
TRINITY_DN2_c4_g2_i4	XM_033471051.1	78.409	88	17	2	167	252	7348	7261	2.20e-08	67.1


In [43]:
# Look at which sequences are duplicated, along with duplication counts
!cat ../output/BLASTn/arthropoda_publicseqs/elev0_vs_elev2.tab | \
cut -f1 | sort | uniq -cd

      2 TRINITY_DN10357_c5_g1_i9
      2 TRINITY_DN10480_c0_g1_i4
      2 TRINITY_DN111895_c0_g1_i7
      2 TRINITY_DN112784_c3_g1_i21
      3 TRINITY_DN14311_c1_g1_i9
      5 TRINITY_DN151000_c0_g1_i5
      2 TRINITY_DN1755_c2_g2_i1
      2 TRINITY_DN185199_c0_g1_i1
      3 TRINITY_DN1_c39_g1_i13
      2 TRINITY_DN20315_c0_g1_i1
      2 TRINITY_DN24228_c0_g1_i4
      2 TRINITY_DN268934_c0_g1_i6
      2 TRINITY_DN28899_c0_g1_i14
      2 TRINITY_DN3132_c4_g1_i6
      4 TRINITY_DN32708_c0_g1_i14
      2 TRINITY_DN32736_c2_g1_i9
      2 TRINITY_DN36181_c0_g1_i5
      2 TRINITY_DN44184_c1_g2_i1
      2 TRINITY_DN48889_c0_g1_i16
      5 TRINITY_DN57937_c0_g1_i12
      2 TRINITY_DN59827_c1_g1_i5
      2 TRINITY_DN60021_c3_g1_i2
      3 TRINITY_DN61628_c0_g1_i10
      7 TRINITY_DN6804_c0_g1_i28
      2 TRINITY_DN69810_c0_g2_i2
      2 TRINITY_DN7000_c0_g1_i49
      5 TRINITY_DN75802_c0_g1_i3
      6 TRINITY_DN8210_c2_g1_i29
      3 TRINITY_DN99555_c0_g1_i19
