**Analyzing hematodinium transcriptome blasts**

I'll be analyzing the two blasts (one with a max hsps value of 1, the other with no max hsps value) and determining some of the differences between them


**Determining length differences between our two blasts**


In [None]:
%%bash
#determining number of lines for max_hsps blast
wc -l hemat_uniprot_maxhsps_blastx.tab

In [None]:
%%bash
#determining number of lines for blast w/o max_hsps value
wc -l hemat_uniprot_nomaxhsps_blastx.tab

Note: For both blasts, max_target_seqs was set to 1
After some googling, it looks like the discrepancy between maxhsps and nomaxhsps is due to the max_target_seqs variable. Essentially, if there are multiple HSPs for a single query, they'll all be included. However, the addition of max_hsps reduces it to grabbing only 1 HSP per subject

**Finding which queries were unmatched**

In [2]:
%%bash
#Checking how many sequences there were in our original file
grep -c ">" hemat.fasta

6348


In [3]:
!grep -c "TRINITY" hemat.fasta
#Alright, all queries start with TRINITY

6348


In [None]:
!grep -o 'TRINITY[^ ]*' hemat.fasta > hemat_queries.txt
#selecting only query names. NOTE: Keep em as a text file, not .fasta or .tab or anything!

In [None]:
!grep -o 'TRINITY[^sp]*' hemat_uniprot_maxhsps_blastx.tab > hemat_uniprot_maxhsps_subjects_blastx.txt
# selecting only matches

In [4]:
!head hemat_uniprot_maxhsps_subjects_blastx.txt
#Confirmation that our formatting works!

TRINITY_DN5655_c0_g1_i1	
TRINITY_DN5691_c0_g1_i1	
TRINITY_DN5627_c0_g1_i1	
TRINITY_DN5653_c0_g1_i1	
TRINITY_DN5610_c0_g1_i1	
TRINITY_DN5607_c0_g1_i1	
TRINITY_DN5630_c0_g1_i1	
TRINITY_DN5664_c0_g1_i1	
TRINITY_DN5613_c0_g1_i1	
TRINITY_DN5644_c0_g1_i1	


In [None]:
!diff -u -w hemat_uniprot_maxhsps_subjects_blastx.txt hemat_queries.txt | grep -o '+[^ ]*' > unmatched_queries.txt
#find differences between our matches and queries, output all unmatched queries to a new file

In [None]:
!head unmatched_queries.txt
# looks like we've still got some pesky notation laying around...

In [None]:
!tr -d \+ < unmatched_queries.txt >cleaned_unmatched_queries.txt
#Eliminate all + signs from the file

In [None]:
!head cleaned_unmatched_queries.txt
#The header from our diff command is still there...

In [None]:
!sed -i '1,2d;' cleaned_unmatched_queries.txt
#Delete the first two lines of the file - first line is empty, second is 1,223

In [None]:
!head cleaned_unmatched_queries.txt
#Success!! We now have a full list of unmatched queries

In [None]:
!grep -c "TRINITY" cleaned_unmatched_queries.txt
#Okay, it's got the correct number of queries

In [None]:
!wc -l cleaned_unmatched_queries.txt
#But the total number of lines is off by 50. Ran cat, and looks like there's 50 lines that are just numbers

In [None]:
!sort hemat_uniprot_maxhsps_subjects_blastx.txt | uniq -d
#Checking it's not from duplicates in either input file - uniq searches for unique lines, -d prints only duplicates

In [None]:
!sort hemat_queries.txt | uniq -d
#Finishing the duplicate check - again, none found

In [None]:
!grep -v '^[0-9]' cleaned_unmatched_queries.txt > final_unmatched_queries.txt
#Alright, let's get rid of all those lines that begin with numbers

In [None]:
!wc -l final_unmatched_queries.txt
#Perfect! We now have a full list of all unmatched queries!

**Re-run BLAST using DIAMOND BLASTx**
To download DIAMOND, follow the instructions at http://www.diamondsearch.org/index.php

In [6]:
!cd bin

In [15]:
!ls

analyzing-hemat-data.ipynb     hemat_uniprot_maxhsps_blastx.tab
bin			       hemat_uniprot_maxhsps_subjects_blastx.txt
cleaned_unmatched_queries.txt  hemat_uniprot_nomaxhsps_blastx.tab
diamond			       hematodinium-script.ipynb
diamond-linux64.tar	       results
doc			       uniprot_sprot_diamond.dmnd
hemat.fasta		       unmatched_queries.txt
hemat_queries.txt


In [2]:
!diamond blastx -d uniprot_sprot_diamond -q hemat.fasta -o diamondblast.m8

/bin/bash: diamond: command not found


In [3]:
!head hemat_uniprot_maxhsps_blastx.tab

TRINITY_DN5655_c0_g1_i1	sp|P20035|HGXR_PLAFG	56.481	216	90	3	63	707	15	227	6.36e-81	246
TRINITY_DN5691_c0_g1_i1	sp|Q8C4J7|TBL3_MOUSE	42.857	245	126	5	712	17	400	643	1.84e-57	198
TRINITY_DN5627_c0_g1_i1	sp|Q9NX58|LYAR_HUMAN	39.706	136	77	3	20	418	1	134	4.59e-26	105
TRINITY_DN5653_c0_g1_i1	sp|P48598|IF4E_DROME	43.506	154	78	4	271	729	81	226	8.12e-30	117
TRINITY_DN5610_c0_g1_i1	sp|P34736|TKT_PICST	50.725	276	130	2	824	15	311	586	7.46e-86	273
TRINITY_DN5607_c0_g1_i1	sp|Q9VXK6|IF5_DROME	43.860	171	89	4	520	14	4	169	3.54e-38	141
TRINITY_DN5630_c0_g1_i1	sp|Q803X4|DCA13_DANRE	52.809	445	207	2	1393	68	1	445	1.77e-166	480
TRINITY_DN5664_c0_g1_i1	sp|Q8K339|KIN17_MOUSE	41.065	263	148	4	811	38	1	261	2.07e-61	202
TRINITY_DN5613_c0_g1_i1	sp|Q54ED4|GRWD1_DICDI	44.796	221	110	4	16	657	264	479	4.84e-61	202
TRINITY_DN5644_c0_g1_i1	sp|Q9FNK4|OAT_ARATH	30.928	291	183	8	70	936	171	445	3.71e-28	118
