# Dicistroviridae Analysis I / II
```
Lead     : ababaian
Issue    : #239
start    : 2020 11 13
complete : ...
files    : ~/serratus/notebook/201113_dicistro/
s3 files : s3://serratus-public/notebook/201113_dicistro/
```

## Introduction

As part of the last set of protein runs "Dicistrovirade" sequences were included. The objective is to assemble and identify novel species of dicistro with the intent of isolating IRES elements between the two ORFs for a collaboration with E. Jan at UBC.

For pre-amble see: [Dev Test: Dicistro](http://localhost:8889/notebooks/serratus/notebook/200822_DevTest_v0.3.5.ipynb) for set-up of reference genomes used in `protref5`

### Objectives
- Process the protein run data from the last set of runs


## Materials and Methods


### EC2 Workstation Initialization

Launched `c5.2xlarge` instance with 200 GB SSD as a workstation.

In [None]:
# ON EC2 (PREVIOUS VERSION)
mkdir -p dicistro; cd dicistro

# Download full set of protein data
aws s3 sync s3://serratus-public/out/200828_pviro5/psummary/ ./
aws s3 sync s3://serratus-public/out/200830_pmeta5/psummary/ ./
aws s3 sync s3://serratus-public/out/200905_pvert5/psummary/ ./

# Total Dicistro listing
cd ~
grep -R "Dicistro" psummary/    > dicistro.psummary
grep "famcvg" dicistro.psummary > dicistro.fam.psummary

aws s3 cp dicistro.psummary     s3://serratus-public/notebook/201113_dicistro/
aws s3 cp dicistro.fam.psummary s3://serratus-public/notebook/201113_dicistro/

# Clean-up to TSV
echo -e "sra\tcvg\tfam\tscore\tpctid\talns\tavgcols" \
  > dicistro.fam.tsv

sed 's/.*:sra=//g' dicistro.fam.psummary \
  | sed 's/;famcvg=/\t/g' - \
  | sed 's/;fam=/\t/g' - \
  | sed 's/;score=/\t/g' - \
  | sed 's/;pctid=/\t/g' - \
  | sed 's/;alns=/\t/g' - \
  | sed 's/;avgcols=/\t/g' - \
  | sed 's/;$//g' - \
  >> dicistro.fam.tsv

aws s3 cp dicistro.fam.tsv s3://serratus-public/notebook/201113_dicistro/


In [None]:
#LOCAL
WORK='/home/artem/serratus/notebook/201113_dicistro/'
cd $WORK

aws s3 cp s3://serratus-public/notebook/201113_dicistro/dicistro.fam.tsv ./

In [3]:
head dicistro.fam.tsv

sra	cvg	fam	score	pctid	alns	avgcols
ERR1300950	___u___________:.________	Dicistroviridae	6	54	7	72
ERR1111183	_._______________________	Dicistroviridae	1	97	1	33
DRR037362	______________a:__m__o:._	Dicistroviridae	30	50	130	88
ERR1300956	__:.__.__________________	Dicistroviridae	6	50	4	68
DRR023333	^^^^^^^^^^^^^^^^^^^^^^^^^	Dicistroviridae	100	77	998016	65
DRR042075	_WmmUmaaauaUmmaaAMMWmUaa_	Dicistroviridae	100	50	8997	67
ERR1300953	________________.:_______	Dicistroviridae	4	56	4	40
DRR053208	_____________:___u_______	Dicistroviridae	4	50	8	54
ERR1300954	__.______________________	Dicistroviridae	2	50	1	61


## Spot checking R+R work

See github issue 239:

> RC: Pathracer analysis versus all assembly graphs (not the gene_clusters this time):

>all files are in s3://serratus-public/assemblies/dicistro/analysis/

>(1) the a.a. translations of RdRp according to PR

>(PR applied to each assembly graph and the concatenation of RdRP_1+RdRP_2+RdRP_3+RdRP_4+RdRP_quenya)

>all_assembly_graphs.RdRP_1234q.seqs.fa.gz

>(2) those RdRp a.a. sequences clustered at 97%id

>all_assembly_graphs.RdRP_1234q.seqs.centroids.fa.gz

>(3) diamond search of the clustered sequences vs. dicistro.protref.aa

>all_assembly_graphs.RdRP_1234q.seqs.centroids.diamond_vs_dicistro.protref.fmt6.gz̀

>(4) diamond search of the clustered sequences vs. rdrp0.

>all_assembly_graphs.RdRP_1234q.seqs.centroids.diamond_vs_rdrp0_r1.fmt6.gz̀

>scripts used:
> https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/quenya/pathracer/graph_run.sh
> https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/quenya/pathracer/graph_organize.sh

-------------

> RCE: Results uploaded to s3://serratus-public/rce/dicistro_analysis/pr_graphs/, same filenames as in my comment above. If this analysis is correct, there are 59k new species. Summary totals:
>``` 
21886   NovelGenus_Other
20666   NovelFamily
10955   NovelSpecies_Other
6439    KnownSpecies
4310    Novel_NoMatch
1340    NovelGenus_Dicistro
99      NovelSpecies_Dicistro```

#### Re-analyzed with scaffold-only sequences

>```
s3://serratus-public/rce/dicistro_analysis/scaffolds/
15538   NovelFamily
 9746    NovelGenus_Other
 2252    NovelSpecies_Other
 1713    KnownSpecies
 877     Novel_NoMatch
 481     NovelGenus_Dicistro
 94      NovelSpecies_Dicistro
```


In [None]:
# ON EC2
mkdir -p dicistro; cd dicistro

# Download RCE's analysis
mkdir rce 
aws s3 sync s3://serratus-public/rce/dicistro_analysis/pr_graphs/ ./rce

# Download RC's analysis
mkdir rc
aws s3 sync s3://serratus-public/assemblies/dicistro/analysis/ ./rc
gzip -d rc/*.gz

# Monkey work
# less -NS rce/novel_species.fa

### Error Type 1: Low complexity sequences

From: `novel_species.fa`
```
    285 >SRR6400004;Type=Novel_NoMatch;Rdrp0=none;Dicistro=none;
    286 KHTQTLKHTHTLKHTHKHTHKHTLKHSNTLKHTHTHTLKHTLKHTHTQTHTHTHTQTHTQTHTLKHTLKHTHTHTHTHTN
    287 THTHTHTHTHTHTNTHTHTHTHTQTHTQTHTHSNTHTHTLKHTLKHTHSNTHSNTHTHTHTHTHTHTHTQTHTHTHTHTH
    288 KHTLKHTHTQTHTHTHSNTHSNTHTQTHTQTHTHTHTHTHKHTHTHTHTHTHTHKHTHTHTHTHTHTHSNTHSNTHTQTH
    289 TQTHTHTHTHTHTHTHTHTHSNTHSNTHTQTHTQTHTHTHTHTHTQTHTQTHTLKHTLKHTHTHTHTHTHTHTHTHTQTH
    290 TQTHTHSNTHTHTLKHTHTHTHTHTHTHTHTHTLKHTHTHTQTHTQTHTLKHTLKHTHTHTHTHTLKHTLKHTHSNTHSN
    291 THTHTHTHTHTHTHTQTHTHTHKHTHTHTHTHTNTHTNTHTHKHTHTHTQTHTQTHTHTHTHTHTHTQTHTHTHTHTHTH
    292 TLKHTLKHTHSNTHSNTHTHTHTHTHSNTHSNTHTQTHTQTHTHTHTHTHTHTHTHTHSNTHSNTHTHTHTHSNT
``` 


Diamond finds no similarity of this in either rdrp0 or dicistro databases. This has no dicistro/rdrp0 match yet the HMM models hit. Traceback on the hit in pathracer, Anton working on this.

### Traceback 1


```
   1336 >SRR10695052;Type=NovelGenus_Other;rdrp0=rdrp2.yaOV90.orf25381(58.7%);Dicistro=Dicistroviridae.ORF1.NC_014793.1(26.9%);
   1337 PIIEKVMGVANQWGPPKMTPNYDAFNKTLEHVVDPADTFDPDLLQMAMQDWIQPLNTAMKSWKKEEGFAPLTEKESIMGI
   1338 DGKRFIDAIPMNTSTGFPLFQSKHKWFLETRDDSNILLDRKPHPDIHVEMERLLSAWRLGQRGYPVTSATLKDEPTLLGK
   1339 DKVRVFQGGSIAFGLQLRKYFLPVLRFLHFHPTLSESAVGVNAFGPEWEILMTHAEKYAEDDKMIAWDYSKYDVRMNSQM
   1340 TRAVLYLFIELAETGGYSQADLKIMRTMVVDLVHPLIDWNGVMFMAFNMNTSGNNLTVDINGTAGSLYVRTAFFNLFQTV
   1341 KVGDFRSKVAALTYGDDFIGSVQKDYRDFNFEYFKSFLMKHKMKVTLPSKDDSSSEFLDKSDVDFLKRKSSYISEIGCSI
   1342 GRLDEMSIFKSLHSNVKSKNITSSELQVSVIRGAMHEWFAHGRDVYDLRRDQMEEVCT
```


blastp results:

```
 	hypothetical protein 1 [Beihai picorna-like virus 17] 	Beihai picorna-like virus 17 	568 	568 	99% 	0.0 	57.68% 	1669 	YP_009333556.1
Select seq ref|YP_009333397.1| 	hypothetical protein 1 [Beihai sesarmid crab virus 1] 	Beihai sesarmid crab virus 1 	568 	568 	99% 	0.0 	57.80% 	1703 	YP_009333397.1
Select seq ref|YP_009336771.1| 	hypothetical protein 1 [Changjiang crawfish virus 1] 	Changjiang crawfish virus 1 	551 	551 	99% 	2e-177 	57.89% 	1772 	YP_009336771.1
Select seq gb|QKQ15127.1| 	hypothetical protein [Lindernia crustacea marnavirus] 	Lindernia crustacea marnavirus 	561 	561 	99% 	4e-177 	57.89% 	2829 	QKQ15127.1
```

Hit: [Penguin Poop](https://www.ncbi.nlm.nih.gov/sra/?term=SRR10695052)

Dead-on hit for a real virus at 57.68% identity to a known virus in GenBank. Likely this virus already got picked up by the Paper (looking for viruses) [Sustained RNA virome diversity in Antarctic penguins and their ticks](https://www.nature.com/articles/s41396-020-0643-1).

### Putative Error 2: Known polio,  Assembly-variants


From: `rce/novel_species.fa`
```
Line: 5573 >SRR1036663;Type=NovelGenus_Other;rdrp0=rdrp2.Picornaviridae.Human_poliovirus_1:CAA24445.1(70.4%);Dicistro=none;
EPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQLMSLDINTEQMCLEDAMYGTD
GLEALDLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQNLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQNLSTSAGYPYV
AMGKKKRDILNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKTK
VEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVITGSAVGCDPDLFWSKIPVLMEEKLFAFDYTGYDASLSPAWFE
ALKMGLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMPSGCSGTSIFNSMINNSHHLYKNKTYCVKGGMPSGCSGTSI
FNSMINNLIIRTLLLKTYKGIDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVTFLK
RQSGKDYGLTMTPADKSATFETVTWENVTFLKRFFRADEKYPFLIHPVMPMKEIHESIRWTKDPRNTQDHVRSLCLLAWH
NGEEEYNKFLAKIRSVPI
```

Taken from [SRR1036663: WT poliovirus serially passaged through HelaS3 - sequenced using CirSeq - Acevedo et al. Nature 2013](https://www.ncbi.nlm.nih.gov/sra/SRR1036663/)

[Blastp hit](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Get&RID=YHPMW4UR016): 70.93% from polio

```
Select seq gb|AFQ38633.1| 	RNA polymerase 3D [Human poliovirus 1] 	Human poliovirus 1 	781 	781 	100% 	0.0 	70.93% 	461 	AFQ38633.1

Query  1    EPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQL  60
            EPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQL
Sbjct  26   EPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQL  85

Query  61   MSLDINTEQMCLEDAMYGTDGLEALDLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQNLS  120
            MSLDINTEQMCLEDAMYGTDGLEALDLSTS                              
Sbjct  86   MSLDINTEQMCLEDAMYGTDGLEALDLSTS------------------------------  115

Query  121  TSAGYPYVAMGKKKRDILNKQTRDTKEMQNLSTSAGYPYVAMGKKKRDILNKQTRDTKEM  180
              AGYPYVA+GKKKRDI+NKQTRDTKE                                M
Sbjct  116  --AGYPYVALGKKKRDIMNKQTRDTKE--------------------------------M  141

Query  181  QKLLDTYGINLPLVTYVKDELRSKNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKTK  240
            Q+LLDTYGINLPLVTYVKDEL                                  RS+TK
Sbjct  142  QRLLDTYGINLPLVTYVKDEL----------------------------------RSRTK  167

Query  241  VEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVITGSAVGCDPDLFWSKIPVLMEE  300
            VEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGV+TGSAVGCDPDLFWSKIPVLMEE
Sbjct  168  VEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVVTGSAVGCDPDLFWSKIPVLMEE  227

Query  301  KLFAFDYTGYDASLSPAWFEALKMGLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMP  360
            KLFAFDYTGYDASLSPAWFEALKM LEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMP
Sbjct  228  KLFAFDYTGYDASLSPAWFEALKMVLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMP  287

Query  361  SGCSGTSIFNSMINNSHHLYKNKTYCVKGGMPSGCSGTSIFNSMINNLIIRTLLLKTYKG  420
            SGCSGTSIFNSM                                INNLIIRTLLLKTYKG
Sbjct  288  SGCSGTSIFNSM--------------------------------INNLIIRTLLLKTYKG  315

Query  421  IDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVTFLK  480
            IDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVT   
Sbjct  316  IDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVT---  372

Query  481  RQSGKDYGLTMTPADKSATFETVTWENVTFLKRFFRADEKYPFLIHPVMPMKEIHESIRW  540
                                         FLKRFFRADEKYPFLIHPVMPMKEIHESIRW
Sbjct  373  -----------------------------FLKRFFRADEKYPFLIHPVMPMKEIHESIRW  403

Query  541  TKDPRNTQDHVRSLCLLAWHNGEEEYNKFLAKIRSVPI  578
            TKDPRNTQDHVRSLCLLAWHNGEEEYNKFLAKIRSVPI
Sbjct  404  TKDPRNTQDHVRSLCLLAWHNGEEEYNKFLAKIRSVPI  441

```


That library contains a good clean hit to polio, and the hit above is from "another assembly" I am guessing.

From `rc/all_gene_clusters.pathracer_vs_RdRP_1234q.seqs.fa`
```
>Score=79.0939|Bitscore=161.231|PartialBitscore=161.231|Seq=SRR1036663.NODE_1_length_6046_cluster_1_candidate_1_domains_9|Position=4995|Frameshifts=0|Alignment=29M3D12M1D9M1D21M6D39M2D15M8D56M1I7M1D12M3D28M1D10M4D13M1D46M10D16M2D21M12D10M57D5M
```

From: `.rc/all_gene_clusters.fa.diamond_vs_rdrp0_q_d.fmt6`
```
SRR1036663.NODE_1_length_6046_cluster_1_candidate_1_domains_9  rdrp2.Picornaviridae.Human_poliovirus_1:CAA24445.1      99.7    339     1       0       5029    6045    1       339     1.5e-195        680.2   6046    425     GTGAAGGAACCAGCAGTCCTCACTAAAAACGATCCCAGGCTTAAGACAGACTTTGAGGAGGCAATTTTCTCCAAGTACGTGGGTAACAAAATTACTGAAGTGGATGAGTACATGAAAGAGGCAGTAGACCACTATGCTGGCCAGCTCATGTCACTAGACATCAACACAGAACAAATGTGCTTGGAGGATGCCATGTATGGCACTGATGGTCTAGAAGCACTTGATTTGTCCACCAGTGCTGGCTACCCTTATGTAGCAATGGGAAAGAAGAAGAGAGACATCTTGAACAAACAAACCAGAGACACTAAGGAAATGCAAAAACTGCTCGACACATATGGAATCAACCTCCCACTGGTGACTTATGTAAAGGATGAACTTAGATCCAAAACAAAGGTTGAGCAGGGGAAATCCAGATTAATTGAAGCTTCTAGTTTGAATGACTCAGTGGCAATGAGAATGGCTTTTGGGAACCTATATGCTGCTTTTCACAAAAACCCAGGAGTGATAACAGGTTCAGCAGTGGGGTGCGATCCAGATTTGTTTTGGAGCAAAATTCCGGTATTGATGGAAGAGAAGCTGTTTGCTTTTGACTACACAGGGTATGATGCATCTCTCAGCCCTGCTTGGTTCGAGGCACTAAAGATGGTGCTTGAGAAAATCGGATTCGGAGACAGAGTTGACTACATCGACTACCTAAACCACTCACACCACCTGTACAAGAATAAAACATACTGTGTCAAGGGCGGTATGCCATCTGGCTGCTCAGGCACTTCAATTTTTAACTCAATGATTAACAACTTGATTATCAGGACACTCTTACTGAAAACCTACAAGGGCATAGATTTAGACCACCTAAAAATGATTGCCTATGGTGATGATGTAATTGCTTCCTACCCCCATGAAGTTGACGCTAGTCTCCTAGCCCAATCAGGAAAAGACTATGGACTAACTATGACTCCAGCTGACAAATCAGCTACATTTGAAACAGTCACATGGGAGAATGTAACATTCTTGAAG       VKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQLMSLDINIEQMCLEDAMYGTDGLEALDLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKTKVEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVITGSAVGCDPDLFWSKIPVLMEEKLFAFDYTGYDASLSPAWFEALKMVLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMPSGCSGTSIFNSMINNLIIRTLLLKTYKGIDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVTFLK
```

#### Proposed Solution

Here is the assembly graph from @asl for poliovirus ![poliovirus assembly](/home/artem/Desktop/serratus/notebook/201113_dicistro/poliovirus_exclusion.png)

The issue is when within a single assembly-graph there are multiple paths which provide a RdRp match, each one is being returned, this is yielding indels as seen above in slippage-prone viruses like polio (or other RNA viruses).

Proposed Solution: 1. For each RdRp output graph, allow each rdrp-containing edge (red highlight above) to be used at most ONCE per reported sequences. The "top hit" will be selected by percent-identity to a known virus (thus assuming the hit is not novel). The risk is if there is a novel virus in the same library as a known virus and they share homology over an edge of say 50 amino acids, then the novel virus would be excluded as the known virus takes priority. The benefit is this will reduce intra-sample viral variants to the most conservative.

One more caveat, assume rdrp is the red region in the graph above and the viral genome is green. If a sub-graph (blue) has a higher identity match then the longest match (green), but does not contain an end-to-end RdRp domain, the longest match with variants should take priority.

### Traceback 2

study: `SRR5214089: Wetland microbial communities from Old Woman Creek Reserve in Ohio, USA - Plant_9_15_B metatranscriptome`

```
Line:   8091 
>SRR5214089;Type=NovelSpecies_Dicistro;rdrp0=rdrp2.yaOV72.orf339(75.2%);Dicistro=Dicistroviridae.ORF2.MF189972.1(77.2%);
MMSGLKKCGVTPALLNDDDLKACVNDIARTLRTNYSRIDESVYKRVLSYEEAVQGANDEYMTAVNRLTSPGMPYSLMREG
KVGKTKWLGSNENFDFVSPDALEMRNDVAKLIDDCRNGIIRGVYCSDTLKDEKRDLAKVAVGKTRVFSACPVHFVLAFRR
YFLGFSAWCMHNRIDNEVAVGTNQYSLDWHKIAIRLQKKGEAVIAGDFSNFDGSLNAQVLWAILDIVNEWYDDGEENVKI
RTGLWAHVVHSTHIFEDNVYMWTHSQPSGNPFTVIINSIYNSIIMRMAWQIVMKEQGMAGMDQFQKYVSMISYGDDNCLN
ISHSIIEQFNQQTIADALSTIAHTYTDEGKTGEIVKARKLNDVNFLKRGFMFSSELQRYVAPLEERVIYEMLNWTRNTVD
PDEILKTNVETAAREMALHGKVKFDNFCKEIRQIE
```

blastp: 	

`putative ORF1 [Caledonia beadlet anemone dicistro-like virus 2] 	Caledonia beadlet anemone dicistro-like virus 2 	734 	734 	100% 	0.0 	75.86% 	1746 	ASM93984.1`

```
Query  1     MMSGLKKCGVTPALLNDDDLKACVNDIARTLRTNYSRIDESVYKRVLSYEEAVQGANDEY  60
             +M+GLKKCG TPALL++D++ AC NDI+R +RTN++ ID + YKRVLSYE+AV+GA+D++
Sbjct  1284  LMNGLKKCGKTPALLDNDEINACSNDISRLIRTNFANIDINFYKRVLSYEQAVKGADDDF  1343

Query  61    MTAVNRLTSPGMPYSLMREGKVGKTKWLGSNENFDFVSPDALEMRNDVAKLIDDCRNGII  120
             MT+VNR+TSPG PYS  R G+VGKTKWLGSNE+FDF S ++L+M+ DV  LI+DC+NGII
Sbjct  1344  MTSVNRVTSPGFPYSQQRSGQVGKTKWLGSNEDFDFTSENSLQMQQDVKNLINDCKNGII  1403

Query  121   RGVYCSDTLKDEKRDLAKVAVGKTRVFSACPVHFVLAFRRYFLGFSAWCMHNRIDNEVAV  180
             RGVYC+DTLKDEKRDL KVAVGKTRVFSACP+HFVLAFR+YFLGFSAWCMHNR+DNE+AV
Sbjct  1404  RGVYCADTLKDEKRDLEKVAVGKTRVFSACPIHFVLAFRQYFLGFSAWCMHNRVDNEIAV  1463

Query  181   GTNQYSLDWHKIAIRLQKKGEAVIAGDFSNFDGSLNAQVLWAILDIVNEWYDDGEENVKI  240
             GTNQYSLDW+KIA+RL++KG  VIAGDFSNFDGSLNAQ+LWAIL+I+N+WYDD  EN KI
Sbjct  1464  GTNQYSLDWNKIALRLKRKGNPVIAGDFSNFDGSLNAQILWAILEIINDWYDDDIENKKI  1523

Query  241   RTGLWAHVVHSTHIFEDNVYMWTHSQPSGNPFTVIINSIYNSIIMRMAWQIVMKEQGMAG  300
             R GLW HVVHSTH+F+DNVYMWTHSQPSGNPFTVIINSIYNSI+MRMAW+IVM   G++G
Sbjct  1524  RIGLWTHVVHSTHVFDDNVYMWTHSQPSGNPFTVIINSIYNSIVMRMAWRIVMSSHGLSG  1583

Query  301   MDQFQKYVSMISYGDDNCLNISHSIIEQFNQQTIADALSTIAHTYTDEGKTGEIVKARKL  360
             M+ F K+VSM+SYGDDN LNIS S+I+ FNQQTIA+AL +IAHTYTDE K+GE++K R L
Sbjct  1584  MNHFNKFVSMVSYGDDNVLNISQSVIDLFNQQTIAEALESIAHTYTDETKSGEMIKFRSL  1643

Query  361   NDVNFLKRGFMFSSELQRYVAPLEERVIYEMLNWTRNTVDPDEILKTNVETAAREMALHG  420
              DVNFLKR F+FS ELQRY+APLEERVIYEMLNWTRNT+DPDEIL  NVETAAREMALHG
Sbjct  1644  KDVNFLKRSFVFSEELQRYIAPLEERVIYEMLNWTRNTIDPDEILMMNVETAAREMALHG  1703

Query  421   KVKFDNFCKEIRQIE  435
             + KF+NF  E+RQIE
Sbjct  1704  RNKFNNFVSELRQIE  1718
```


"Top hit" from library: `all_assembly_graphs.RdRP_1234q.seqs.fa`

```
4622803 >SRR5214089.Score=197.009|Edges=5987|Position=940|Alignment=29M2D13M1D31M1D38M1D22M4I190M2D26M2I17M1I27M2I61M|ScaffoldSuperpaths=16349025'_2200
RSAIYGEVLEPISAPAVLSRRVVLPDGTLHDPVLAGLKKTGKIPPFMETKLIKAAVNDVL
RIHQTNDRTRKRVLTNEEALSGVLDDIYSNPLNRGSSPGFPWVLNRSGKGKMKWTADENG
EYKMNEELKKAIDEREEMALRNERFPTIWIDTLKDERRPLEKVRVGKTRVFAAGPMDFVV
CARKYFLGFCAHLAEHRIDNEVAVGINPYSYDWTQLAVHLKKKGERVVAGDFGNFDGTLI
LQILEEIGNAINEWYDDGEDNRQIRTILWKELINSIHIEADNIYFWTHGHPSGHPLTAIL
NSLYNSVVCRIVFVMCARKAGQFMTMKDFNENVSMISYGDDNVLNISERITEWFNQHTMS
EVFAEIGMEYTDELKSSATDALPFRKLEDVSFLKRKFKFDEERHCYNAPLEYGVCMEMVN
WIRGELDPEDACCVNCQTSAMELSLHGRKVFEKSTKLIKQACL
```

`Select seq ref|YP_009336777.1| 	hypothetical protein 1 [Wenzhou channeled applesnail virus 2] 	Wenzhou channeled applesnail virus 2 	821 	821 	99% 	0.0 	82.90% 	2007 	YP_009336777.1`

```
Query  2     SAIYGEVLEPISAPAVLSRRVVLPDGTLHDPVLAGLKKTGKIPPFMETKLIKAAVNDVLR  61
             S I+G VL PISAPAVLSRRVVL DGT+HDPVLAGLKKTGKIPP+M+  LIKAAVNDVLR
Sbjct  1510  SLIHGAVLPPISAPAVLSRRVVLEDGTIHDPVLAGLKKTGKIPPYMDPNLIKAAVNDVLR  1569

Query  62    IHQTNDRTRKRVLTNEEALSGVLDDIYSNPLNRGSSPGFPWVLNRSGKGKMKWTADENGE  121
             +HQTNDRTRKRVLTN EALSGVLDD YSNPLNR SSPG+PWV +R GKGKMKWT+D +GE
Sbjct  1570  VHQTNDRTRKRVLTNLEALSGVLDDPYSNPLNRSSSPGYPWVKDRVGKGKMKWTSDFDGE  1629

Query  122   YKMNEELKKAIDEREEMALRNERFPTIWIDTLKDERRPLEKVRVGKTRVFAAGPMDFVVC  181
             YKM++EL  AI+ERE MAL NER+PT+WIDTLKDERRP+EKV++GKTRVFAAGPMDF+VC
Sbjct  1630  YKMHKELADAIEEREFMALNNERYPTVWIDTLKDERRPIEKVKIGKTRVFAAGPMDFIVC  1689

Query  182   ARKYFLGFCAHLAEHRIDNEVAVGINPYSYDWTQLAVHLKKKGERVVAGDFGNFDGTLIL  241
             ARKY+LGFCAHLAE+RI+NEVAVGINPYSYDWT LA HLKK G +VVAGDFGNFDGTLIL
Sbjct  1690  ARKYYLGFCAHLAENRINNEVAVGINPYSYDWTHLARHLKKFGNKVVAGDFGNFDGTLIL  1749

Query  242   QILEEIGNAINEWYDDGEDNRQIRTILWKELINSIHIEADNIYFWTHGHPSGHPLTAILN  301
             QILE IG AI+EWYDDGEDN QIR ILWKELINS+H+E +N+YFWTHGHPSGHPLTAILN
Sbjct  1750  QILESIGEAISEWYDDGEDNAQIRRILWKELINSVHVEGNNLYFWTHGHPSGHPLTAILN  1809

Query  302   SLYNSVVCRIVFVMCARKAGQFMTMKDFNENVSMISYGDDNVLNISERITEWFNQHTMSE  361
             SLYNSVVCRIVFV+CARKAG+   MKDFNENVSMISYGDDNVLNIS+R+ ++FNQHTMSE
Sbjct  1810  SLYNSVVCRIVFVLCARKAGKIANMKDFNENVSMISYGDDNVLNISDRVIDYFNQHTMSE  1869

Query  362   VFAEIGMEYTDELKSSATDALPFRKLEDVSFLKRKFKFDEERHCYNAPLEYGVCMEMVNW  421
              F EIGMEYTDELKSSA DA PFR LE+VSFLKRKF++DEER C+ APLE GVCMEMVNW
Sbjct  1870  CFTEIGMEYTDELKSSAADAKPFRSLEEVSFLKRKFRWDEERACFTAPLELGVCMEMVNW  1929

Query  422   IRGELDPEDACCVNCQTSAMELSLHGRKVFEKSTKLIKQACL  463
             IRGELDPE+ACCVNCQTSAMELSLHGR+VFEK TK+IK+ACL
Sbjct  1930  IRGELDPEEACCVNCQTSAMELSLHGREVFEKCTKMIKRACL  1971
```

### Traceback 3

```
Line    131 >ERR4147417;Type=NovelFamily;rdrp0=rdrp2.Dicistroviridae.Ancient_Northwest_Territories_cripavirus:AIM55450.1(43.3%);Dicistro=Dicistroviridae.OR
VDKGVRVHDTSAVKPSDLSGAWPCEQMGPGVLTNNAYEKARARYARESILLPTGDLKTVVRTLFEYYENVSTYDVQRNLL
SFEEAAEGLPDDPDYKPISRKTSCGYPISIHDDPRMRSKASYFGSDGPFVYTPKARALEDELREIVDKAKQGFRTQFVYT
DNLKDERVSVVKILNEKTRLFSGVPLAYLLLVRMYFGKFMLWIAKNRISNSSAIGINAYEIEWAIMFRHLVGQNGVDDVC
FFAGDFKGFDQSGKPTIYLMILDEINAWYDDSLENQRIRRVLWLELIQSKHIRGRLIYEWSRSLPSGHPMTTIVNTIYNH
IAYRYCFFRIVGHNTYMLSNFTDYVRLMSFGDDVVGTVVEALREDFNEMTIAPYMEEIGLVYTTDLKESAEVPLRTYQQV
TFLKRSFMFCPETDKLLAPLNLQVVLNIPMWTKRGADDGAITRDNVCTALRELSLWGRVIYNQHAPVIIKACK
```

blastp:

```
 	RNA-dependent RNA polymerase [Picornavirales sp.] 	Picornavirales sp. 	367 	367 	99% 	3e-118 	42.58% 	540 	QDH88488.1
Select seq gb|QDH89953.1| 	RNA-dependent RNA polymerase [Picornavirales sp.] 	Picornavirales sp. 	363 	363 	99% 	5e-112 	42.37% 	985 	QDH89953.1
Select seq gb|QDH88023.1| 	RNA-dependent RNA polymerase [Picornavirales sp.] 	Picornavirales sp. 	306 	306 	74% 	4e-96 	44.63% 	411 	QDH88023.1
Select seq gb|QJI52026.1| 	nonstructural polyprotein [Dicistroviridae sp.] 	Dicistroviridae sp. 	319 	319 	93% 	6e-93 	39.78% 	1574 	QJI52026.1
Select seq gb|AVA30705.1| 	nonstructural protein [Dicistroviridae sp.] 	Dicistroviridae sp. 	301 	301 	97% 	1e-92 	34.97% 	541 	AVA30705.1
Select seq gb|AIM55450.1| 	NS [Ancient Northwest Territories cripavirus] 	Ancient Northwest Territories cripavirus 	280 	280 	70% 	2e-86 	44.14% 	369 	AIM55450.1
```

Back to the `psummary` file

```
mkdir -p ~/ab; cd ~/ab
aws s3 cp s3://lovelywater2/psummary/ERR4147417.psummary ./
aws s3 cp s3://lovelywater2/pro/ERR4147417.pro.gz ./

SUMZER_COMMENT=sra=ERR4147417,genome=protref5,date=200906-01:54,type=protein;totalalns=2571;readlength=100;truncated=no;
famcvg=.ww:uuawwwwawaauuwwuwowwu;fam=Dicistroviridae;score=100;pctid=68;alns=265;avgcols=31;

gencvg=_::_::awwwww.::::uwuuowwu;gen=Dicistroviridae.ORF1;score=78;pctid=68;alns=186;avgcols=32;
gencvg=.uu:::.....uwww.::._:.___;gen=Dicistroviridae.ORF2;score=45;pctid=67;alns=79;avgcols=31;

seqcvg=__.___:_.uuw_____.:._____;seq=Dicistroviridae.ORF1.KJ938718.1;score=20;pctid=61;alns=29;avgcols=32;
seqcvg=_____.:u_:____::.._:_.:u:;seq=Dicistroviridae.ORF1.MH320557.1;score=19;pctid=69;alns=28;avgcols=32;

```

and `.pro` file

```
ERR4147417.721131       Dicistroviridae.ORF1.MH320557.1
98      6       100     709     739     1825    67.7    4.0e-07 31M     -
GTCAAGTCAGGGACAAAATGCAATTTTCCAGAAGTAATTTGTTTCCTATATTTTTGGAAAGTATAAATTTCGGTGGAAGTAGGTACTCTATATCCCGCTG
AGYRVPITTDIYNFVKYRKLIVSGKTQFQPD
```

### Traceback 4 - Putative novel family; by-catch?

Study: [SRR6824969: Metatranscriptome of coastal salt marsh microbial communities from the Groves Creek Marsh, Georgia, USA - 041409AS metaT metatranscriptome](https://www.ncbi.nlm.nih.gov/sra/?term=SRR6824969)


Hit to RdRp, no blastp homology in `NR` or `TSA`

```
>SRR6824969;Type=Novel_NoMatch;Rdrp0=none;Dicistro=none;
SWIRRTRKHFTHTVSTSYGPSQFPPHFVFRSFRRSGGVIFFKIWESETRSWSCVGDSWEEIARRFVYRVVSHSWIQGEKM
SLQLSLAGFTLRAAQAGKLRAVNGYPHMALGGASSQPPPPDVVELCEELQRELGVDVSDWEYAGDKTGEQTMNKLEKLLI
PRKNREYSAEDFIMTLFELMLKTPHYGNEWIKHTPPTLEELLEEMVIKDSSSGYDKPGNPGGSSKNENALWAYAEHCRTA
RGYEPSSNAPPVYAVFSKGNEWLKKTKDQRKIVGESTQQNIALQACFGTIIRERANPFGMSMIGYKTAGNSSLQMLKKLS
QKHITHEDIAKLEEIQVHASDKKQWEYTMSGALKALYASHLMLRVNWTESAMEHLLPLVEALGSYLHPLIGVGKNLVVEA
EHYMPSGTYPTLNGNTHKHMAIVCDYVQQQLAKCDVDQAKELIEWRRTISILGDDFIARWHDEHSPALDEFSDSLYGTVT
ESEGKLPIGKAAFCQRIMRLENGMPVFAYNVERAKIKMCLPRRNMGIQLDAIRSLAADACLLGDDLATVRRITEPYEIGM
IAKLDQDSVQKYKTTSVYDQNGMLRLWSPNHDVHRDVATAHDMANHVSMH
```

Anton grabbed the nucleotide sequence
```
>NODE_71_length_4866_cov_12.060411
GTGACCAGATGGTTTATGATCCTTTTGACATTCATCGCTTTCTGCGCAGCAGCAGTTGGT
ACCGTCATCGTCGTTATCTTCTACCCAAAGAGGGAGAAACATACTGACGAAGACAAAAAG
GTGATGGAAAACTACGAAGACTTGGACAAATTCATGAATGAGGGAGAACCTTTCTACTAC
AAAGGTATTAAGTTCAACCCCGAAGAAGACAAGTTGGATGACTTCTCCGCCCAGGCAGAG
GCAGCAAAGCTTGCTTCGGACCTTAAGGACAACGTCTTTCCCTCAATCGAGGGACAAAGT
ACGCTTTCGGACACTATCATGATGCTTATATCATTTGTTGTGATGGTTTCCTCTTGCGTC
AGCATTGGCGCTGCCAGAAAAGTTCACGAAGCAGCACGTTCTCTTGCCACAGCGGGCACC
TCCCTGATCGCATTTTTCAGGGCAGGTAACAATTGGCTAGGAATCACCAAATTATTCCGC
GCCCAATCCGAATTCAAAGTACTCCGAGACCTCACCAGGTTACAGATCCTGATGGTCGCA
CTCCTCACAGGAATACTCGCCGCAGCGGCGGGTTTTGCGGCCACTCGAAAGTACAAATCT
TCACCCAAAGTGAAGAGGACCAAAGAGACGAAACCCAGGGGGTTTAGGGTCATGGATCCA
ATGGAAGAGGTTAGCTTTCTCCAGGCAGTCTGGGAGGAAGAAAGGTTCCGGTCACTCGCC
GAAGGCGGGAGCTGGGCCGACGAGGAGGATAATCTCGCCGAAGACTACGCAAACTTCAGA
CGCGAATGGGAAGATAAAAATGTCCGCGAACTCGACGAAGTCGATCTCGAGGATTACTCC
CCAGGCGCAAGTAAGTCTAAGGCATCTCGGAAGGGCAACAAGCGAATCCACGAAACAGCA
AAGGTCACTCGCAAAAAGAAGGTTTCCAAGGCTACTCAGACCGACAAGGTCGCCGAAGAA
CCCGAAAAGGATGACACAAAGGAATCCAAAACTAAGACCGAGACCAAGAAGAAGCCGTCC
ACTAAAAGTGAGTCAGCTAAGAACACTACCAAGACGGCCAAAACTCAGACCGATAAAGGC
AAGGACAATGAAAAGCCTAGAGTCCAAGTCAGAATTGCTGTAAAGTTGACAGGTGAAACC
CAGCACATACATGTAGATCCTACCAACCAACTTGAGTACCTTTGGGTCGTTGGTGACAAG
TACATCGCTTGCCCCAACAAGACCGTCACACTCGCTGACACCAATCAGACATTCAAGTTC
AGCAACACCACACCTGAGGTTCGTACTCCGGAGCTAAGTGTTTACCTTAAGGCCCCAACG
AGGGTTAGCTACAATGGCAAGGTTTATTACACCCCTCTCGAGACGAGGAAGGACATCGTT
CGTACTGAAGATGATACCAAAAACTCTTGGTGGTCACGTTTCCAGTCAACCGTGCAAAAG
TTCGGTTTCGACAACGTGCGTGTCACCGAGTCCCTCAACCCCAGTTCACCTACATTTGTC
GCGTTTTACACCAAATGTGTTGAGGTCCATCAGGTCCTCTCGGGCAGCGACTGCAAAATG
GTCCTTAAGTACAACGGACTGTACTACCCAATGGAGATCATATCTCAAGGACCTACAGTT
GCGGCCTGGACAGACGGACGTTCATACTGTGCCTTACCCGCTGGATGCCCGTCATTCAAG
GTGACAACCAGAAACCCGGCGACGGTGGCTATCTCATACCCAAGAGATGGTCAACCTGCT
TTCAGCGTCGGTCAATCTTCGAAATACACCCAACAAGGTGAAGTTCTTTACAACATGTCT
ACGGCAAAGGGCGATTCCGGACTGCCCGTCTTGGACACCTCCGGGCGGGTCGTGTGCCTT
CACAATGGTCACGCCCCGGCCACCGAGTCTGCTTACAACGGTATGGGAATCGTGCCGGTG
GACCCCGTCGGATCGTACACCATGCGATCTGTGATGCCCCTCAAGGAGCAGAAGCTTGAG
ACCGAGCGTAAGTCTTGGGTCACGCCGATCAAGGCTGTGGGCCATCTCTTCGCTCATCAG
GTTTTTCAGTAAGCCCTTCCCAGATGACCTTTGAGTACCGCAGTGCCACACCAACAAAAG
GCACATTTGCGGGACTCTCCTCTGAGGTCATCTGGACTAAGAAGGGCGGCTGGAGAACTC
TCTTGACGGAACGACCGAAGAAGTTCCGACTTGAGAATCTCAAGTACGTGGGAGGTTTGC
CTTACTACGTGAGGCAAAAGGAACAGACACGCAGGGTGAATGAGGTCGTGCACGAGCTTA
TCAGGAAGCACTTTCCAGATTTTGACGCCAATGCATACGGTAAGTCACGACCCACCTTCG
AAAAGATTGAAAATGAGGCTAAGCGGATAAACCCGGCTCCCACATGGGAACTGGATGAAG
AAGCTAGGGCCTTTGCAGATTACACACTCTACAACTACATCCTGCCATTGGGCCCCTTCC
CTCTGGCGACCGACGATGAAATCATCAACGAGCTCTTGCCTAAGTCTCCGGGACTATGGT
ATGAGGATCTGCATAAAAACGCAACCAAGCGCACCACTTGTGGTGTTGCTGTCAAACAAG
CCCGAGGACTGCTCGATGGCACATTAAAAGACTTTGAGGATTCGGACCCGCTGTTTAAAT
TCGTCGGCAAGACGGAAATACTCTCAGCGGTGAAGATTCAAGAGAAAAGGAACCGGAACT
ACGCCGTCCCCCCTCTCTATCTCTATATTTTGGAGATCATTTTCTTTCACCATCTAAGTA
ATGCATACAAAGAGCATGACAAAGGTTACACACTGACACTCCAACACGGCGGACTCTATG
AACTGTTCGCACAAATGGACAAATACGGAACCGTTATCTCAGACGACAAGACTGGTTTCG
ACTTACGACAACAATGGATCGTTCAGAAGAGCGTGGCACGAATCTGTGCCTACACCATTG
ACATGTCTGACAAACATCACATGATGTACATGCATCTCGCACACCAATTCTGCGCAAAGA
ACATTCTCATGCCAGACGGAAGCGTGTTCTACACGCCGTACCATCACGCCTCAGGGCGAT
ACGTGACGACTATGAAAGGTTCGTTATTCCACAGGTGGGAACAAGCGTATGTATTTTATG
TCGTGATGCAACAACAAACACCCGATCGAGCAACTGAGGGCCTGGGCCGCGAAGCCTTAA
GGCTCCTATTCGAACGGATGAATACGCGAATTGCGTCAGACGACGTTTTGCAGGGGGTCC
CGAATGACCCTATCTACGAACCGTGGTTGGACCAAGAGAAGCGCACATCAGTGTGGGAAA
CGCTGGTGCCGATTAAACCCGGCAGCGCTCTTCAAACCCACTCCTCGACCGGACACTTCT
ACCTAGGATGGACGGTCGAAAATGGCAAAGTGAAGCACAACTCGCGGACAAAACTTCTCG
CAAAGCTACTCTACTCTAGTATCAAAGACAAACAGGGAGTCATCACAGGTTTGGTTCATA
CCTCACCCTACGACACCCCAATGTGTGATTTCTTGAGAGAGCTGGCGAACAAATGGGGAG
TGGAGTTCTGTGGGCAACGCATTGCCCAGCTGTTATGGAGACGACGCTTGGACACGGCCC
CTTTTTTGCCGAGTCCCCCTACCCCGGAGGGCGTGATTATGATGCACCCGGTGGACAATG
TGCACAAAATTCGCAACAGTGTTACAATGGCACGTAAAAAGCAAGTAAAGCAAGCCGCAA
AGCGCAAGGAGGAGCGTAAGATCGAGCGTGCAGTGGAAAGACGCCAAGCCGTCCACCCAC
TCCGCCATCTTATGACTCCCAAGCTTCACGGCAACACATACAAGTATCTCGAGACTCTCG
TCGATCCTTACAACACACCTGCTGGTGCGTCTACGCCCGACAAGGTTGTGAGGAAGAGCG
CGAGGTACAAATCCTACGTCAAGACAACAATGACAACCGGGACCGGAGGTTTCGGTTTTT
GTCTTTTCAACCCATACTCCGCTGCGTGGGGCGATCAGTTCGCTGTCGCCACCTCGAACG
GGACCTACGCTGGAACGCTGACGGACCCGAACAGCGCCACTGCTGGAGTCAACAACTACA
AGTCAACAGCGCCCCTTAATTGGGCCGACGCTAGCGCAAACGGCTACTCATGCCGTGTTG
TTGGCGCTGGAATTCGAATCTTGAACAATACTGCCCTCCTGGATATGGGAGGTTCAGTTA
CAGGGATTCGTGAGCCAGGAAACCAGTATCTCACAGGATATACCTTCGCAGACATTCTGA
ACTACAACCAATGCCACGTGATTCGGCCAGAGGCCGGCAAATGGATTCATGTGGAATGGG
CTCCGACCGGAAATATCGGCCAGGAGACTGGAGACCAGTTCGAATTCGACGATGAGAACC
CTGCTTCTGCAGCCAGCATGATTGGTGGGCAGATCGCAATCGTCGCGACCTCCGCGGGAC
CTGCTGCAAATGCGCAGTCCTACGATGTGGAGGCCGTGGTCATATACGAAATGCTAGGGG
AGAAACTGCCCCTACAGGATGTTCACGTCGACCCGCTGGGCCAAGCAGCGTGTCTCGAAG
TTATTGCTGAGCTAGATGGAAGCGCATCTGACAAGTGGATCACACAAGTCGGTAAAGTTG
CGAAGAGAACTGGCAAAGCTCTGAGGGAAATCAGTGCAGCGGCGATGCCCGCAGCCCTCA
TGTTGGGGCTCGCATGAATTGGTGCGAGGGTATCAACGCGGTGTGGATGGATGTATGTAT
TTTTATTTGCGGTATAAGGGACCCAACCCCTCTGTACATCTCGGAGTCTGATGATGACGA
GATTAATAGGCTATTCCATTAGCCTCAAAATGGTTAAGAGAAGTCCCAGAGACTCCTCGG
CGCACG
```

Two ORFs found which make up ~95% of the sequence. Domains checked by InterPro.
```
>ORF1-peptidase_containing
MILLTFIAFCAAAVGTVIVVIFYPKREKHTDEDKKVMENYEDLDKFMNEG
EPFYYKGIKFNPEEDKLDDFSAQAEAAKLASDLKDNVFPSIEGQSTLSDT
IMMLISFVVMVSSCVSIGAARKVHEAARSLATAGTSLIAFFRAGNNWLGI
TKLFRAQSEFKVLRDLTRLQILMVALLTGILAAAAGFAATRKYKSSPKVK
RTKETKPRGFRVMDPMEEVSFLQAVWEEERFRSLAEGGSWADEEDNLAED
YANFRREWEDKNVRELDEVDLEDYSPGASKSKASRKGNKRIHETAKVTRK
KKVSKATQTDKVAEEPEKDDTKESKTKTETKKKPSTKSESAKNTTKTAKT
QTDKGKDNEKPRVQVRIAVKLTGETQHIHVDPTNQLEYLWVVGDKYIACP
NKTVTLADTNQTFKFSNTTPEVRTPELSVYLKAPTRVSYNGKVYYTPLET
RKDIVRTEDDTKNSWWSRFQSTVQKFGFDNVRVTESLNPSSPTFVAFYTK
CVEVHQVLSGSDCKMVLKYNGLYYPMEIISQGPTVAAWTDGRSYCALPAG
CPSFKVTTRNPATVAISYPRDGQPAFSVGQSSKYTQQGEVLYNMSTAKGD
SGLPVLDTSGRVVCLHNGHAPATESAYNGMGIVPVDPVGSYTMRSVMPLK
EQKLETERKSWVTPIKAVGHLFAHQVFQ

>ORF2-rdrp_containing
MTFEYRSATPTKGTFAGLSSEVIWTKKGGWRTLLTERPKKFRLENLKYVG
GLPYYVRQKEQTRRVNEVVHELIRKHFPDFDANAYGKSRPTFEKIENEAK
RINPAPTWELDEEARAFADYTLYNYILPLGPFPLATDDEIINELLPKSPG
LWYEDLHKNATKRTTCGVAVKQARGLLDGTLKDFEDSDPLFKFVGKTEIL
SAVKIQEKRNRNYAVPPLYLYILEIIFFHHLSNAYKEHDKGYTLTLQHGG
LYELFAQMDKYGTVISDDKTGFDLRQQWIVQKSVARICAYTIDMSDKHHM
MYMHLAHQFCAKNILMPDGSVFYTPYHHASGRYVTTMKGSLFHRWEQAYV
FYVVMQQQTPDRATEGLGREALRLLFERMNTRIASDDVLQGVPNDPIYEP
WLDQEKRTSVWETLVPIKPGSALQTHSSTGHFYLGWTVENGKVKHNSRTK
LLAKLLYSSIKDKQGVITGLVHTSPYDTPMCDFLRELANKWGVEFCGQRI
AQLLWRRRLDTAPFLPSPPTPEGVIMMHPVDNVHKIRNSVTMARKKQVKQ
AAKRKEERKIERAVERRQAVHPLRHLMTPKLHGNTYKYLETLVDPYNTPA
GASTPDKVVRKSARYKSYVKTTMTTGTGGFGFCLFNPYSAAWGDQFAVAT
SNGTYAGTLTDPNSATAGVNNYKSTAPLNWADASANGYSCRVVGAGIRIL
NNTALLDMGGSVTGIREPGNQYLTGYTFADILNYNQCHVIRPEAGKWIHV
EWAPTGNIGQETGDQFEFDDENPASAASMIGGQIAIVATSAGPAANAQSY
DVEAVVIYEMLGEKLPLQDVHVDPLGQAACLEVIAELDGSASDKWITQVG
KVAKRTGKALREISAAAMPAALMLGLA
```

```
 Line 176210 >ERR2737793;Type=NovelGenus_Dicistro;rdrp0=rdrp2.Dicistroviridae.Israeli_acute_paralysis_virus:AQY03950.1(63.2%);Dicistro=Dicistroviridae.ORF1.
CIDVVASPTKTALRPSLLYGKLEEVKTRPSALFASEFNIKYKNLEKCAGNIPYIEQELIDNACFYVKEKWLNNINLELAR
VLTYEEAISGRADVSEYMGPIHRQSSPGYPWIKSRKSNFPGKTGWFGNDEVYLYDAEVKNVVEHRINQAKLGIRTPTLWT
DTLKDERRSHEKVLAYKTRVFSNGPMDFNIAFRMYYLGFIAHLMENRIDNEVSIGTNVYSRDWTKTAHKLMEKGEKVIAG
DFSGFDGSLHTAMMLKFVEIANEFYDDGDENALVRLVLMLEIINSVHICDRSVYQMTHSQPSGNPATTPLNCLINSLGLR
MCFSYLAKKYNKPYTLKDFEKYVSIVSYGDDNVINFADEVAEWYNMETLTMAFKVFGFTYTDELKGKNGEVPKWRKLDQV
AYLKRKFRKREDFPIYDAPLDIETIMEMPNWCRETVDVFEGTKINAEIAIMELHMHDKETFNIKSNLIKRQF
```



### Performing traceback ala Anton

```
Anton Korobeynikov 
well… gene_clusters might lack the corresponding sequence if this is a PathRacer alignment
as far as I can see there is lots of things in this SRR
but it’s NODE_4658_length_1308_cov_1.955466 in SRR5214089.rnaviralspades.scaffolds.fasta


This is the path:
Check the alignment:

>SRR5214089.Score=142.92|Edges=5275154'|Position=3|Alignment=2M3D19M9D11M1D9M7D15M12D50M7I120M1I23M1D46M3D25M2I21M1D22M2I51M1D9M|ScaffoldSuperpaths=5275154':9917/0|OriginScaffo
ldPath=
MMSGLKKCGVTPALLNDDDLKACVNDIARTLRTNYSRIDESVYKRVLSYEEAVQGANDEYMTAVNRLTSPGMPYSLMREGKVGKTKWLGSNENFDFVSPDALEMRNDVAKLIDDCRNGIIRGVYCSDTLKDEKRDLAKVAVGKTRVFSACPVHFVLAFRRYFLGFSAWCMHNRIDNEVAVGTNQYSLDWHKIAIRLQKKGEAVIAGDFSNFDGSLNAQVLWAILDIVNEWYDDGEENVKIRTGLWAHVVHSTHIFEDNVYMWTHSQPSGNPFTVIINSIYNSIIMRMAWQIVMKEQGMAGMDQFQKYVSMISYGDDNCLNISHSIIEQFNQQTIADALSTIAHTYTDEGKTGEIVKARKLNDVNFLKRGFMFSSELQRYVAPLEERVIYEMLNWTRNTVDPDEILKTNVETAAREMALHGKVKFDNFCKEIRQIE


So, it’s on the edge 5275154 (and this edge is a single scaffold)


2. Check  SRR5214089.rnaviralspades.scaffolds.paths and look for the mapping between contig/scaffold name and set of edges:
NODE_4658_length_1308_cov_1.955466
5275154+
NODE_4658_length_1308_cov_1.955466'
5275154-


3. We done

>NODE_4658_length_1308_cov_1.955466
TTCAATTTGGCGTATTTCTTTACAGAAATTGTCGAATTTAACCTTGCCGTGTAACGCCAT
TTCACGAGCTGCAGTTTCCACATTTGTTTTAAGAATTTCGTCAGGGTCAACAGTATTTCT
AGTCCAATTTAACATTTCATAAATCACTCTTTCTTCCAATGGGGCAACATATCTTTGCAA
TTCCGACGAGAACATGAAGCCTCGCTTCAAAAAATTTACATCATTCAATTTCCTTGCTTT
AACAATTTCACCAGTTTTTCCTTCATCTGTGTAAGTATGGGCGATTGTAGATAACGCATC
CGCAATTGTTTGTTGATTGAATTGTTCAATAATGGAATGTGAAATATTCAAACAGTTATC
ATCCCCATAACTAATCATTGAGACATATTTTTGAAATTGATCCATTCCTGCCATCCCTTG
TTCTTTCATAACAATTTGCCATGCCATTCTCATTATTATTGAATTGTAAATTGAATTGAT
AATTACGGTAAACGGATTTCCAGAAGGCTGACTATGAGTCCACATATAAACATTATCCTC
AAAAATATGAGTAGAATGTACAACATGAGCCCACAGACCTGTTCTTATTTTAACATTTTC
TTCTCCATCATCATACCACTCGTTTACGATGTCTAAAATAGCCCACAATACTTGCGCATT
CAAGGATCCATCAAAATTCGAAAAATCTCCCGCAATCACTGCTTCACCTTTCTTTTGCAA
ACGTATAGCTATTTTGTGCCAATCCAATGAGTATTGATTAGTACCAACTGCCACCTCATT
GTCGATTCGATTATGCATACACCAAGCTGAAAAACCCAGAAAATATCGTCTAAAAGCCAA
AACAAAATGTACAGGACAGGCAGAAAATACACGTGTTTTTCCAACAGCAACCTTTGCCAA
ATCTCGTTTTTCATCCTTAAGAGTATCCGAACAATACACACCACGAATTATACCATTTCT
ACAATCGTCAATTAACTTTGCCACATCATTCCTCATTTCAAGTGCGTCTGGCGACACGAA
ATCAAAATTTTCATTGCTACCAAGCCATTTTGTTTTTCCAACTTTACCTTCTCTCATTAA
AGAATATGGCATTCCTGGAGAAGTTAATCGATTAACTGCTGTCATATATTCATCATTTGC
TCCTTGCACTGCTTCTTCATAACTCAACACGCGCTTATAAACTGATTCGTCAATTCGTGA
GTAATTAGTTCTTAGAGTTCTAGCTATATCATTCACACATGCTTTTAAGTCATCATCGTT
CAAAAGTGCTGGTGTTACTCCACATTTCTTGAGTCCACTCATCATCGG


For gene_clusters, check SRR5214089.rnaviralspades.bgc_statistics.txt


we’re having there:
BGC subgraph 491
# domains in the component - 1
# Strong/weak edges in the component - 0/0
BGC candidate 1
Dicistroviridae.ORF2.MF189972.1
Predicted type: Custom
Domain cordinates:
3 1235
Edge order: 
5275154-
Path is linear

so, it will be NODE_499_length_1308_cluster_491_candidate_1_domains_1 there (“cluster 491” is the key) (edited) 
```