# Mitovirus / Narnaviridae Set-up
```
Lead     : ababaian
Issue    : NA
start    : 2020 10 12
complete : 2020 10 26
files    : ~/serratus/notebook/201012_ab/
s3 files : s3://serratus-public/notebook/YYMMDD/
```

## Introduction

I recieved email from Adam Bergman and Samantha Lewis at UC Berkeley, in brief they are interested in a potential collaboration to apply Serratus for searching for novel Mitoviruses within Metazoa, specifically human. These are a class of viruses characterized as infecting the mitochondria of fungi.

In brief, [Mitoviruses](https://viralzone.expasy.org/304) are short (2-3 kb) ssRNA+ viruses encoding for an RdRp. They fall within the family Narnaviridae.

There are several routes with which this question can be explored from a technical standpoint. The greatest 'complication' arises from the difference in the vertrebrate genetic code and mitochondrial genetic code

- [Mitovirus Taxa: 186768](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=186768)
- [Narnavirus Taxa: 186766](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Tree&id=186766)
- [Lenarviricota Phylum: 2732407](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=2732407)

**Standard Genetic Code (1)**
```
   AAs  = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
  Starts = ---M------**--*----M---------------M----------------------------
  Base1  = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
  Base2  = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
  Base3  = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
```

**Vertebrate Mitochondria (2)**
```
        Code 2          Standard

 AGA    Ter  *          Arg  R
 AGG    Ter  *          Arg  R
 AUA    Met  M          Ile  I
 UGA    Trp  W          Ter  *
```

**Fungi Mitrochondria (4)**
```
      Code 4         Standard

 UGA    Trp  W          Ter  *
```

**Genetic Code**
![Genetic code](201012_ab/genetic_code.jpg)


### Spearfish Method

The direct method will be to use the RdRP protein sequences from the known Mitovirus (gc 4) and Narnavirus (gc standard) outgroup with their respective genetic codes. These protein sequence can then be searched with `diamond` using a code 2 (vertebrate MT) translation.

This is the most direct and specific means with which to search for a Mito/Narna protein sequence in vertebrate samples. This search is not compatible with any other sequences (i.e. protref5) since the altered translation  in diamond will render other standard RdRp translations incorrect.

The advantage of including other (non-narna) sequences though is if there are mitochondrially encoded RdRp in other families they may result in a hit.

### Trolling Method

An indirect method is to extract the genetic code appropriate CDS sequence from each Mitovirus/Narnavirus and force translation via the standard code. Assuming a hypothetical vertebrate Mitovirus would use mtDNA code,the following changes would apply:

```
        Code 2          Standard

 AGA    Ter  *          Arg  R
 AGG    Ter  *          Arg  R
 AUA    Met  M          Ile  I
 UGA    Trp  W          Ter  *
```

The TER --> ARG changes are perfectly acceptable, these will be defined by CDS boundaries.

The AUA M --> I substitution is a `+1` in `BLOSUM62` instead of M match `+5` so it won't significantly alter the score (not a loss). 

The primary alteration is UGA (W) will be interpreted as a stop codon by the standard code, which I can manually change via grep to \*.

There is an option  `-l1` which should be capable of using these translated sequences, from `diamond` docs:


>`--min-orf/-l #`
>
>Ignore translated sequences that do not contain an open reading frame of at least this length. By default this feature is disabled for sequences of length below 30, set to 20 for sequences of length below 100, and set to 40 otherwise. Setting this option to 1 will disable this feature.

Overall this design may lead to some false-positives where mt-encoded stop codons are read as "R" by diamond and some false-negatives where the stop-codon 

## Materials and Methods


### Reference Sequences (Nucleotide)

#### mitovirus.fa

Query: `txid186768[Organism:exp]`
Date : `2020-10-12`
Results: `2364`

Downloaded complete sequences, `mitovirus.fa` as well as"nucleotide CDS", `mitovirus.cds.fa`


#### lenarviricota.fa (call it narnavirus for simplicity)

Query: `txid2732407[Organism:exp] NOT txid186768[Organism:exp]`
Date : `2020-10-12`
Results: `4878`


In [5]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS

# Create local run directory
WORK="$SERRATUS/notebook/201012_ab"
mkdir -p $WORK; cd $WORK

# S3 notebook path
S3_WORK='s3://serratus-public/notebook/201012_ab/'

# date and version
date

Sun Oct 25 15:51:15 PDT 2020


In [4]:
# Mitovirus
# EMBOSS 6.6.0
# Retrieve CDS sequence from mitovirus from nuccore

NAME='mitovirus'
seqkit sort -l -r mitovirus.cds.fa > mitovirus.cds.1.fa

# Remove hypothetical protein cds
seqkit grep -n -v -r -p "hypo" mitovirus.cds.1.fa \
  > mitovirus.cds.2.fa

# Save full headers
grep ">" mitovirus.cds.2.fa > mitovirus.cds.headers

# Reduce header to accession name
sed 's/lcl|//g' mitovirus.cds.2.fa \
  | sed 's/_cds.*//g' \
  > mitovirus.cds.3.fa
  
# Remove duplicates (in effect, keep only longest ORF from each file)
seqkit rmdup -n mitovirus.cds.3.fa >  mitovirus.rdrp.fa

samtools faidx mitrovirus.rdrp.fa

# Final: Mitovirus.RdRP.MN036117.1

[INFO][0m read sequences ...
[INFO][0m 2200 sequences loaded
[INFO][0m sorting ...
[INFO][0m output ...
[INFO][0m 45 duplicated records removed


In [None]:
# EMBOSS 6.6.0
# Translate sequence using standard genetic code
NAME='mitovirus'

transeq \
  -sequence mitovirus.rdrp.fa \
  -frame 1 \
  -table 0 \
  -methionine \
  -outseq mitovirus.rdrp.aa.fa
  
sed -i 's/._.//g' mitovirus.rdrp.aa.fa
sed -i 's/>/>Mitovirus.rdrp./g' mitovirus.rdrp.aa.fa

In [None]:
# Lenarviricota.cds.fa
# EMBOSS 6.6.0
# Retrieve CDS sequence from mitovirus from nuccore
cd $WORK

NAME='lenarviricota'
seqkit sort -l -r $NAME.cds.fa > $NAME.cds.1.fa

# Remove hypothetical protein cds
seqkit grep -n -v -r -p "hypo" $NAME.cds.1.fa \
  > $NAME.cds.2.fa

# Save full headers
grep ">" $NAME.cds.2.fa > $NAME.cds.headers

# Reduce header to accession name
sed 's/lcl|//g' $NAME.cds.2.fa \
  | sed 's/_cds.*//g' \
  > $NAME.cds.3.fa
  
# Remove duplicates (in effect, keep only longest ORF from each file)
seqkit rmdup -n $NAME.cds.3.fa >  $NAME.rdrp.fa

samtools faidx $NAME.rdrp.fa

In [None]:
transeq \
  -sequence $NAME.rdrp.fa \
  -frame 1 \
  -table 0 \
  -methionine \
  -outseq $NAME.rdrp.aa.fa
  
sed -i 's/._.//g' $NAME.rdrp.aa.fa
sed -i 's/>/>Narnavirus.rdrp./g' $NAME.rdrp.aa.fa

In [None]:
cd $WORK

# Clean-up folders
mkdir mito
mv mitov* mito/

mkdir lena
mv lenar* lena/

In [14]:
# Create protein refernece at 95%
NAME='narnavirus'

# Sort priority
# Merge and cluster for protref add-on
cat mito/mitovirus.rdrp.aa.fa lena/lenarviricota.rdrp.aa.fa \
  > $NAME.aa.fa

# Prune to 95% nucleotide identity
usearch -cluster_smallmem narna.aa.fa \
   -id 0.95 \
   -sortedby other \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc $NAME.id95.uc \
   -centroids $NAME.id95.fa
   
grep "^>" $NAME.id95.fa > $NAME.id95.headers


usearch v11.0.667_i86linux32, 4.0Gb RAM (16.3Gb total), 4 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch

License: personal use only

00:02 132Mb   100.0% 1948 clusters, max size 444, avg 1.8
00:02 132Mb   100.0% Writing centroids to narnavirus.id95.fa
                                                            
      Seqs  3422
  Clusters  1948
  Max size  444
  Avg size  1.8
  Min size  1
Singletons  1620, 47.3% of seqs, 83.2% of clusters
   Max mem  132Mb
      Time  2.00s
Throughput  1711.0 seqs/sec.



In [15]:
aws s3 sync ./ $S3_WORK

upload: ./genetic_code.jpg to s3://serratus-public/notebook/201012_ab/genetic_code.jpg
upload: lena/lenarviricota.rdrp.fa.fai to s3://serratus-public/notebook/201012_ab/lena/lenarviricota.rdrp.fa.fai
upload: lena/lenarviricota.cds.headers to s3://serratus-public/notebook/201012_ab/lena/lenarviricota.cds.headers
upload: mito/mitovirus.cds.headers to s3://serratus-public/notebook/201012_ab/mito/mitovirus.cds.headers
upload: lena/lenarviricota.rdrp.aa.fa to s3://serratus-public/notebook/201012_ab/lena/lenarviricota.rdrp.aa.fa
upload: mito/mitovirus.cds.1.fa to s3://serratus-public/notebook/201012_ab/mito/mitovirus.cds.1.fa
upload: mito/mitovirus.cds.2.fa to s3://serratus-public/notebook/201012_ab/mito/mitovirus.cds.2.fa
upload: mito/mitovirus.rdrp.aa.fa to s3://serratus-public/notebook/201012_ab/mito/mitovirus.rdrp.aa.fa
upload: mito/mitovirus.cds.3.fa to s3://serratus-public/notebook/201012_ab/mito/mitovirus.cds.3.fa
upload: mito/mitovirus.rdrp.fa.fai to s3://serratus-public/notebook/201

In [None]:
# Fire up EC2 Instance
sudo yum install -y docker
sudo yum install -y git
sudo service docker start

# Download latest serratus repo
git clone -b diamond-dev https://github.com/ababaian/serratus.git; cd serratus/containers

# If you want to upload containers to your repository, include this.
export DOCKERHUB_USER='serratusbio' # optional
sudo docker login # optional

# Build all containers and upload them docker hub repo (if available)
./build_containers.sh

# Launch aligner container
sudo docker run --rm --entrypoint /bin/bash -it serratus-align:latest


Nucleotide Sequence (CDS)
```
>HE586976.1
ATTGTAGCTATGTTAGATTATACAACACAGCTATTTCTTCGACCTATACATTCTGACTTG
TTTAAACTTCTAAAAAAGTTACCACAAGATAGAACTTTTACCCAAAATCCATTAAATGAT
TGAGAAGATAATGAACACTCATTTTGATCAATCGACCTTACAGCTGCAACTGATAGATTT
CCCATTAGTTTACAACGGCGGTTATTACTATATATATATAGTGATCCCGAAATTGCAAAT
TCTTGGCAAAATCTATTAGTACATAGAGAATATGCCCGTAATGGGTTAAGTCCAATAAAA
TATTCTGTTGGACAGCCCATGGGAGCATATTCATCCTGACCTGCTTTCACATTATCTCAT
CACCTTGTAGTTCATTGATGTGCACATTTATGCAACATCAATAAATTCAAGGATTATATA
ATTCTTGGTGACGATATTGTTATACATAACGATAACGTTGCTAAAAAATATATTGAAATA
ATGGGAAAATTAGGAGTGGGTCTATCAGATAGTAAAACACATGTATCAAAAG
```
Protein Sequence
```
>Mitovirus.rdrp.HE586976
IVAMLDYTTQLFLRPIHSDLFKLLKKLPQDRTFTQNPLND*EDNEHSF*SIDLTAATDRF
PISLQRRLLLYIYSDPEIANSWQNLLVHREYARNGLSPIKYSVGQPMGAYSS*PAFTLSH
HLVVH*CAHLCNINKFKDYIILGDDIVIHNDNVAKKYIEIMGKLGVGLSDSKTHVSKX
```

Pseudo-reads
```
>r1
ATTGTAGCTATGTTAGATTATACAACACAGCTATTTCTTCGACCTATACATTCTGACTTG
>r2
TTTAAACTTCTAAAAAAGTTACCACAAGATAGAACTTTTACCCAAAATCCATTAAATGAT
>r3
TGAGAAGATAATGAACACTCATTTTGATCAATCGACCTTACAGCTGCAACTGATAGATTT
>r3_firststop
TGTGAAGATAATGAACACTCATTTTGATCAATCGACCTTACAGCTGCAACTGATAGATTT
>r5
CCCATTAGTTTACAACGGCGGTTATTACTATATATATATAGTGATCCCGAAATTGCAAAT
>r6
TCTTGGCAAAATCTATTAGTACATAGAGAATATGCCCGTAATGGGTTAAGTCCAATAAAA
>r7
TATTCTGTTGGACAGCCCATGGGAGCATATTCATCCTGACCTGCTTTCACATTATCTCAT
>r8
CACCTTGTAGTTCATTGATGTGCACATTTATGCAACATCAATAAATTCAAGGATTATATA
>r9
ATTCTTGGTGACGATATTGTTATACATAACGATAACGTTGCTAAAAAATATATTGAAATA
```

Diamond Test
```
r1      Mitovirus.rdrp.HE586971 1       60      60      1       20      178     100.0   7.0e-07 20      20M     +       ATTGTAGCTATGTTAGATTATACAACACAGCTATTTCTTCGACCTATACATTCTGACTTG    IVAMLDYTTQLFLRPIHSDL
r2      Mitovirus.rdrp.AF534641 1       60      60      266     285     742     100.0   3.1e-07 20      20M     +       TTTAAACTTCTAAAAAAGTTACCACAAGATAGAACTTTTACCCAAAATCCATTAAATGAT    FKLLKKLPQDRTFTQNPLND
r3      Mitovirus.rdrp.AF534641 1       60      60      286     305     742     100.0   1.6e-06 20      20M     +       TGAGAAGATAATGAACACTCATTTTGATCAATCGACCTTACAGCTGCAACTGATAGATTT    *EDNEHSF*SIDLTAATDRF
r5      Mitovirus.rdrp.AF534641 1       60      60      306     325     742     100.0   9.1e-07 20      20M     +       CCCATTAGTTTACAACGGCGGTTATTACTATATATATATAGTGATCCCGAAATTGCAAAT    PISLQRRLLLYIYSDPEIAN
r6      Mitovirus.rdrp.HE586971 1       60      60      81      100     178     100.0   6.3e-08 20      20M     +       TCTTGGCAAAATCTATTAGTACATAGAGAATATGCCCGTAATGGGTTAAGTCCAATAAAA    SWQNLLVHREYARNGLSPIK
r7      Mitovirus.rdrp.AF534641 1       60      60      346     365     742     100.0   4.1e-07 20      20M     +       TATTCTGTTGGACAGCCCATGGGAGCATATTCATCCTGACCTGCTTTCACATTATCTCAT    YSVGQPMGAYSS*PAFTLSH
r8      Mitovirus.rdrp.AF534641 1       60      60      366     385     742     100.0   2.2e-08 20      20M     +       CACCTTGTAGTTCATTGATGTGCACATTTATGCAACATCAATAAATTCAAGGATTATATA    HLVVH*CAHLCNINKFKDYI
r9      Mitovirus.rdrp.HE586971 1       60      60      141     160     178     100.0   1.2e-06 20      20M     +       ATTCTTGGTGACGATATTGTTATACATAACGATAACGTTGCTAAAAAATATATTGAAATA    ILGDDIVIHNDNVAKKYIEI
```

From this it looks like this set-up will work out of the box with diamond, it's handling stop codons quite efficiently.



In [None]:
# In serratus-align
mkdir tmp; cd tmp
aws s3 cp s3://serratus-public/notebook/201012_ab/mitovirus.rdrp.aa.fa ./mito.fa

# Make mitovirus database
diamond makedb --in mito.fa -d mito

# Test diamond search with interrupting stop codons
# based on standard genetic code

cat reads.fa \
  | diamond blastx \
   -d mito.dmnd \
   --unal 0 \
   -k 1 \
   -p 1 \
   -b 0.2 \
   -f 6 qseqid sseqid qstart qend qlen sstart send slen pident evalue btop cigar qstrand qseq sseq \
   > tmp.bam

## Update to protref5b

In [None]:
mkdir p5b; cd p5b
aws s3 sync s3://serratus-public/seq/protref5/ ./
aws s3 cp $S3_WORK/narnavirus.id95.fa ./

mkdir update; cd update
cat ../protref5.fa ../narnavirus.id95.fa > protref5b.fa
cp ../protref5.msa protref5b.msa

# Make fasta index
samtools faidx protref5b.fa
mv protref5b.fa.fai protref5b.sumzer.tsv

# Make diamond index
diamond makedb --in protref5b.fa -d protref5b

md5sum * > protref5b.md5

# UPLOAD
 aws s3 sync ./ s3://serratus-public/seq/protref5b/

```
3426e5b4d178776b0b5737417372bf5a  protref5b.dmnd
55b7ac3cd282703e4a512b36361f7da5  protref5b.fa
0da30de8910dd6bdb3c7f150dc06f65b  protref5b.md5
e094fc7db19c07ffcedf8bc42963ab80  protref5b.msa
96c81da14ae47216ec9267ae0e2d1b42  protref5b.sumzer.tsv
```