# Coronaviridae Index Genomes
```
Lead     : ababaian
Issue    : #101
start    : 2020 05 17
complete : 2020 05 17
s3 files : s3://serratus-public/notebook/200517_ab/
```

## Introduction

To begin phylogenomics and organization of CoV fragment sequences / contigs into a unified 'pan-genome' we will need a central annotation and multiple sequence alignment for coronaviridae.

This is the rationale for choosing 12 representative sequences with good annotation which span coronaviriade, and two toroviruses as an outgroup. From these we will define our pan-genome reference.

### Objectives
- Select 12 representative and divergent CoV sequences and 3 toroviruses as outgroup
- Each sequence should have well annotated ORFs in genbank format
- Try a MSA between these sequences, fix it by hand if need be


## Materials and Methods

- FLOM1 reference sequences where available:
Nucleotide Search:
```
Viruses[Organism] AND srcdb_refseq[PROP] NOT wgs[PROP] NOT cellular organisms[ORGN] NOT AC_000001:AC_999999[PACC] AND ("vhost human"[Filter] AND "vhost vertebrates"[Filter])
```

Also used: [Virus Genome Browser](https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=11118)

## iCOV - Index Search Query
Date accessed: 2020 05 17

```
NC_003436 OR NC_005831 OR NC_002306 OR NC_001846 OR NC_045512 OR NC_003045 OR NC_001451 OR NC_010646 OR NC_046965 OR NC_011547 OR NC_011549 OR NC_016994 OR NC_007447 OR NC_022787 OR NC_026812
```

### Files Downloaded

- `iCOV.gb` : Full Genbank records
- `iCOV.fa` : Genome nucleotide sequences
- `iCOV_protein.fa` : Annotated Coding Sequences (protein)
- `iCOV_cds.fa` : Annotated Coding Sequences (DNA)


## Name/accession
`iCOV.names`

```
aFIPV	NC_002306.3	29355
aNL63	NC_005831.2	27553
aPEDV	NC_003436.1	28033

bBOV	NC_003045.1	31028
bMHV	NC_001846.1	31357
hCOV	NC_045512.2	29903

gIBV	NC_001451.1	27608
gBWV	NC_010646.1	31686
gCGV	NC_046965.1	28539

dBUV	NC_011547.1	26487
dTHV	NC_011549.1	26396
dNHV	NC_016994.1	26077

tCSB	NC_026812.1	27004
tPTO	NC_022787.1	28301
tBTO	NC_007447.1	28475
```


In [4]:
# Create a simple index file for TSV start
cd /home/artem/Desktop/serratus/notebook/200517_ab
samtools faidx iCOV.fa
cat iCOV.fa.fai

NC_046965.1	28539	81	70	71
NC_045512.2	29903	29126	70	71
NC_011547.1	26487	59517	70	71
NC_026812.1	27004	86454	70	71
NC_022787.1	28301	113904	70	71
NC_016994.1	26077	142671	70	71
NC_002306.3	29355	169188	70	71
NC_011549.1	26396	199023	70	71
NC_010646.1	31686	225857	70	71
NC_007447.1	28475	258039	70	71
NC_005831.2	27553	286975	70	71
NC_003436.1	28033	314985	70	71
NC_003045.1	31028	343469	70	71
NC_001846.1	31357	375020	70	71
NC_001451.1	27608	406890	70	71


In [None]:
# Extract orf1ab protein sequences
mkdir tmp
cp iCOV_protein.fa tmp/
cd tmp

fastaexplode iCOV_protein.fa
# sort by size
# select top 15, that's the orf1ab (gene 1)
# cp to orf1ab/
cd ..; rm -rf tmp/

# Alphacoronavirus

## Porcine Epidemic Diarrhea Virus (PEDV)

[`NC_003436`](https://www.ncbi.nlm.nih.gov/nuccore/19387576)

## Human Coronavirus NL63 (NL63)

[`NC_005831`](https://www.ncbi.nlm.nih.gov/nuccore/49169782)

## Feline Infectious Peritonitis Virus (FIPV)

[`NC_002306`](https://www.ncbi.nlm.nih.gov/nuccore/315192962)


# Betacoronavirus

## Murine Hepatitis Virus (MHV)

[`NC_001846`](https://www.ncbi.nlm.nih.gov/nuccore/9629812)

## SARS-CoV-2 (hCOV2)

[`NC_045512`](https://www.ncbi.nlm.nih.gov/nuccore/1798174254)

## Bovine Coronavirus (BOV)

[`NC_003045`](https://www.ncbi.nlm.nih.gov/nuccore/15081544)


# Gammacoronavirus

## Infectious Bronchitis Virus (IBV)

[`NC_001451`](https://www.ncbi.nlm.nih.gov/nuccore/9626535)

## Beluga Whale Coronavirus (BWV)

[`NC_010646`](https://www.ncbi.nlm.nih.gov/nuccore/187251953)

## Canada Goose Coronavirus (CGV)

[`NC_046965`](https://www.ncbi.nlm.nih.gov/nuccore/1830345784)

# Deltacoronavirus

## Bulbul Coronavirus (BUV)

[`NC_011547`](https://www.ncbi.nlm.nih.gov/nuccore/1464306524)

## Thrush Coronavirus (THV)

[`NC_011549`](https://www.ncbi.nlm.nih.gov/nuccore/212681378)

## Night Heron Coronavirus (NHV)

[`NC_016994`](https://www.ncbi.nlm.nih.gov/nuccore/383080775)


# Torovirus

## Breda (BTO)

[`NC_007447`](https://www.ncbi.nlm.nih.gov/nuccore/77118348)

## Porcine Torovirus (PTO)

[`NC_022787`](https://www.ncbi.nlm.nih.gov/nuccore/557745614)

## Chinook Salmon Bafinivirus (CSB)

[`NC_026812`](https://www.ncbi.nlm.nih.gov/nuccore/807743898)


In [None]:
cd /home/artem/Desktop/serratus/notebook/200517_ab
aws s3 sync ./ s3://serratus-public/notebook/200517_ab/