## Lesson on microbial genomics and protein homology modelling using the SARS-CoV-2 example

### About

The virus was first identified in late December 2019 in the city of Wuhan, where patients were suffering from respiratory illnesses such as pneumonia. Since then, the virus has been detected in several other countries. There has been considerable discussion and uncertainty over the origin of the causative virus, SARS-CoV-2.

The genome of the newly emerging CoV consists of a single, positive-stranded RNA that is approximately 30k nucleotides long. The overall genome organization of the newly emerging CoV is similar to that of other coronaviruses. The newly sequenced virus genome encodes the open reading frames (ORFs) common to all betacoronaviruses, including ORF1ab that encodes many enzymatic proteins, the spike-surface glycoprotein (S), the small envelope protein (E), the matrix protein (M), and the nucleocapsid protein (N), as well as several nonstructural proteins.

### Links

* http://virological.org/t/the-proximal-origin-of-sars-cov-2/398
* https://www.gisaid.org/
* https://www.ncbi.nlm.nih.gov/genome/genomes/86693
* https://www.ncbi.nlm.nih.gov/assembly/?term=Betacoronavirus
* https://theprepared.com/blog/no-the-2019-ncov-genome-doesnt-actually-seem-engineered-from-hiv/

In [72]:
import os,sys,glob
import pandas as pd
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import Entrez
from pygenefinder import app,tools
#from pybioviz import viewers

In [73]:
filenames = glob.glob('*.fasta')
info = [tools.get_fasta_info(f) for f in filenames]
fastatable = pd.DataFrame(info)


In [None]:
res = []
annot = {}
for i,row in fastatable.iterrows():
    featdf,recs = app.run_annotation(row.filename, threads=10, kingdom='viruses')
    featdf['label'] = row.label
    res.append(featdf)    
    annot[row.label] = recs
res = pd.concat(res)

In [78]:
protname = ''
seqs = []
for i,df in res.groupby('label'):
    s = df[df['product']=='Spike glycoprotein'].iloc[0]
    #print (s.sequence)
    seq = SeqRecord(Seq(s.sequence),id=s.label)
    seqs.append(seq)
    
#print (seqs)

aln = tools.clustal_alignment(seqs=seqs)
print (aln[:,110:220])

clustalw -infile=temp.faa
SingleLetterAlphabet() alignment with 5 rows and 110 columns
DSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWME...QGF EPI_ISL_411950
DSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWME...QGF EPI_ISL_411953
DSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWME...QGF EPI_ISL_411955
DSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWME...QGF EPI_ISL_412029
DSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWME...QGF EPI_ISL_411956


In [46]:
prot = 'Spike glycoprotein'
seqs = []
for r in annot:
    recs = annot[r]
    for s in recs[0].features:
        quals = s.qualifiers
        if quals['product'] == prot:
            print (s)
            seqs.append(quals['translation'])

type: CDS
location: [21563:25384](+)
qualifiers:
    Key: gene, Value: S
    Key: locus_tag, Value: BCBDFAD_00003
    Key: product, Value: Spike glycoprotein
    Key: translation, Value: MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPS