## Lesson on microbial genomics, phylogenies and protein homology modelling using betacoronavirus/SARS-CoV-2 example

### About

The virus was first identified in late December 2019 in the city of Wuhan, where patients were suffering from respiratory illnesses such as pneumonia. Since then, the virus has been detected in several other countries. There has been considerable discussion and uncertainty over the origin of the causative virus, SARS-CoV-2.

The genome of the newly emerging CoV consists of a single, positive-stranded RNA that is approximately 30k nucleotides long. The overall genome organization of the newly emerging CoV is similar to that of other coronaviruses. The newly sequenced virus genome encodes the open reading frames (ORFs) common to all betacoronaviruses, including ORF1ab that encodes many enzymatic proteins, the spike-surface glycoprotein (S), the small envelope protein (E), the matrix protein (M), and the nucleocapsid protein (N), as well as several nonstructural proteins.

### Links

* http://virological.org/t/the-proximal-origin-of-sars-cov-2/398
* https://theprepared.com/blog/no-the-2019-ncov-genome-doesnt-actually-seem-engineered-from-hiv/
* https://www.ncbi.nlm.nih.gov/labs/virus/vssi
* https://biopython.org/wiki/Phylo

In [32]:
import os,sys,glob,random,subprocess
from importlib import reload
import pandas as pd
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import Phylo, AlignIO
from pygenefinder import app,tools
from pybioviz import plotters
from bokeh.io import show, output_notebook, output_file
output_notebook()
from ete3 import Tree, NodeStyle, TreeStyle, PhyloTree

In [2]:
filenames = glob.glob('*.fasta')
info = [tools.get_fasta_info(f) for f in filenames]
fastatable = pd.DataFrame(info)

### We used the NCBI viruses database to download betacoronavirus sequences

https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=Betacoronavirus,%20taxid:694002&Flags_csv=complete

In [6]:
ncbidata = pd.read_csv('ncbi_betacoronavirus_25-02-20.csv')
print (ncbidata.columns)
ncbidata['Release_Date']=pd.to_datetime(ncbidata.Release_Date)

#print (ncbidata.Host.value_counts()[:10])
#put sequences into 
seqrecs = SeqIO.to_dict(SeqIO.parse('ncbi_betacoronavirus.fasta','fasta'))

Index(['Details', 'Accession', 'Release_Date', 'Species', 'Genus', 'Family',
       'Length', 'Nuc._Completeness', 'Genotype', 'Geo_Location', 'Host',
       'Isolation_Source', 'Collection_Date', 'BioSample', 'GenBank_Title'],
      dtype='object')


### Get a subset of the sequences

In [4]:
subset = ncbidata[ncbidata.Host=='Chiroptera']
#ncbidata[(ncbidata.Release_Date>'2019-01-01') & (ncbidata.Host=='Chiroptera')]

### Annotate the nucleotide sequences and save in a dataframe

In [18]:
outdir = 'annot'
res = []
for i,row in subset.iterrows():
    label = row.Accession
    #print(row)
    gbfile = os.path.join(outdir,label+'.gbk')
    if os.path.exists(genbankfile):        
        featdf = tools.genbank_to_dataframe(gbfile)
        featdf['sequence'] = featdf.translation
    else:
        seq = seqrecs[label]
        filename = os.path.join('temp',label+'.fasta')
        SeqIO.write(seq,filename,'fasta')
        featdf,recs = app.run_annotation(filename, threads=10, kingdom='viruses')
        tools.recs_to_genbank(recs, gbfile)
    featdf['label'] = label
    featdf['host'] = row.Host
    res.append(featdf)    
    
res = pd.concat(res)

### Get a protein sequence of interest across all the annotations

In [24]:
protname = 'Spike glycoprotein'
seqs = []
for i,df in res.groupby('label'):    
    s = df[df['product']==protname]
    if len(s)==0:
        continue
    s = s.iloc[0]
    #print (s)
    seq = SeqRecord(Seq(s.sequence),id=s.label,description=s.host)
    seqs.append(seq)
        
#print (seqs)

aln = tools.muscle_alignment(seqs=seqs)
AlignIO.write(aln, 'aligned.fasta', 'fasta')
#print (aln[:,110:220])
p=plotters.plot_sequence_alignment(aln)
show(p)

### Make a tree from the alignment

In [None]:
AlignIO.convert("aligned.fasta", "fasta", "aligned.phy", "phylip-relaxed")
from Bio.Phylo.Applications import PhymlCommandline
cmdline = PhymlCommandline(input='aligned.phy', datatype='aa', model='WAG', alpha='e', bootstrap=10)
print (cmdline)
out_log, err_log = cmdline()


In [43]:
tree = Phylo.read("aligned.phy_phyml_tree.txt", "newick")
Phylo.draw_ascii(tree)

                                                           __ DQ071615
                                __________________________|
                               |                          |___ KU973692
       ________________________|
      |                        |                                  , EF065513
      |                        |__________________________________|
      |                                                           | NC_009021
  ____|
 |    |        , EF065509
 |    |       ,|
 |    |       || NC_009020
 |    |_______|
 |            | , EF065510
 |            |_|
 |              , EF065511
 |              |
 |              | EF065512
 |
 | EF065508
 |
_| EF065506
 |
 | EF065507
 |
 , EF065505
 |
 | NC_009019



In [None]:
#t = PhyloTree('RAxML_bipartitions.variants')
t =PhyloTree('aligned.phy_phyml_tree.txt', quoted_node_names=True)
alf=aln[:,650:700].format('fasta')
#t.link_to_alignment(alf)
ts = TreeStyle()
ts.scale=180

t.render("%%inline", tree_style=ts)
#t.render("tree.png", dpi=150, tree_style=ts)