<a href="https://colab.research.google.com/github/cappelchi/Bioinformatics-for-Infectious-Diseases/blob/master/Pine_Bio_Infection_Dseases_presentation_part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Reference publications

[Genetic Predisposition To Acquire a Polybasic Cleavage Site for Highly Pathogenic Avian Influenza Virus Hemagglutinin](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5312086/#B26)

[SARS-CoV-2 furin cleavage site revisited](https://www.virology.ws/2020/05/14/sars-cov-2-furin-cleavage-site-revisited/)

<img src="https://biopython.org/assets/images/biopython_logo_white.png" width="170" height="135">

In [3]:
!pip install biopython



In [4]:
from Bio import SeqIO, Entrez
from Bio.Seq import Seq
from Bio.Data import CodonTable

In [5]:
print(CodonTable.unambiguous_dna_by_id[1])

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

In [6]:
Entrez.email = "cappelchi@gmail.com"  # Always tell NCBI who you are

In [7]:
#@title get_position function
def get_position(access, find_poly, offset = -1, after = 6000):
    if find_poly == '':
        print('ERROR: set serching site')
        return
    if offset != -1:
        get_seq = Entrez.efetch(db="nucleotide", id= access, retmode="text", rettype="fasta") 
        reference1 = SeqIO.read(get_seq, "fasta")[offset:].translate() #in amino acids
        pos = reference1.seq.find(find_poly)
        print('position: ', pos, ' offset: ', offset) #where is polybasic site
        return pos, offset
    else:
        for offset in range(3):
            get_seq = Entrez.efetch(db="nucleotide", id= access, retmode="text", rettype="fasta") 
            reference1 = SeqIO.read(get_seq, "fasta")[offset:].translate() #in amino acids
            pos = reference1.seq.find(find_poly)
            if (pos != -1) & (pos > after):
                break
        print('position: ', pos, ' offset: ', offset) #where is polybasic site
        return pos, offset

In [8]:
#@title print_sequence function
def print_sequence(access, pos, offset, off_back = 6, off_forw = 23):
    #access = 'MT072864.1'
    get_seq = Entrez.efetch(db="nucleotide", id= access, retmode="text", rettype="fasta") #accession MT072864.1 Pangolin coronavirus isolate PCoV_GX-P2V, complete genome.
    #pos = 7850
    #offset = 2
    #off_back = 7
    #off_forw = 23
    cut_seq = SeqIO.read(get_seq, "fasta")[offset + pos * 3 - off_back:offset + pos * 3 + off_forw] #polybasic cleavage site in nucleotides
    poly = cut_seq[off_back:-off_forw + 12]
    amino_cut_seq = cut_seq.translate()
    print(len(cut_seq.seq))
    print(cut_seq.seq)
    print(amino_cut_seq.seq)
    print('Polybasic nucleotide:', poly.seq, ' - amino:', poly.translate().seq)

### **Study SARS-CoV-2 NC_045512.2 for predisposal for pathogenesis**

1. Find reference genome of SARS-CoV-2 **NC_045512.2** on NCBI 
2. Get polybasic cleavage site '**RRAR**' position in reference 

In [9]:
access = 'NC_045512.2' #corona COVID 19 reference
find_poly = 'RRAR'
pos, offset = get_position(
    access = access, 
    find_poly = find_poly, 
    offset = -1
    )



position:  7868  offset:  1


3. Print sequence across the cleavage site.

In [11]:
print_sequence(
    access = access, 
    pos = pos, 
    offset = offset, 
    off_back = 6, 
    off_forw = 24
    )

30
TCTCCTCGGCGGGCACGTAGTGTAGCTAGT
SPRRARSVAS
Polybasic nucleotide: CGGCGGGCACGT  - amino: RRAR


**SARS-CoV-2 alrady contain polybasic site. The question is this polybasic site is predisposal for pathogenesis?**

4. Run the **[RNAfold](http://rna.tbi.univie.ac.at//cgi-bin/RNAWebSuite/RNAfold.cgi)** & **[Quickfold](http://unafold.rna.albany.edu/?q=DINAMelt/Quickfold)** web servers with sequence printed above.

5. [**RNAfold**](http://rna.tbi.univie.ac.at//cgi-bin/RNAWebSuite/RNAfold.cgi?PAGE=3&ID=tv8mTBzhsa) & [**Quickfold**](http://unafold.rna.albany.edu/results2/quikfold/200712/182622/A_1.png) reults:

<img src="https://raw.githubusercontent.com/cappelchi/Bioinformatics-for-Infectious-Diseases/master/images/SARS-CoV-2_1_rnafold.png" width="134" height="182"> <img src="https://raw.githubusercontent.com/cappelchi/Bioinformatics-for-Infectious-Diseases/master/images/SARS-CoV-2_1_quickfold.jpg" width="134" height="182">



6. Change some alphabet and try to mantain aminoacid sequence as close as possible, to be sure that will work. [**RNAfold**](http://rna.tbi.univie.ac.at//cgi-bin/RNAWebSuite/RNAfold.cgi?PAGE=3&ID=QhFdXK9lJ5) & [**Quickfold**](http://unafold.rna.albany.edu/results2/quikfold/200713/050433/A_1.png)

![corona](https://raw.githubusercontent.com/cappelchi/T-Bio/master/infections/coronafold.svg)

###**Study Pangolin coronavirus isolate PCoV_GX-P2V, accession MT072864.1 for predisposition to acquire a Polybasic Cleavage Site**

![align](https://raw.githubusercontent.com/cappelchi/T-Bio/master/infections/pangolin_covid19_spike_align.png)

In [None]:
access = 'MT072864.1'
#accession MT072864.1 Pangolin coronavirus isolate PCoV_GX-P2V, complete genome.
find_poly = 'SSFR'
pos, offset = get_position(
    access = access, 
    find_poly = find_poly, 
    offset = -1
    )



position:  7850  offset:  2


In [None]:
print_sequence(
    access = access, 
    pos = pos, 
    offset = offset, 
    off_back = 6, 
    off_forw = 24
    )

30
TCCATGTCATCATTTCGTAGTGTCAACCAG
SMSSFRSVNQ
Polybasic nucleotide: TCATCATTTCGT  - amino: SSFR


<img src="https://raw.githubusercontent.com/cappelchi/T-Bio/master/infections/Pangolin.png" width="134" height="280">

###**Study MERS CoV reference, accession NC_019843**

In [None]:
access = 'NC_019843' #get MERS CoV reference
find_poly = 'RSVR'
pos, offset = get_position(
    access = access, 
    find_poly = find_poly, 
    offset = -1
    )



position:  7898  offset:  2


In [None]:
print_sequence(
    access = access, 
    pos = pos, 
    offset = offset, 
    off_back = 12, 
    off_forw = 23
    )

35
ACTCTCACACCTCGCAGTGTGCGCTCTGTTCCAGG
TLTPRSVRSVP
Polybasic nucleotide: CGCAGTGTGCGC  - amino: RSVR




###**Study (MERS-CoV) camelus dromedarius, accession KJ477102.1 (MERS-CoV) camelus dromedarius**

In [None]:
access = 'KJ477102.1' #accession KJ477102.1 (MERS-CoV) camelus dromedarius
find_poly = 'RSSR'
pos, offset = get_position(
    access = access, 
    find_poly = find_poly, 
    offset = 1
    )

position:  8016  offset:  1


In [None]:
print_sequence(
    access = access, 
    pos = pos, 
    offset = offset, 
    off_back = 12, 
    off_forw = 23
    )

35
TCTACTGGCAGTCGTAGTTCACGTATTGCTATTGA
STGSRSSRIAI
Polybasic nucleotide: CGTAGTTCACGT  - amino: RSSR




###**Study Human coronavirus NL63 (HCoV-NL63), accession NC_005831.2** 

In [None]:
access = 'NC_005831.2' #accession Human coronavirus NL63 (HCoV-NL63)
find_poly = 'RPR'
pos, offset = get_position(
    access = access, 
    find_poly = find_poly, 
    offset = 2
    )

position:  7568  offset:  2




In [None]:
print_sequence(
    access = access, 
    pos = pos, 
    offset = offset, 
    off_back = 6, 
    off_forw = 24
    )

30
CCTGTTCGTCCGCGTAATTCTAGTGATAAT
PVRPRNSSDN
Polybasic nucleotide: CGTCCGCGTAAT  - amino: RPRN


<img src="https://raw.githubusercontent.com/cappelchi/T-Bio/master/infections/hcov-nl63.png" width="250" height="280"> <img src="https://raw.githubusercontent.com/cappelchi/T-Bio/master/infections/hcov-nl63-2.png" width="250" height="280">