<a href="https://colab.research.google.com/github/cappelchi/Bioinformatics-for-Infectious-Diseases/blob/master/Pine_Bio_Infection_Dseases_presentation_part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Reference publications

[Genetic Predisposition To Acquire a Polybasic Cleavage Site for Highly Pathogenic Avian Influenza Virus Hemagglutinin](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5312086/#B26)

[SARS-CoV-2 furin cleavage site revisited](https://www.virology.ws/2020/05/14/sars-cov-2-furin-cleavage-site-revisited/)

In [None]:
!pip install biopython

Collecting biopython
[?25l  Downloading https://files.pythonhosted.org/packages/a8/66/134dbd5f885fc71493c61b6cf04c9ea08082da28da5ed07709b02857cbd0/biopython-1.77-cp36-cp36m-manylinux1_x86_64.whl (2.3MB)
[K     |████████████████████████████████| 2.3MB 2.8MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.77


In [None]:
from Bio import SeqIO, Entrez
from Bio.Seq import Seq
from Bio.Data import CodonTable

In [None]:
print(CodonTable.unambiguous_dna_by_id[1])

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

In [None]:
Entrez.email = "cappelchi@gmail.com"  # Always tell NCBI who you are

In [None]:
def get_position(access, find_poly, offset = -1, after = 6000):
    if find_poly == '':
        print('ERROR: set serching site')
        return
    if offset != -1:
        get_seq = Entrez.efetch(db="nucleotide", id= access, retmode="text", rettype="fasta") 
        reference1 = SeqIO.read(get_seq, "fasta")[offset:].translate() #in amino acids
        pos = reference1.seq.find(find_poly)
        print('position: ', pos, ' offset: ', offset) #where is polybasic site
        return pos, offset
    else:
        for offset in range(3):
            get_seq = Entrez.efetch(db="nucleotide", id= access, retmode="text", rettype="fasta") 
            reference1 = SeqIO.read(get_seq, "fasta")[offset:].translate() #in amino acids
            pos = reference1.seq.find(find_poly)
            if (pos != -1) & (pos > after):
                break
        print('position: ', pos, ' offset: ', offset) #where is polybasic site
        return pos, offset

In [None]:
def print_sequence(access, pos, offset, off_back = 6, off_forw = 23):
    #access = 'MT072864.1'
    get_seq = Entrez.efetch(db="nucleotide", id= access, retmode="text", rettype="fasta") #accession MT072864.1 Pangolin coronavirus isolate PCoV_GX-P2V, complete genome.
    #pos = 7850
    #offset = 2
    #off_back = 7
    #off_forw = 23
    cut_seq = SeqIO.read(get_seq, "fasta")[offset + pos * 3 - off_back:offset + pos * 3 + off_forw] #polybasic cleavage site in nucleotides
    poly = cut_seq[off_back:-off_forw + 12]
    amino_cut_seq = cut_seq.translate()
    print(len(cut_seq.seq))
    print(cut_seq.seq)
    print(amino_cut_seq.seq)
    print('Polybasic nucleotide:', poly.seq, ' - amino:', poly.translate().seq)

### **NC_045512.2** corona COVID 19 reference

In [None]:
access = 'NC_045512.2' #corona COVID 19 reference
find_poly = 'RRAR'
pos, offset = get_position(
    access = access, 
    find_poly = find_poly, 
    offset = -1
    )



position:  7868  offset:  1


In [None]:
print_sequence(
    access = access, 
    pos = pos, 
    offset = offset, 
    off_back = 3, 
    off_forw = 21
    )

24
CCTCGGCGGGCACGTAGTGTAGCT
PRRARSVA
Polybasic nucleotide: CGGCGGGCACGT  - amino: RRAR


![corona](https://raw.githubusercontent.com/cappelchi/T-Bio/master/infections/coronafold.svg)

###accession MT072864.1 Pangolin coronavirus isolate PCoV_GX-P2V, complete genome

![align](https://raw.githubusercontent.com/cappelchi/T-Bio/master/infections/pangolin_covid19_spike_align.png)

In [None]:
access = 'MT072864.1'
#accession MT072864.1 Pangolin coronavirus isolate PCoV_GX-P2V, complete genome.
find_poly = 'SSFR'
pos, offset = get_position(
    access = access, 
    find_poly = find_poly, 
    offset = -1
    )



position:  7850  offset:  2


In [None]:
print_sequence(
    access = access, 
    pos = pos, 
    offset = offset, 
    off_back = 6, 
    off_forw = 24
    )

30
TCCATGTCATCATTTCGTAGTGTCAACCAG
SMSSFRSVNQ
Polybasic nucleotide: TCATCATTTCGT  - amino: SSFR


![alt text](https://raw.githubusercontent.com/cappelchi/T-Bio/master/infections/Pangolin.png)

###**NC_019843** get MERS CoV reference

In [None]:
access = 'NC_019843' #get MERS CoV reference
find_poly = 'RSVR'
pos, offset = get_position(
    access = access, 
    find_poly = find_poly, 
    offset = -1
    )



position:  7898  offset:  2


In [None]:
print_sequence(
    access = access, 
    pos = pos, 
    offset = offset, 
    off_back = 12, 
    off_forw = 23
    )

35
ACTCTCACACCTCGCAGTGTGCGCTCTGTTCCAGG
TLTPRSVRSVP
Polybasic nucleotide: CGCAGTGTGCGC  - amino: RSVR




###**KJ477102.1** accession KJ477102.1 (MERS-CoV) camelus dromedarius

In [None]:
access = 'KJ477102.1' #accession KJ477102.1 (MERS-CoV) camelus dromedarius
find_poly = 'RSSR'
pos, offset = get_position(
    access = access, 
    find_poly = find_poly, 
    offset = 1
    )

position:  8016  offset:  1


In [None]:
print_sequence(
    access = access, 
    pos = pos, 
    offset = offset, 
    off_back = 12, 
    off_forw = 23
    )

35
TCTACTGGCAGTCGTAGTTCACGTATTGCTATTGA
STGSRSSRIAI
Polybasic nucleotide: CGTAGTTCACGT  - amino: RSSR




###**NC_005831.2** accession Human coronavirus NL63 (HCoV-NL63)

In [None]:
access = 'NC_005831.2' #accession Human coronavirus NL63 (HCoV-NL63)
find_poly = 'RPR'
pos, offset = get_position(
    access = access, 
    find_poly = find_poly, 
    offset = 2
    )

position:  7568  offset:  2




In [None]:
print_sequence(
    access = access, 
    pos = pos, 
    offset = offset, 
    off_back = 6, 
    off_forw = 24
    )

30
CCTGTTCGTCCGCGTAATTCTAGTGATAAT
PVRPRNSSDN
Polybasic nucleotide: CGTCCGCGTAAT  - amino: RPRN


![nl63](https://raw.githubusercontent.com/cappelchi/T-Bio/master/infections/hcov-nl63.png)
![nl63-2](https://raw.githubusercontent.com/cappelchi/T-Bio/master/infections/hcov-nl63-2.png)