# Covid-19 Protien Analysis - Protien Synthesis


<br>

 <div class="alert alert-block alert-info">
    <b></b> Learn the basics of protein analysis with the <b> covid-19 antibody</b>  (part 1 of 5 of tutorial) 
</div>

<hr>

__What we will go over__
- Collecting Data
- Understanding the Covid-19 Genome
- Protien Synthesis with Biopython Library
- Understanding Protien Data

<hr>

## Collecting the data

Here we use the covid-19 FASTA data from genebank

#### Linux

In [1]:
# Get the data
!wget https://raw.githubusercontent.com/VarunSendilraj/Bioinformatics/main/covid19_basic _protien_analysis/sequence.fasta

--2021-09-19 13:22:46--  https://raw.githubusercontent.com/VarunSendilraj/Bioinformatics/main/covid19_basic
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-09-19 13:22:46 ERROR 404: Not Found.

--2021-09-19 13:22:46--  http://_protien_analysis/sequence.fasta
Resolving _protien_analysis (_protien_analysis)... 23.221.222.250
Connecting to _protien_analysis (_protien_analysis)|23.221.222.250|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 339 [text/html]
Saving to: ‘sequence.fasta.2’


2021-09-19 13:22:46 (34.4 MB/s) - ‘sequence.fasta.2’ saved [339/339]

FINISHED --2021-09-19 13:22:46--
Total wall clock time: 0.2s
Downloaded: 1 files, 339 in 0s (34.4 MB/s)


#### Windows/MAC

In [18]:
import urllib.request
url = 'https://raw.githubusercontent.com/VarunSendilraj/Bioinformatics/main/covid19_basic%20_protien_analysis/sequence.fasta'
filename = 'sequence.fasta'
urllib.request.urlretrieve(url, filename)

('sequence.fasta', <http.client.HTTPMessage at 0x7ff865dba3c8>)

<hr>

## Understanfing the FASTA File

In [3]:
import Bio
from Bio import SeqIO # library used to parse the file
from Bio import Seq

covid19 = SeqIO.parse('sequence.fasta', 'fasta')

In [4]:
for rec in covid19:
    seq = rec.seq
    print(rec.description)
    print(seq[:10])
    print(seq.alphabet)

NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
ATTAAAGGTT
SingleLetterAlphabet()


In [5]:
from Bio import Seq
from Bio.Alphabet import IUPAC
seq = Seq.Seq(str(seq), IUPAC.unambiguous_dna)

In [6]:
print(seq)

ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTCGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACGGAGGGGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCACGTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCGTGAACATGAGCATGAAATTGCTTGGTACACGGAACGTTCT

In [7]:
from collections import defaultdict
count = defaultdict(int)

for letter in seq:
    count[letter] += 1
    
total = sum(count.values())
count

defaultdict(int, {'A': 8954, 'T': 9594, 'G': 5863, 'C': 5492})

In [8]:
for letter, count in count.items():
    print(f'{letter}: {100.*count / total} {count}')

A: 29.943483931378122 8954
T: 32.083737417650404 9594
G: 19.60672842189747 5863
C: 18.366050229074006 5492


## Protien Synthesis With Biopython

### Transcription 

In [9]:
rna = seq.transcribe()

In [10]:
rnaCount = defaultdict(int)

for letter in rna:
    rnaCount[letter] += 1
    
total = sum(rnaCount.values())
rnaCount

defaultdict(int, {'A': 8954, 'U': 9594, 'G': 5863, 'C': 5492})

### Translation 

In [11]:
protien = rna.translate(stop_symbol="*")  # you can understand where the stopcodons lie



In [12]:
protien

Seq('IKGLYLPR*QTNQLSISCRSVL*TNFKICVAVTRLHA*CTHAV*LITNYCR*QD...KKK', HasStopCodon(IUPACProtein(), '*'))

In [13]:
aa = protien.split("*")

ncov = [str(i) for i in aa]
ncov_len = [len(str(i)) for i in aa]

#store the amino acids into a df
import pandas as pd
df = pd.DataFrame({'Amino Acids': ncov, 'Lenght': ncov_len })

df.head()

Unnamed: 0,Amino Acids,Lenght
0,IKGLYLPR,8
1,QTNQLSISCRSVL,13
2,TNFKICVAVTRLHA,14
3,CTHAV,5
4,LITNYCR,7


In [14]:
df.nlargest(5, "Lenght")

Unnamed: 0,Amino Acids,Lenght
548,CTIVFKRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFL...,2701
694,ASAQRSQITLHINELMDLFMRIFTIGTVTLKQGEIKDATPSDFVRA...,290
719,TNMKIILFLALITLATCELYHYQECVRGTTVLLKEPCSSGTYEGNS...,123
695,AQADEYELMYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALR...,83
718,QQMFHLVDFQVTIAEILLIIMRTFKVSIWNLDYIINLIIKNLSKSL...,63


In [15]:
df.nsmallest(5, "Lenght")

Unnamed: 0,Amino Acids,Lenght
14,,0
52,,0
59,,0
64,,0
93,,0


In [16]:
from collections import Counter

Counter(protien).most_common(10)

[('L', 886),
 ('S', 810),
 ('*', 774),
 ('T', 679),
 ('C', 635),
 ('F', 593),
 ('R', 558),
 ('V', 548),
 ('Y', 505),
 ('N', 472)]

<hr>

 <div class="alert alert-block alert-success">
    Check out the Medium Article Tutorial: <b>Comming Soon</b>
    <br>
    <br>
    Check Out My Github Profile: <a>https://github.com/VarunSendilraj</a>
    <br>
    <br>
    Part 2: <b>Comming Soon</b>
</div>

<hr>