## Collecting Corona Virus Proteins from Uniprot Database

**What is Uniprot Database?**

UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.[source: Wikipedia](https://en.wikipedia.org/wiki/UniProt)

**Why is this protein list important to current COVID-19 Dataset?**

Inorder to gather the information about the biomolecular mechanism from the scientific literature (COVID-19 Dataset), one need to have the list of associated Proteins, Genes, Pathways, Drugs etc. This notebook presents the steps to gather Corona Virus associated proteins, Gene names and associated Pathways from Uniprot database. These lits could be useful to look at the textual documents for further NLP processing and to present the entity relationship.

#### Step -I

Gp to Uniprot Database (https://www.uniprot.org/) and select UniprotKB in search bar. Then inter corona virus into the search bar.


![img](img/uniprot-search.png)

#### Step -II:
After you hit search operation, you will get a table like disply of the result. It is multi page table. 

![img](img/uniprot-table.png)

#### Step-III:
Look at the right most task bar of this table. You can see pen like icon through which you get next window. You can make a selection of the information you want to gather (e.g., Name, Gene, Pathways).

![img](img/pen.png)

#### Step - IV

Once you are done with selection of information, you can go back to previous table and hit download button. You can select the format of the data. Excel file download is one option.


![img](img/uniprot-download.png)

#### What After getting Protein Data?

Lets play around with this data

In [5]:
import pandas as pd

In [10]:
ls data

DATA.json                            noncomm_use_subset.tar.gz
II - COVID19-Citation-Network.ipynb  pmc_custom_license.tar.gz
biorxiv_medrxiv.tar.gz               virus-proteins.xlsx
comm_use_subset.tar.gz


In [11]:
df = pd.read_csv("./data/corona.csv")

In [12]:
df.head()

Unnamed: 0,Entry,Entry name,Status,Protein names,Gene names,Organism,Virus hosts,Pathway
0,A0A3R5SMJ6,A0A3R5SMJ6_WNV,unreviewed,Genome polyprotein,,West Nile virus (WNV),Aedes [TaxID: 7158]; Amblyomma variegatum (Tro...,
1,M1UFP6,M1UFP6_9FLAV,unreviewed,Genome polyprotein,,Bovine viral diarrhea virus 1b,,
2,P11223,SPIKE_IBVB,reviewed,Spike glycoprotein (S glycoprotein) (E2) (Pepl...,S 2,Avian infectious bronchitis virus (strain Beau...,Gallus gallus (Chicken) [TaxID: 9031],
3,P11224,SPIKE_CVMA5,reviewed,Spike glycoprotein (S glycoprotein) (E2) (Pepl...,S 3,Murine coronavirus (strain A59) (MHV-A59) (Mur...,Mus musculus (Mouse) [TaxID: 10090],
4,P0C6X9,R1AB_CVMA5,reviewed,Replicase polyprotein 1ab (pp1ab) (ORF1ab poly...,rep 1a-1b,Murine coronavirus (strain A59) (MHV-A59) (Mur...,Mus musculus (Mouse) [TaxID: 10090],


In [13]:
df.groupby("Organism").count()

Unnamed: 0_level_0,Entry,Entry name,Status,Protein names,Gene names,Virus hosts,Pathway
Organism,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
229E-related bat coronavirus,39,39,39,39,31,0,0
Alpaca respiratory coronavirus,5,5,5,5,3,0,0
Alphacoronavirus 1,21,21,21,21,21,0,0
Alphacoronavirus Bat-CoV/P.kuhlii/Italy/206645-41/2011,6,6,6,6,6,0,0
Alphacoronavirus Bat-CoV/P.kuhlii/Italy/206679-3/2010,6,6,6,6,6,0,0
Alphacoronavirus Bat-CoV/P.kuhlii/Italy/3398-19/2015,6,6,6,6,6,0,0
Alphacoronavirus Eptesicus fuscus/Appalachian Ridge/P1-C265/IT/USA/2009,1,1,1,1,1,0,0
Alphacoronavirus Eptesicus fuscus/Appalachian Ridge/P3-C450/IT/USA/2009,1,1,1,1,1,0,0
Alphacoronavirus Eptesicus fuscus/Appalachian Ridge/P3-C766/IT/USA/2009,1,1,1,1,0,0,0
Alphacoronavirus Mink/China/1/2016,5,5,5,5,5,0,0


In [14]:
df.groupby("Virus hosts").count()

Unnamed: 0_level_0,Entry,Entry name,Status,Protein names,Gene names,Organism,Pathway
Virus hosts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Aedes [TaxID: 7158]; Amblyomma variegatum (Tropical bont tick) [TaxID: 34610]; Aves [TaxID: 8782]; Culex [TaxID: 53527]; Homo sapiens (Human) [TaxID: 9606]; Hyalomma marginatum [TaxID: 34627]; Mansonia uniformis [TaxID: 308735]; Mimomyia [TaxID: 308737]; Rhipicephalus [TaxID: 34630],1,1,1,1,0,1,0
Alliaria petiolata (Garlic mustard) (Arabis petiolata) [TaxID: 126270]; Brassica [TaxID: 3705]; Calanthe [TaxID: 38206]; Capsella bursa-pastoris (Shepherd's purse) (Thlaspi bursa-pastoris) [TaxID: 3719]; Hesperis matronalis [TaxID: 264418]; Stellaria media (Common chickweed) (Alsine media) [TaxID: 13274]; Trifolium hybridum (Alsike clover) [TaxID: 74517],33,33,33,33,0,33,0
Bos taurus (Bovine) [TaxID: 9913],78,78,78,78,77,78,0
Camelus dromedarius (Dromedary) (Arabian camel) [TaxID: 9838]; Homo sapiens (Human) [TaxID: 9606],6,6,6,6,6,6,0
Canis lupus familiaris (Dog) (Canis familiaris) [TaxID: 9615],16,16,16,16,16,16,0
Capsicum annuum (Capsicum pepper) [TaxID: 4072]; Myzus [TaxID: 13163]; Petunia [TaxID: 4101]; Spinacia oleracea (Spinach) [TaxID: 3562]; Vicia faba (Broad bean) (Faba vulgaris) [TaxID: 3906],1,1,1,1,0,1,0
Chlorocebus pygerythrus (Vervet monkey) (Cercopithecus pygerythrus) [TaxID: 60710],1,1,1,1,0,1,0
Cucumis sativus (Cucumber) [TaxID: 3659],2,2,2,2,0,2,0
Cymbidium [TaxID: 14366]; Trifolium repens (Creeping white clover) [TaxID: 3899],2,2,2,2,0,2,0
Dianthus caryophyllus (Carnation) (Clove pink) [TaxID: 3570],2,2,2,2,0,2,0
