<a href="https://colab.research.google.com/github/geovalexis/TFG/blob/main/notebooks/1_Data_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data retrieval



## SwissProt dataset

It might be a good idea to filter by SwissProt proteins (not sure if those in QfO are all in SwissProt or they also include trEMBL proteins).
SwissProt -> manually reviewed proteins (suitable for predicting algorithms)
trEMBL -> unreviewd or electronically predicted

### Mapping UniProtKB/Swiss-Prot in FASTA format

In [None]:
!wget ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz

In [None]:
!zcat uniprot_sprot.dat.gz | head -10000 | gzip > uniprot_sprot_prova.dat.gz

In [None]:
import gzip

uniprotKBs_list = []

with gzip.open("uniprot_sprot.dat.gz", "rt") as swiss_prot_file:
    swiss_prot_data = swiss_prot_file.read()

for entry in swiss_prot_data.split("\n//"):
    uniprotKBs = []
    for line in entry.splitlines():
        if line.startswith("AC"):
            uniprotKBs.extend(line.lstrip("AC").lstrip().rstrip(";").split(";"))
    uniprotKBs_list.append((",".join(uniprotKBs)))

with open("drive/MyDrive/TFG/swiss_prot_ids.csv", "w") as output:
    output.write("\n".join(uniprotKBs_list))


### Making a query to Uniprot


In [None]:
def downloadSwissProtIds() -> list:
    # Documentation in https://www.uniprot.org/help/api_queries
    endpoint = "https://www.uniprot.org/uniprot/"
    params = {
        'query': "reviewed:yes",
        'format': 'list'
    }
    response = requests.get(endpoint, params=params)
    if response.ok:
        return response.text.splitlines()
    else:
        response.raise_for_status()

## Id mapping file

The official Uniprot FTP server offers a way for retrieving a mapping of IDs amog the different bioinformatics databases.
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/idmapping/ 


In [None]:
!wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz 

In [None]:
# We only need a subset of this file: 
# * Columns: UniprotKB, GO (might be useful for a second check or something) and NCBI
# * Rows: only those taxIds that are within the QfO dataset

# According with the README file of the ftp sever:
# - UniprotKB corresponds to column 1
# - GO corresponds to column 7
# - NCBI Taxon corresponds to column 13

# So we cut the file by the these columns as follows:
zcat idmapping_selected.tab.gz | cut -d$'\t' -f1,7,13 | gzip > idmapping_selected_subset.tab.gz

# The we can do the same as with the prot.accession2taxid file and only keep the taxa from QfO
# It's important to set the "\t" (tabulation) as delimeter here (with the -F flag) and also to indicates that
# we want that Output Field Separator (OFS) is the same as the Field Separator (FS), which is the tabulation in this case.
awk -F '\t' 'BEGIN{OFS=FS} FNR==NR {a[$2]; next} ($3 in a)' drive/MyDrive/TFG/QFO_2018/QfO_statistics.tsv <(gzip -dc idmapping_selected_subset.tab.gz) | gzip > idmapping_selected_qfo_subset.tab.gz

## Gene Ontology Annotation (GOA) in GAF format 
 
 Documentation at http://geneontology.org/docs/go-annotation-file-gaf-format-2.1/ 

### GOA for all species (full and unfiltered)



In [None]:
# Last version available at 15/01/2021
!wget ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz # See https://www.ebi.ac.uk/GOA/ and https://www.ebi.ac.uk/QuickGO/annotations


In [None]:
!zcat goa_uniprot_all.gaf.gz | head -n 10

!gaf-version: 2.1
!
!Generated: 2020-12-02 07:56
!GO-version: http://purl.obolibrary.org/obo/go/releases/2020-11-28/extensions/go-plus.owl
!
UniProtKB	A0A009GUA7	J508_4179		GO:0003677	GO_REF:0000002	IEA	InterPro:IPR006119|InterPro:IPR036162	F	Resolvase/invertase-type recombinase catalytic domain-containing protein	J508_4179	protein	taxon:1310609	20201128	InterPro		
UniProtKB	A0A009GUA7	J508_4179		GO:0006310	GO_REF:0000002	IEA	InterPro:IPR006119|InterPro:IPR036162	P	Resolvase/invertase-type recombinase catalytic domain-containing protein	J508_4179	protein	taxon:1310609	20201128	InterPro		
UniProtKB	A0A009GUA7	J508_4179		GO:0000150	GO_REF:0000002	IEA	InterPro:IPR006119|InterPro:IPR036162	F	Resolvase/invertase-type recombinase catalytic domain-containing protein	J508_4179	protein	taxon:1310609	20201128	InterPro		
UniProtKB	A0A009HCR2	J517_0313		GO:0005886	GO_REF:0000044	IEA	UniProtKB-SubCell:SL-0039	C	Biopolymer transport ExbD/TolR family protein	J517_0313	protein	taxon:1310618	20201128	U

In [None]:
# As the original gaf file was too large, we need to filter it by the QfO taxa
# This couldn't be performed with awk command as the gaf file does not have a fixed number of columns (IMPORTANT) -> DONT PARSE WITH PANDAS
!grep -wFf <(awk '{print "taxon:"$2}' drive/MyDrive/TFG/QFO_2018/QfO_statistics.tsv | tail -n +1) <(zcat -dc goa_uniprot_all.gaf.gz | tail -n +13) | gzip > goa_uniprot_qfo.gaf.gz #https://stackoverflow.com/questions/17863301/how-to-grep-with-a-list-of-words

### GOA for Homo Sapiens

In [None]:
!wget ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human.gaf.gz

--2021-01-29 17:58:18--  ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human.gaf.gz
           => ‘goa_human.gaf.gz’
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.197.74|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/databases/GO/goa/HUMAN ... done.
==> SIZE goa_human.gaf.gz ... 8256026
==> PASV ... done.    ==> RETR goa_human.gaf.gz ... done.
Length: 8256026 (7.9M) (unauthoritative)


2021-01-29 17:58:21 (6.30 MB/s) - ‘goa_human.gaf.gz’ saved [8256026]



## Orthologs dataset

### From MetaPhors

Website: http://orthology.phylomedb.org/ 

In [3]:
# Orthologs input from MetaPhOrs that was used for the QfO benchmarking. 
# File generated and provided by Manu 
# MetaPhOrs database version: 201603
!ls drive/MyDrive/TFG/MtP_201912.tab

drive/MyDrive/TFG/MtP_201912.tab


In [None]:
!zcat goa_human.gaf.gz | head -n 20

!gaf-version: 2.1
!
!The set of protein accessions included in this file is based on UniProt reference proteomes, which provide one protein per gene.
!They include the protein sequences annotated in Swiss-Prot or the longest TrEMBL transcript if there is no Swiss-Prot record.
!If a particular protein accession is not annotated with GO, then it will not appear in this file.
!
!Note that the annotation set in this file is filtered in order to reduce redundancy; the full, unfiltered set can be found in
!ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gz
!
!Generated: 2020-12-01 16:40
!GO-version: http://purl.obolibrary.org/obo/go/releases/2020-11-28/extensions/go-plus.owl
!
UniProtKB	A0A024R1R8	hCG_2014768		GO:0002181	PMID:21873635	IBA	PANTHER:PTN002008372|SGD:S000007246	P	Coiled-coil domain-containing protein 72	hCG_2014768	protein	taxon:9606	20171102	GO_Central		
UniProtKB	A0A024RBG1	NUDT4B		GO:0000298	PMID:21873635	IBA	PANTHER:PTN000290327|SGD:S000005689	F

# Deprecated

## Reference Proteome Dataset

In [None]:
#### CELL DEPRECATED ####

# Quest for Orthologs (QfO) dataset
!ls drive/MyDrive/TFG/QFO_2018/

# QfO species and statistics
# We got this information from README file of the QfO release 
# It was formatted into a tabulated file by the following command: 
# cat QfO_statistics.txt | tr -s ' ' | cut -f1-6 -d" " | tr ' ' '\t' > QfO_statistics.tsv
!head -10 drive/MyDrive/TFG/QFO_2018/QfO_statistics.tsv


In [None]:
# Get QfO species (taxIDs) and all their corresponding uniprotIDs available in the QfO database

!wget ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/previous_releases/qfo_release-2018_04/QfO_release_2018_04.tar.gz 
!tar -xvzf QfO_release_2018_04.tar.gz --one-top-level

In [None]:
import os
import pandas as pd
qfoDir='./QfO_release_2018_04/'
QfO_proteome = {}
for subdir, dirs, files in os.walk(qfoDir):
    for file in files:
        if file.endswith(".gene2acc"):
            taxID = file.split("_")[1].split(".")[0]
            QfO_proteome[taxID] = pd.read_csv(os.path.join(subdir, file), sep="\t", names=["GeneSymbol", "UniprotKB", "CanonicalGeneSymbol"], usecols=["UniprotKB"])["UniprotKB"]
QfO_proteome = pd.DataFrame(QfO_proteome)
QfO_proteome.to_csv("drive/MyDrive/TFG/QFO_2018/QfO_uniprotKBs.tsv", sep="\t", header=True, index=False)
QfO_proteome

Unnamed: 0,8090,36329,7165,237631,10090,10116,284812,330879,418459,6945,3055,6239,9031,367110,13616,5722,684364,9615,7227,8364,559292,5664,321614,4577,9595,7918,184922,9913,35128,7955,9606,3702,6412,164328,7719,7070,214684,81824,237561,665079,5888,44689,39947,3218,45351,7739,284591,9598,243232,188937,64091,273057,374847,436308,69014,1111708,189518,100226,224308,243231,243230,515635,243090,324602,243274,83333,251221,226186,224324,85962,224911,289376,190304,208964,122586,83332,272561,243273
0,Q01798,Q25823,Q7QEI4,Q0H8Y6,P01753,P01681,Q9US57,Q8X176,E3JPW3,B7P404,P35006,P34257,P40666,P22702,Q28462,A2GII8,F4NY12,O97945,Q8T8R6,P84383,P34111,Q01782,A7UG00,P19950,P30375,W5LV44,A8B2Y5,Q2M2T3,B8BT19,P0C8Y2,Q6P435,Q9SP32,T1ECX9,H3G4X1,F7A355,Q94R76,Q5KQA0,A9V4W6,A0A1D8PCG7,A7E3X9,A0BAJ9,Q23890,Q8RZL9,Q1XGB8,A7S0L9,O47426,Q36258,P30686,Q60328,Q8TUR2,O51955,Q97TX9,B1L7D4,A9A4Z2,P77933,P77971,Q8FA34,O50499,P05648,Q74GG6,Q9RYE8,B8DYG0,Q7UZ97,A9WAN1,Q9WYH1,P23843,Q7NPQ4,Q8ABV9,O67904,O24866,Q89YF1,B5YGS5,Q8RHA7,Q9I7C5,Q9K1R8,P71591,O84004,P47377
1,Q91178,Q8IBG1,Q7PRX6,Q0H8Y5,P01750,Q63010,Q9US56,Q6A3P9,E3JPW4,B7P3D3,Q32063,Q11075,P40665,Q6MVL6,Q864T8,A2H602,F4NZ78,Q95155,Q9W1R9,P84386,P39702,Q01782,A7UG01,P19656,Q28426,W5LV45,A8B2Y7,Q2M2T2,B8CCW4,P0C8Y3,Q6AWC8,F4HQG6,T1ECY0,H3G4X2,P16240,Q94R77,Q5KQ99,A9V4W9,A0A1D8PCJ9,A7E3Y0,A0BAK0,Q23891,Q8RZL9,Q1XGB6,A7S0I6,C3Y0R7,Q36257,Q68US2,Q60338,Q8TUR1,O51955,Q981E3,B1L7D5,A9A4Z3,Q5JEA6,Q55524,Q8FA33,O50500,P05649,Q74H91,Q9RYE7,B8DYG1,Q7UKR2,A9WAN2,Q9WYH1,P23843,Q7NPQ3,Q8ABV8,O66428,O24866,Q89YF0,B5YGS6,P68997,Q9I7C4,Q9K1R7,P71591,O84005,P47441
2,Q6E211,Q25822,Q7QEI6,Q0H8W9,P01748,Q63003,Q9US55,Q4WJA1,E3JPW5,B7P3D4,P37825,Q09234,P35331,Q6MVL6,O77618,A2HMN3,F4NYV6,Q95154,Q9W283,P84385,P32471,Q01782,A7UG02,P17571,P30388,W5LV46,A8B2Y9,Q2KIS6,B8CCW8,F1QYH9,Q5XG85,Q9MAL9,T1ECY1,H3G4X3,O02367,Q94R78,Q5KQ98,A9V4X1,A0A1D8PCL1,A7E3Y1,A0BAK1,Q8T132,Q8RZL9,Q1XGD2,A7S0J6,C3Y0R9,Q37695,Q28814,P58415,Q8TUR0,O51956,Q981E2,B1L7D6,A9A4Z4,Q5JEA5,Q55525,Q8FA32,O50501,P05650,Q74H90,Q9RYE6,B8DYG2,Q7UKR1,A9WAN3,Q9WYH1,P0AFH2,Q7NPQ2,Q8ABV7,O66429,O24867,Q89YE9,B5YGS7,Q8RHA6,Q9I7C3,Q9K1R6,P71594,O84006,P47485
3,Q9I8F9,Q8IBP1,Q7QEI7,Q0H8X3,P01747,Q62716,Q9US54,Q4WJ38,E3JPW6,B7P2B5,P36443,Q8IG42,P35062,Q7RYA2,O02789,A2FFW5,F4NYY7,Q075B4,A0A0B4LG21,P84384,P10591,Q9NE83,A7UG03,P15719,P30387,W5LV47,A8B2Z3,Q2KIL1,B8CCX0,F1QNN1,Q5W150,F4HQJ3,T1ECY2,H3G4X4,Q94425,Q94R75,Q5KQ97,A9V4X7,A0A1D8PCW6,A7E3Y1,Q6BFB0,Q86HD6,Q8RZL9,A9U1T9,A7S0H7,C3Y0S0,Q9B6D0,Q28813,P58416,Q8TH29,O51956,Q981E1,B1L7D7,A9A4Z5,Q5JEA4,Q55526,Q8FA31,O50502,P05651,Q74H89,Q9RYE5,B8DYG3,Q7UKR0,A9WAN4,Q9WYC5,P0AFH2,Q7NPQ1,Q8ABV6,O66429,O24867,Q89YE8,B5YGS8,Q8RHA5,Q9I7C2,Q9K1R5,P71594,O84007,Q49329
4,P87368,Q7KQL5,Q7PRW8,Q0H8Y4,P01746,Q62713,Q9US53,Q4WSM6,E3JPW7,B7P325,Q32065,A0A1S7LE80,P30373,Q7RYA1,P0C593,A2FFW6,F4NYQ8,P99506,P29555,P84387,P39704,Q9NE83,A7UG04,P15718,P30386,W5LV48,A8B8W8,Q2KID8,B8CCX2,E7F8I1,Q5VV11,A0A1P8AVN2,T1ECY3,H3G4X5,F6X2V8,Q94R68,Q5KQ96,A9V4X9,P0CU35,A7E3Y2,Q6BFB1,Q75K15,Q75KY3,A9U1T9,A7S0Q7,C3Y0S6,Q9B6E7,Q28812,P54110,Q8TH29,O51957,Q981E0,B1L7D8,A9A4Z6,Q5JEA3,Q55527,Q8FA30,O50503,P37525,Q74H88,Q9RYE4,B8DYG4,Q7UKQ9,A9WAN5,Q9WYC5,P0AFH6,Q7NPQ0,Q8ABV5,O66430,O25029,Q89YE7,B5YGS9,Q8RHA4,Q9I7C1,P63622,P71599,O84008,P47530
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100307,,,,,,,,,,,,,,,,,,,,,,,,P27324,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
100308,,,,,,,,,,,,,,,,,,,,,,,,P46642,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
100309,,,,,,,,,,,,,,,,,,,,,,,,P03938,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
100310,,,,,,,,,,,,,,,,,,,,,,,,Q36997,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
# Get species taxIDs from QfO (it comes with many strain or subspecie taxIDs)
!pip install --upgrade ete3
ncbi.update_taxonomy_database() 

In [None]:
from ete3 import NCBITaxa
import pandas as pd

# Snippet got from mapQfO2MtP.py (Gabaldonlab/qfo-2020 repo) but slightly modified to also output specie's name
def getSpecie(taxID: int):
    """This function searches the corresponding specie taxID for a given strain or subspecie taxID (MUST BE AN INTEGER) in the NCBI database. 
    Args:
        taxID (int): taxID of the taxon of interest.
    Returns:
        specieID (int): taxID of the corresponding specie.
    """
    
    ncbi = NCBITaxa()
    
    specieID = taxID # If the taxID is already a specie it will return the same taxID
    if ncbi.get_rank([taxID])[taxID] != 'species':
        lineage = ncbi.get_lineage(taxID)
        for j in reversed(lineage): #Reverse because it's faster when it's a strain or subspecie
            if ncbi.get_rank([j])[j] == 'species':
                specieID = j
    species_name = ncbi.get_taxid_translator([specieID])[specieID]
    return specieID, species_name

# The first line (header) of the tsv file corresponds to the taxIDs of the reference species
QfO_taxa = pd.read_csv("drive/MyDrive/TFG/QFO_2018/QfO_uniprotKBs.tsv", sep="\t", nrows=0).columns.astype("int32")
QfO_reference_species = [(QfO_taxa[i], *getSpecie(QfO_taxa[i])) for i in range(QfO_taxa.size)]
QfO_reference_species = pd.DataFrame(QfO_reference_species, columns=["QfO_taxID", "specie_taxID", "species_name"])
QfO_reference_species.to_csv("drive/MyDrive/TFG/QFO_2018/subspecieID2specieID.tsv", sep="\t", header=True, index=False)
assert len(QfO_taxa) == len(QfO_reference_species)
QfO_reference_species



Unnamed: 0,QfO_taxID,specie_taxID,species_name
0,8090,8090,Oryzias latipes
1,36329,5833,Plasmodium falciparum
2,7165,7165,Anopheles gambiae
3,237631,5270,Ustilago maydis
4,10090,10090,Mus musculus
...,...,...,...
73,208964,287,Pseudomonas aeruginosa
74,122586,487,Neisseria meningitidis
75,83332,1773,Mycobacterium tuberculosis
76,272561,813,Chlamydia trachomatis


In [None]:
# Human referece proteome dataset - CANONICAL PROTEOME
# These are only the canonical proteins (in the xml are only the canonical sequences).
# IMPORTANTE NOTE: We are not adding isoforms or variants proteins so far.
# This is the reason why we do not take the uniprotIDs directly from the previous file.
!pip install biopython

In [None]:
from Bio import SeqIO
import pandas as pd
records = list(SeqIO.parse("drive/MyDrive/TFG/QFO_2018/UP000005640_9606.xml", "seqxml"))
records_ids = [record.id for record in records]
df_records_ids = pd.DataFrame(records_ids)
df_records_ids.to_csv("drive/MyDrive/TFG/QFO_2018/human_reference_proteome.tsv", sep="\t", index=False, header=False)
df_records_ids

Unnamed: 0,0
0,A0A024R161
1,A0A024R1R8
2,A0A075B6F4
3,A0A075B6H5
4,A0A075B6H7
...,...
20991,V9GYY9
20992,V9GZ38
20993,W5XKT8
20994,X5D2U9


## UniprotKB to TaxID 

Neccesary for retrieving the taxID of the proteins


In [None]:
!wget ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
!wget ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/dead_prot.accession2taxid.gz #There are also some proteins that have been supressed or withdrawn ("dead")

--2021-02-01 21:22:02--  http://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.11, 2607:f220:41e:250::11, 2607:f220:41e:250::12, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.11|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz [following]
--2021-02-01 21:22:02--  https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6458643029 (6.0G) [application/x-gzip]
Saving to: ‘prot.accession2taxid.gz’


2021-02-01 21:24:22 (44.0 MB/s) - ‘prot.accession2taxid.gz’ saved [6458643029/6458643029]



In [None]:
# Making a subset of only proteins whose taxon is a reference taxon within QfO. This will save us a lot of memory when assigning taxID.

## Awk is much more memory efficient that pandas. Credits https://stackoverflow.com/questions/14062402/awk-using-a-file-to-filter-another-one-out-tr 
!awk 'FNR==NR {a[$2]; next} ($3 in a)' drive/MyDrive/TFG/QFO_2018/QfO_statistics.tsv <(gzip -dc prot.accession2taxid.gz) | gzip > prot.accession2QfOtaxid.gz
## Check if there are all the taxa in QfO (78 in the 2018 dataset)
!zcat prot.accession2QfOtaxid.gz | cut -d$'\t' -f3 | sort | uniq | wc -l

78


In [None]:
## We run out of memory using Pandas
QfO_taxIDs = set(pd.read_csv("drive/MyDrive/TFG/QFO_2018/QfO_uniprotKBs.tsv", sep="\t", nrows=0).columns.astype("int32"))
uniprot2taxid = pd.read_csv("prot.accession2taxid.gz", sep="\t", 
                            header=0, compression="gzip", 
                            usecols=["accession", "taxid"],
                            dtype={"accession":"string", "taxid": "int32"})
uniprot2taxid_qfo_subset = uniprot2taxid[uniprot2taxid["taxid"].isin(QfO_taxIDs)]
uniprot2taxid_qfo_subset.to_csv("uniprot2QFOtaxid.tsv", sep="\t", header=True, index=False)