In [1]:
import networkx as nx
import pickle
import numpy as np

# Notes

This is a notebook to introduce graph data created by Synergetic. We constructed 2 graph of human portein-protein interactions to help research on COVID-19. The purpose is to use this graph internally and externally for machine learning research purposes with Python. We gathered and consolidated data from several databases:
* Uniprot : https://www.uniprot.org/
* QuickGO : https://www.ebi.ac.uk/QuickGO/
* Bgee : https://bgee.org/
* HMDB : https://hmdb.ca/
* STRING : https://string-db.org/
* SMPD : https://smpdb.ca/
* covid-data:
    * https://covid-19.uniprot.org/uniprotkb?query=*
    * https://www.ebi.ac.uk/ena/pathogens/covid-19

# Load data

In [14]:
pp_interactions_undirected = nx.read_gpickle("../data/pp_interactions_undirected.gpickle")
pp_interactions_directed = nx.read_gpickle("../data/pp_interactions_directed.gpickle")
cc_ontology = nx.read_gpickle("../data/cc_ontology.gpickle")
mf_ontology = nx.read_gpickle("../data/mf_ontology.gpickle")
bp_ontology = nx.read_gpickle("../data/bp_ontology.gpickle")
cc_union_dict = pickle.load(open("../data/cc_union_dict.p", "rb"))
mf_union_dict = pickle.load(open("../data/mf_union_dict.p", "rb"))
bp_union_dict = pickle.load(open("../data/bp_union_dict.p", "rb"))
go_to_name = pickle.load(open("../data/string_go_to_name.p", "rb"))
metabolite_id_to_name = pickle.load(open("../data/matabolites_id_to_name.p", "rb"))
tissue_num_mapping = pickle.load(open("../data/tissue_num_mapping.p", "rb"))
index_tissue = {tissue_num_mapping[tissue]:tissue for tissue in tissue_num_mapping.keys()}
string_gene_to_proteins = pickle.load(open("../data/string_gene_to_proteins.p", "rb"))
covid_data = pickle.load(open("../data/covid_data.p", "rb"))
covid_go_to_name = pickle.load(open("../data/covid_go_to_name.p", "rb"))
covid_interacting_nodes = pickle.load(open("../data/covid_interacting_nodes.p", "rb"))

# Protein-protein interactions undirected

pp_interactions_undirected : networkX undirected nx.Graph (see https://networkx.github.io/documentation/stable/reference/classes/index.html)

### nodes attributes:
   * label : uniprot_id from uniprot (https://www.uniprot.org/)
   * string node_type : metabolome_graph (with pathway and metabolites associated) or other_protein (not referenced as metabolome proteins : no metabolites and no pathway on smpd : https://smpdb.ca/)
   * string info : small text explaining the products of the mRNA that codes the protein from STRING database : https://string-db.org/)
   * list cellular_components : list of Go Id cellular components the protein is belonging to in QuickGO (see gene ontology : https://www.ebi.ac.uk/QuickGO/). The dict go_to_name maps GoId to names.
   * list molecular_dunctions : list of Go Id as above but for molecular functions.
   * list biological_processes : list of Go Id as above but for biological processes.
   * list expression_data : vector of float of size 308 corresponding to expression ranks of intial RNAm coding the protein renormalized from 0 to 1 in 308 tissues (see https://bgee.org/). index_tissue is a dict mapping index in vector to string tissue name.
   * list metabolites : list of HMDB ID metabolites associated to protein if it is a metabolome_protein (see https://hmdb.ca/). metabolite_id_to_name is a dict mapping id to metabolite name.
   * list pathways : list of pathway names the metabolome_protein is belonging to. for more information on a pathway, search it on smpd (might not be referenced).
   * string sequence : amino acid sequence for the protein

In [8]:
list(pp_interactions_undirected.nodes(data=True))[300]

('Q16671',
 {'node_type': 'metabolome_protein',
  'info': 'Anti-Muellerian hormone type-2 receptor; On ligand binding, forms a receptor complex consisting of two type II and two type I transmembrane serine/threonine kinases. Type II receptors phosphorylate and activate type I receptors which autophosphorylate, then bind and activate SMAD transcriptional regulators. Receptor for anti-Muellerian hormone',
  'cellular_components': ['GO:0005887', 'GO:0043235'],
  'biological_processes': ['GO:0005515',
   'GO:0005524',
   'GO:0046872',
   'GO:0042562',
   'GO:0005026',
   'GO:1990272'],
  'molecular_functions': ['GO:0005515',
   'GO:0005524',
   'GO:0046872',
   'GO:0042562',
   'GO:0005026',
   'GO:1990272'],
  'expression_data': [0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.3336732673267327,
   0.0,
   0.23565346534653464,
   0.28515841584158413,
   0.30298019801980197,
   0.0,
   0.0,
   0.0,
   0.3316930693069307,
   0.0,
   0.3346633663366337,
   0.0,
   0.2346633663366336

### edges attribute:
* float score : a score between 0 and 1 representing the strength of the interaction(see https://string-db.org/)

In [9]:
list(pp_interactions_undirected.edges(data=True))[300]

('P84085', 'P42336', {'score': 0.192})

# Protein-protein interactions

pp_interactions_directed : networkX directed nx.MultiDiGraph (https://networkx.github.io/documentation/stable/reference/classes/index.html)

### nodes attributes: same as undirected

### edges attributes:
* string link : type of edge between 2 proteins :
    * binding_activation : directed binding activation
    * binding_inhibition : directed binding inhibition
    * binding : directed binding without further information
    * activation : directed protein activation
    * inhibition : directed protein inhibition
    * reaction : the product of first protein is involved in reactants or products of target protein
    * catalysis : first protein product catalysis target protein's reaction/function
    * ptmod : first protein modifies second protein post-translationally
    * expression : first protein increases expression of target protein (exemple : transcription factor)
    * expression_inhibition : first protein inhibates target protein (exemple : transcription factor)
* score : a score between 0 and 1 from String database representing the confidence/strength of connexion

In [10]:
list(pp_interactions_directed.edges(data=True))[300]

('Q9Y587', 'Q6ULP2', {'link': 'binding', 'score': 0.902})

# Ontology graphs

* cc_ontology : networkX directed nx.DiGraph of cellular component ontology from QuickGo 
* mf_ontology : same for molecular_function
* bp_ontology : same for biological processes

see https://networkx.github.io/documentation/stable/reference/classes/index.html

### node attributes:
* string node_type : string node type:
     * metabolome_protein : protein involved in metabolic processes
     * other_protein : other protein not involved or not referenced in metabolic processes
     * cellular_component : Go Id from QuickGo of cellular component (for cc_ontology graph)
     * biological_process : Go Id from QuickGo of biological process (for bp_ontology graph)
     * molecular_function : Go Id from QuickGo of molecular function (for mf_ontology graph)

### edge attributes :
* string link : type of edge in ontology :
    *  is_a : is a ontology
    *  part_of : part of ontology

In [11]:
list(cc_ontology.edges(data=True))[0]

('GO:0005737', 'GO:0110165', {'link': 'is_a'})

In [13]:
print(list(cc_ontology.nodes(data=True))[0])
print(list(cc_ontology.nodes(data=True))[-1])

('GO:0005737', {'node_type': 'cellular_component', 'size': 11252})
('Q9NX36', {'node_type': 'other_protein'})


# Other data

* dict string_gene_to_proteins : mapping of genes to proteins products
* dict cc_union_dict : mapping cellular_component GoId to all proteins included in the category
* dict mf_union_dict : mapping molecular function GoId to all proteins included in the category
* dict bp_union_dict : mapping biological processes GoId to all proteins included in the category
* dict go_to_name : Go Id to name to the category
* dict metabolite_id_to_name : HMDB ID to name of metabolite
* dict tissue_num_mapping : tissue name to index in expression vector
* covid_data : dict containing data about proteins involved in covid-19 from https://covid-19.uniprot.org/uniprotkb?query=*:
    * key human : homo sapiens protein or not
    * key sequance : amino acid sequence of the protein
    * key molecular_functions : same as for nodes of protein-protein interaction graph
    * key cellular_components : same as for nodes of protein-protein interaction graph
    * key biological_processes : same as for nodes of protein-protein interaction graph
    * key info : info from publications about the protein
* covid_go_to_name : dict mapping go_id from covi-19 to names
* covid_interacting_nodes : a list of nodes (proteins) that the proteins from covid-19 is interacting with. The purpose of this list is to use it as a test set for edge regression/classification machine learning models. The proteins has been extracted from https://www.biorxiv.org/content/10.1101/2020.03.22.002386v1

# Tissues

In [15]:
for key in tissue_num_mapping.keys():
    print(key)

male germ cell
sperm
epithelial cell of pancreas
endothelial cell
leukocyte
bronchial epithelial cell
buccal mucosa cell
uterine cervix
islet of Langerhans
pituitary gland
zone of skin
lymph node
tendon
dorsal root ganglion
urethra
large intestine
renal glomerulus
metanephros
adult mammalian kidney
intestine
oral cavity
amniotic fluid
blood
colonic mucosa
pharyngeal mucosa
renal medulla
jejunal mucosa
prefrontal cortex
endocervix
anatomical system
multi-cellular organism
testis
female reproductive system
embryo
stomach
aorta
heart
brain
cerebral cortex
retina
pleura
tibia
penis
female gonad
uterus
vagina
mammalian vulva
seminal vesicle
adipose tissue
central nervous system
esophagus
saliva-secreting gland
right lobe of liver
right lobe of thyroid gland
left lobe of thyroid gland
skeletal muscle tissue
smooth muscle tissue
body of pancreas
caecum
vermiform appendix
colon
transverse colon
sigmoid colon
fundus of stomach
body of stomach
cardia of stomach
pylorus
mucosa of stomach
cortex o

# Create a protein-protein interaction dataset

In [3]:
import sys
sys.path.insert(0,'..')
from utils import PPInteraction_dataset

The class PPInteraction_dataset extracts edges from a protein-protein interaction graph and returns a torch DataSet. The constructor takes as argument:
* nx graph : the directed or undirected graph to create the data from
* bool directed : True if directed else False
* float score_threshold: a value between 0 an 1 representing score attribute in edges. the constructor will extract edges that have a score superior to this threshold.
* string node_attribute : the type of node attribute to extract from graph (ex "sequence", "cellular_components", "info"...)
* bool regression : True if regression task (keep scores) or False in classification (existing edge will be 1.0)
* float no_interactions_ratio : a value between 0 and 1. The constructor will create edges with proteins that are not in the graph and label them with 0.0

The method __getitem__(ix) will return : 
* (node_attribute_protein_a, node_attribute_protein_b, link_type, label) if directed, with label between 0 and 1 if edge regression task else 0 or 1
* (node_attribute_protein_a, node_attribute_protein_b, label) if undirected, with label between 0 and 1 if edge regression task else 0 or 1

Note: to create the classification dataset, the constructor creates labels with 0.0 by selecting edges that are not in the graph. For creating a balanced classification dataset, you should put a ratio of 1.0 to create as much 0.0 labels as 1.0 labels. You can play with the score_threshold to restrict size of existing edges.

In [4]:
dset = PPInteraction_dataset(pp_interactions_undirected, False, 0.95, "sequence", False, 1.0)

100%|██████████| 64232/64232 [04:28<00:00, 161.35it/s]
