# Heterogenous graphs in DGL


- A heterogeneous graph can have nodes and edges of different types. 
- Knowledge graphs, social networks, biological networks.

<img src='https://d2908q01vomqb2.cloudfront.net/77de68daecd823babbb58edb1c8e14d7106e83bb/2019/01/10/Neptune-Metaphactory-1.png' align='center' width="300px" height="300px" />


In this tutorial, you will learn:

* How to create a DGL heterogeneous graph from external files.
* How to access and modify attributes DGL heterogenous graph.
* Load the drug repurposing knowledge graph in DGL presenting in the following [work](https://github.com/gnn4dr/DRKG).
* Load a knowledge graph benchmark from DGL datasets.

In [9]:
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F
import itertools

## Loading the drug repurposing knowledge graph in dgl

- Drug-repurposing is a drug discovery paradigm that uses existing drugs for new therapies. 

- [DRKG](amazon.science/blog/amazon-web-services-open-sources-biological-knowledge-graph-to-fight-covid-19) is an effort by the by the dgl team in AWS to construct a drug repurposing knowledge graph (DRKG)


<img src='https://assets.amazon.science/dims4/default/974331b/2147483647/strip/true/crop/886x624+0+0/resize/1200x845!/quality/90/?url=http%3A%2F%2Famazon-topics-brightspot.s3.amazonaws.com%2Fscience%2F9a%2Fd8%2F9df6b0f7425191c4a69fac4caeaf%2Fdrkg.png' align='center' width="500px" height="300px" />


- First, we load the dictionary of edge lists corresponding to different types as pairs of source and destination
- The function create_drkg_edge_lists creates the edge list dictionary from a list of files

In [10]:
from tutorial_utils import create_drkg_edge_lists
edge_list_dictionary=create_drkg_edge_lists()
print(edge_list_dictionary.keys())

dict_keys([('Gene', 'bioarx::HumGenHumGen:Gene:Gene', 'Gene'), ('Gene', 'bioarx::VirGenHumGen:Gene:Gene', 'Gene'), ('Compound', 'bioarx::DrugVirGen:Compound:Gene', 'Gene'), ('Compound', 'bioarx::DrugHumGen:Compound:Gene', 'Gene'), ('Disease', 'bioarx::Covid2_acc_host_gene::Disease:Gene', 'Gene'), ('Disease', 'bioarx::Coronavirus_ass_host_gene::Disease:Gene', 'Gene'), ('Gene', 'DGIDB::INHIBITOR::Gene:Compound', 'Compound'), ('Gene', 'DGIDB::ANTAGONIST::Gene:Compound', 'Compound'), ('Gene', 'DGIDB::OTHER::Gene:Compound', 'Compound'), ('Gene', 'DGIDB::AGONIST::Gene:Compound', 'Compound'), ('Gene', 'DGIDB::BINDER::Gene:Compound', 'Compound'), ('Gene', 'DGIDB::MODULATOR::Gene:Compound', 'Compound'), ('Gene', 'DGIDB::BLOCKER::Gene:Compound', 'Compound'), ('Gene', 'DGIDB::CHANNEL BLOCKER::Gene:Compound', 'Compound'), ('Gene', 'DGIDB::ANTIBODY::Gene:Compound', 'Compound'), ('Gene', 'DGIDB::POSITIVE ALLOSTERIC MODULATOR::Gene:Compound', 'Compound'), ('Gene', 'DGIDB::ALLOSTERIC MODULATOR::Gene:C

- Given this edge lists we can create the graph
- Metagraph defines the schema of the graph

In [11]:
graph = dgl.heterograph(edge_list_dictionary);
print(graph)

Graph(num_nodes={'Anatomy': 400, 'Atc': 4048, 'Biological Process': 11381, 'Cellular Component': 1391, 'Compound': 24313, 'Disease': 5103, 'Gene': 39220, 'Molecular Function': 2884, 'Pathway': 1822, 'Pharmacologic Class': 345, 'Side Effect': 5701, 'Symptom': 415, 'Tax': 215},
      num_edges={('Anatomy', 'Hetionet::AdG::Anatomy:Gene', 'Gene'): 102240, ('Anatomy', 'Hetionet::AeG::Anatomy:Gene', 'Gene'): 526407, ('Anatomy', 'Hetionet::AuG::Anatomy:Gene', 'Gene'): 97848, ('Compound', 'DRUGBANK::carrier::Compound:Gene', 'Gene'): 720, ('Compound', 'DRUGBANK::ddi-interactor-in::Compound:Compound', 'Compound'): 1379271, ('Compound', 'DRUGBANK::enzyme::Compound:Gene', 'Gene'): 4923, ('Compound', 'DRUGBANK::target::Compound:Gene', 'Gene'): 19158, ('Compound', 'DRUGBANK::treats::Compound:Disease', 'Disease'): 4968, ('Compound', 'DRUGBANK::x-atc::Compound:Atc', 'Atc'): 15750, ('Compound', 'GNBR::A+::Compound:Gene', 'Gene'): 1568, ('Compound', 'GNBR::A-::Compound:Gene', 'Gene'): 1108, ('Compound',

- Nodes/Edges of different types have independent ID space and feature storage.

- Drug and protein node IDs both start from zero and they have different features.

### Print the statistics of the created graph

- Number of nodes for each node-type
- Number of edges for each edge-type

In [12]:
total_nodes = 0;
for ntype in graph.ntypes:
    print(ntype, '\t', graph.number_of_nodes(ntype));
    total_nodes += graph.number_of_nodes(ntype);
print("Graph contains {} nodes from {} node-types.".format(total_nodes, len(graph.ntypes)))

Anatomy 	 400
Atc 	 4048
Biological Process 	 11381
Cellular Component 	 1391
Compound 	 24313
Disease 	 5103
Gene 	 39220
Molecular Function 	 2884
Pathway 	 1822
Pharmacologic Class 	 345
Side Effect 	 5701
Symptom 	 415
Tax 	 215
Graph contains 97238 nodes from 13 node-types.


In [13]:
total_edges = 0;
for etype in graph.etypes:
    print(etype, '\t', graph.number_of_edges(etype))
    total_edges += graph.number_of_edges(etype);
print("Graph contains {} edges from {} edge-types.".format(total_edges, len(graph.etypes)))

Hetionet::AdG::Anatomy:Gene 	 102240
Hetionet::AeG::Anatomy:Gene 	 526407
Hetionet::AuG::Anatomy:Gene 	 97848
DRUGBANK::carrier::Compound:Gene 	 720
DRUGBANK::ddi-interactor-in::Compound:Compound 	 1379271
DRUGBANK::enzyme::Compound:Gene 	 4923
DRUGBANK::target::Compound:Gene 	 19158
DRUGBANK::treats::Compound:Disease 	 4968
DRUGBANK::x-atc::Compound:Atc 	 15750
GNBR::A+::Compound:Gene 	 1568
GNBR::A-::Compound:Gene 	 1108
GNBR::B::Compound:Gene 	 7170
GNBR::C::Compound:Disease 	 1739
GNBR::E+::Compound:Gene 	 1970
GNBR::E-::Compound:Gene 	 2918
GNBR::E::Compound:Gene 	 32743
GNBR::J::Compound:Disease 	 1020
GNBR::K::Compound:Gene 	 12411
GNBR::Mp::Compound:Disease 	 495
GNBR::N::Compound:Gene 	 12521
GNBR::O::Compound:Gene 	 5573
GNBR::Pa::Compound:Disease 	 2619
GNBR::Pr::Compound:Disease 	 966
GNBR::Sa::Compound:Disease 	 16923
GNBR::T::Compound:Disease 	 54020
GNBR::Z::Compound:Gene 	 2821
Hetionet::CbG::Compound:Gene 	 11571
Hetionet::CcSE::Compound:Side Effect 	 138944
Hetionet::

### Assigning node features

- Introduce node features per node type
- Introduce edge features per edge type

In [14]:
graph.nodes['Compound'].data['hv'] = torch.ones(graph.number_of_nodes('Compound'), 1)
graph.edges['DRUGBANK::treats::Compound:Disease'].data['he'] = torch.zeros(graph.number_of_edges('DRUGBANK::treats::Compound:Disease'), 1)
print('Node features')
print(graph.nodes['Compound'].data['hv'])
print('Edge features')
print(graph.edges['DRUGBANK::treats::Compound:Disease'].data['he'])

Node features
tensor([[1.],
        [1.],
        [1.],
        ...,
        [1.],
        [1.],
        [1.]])
Edge features
tensor([[0.],
        [0.],
        [0.],
        ...,
        [0.],
        [0.],
        [0.]])


## Loading a heterogenous graph benchmark from DGL

- AIFB is a popular knowledge graph benchmark 
- It records the organizational structure of AIFB at the University of Karlsruhe.
- The graph contains 7 node types and 109 edge types.
- The persons in the KG are associated with a label indicating which research group they belong.


In [15]:
from dgl.data.rdf import AIFBDataset

dataset = AIFBDataset()
g = dataset[0]
print("Node types")
print(g.ntypes)
print("Edge types")
print(g.etypes)

Done loading data from cached files.
Node types
['Forschungsgebiete', 'Forschungsgruppen', 'Kooperationen', 'Personen', 'Projekte', 'Publikationen', '_Literal']
Edge types
['ontology#dealtWithIn', 'ontology#isWorkedOnBy', 'ontology#name', 'rdftype', 'rev-ontology#isAbout', 'rev-ontology#isAbout', 'ontology#carriesOut', 'ontology#head', 'ontology#homepage', 'ontology#member', 'ontology#name', 'ontology#publishes', 'rev-ontology#carriedOutBy', 'ontology#finances', 'ontology#name', 'rev-ontology#financedBy', 'ontology#fax', 'ontology#homepage', 'ontology#name', 'ontology#phone', 'ontology#photo', 'ontology#publication', 'ontology#worksAtProject', 'rev-ontology#author', 'rev-ontology#editor', 'rev-ontology#head', 'rev-ontology#isWorkedOnBy', 'rev-ontology#member', 'rev-ontology#member', 'ontology#carriedOutBy', 'ontology#financedBy', 'ontology#homepage', 'ontology#isAbout', 'ontology#member', 'ontology#name', 'ontology#projectInfo', 'rev-ontology#carriesOut', 'rev-ontology#dealtWithIn', 'r

* Access node properties stored as node features

In [16]:
# The node category holding the different labels
category = dataset.predict_category
print(category)

train_mask = g.nodes[category].data['train_mask']
test_mask = g.nodes[category].data['test_mask']
train_idx = torch.nonzero(train_mask, as_tuple=False).squeeze()
test_idx = torch.nonzero(test_mask, as_tuple=False).squeeze()
labels = g.nodes[category].data['labels']
print(g.nodes[category])

Personen
NodeSpace(data={'labels': tensor([ 1, -1, -1,  2, -1, -1,  0,  1, -1, -1, -1, -1, -1, -1, -1,  3,  3,  3,
         2,  3,  3,  1, -1,  1,  1,  2,  2,  1,  2, -1, -1, -1,  1,  1,  0,  0,
        -1, -1,  1,  0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0, -1, -1,  1,
         1, -1,  1, -1,  0, -1,  1,  1,  1,  0,  2,  2,  2, -1,  1,  0, -1,  1,
         0,  0,  0,  0,  0,  2,  0,  2,  2,  2,  1,  0,  0,  0,  0,  0,  0,  0,
        -1,  1,  0,  0,  0, -1, -1,  0,  0, -1, -1, -1,  0,  0,  0,  0,  0, -1,
         0,  0,  0,  0,  1, -1,  0,  0,  0,  0,  1,  3,  2,  1, -1,  0,  1,  2,
         1,  1,  1,  2,  1,  0, -1,  0, -1,  3,  1,  3,  3,  1,  3,  0, -1,  2,
         1, -1,  1,  1,  1,  1,  1, -1,  3, -1,  1,  2,  1,  1,  0,  0,  0,  0,
         1,  0,  2,  0,  0,  0,  0, -1,  2,  0,  1,  0,  0,  0,  0,  0,  3,  0,
         1,  0,  1,  1,  1,  1,  1,  1,  0,  1, -1,  1,  1,  1,  0,  1,  1,  1,
        -1,  1,  2,  1, -1, -1,  2, -1,  1,  2, -1, -1,  0,  0,  2, -1,  2,  3,
     