# Heterogenous graphs in DGL


- A heterogeneous graph can have nodes and edges of different types. 
- Knowledge graphs, social networks, biological networks.

<img src='https://d2908q01vomqb2.cloudfront.net/77de68daecd823babbb58edb1c8e14d7106e83bb/2019/01/10/Neptune-Metaphactory-1.png' align='center' width="300px" height="300px" />


In this tutorial, you will learn:

* How to create and access a DGL heterogenous graph
* Load a subset of the Microsoft academic graph presented [here](https://www.microsoft.com/en-us/research/publication/microsoft-academic-graph-when-experts-are-not-enough/).
* Load the drug repurposing knowledge graph in DGL presented in the following [work](https://github.com/gnn4dr/DRKG).

In [28]:
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F
import itertools

import pandas as pd
import numpy as np


## Creating a heterogenous graph

- Load the edges corresponding to different types as pairs of source and destination
- Load the node features and node labels

In [29]:

node_features = pd.read_csv("mag_small/node-feat.csv").values[:,1:].squeeze()

node_labels = pd.read_csv("mag_small/node-label.csv").values[:,1:].squeeze()

author_write_paper = pd.read_csv("mag_small/author_write_paper_edge.csv")
author_affiliated_with_institution = pd.read_csv("mag_small/author_affiliated_with_institution_edge.csv")
paper_cites_paper = pd.read_csv("mag_small/paper_cites_paper_edge.csv")
paper_has_topic_field_of_study = pd.read_csv("mag_small/paper_has_topic_field_of_study_edge.csv")

edges = {
    ('author', 'affiliated_with', 'institution'): list(author_affiliated_with_institution.itertuples(index=False)),
    ('author', 'writes', 'paper'): list(author_write_paper.itertuples(index=False)),
    ('paper', 'cites', 'paper'): list(paper_cites_paper.itertuples(index=False)),
    ('paper', 'has_topic', 'field_of_study'): list(paper_has_topic_field_of_study.itertuples(index=False)),
}


- Given this edge lists we can create the graph
- Metagraph defines the schema of the graph

In [30]:
g = dgl.heterograph(edges)
print(g)

Graph(num_nodes={'author': 1579, 'field_of_study': 584, 'institution': 421, 'paper': 99},
      num_edges={('author', 'affiliated_with', 'institution'): 2312, ('author', 'writes', 'paper'): 1790, ('paper', 'cites', 'paper'): 197, ('paper', 'has_topic', 'field_of_study'): 959},
      metagraph=[('author', 'institution', 'affiliated_with'), ('author', 'paper', 'writes'), ('paper', 'paper', 'cites'), ('paper', 'field_of_study', 'has_topic')])


- Nodes/Edges of different types have independent ID space and feature storage.

- Author and paper node IDs both start from zero and they have different features.

<img src='https://data.dgl.ai/asset/image/user_guide_graphch_2.png' align='center' width="600px" height="400px" />

- Introduce node features per node type
- Introduce edge features per edge type

In [34]:
g.nodes['paper'].data['hv'] = torch.tensor(node_features)
g.edges['affiliated_with'].data['he'] = torch.zeros(g.number_of_edges('affiliated_with'), 1)
print('Node features')
print(g.nodes['paper'].data['hv'])
print('Edge features')
print(g.edges['affiliated_with'].data['he'])

Node features
tensor([[ 0.0000e+00, -9.5379e-02,  4.0758e-02,  ...,  6.1569e-02,
         -2.7663e-02, -1.3383e-01],
        [ 1.0000e+00, -1.5105e-01, -1.0731e-01,  ...,  3.4575e-01,
         -2.7737e-02, -2.1853e-01],
        [ 2.0000e+00, -1.1480e-01, -1.7598e-01,  ...,  1.7306e-01,
         -1.5645e-01, -2.7795e-01],
        ...,
        [ 9.6000e+01, -2.0896e-01, -2.0555e-01,  ...,  1.8896e-01,
         -2.3225e-02, -4.2441e-01],
        [ 9.7000e+01, -1.6796e-01, -3.0568e-01,  ..., -1.5304e-01,
          1.5082e-02, -4.0195e-01],
        [ 9.8000e+01, -1.2331e-01, -5.4199e-02,  ...,  1.7348e-01,
         -8.1349e-02, -1.9037e-01]], dtype=torch.float64)
Edge features
tensor([[0.],
        [0.],
        [0.],
        ...,
        [0.],
        [0.],
        [0.]])


## Loading the drug repurposing knowledge graph in dgl

- Drug-repurposing is a drug discovery paradigm that uses existing drugs for new therapeutic indications. 

- It has the advantages of significantly reducing the time and cost compared to de novo drug discovery. 

- [DRKG](amazon.science/blog/amazon-web-services-open-sources-biological-knowledge-graph-to-fight-covid-19) is an effort by the by the dgl team in AWS to construct a drug repurposing knowledge graph (DRKG)


<img src='https://assets.amazon.science/dims4/default/974331b/2147483647/strip/true/crop/886x624+0+0/resize/1200x845!/quality/90/?url=http%3A%2F%2Famazon-topics-brightspot.s3.amazonaws.com%2Fscience%2F9a%2Fd8%2F9df6b0f7425191c4a69fac4caeaf%2Fdrkg.png' align='center' width="500px" height="300px" />


In [5]:
from tutorial_utils import create_drkg_edge_lists
edge_list_dictionary=create_drkg_edge_lists()
graph = dgl.heterograph(edge_list_dictionary);

### Print the statistics of the created graph

- Number of nodes for each node-type


In [6]:
total_nodes = 0;
for ntype in graph.ntypes:
    print(ntype, '\t', graph.number_of_nodes(ntype));
    total_nodes += graph.number_of_nodes(ntype);
print("Graph contains {} nodes from {} node-types.".format(total_nodes, len(graph.ntypes)))

Anatomy 	 400
Atc 	 4048
Biological Process 	 11381
Cellular Component 	 1391
Compound 	 24313
Disease 	 5103
Gene 	 39220
Molecular Function 	 2884
Pathway 	 1822
Pharmacologic Class 	 345
Side Effect 	 5701
Symptom 	 415
Tax 	 215
Graph contains 97238 nodes from 13 node-types.


- Number of edges for each edge-type

In [7]:
total_edges = 0;
for etype in graph.etypes:
    print(etype, '\t', graph.number_of_edges(etype))
    total_edges += graph.number_of_edges(etype);
print("Graph contains {} edges from {} edge-types.".format(total_edges, len(graph.etypes)))

Hetionet::AdG::Anatomy:Gene 	 102240
Hetionet::AeG::Anatomy:Gene 	 526407
Hetionet::AuG::Anatomy:Gene 	 97848
DRUGBANK::carrier::Compound:Gene 	 720
DRUGBANK::ddi-interactor-in::Compound:Compound 	 1379271
DRUGBANK::enzyme::Compound:Gene 	 4923
DRUGBANK::target::Compound:Gene 	 19158
DRUGBANK::treats::Compound:Disease 	 4968
DRUGBANK::x-atc::Compound:Atc 	 15750
GNBR::A+::Compound:Gene 	 1568
GNBR::A-::Compound:Gene 	 1108
GNBR::B::Compound:Gene 	 7170
GNBR::C::Compound:Disease 	 1739
GNBR::E+::Compound:Gene 	 1970
GNBR::E-::Compound:Gene 	 2918
GNBR::E::Compound:Gene 	 32743
GNBR::J::Compound:Disease 	 1020
GNBR::K::Compound:Gene 	 12411
GNBR::Mp::Compound:Disease 	 495
GNBR::N::Compound:Gene 	 12521
GNBR::O::Compound:Gene 	 5573
GNBR::Pa::Compound:Disease 	 2619
GNBR::Pr::Compound:Disease 	 966
GNBR::Sa::Compound:Disease 	 16923
GNBR::T::Compound:Disease 	 54020
GNBR::Z::Compound:Gene 	 2821
Hetionet::CbG::Compound:Gene 	 11571
Hetionet::CcSE::Compound:Side Effect 	 138944
Hetionet::