In [1]:
%matplotlib inline


Node Classification with DGL
============================

GNNs are powerful tools for many machine learning tasks on graphs. In
this introductory tutorial, you will learn the basic workflow of using
GNNs for node classification, i.e. predicting the category of a node in
a graph.

By completing this tutorial, you will be able to

-  Load a DGL-provided dataset.
-  Build a GNN model with DGL-provided neural network modules.
-  Train and evaluate a GNN model for node classification on either CPU
   or GPU.

This tutorial assumes that you have experience in building neural
networks with PyTorch.

(Time estimate: 13 minutes)


In [6]:
import os 
import time
import numpy as np
import pandas as pd
import networkx as nx

os.system("pip install dgl -f https://data.dgl.ai/wheels/repo.html >> out.txt")
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F

DGL backend not selected or invalid.  Assuming PyTorch for now.


Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)


Using backend: pytorch


In [None]:
from google.colab import drive
drive.mount('/content/drive')
! ls
%cd drive/MyDrive/ProjetLong
! git clone https://github.com/Viperine2022/projet_long_GCN_internet

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
drive  out.txt	sample_data


In [1]:
from google.colab import drive
drive.mount('/content/drive')
%cd drive/MyDrive/ProjetLong/projet_long_GCN_internet
! git pull

Mounted at /content/drive
/content/drive/MyDrive/ProjetLong/projet_long_GCN_internet
remote: Enumerating objects: 23, done.[K
remote: Counting objects: 100% (22/22), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 15 (delta 7), reused 15 (delta 7), pack-reused 0[K
Unpacking objects: 100% (15/15), done.
From https://github.com/Viperine2022/projet_long_GCN_internet
   fefe422..2e61bb5  main       -> origin/main
Updating fefe422..2e61bb5
Fast-forward
 .../creation_datasetgraph-checkpoint.ipynb         |   8 [32m++++[m[31m----[m
 IMPLANTATION/CAIDA/creation_datasetgraph.ipynb     |  22 [32m++++++++++++++[m[31m-------[m
 .../CAIDA/data_GCN/graph_array_202001.pickle       | Bin [31m20596499[m -> [32m0[m bytes
 .../CAIDA/data_GCN/graph_float_202001.pickle       | Bin [31m20596499[m -> [32m31361167[m bytes
 4 files changed, 19 insertions(+), 11 deletions(-)
 delete mode 100644 IMPLANTATION/CAIDA/data_GCN/graph_array_202001.pickle


Overview of Node Classification with GNN
----------------------------------------

One of the most popular and widely adopted tasks on graph data is node
classification, where a model needs to predict the ground truth category
of each node. Before graph neural networks, many proposed methods are
using either connectivity alone (such as DeepWalk or node2vec), or simple
combinations of connectivity and the node's own features.  GNNs, by
contrast, offers an opportunity to obtain node representations by
combining the connectivity and features of a *local neighborhood*.

`Kipf et
al., <https://arxiv.org/abs/1609.02907>`__ is an example that formulates
the node classification problem as a semi-supervised node classification
task. With the help of only a small portion of labeled nodes, a graph
neural network (GNN) can accurately predict the node category of the
others.

This tutorial will show how to build such a GNN for semi-supervised node
classification with only a small number of labels on the Cora
dataset,
a citation network with papers as nodes and citations as edges. The task
is to predict the category of a given paper. Each paper node contains a
word count vector as its features, normalized so that they sum up to one,
as described in Section 5.2 of
`the paper <https://arxiv.org/abs/1609.02907>`__.

Loading Cora Dataset
--------------------




In [None]:
! ls

dataset_202001.csv  graph_202001.pickle  node_features_202001.csv
graph_202001_DGL    labels_202001.csv	 out.txt


In [3]:
%cd IMPLANTATION/CAIDA/data_GCN

/content/drive/MyDrive/ProjetLong/projet_long_GCN_internet/IMPLANTATION/CAIDA/data_GCN


In [7]:
! pip3 install pickle5
import pickle5 as pickle





# Lecture de la table 'node_features' au format csv
dataset_path = 'dataset_202001.csv'
dataset = pd.read_csv(dataset_path)

# Lecture du graphe nx puis transformation en graphe dgl : G_dgl sans features
path_to_protocol5 = 'graph_202001.pickle'
with open(path_to_protocol5, "rb") as fh:
  G = pickle.load(fh)

start_time = time.time()
G_dgl = dgl.from_networkx(G)
end_time = time.time()


# Lecture du graphe nx puis transformation en graphe dgl : G_dgl avec edges features float
path_to_protocol5 = 'graph_float_202001.pickle'
with open(path_to_protocol5, "rb") as fh:
  G_float = pickle.load(fh)

G_dgl_float = dgl.from_networkx(G_float, edge_attrs='type')

# Lecture du graphe nx puis transformation en graphe dgl : G_dgl avec edges features array






KeyError: ignored

In [18]:
type(G_float)

networkx.classes.digraph.DiGraph

In [None]:
G_dgl_float

In [19]:
path_to_protocol5 = 'graph_float_202001.pickle'
with open(path_to_protocol5, "rb") as fh:
  G_float = pickle.load(fh)

G_dgl_float = dgl.from_networkx(G_float, edge_attrs=['type'])

In [26]:
path_to_protocol5 = 'graph_array_202001.pickle'
with open(path_to_protocol5, "rb") as fh:
  G_array = pickle.load(fh)

G_dgl_array = dgl.from_networkx(G_array, edge_attrs=['type'])

  return th.as_tensor(data, dtype=dtype)


A DGL Dataset object may contain one or multiple graphs. The Cora
dataset used in this tutorial only consists of one single graph.




In [27]:
g = G_dgl_array

In [None]:
class KarateClubDataset(DGLDataset):
    def __init__(self):
        super().__init__(name='karate_club')

    def process(self):
        nodes_data = pd.read_csv('./members.csv')
        edges_data = pd.read_csv('./interactions.csv')
        node_features = torch.from_numpy(nodes_data['Age'].to_numpy())
        node_labels = torch.from_numpy(nodes_data['Club'].astype('category').cat.codes.to_numpy())
        edge_features = torch.from_numpy(edges_data['Weight'].to_numpy())
        edges_src = torch.from_numpy(edges_data['Src'].to_numpy())
        edges_dst = torch.from_numpy(edges_data['Dst'].to_numpy())

        self.graph = dgl.graph((edges_src, edges_dst), num_nodes=nodes_data.shape[0])
        self.graph.ndata['feat'] = node_features
        self.graph.ndata['label'] = node_labels
        self.graph.edata['weight'] = edge_features

        # If your dataset is a node classification dataset, you will need to assign
        # masks indicating whether a node belongs to training, validation, and test set.
        n_nodes = nodes_data.shape[0]
        n_train = int(n_nodes * 0.6)
        n_val = int(n_nodes * 0.2)
        train_mask = torch.zeros(n_nodes, dtype=torch.bool)
        val_mask = torch.zeros(n_nodes, dtype=torch.bool)
        test_mask = torch.zeros(n_nodes, dtype=torch.bool)
        train_mask[:n_train] = True
        val_mask[n_train:n_train + n_val] = True
        test_mask[n_train + n_val:] = True
        self.graph.ndata['train_mask'] = train_mask
        self.graph.ndata['val_mask'] = val_mask
        self.graph.ndata['test_mask'] = test_mask

    def __getitem__(self, i):
        return self.graph

    def __len__(self):
        return 1

A DGL graph can store node features and edge features in two
dictionary-like attributes called ``ndata`` and ``edata``.
In the DGL Cora dataset, the graph contains the following node features:

- ``train_mask``: A boolean tensor indicating whether the node is in the
  training set.

- ``val_mask``: A boolean tensor indicating whether the node is in the
  validation set.

- ``test_mask``: A boolean tensor indicating whether the node is in the
  test set.

- ``label``: The ground truth node category.

-  ``feat``: The node features.




In [28]:
print('Node features')
print(g.ndata)
print('Edge features')
print(g.edata)
print(g.num_edges)


Node features
{}
Edge features
{'type': tensor([[0, 1, 0],
        [0, 0, 1],
        [0, 0, 1],
        ...,
        [0, 0, 1],
        [1, 0, 0],
        [0, 0, 1]], dtype=torch.int32)}
<bound method DGLHeteroGraph.num_edges of Graph(num_nodes=67308, num_edges=893266,
      ndata_schemes={}
      edata_schemes={'type': Scheme(shape=(3,), dtype=torch.int32)})>


Defining a Graph Convolutional Network (GCN)
--------------------------------------------

This tutorial will build a two-layer `Graph Convolutional Network
(GCN) <http://tkipf.github.io/graph-convolutional-networks/>`__. Each
layer computes new node representations by aggregating neighbor
information.

To build a multi-layer GCN you can simply stack ``dgl.nn.GraphConv``
modules, which inherit ``torch.nn.Module``.




In [None]:
from dgl.nn import GraphConv

class GCN(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(GCN, self).__init__()
        self.conv1 = GraphConv(in_feats, h_feats)
        self.conv2 = GraphConv(h_feats, num_classes)
    
    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h
    
# Create the model with given dimensions
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes)

KeyError: ignored

DGL provides implementation of many popular neighbor aggregation
modules. You can easily invoke them with one line of code.




Training the GCN
----------------

Training this GCN is similar to training other PyTorch neural networks.




In [None]:
def train(g, model):
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    best_val_acc = 0
    best_test_acc = 0

    features = g.ndata['feat']
    labels = g.ndata['label']
    train_mask = g.ndata['train_mask']
    val_mask = g.ndata['val_mask']
    test_mask = g.ndata['test_mask']
    for e in range(100):
        # Forward
        logits = model(g, features)

        # Compute prediction
        pred = logits.argmax(1)

        # Compute loss
        # Note that you should only compute the losses of the nodes in the training set.
        loss = F.cross_entropy(logits[train_mask], labels[train_mask])

        # Compute accuracy on training/validation/test
        train_acc = (pred[train_mask] == labels[train_mask]).float().mean()
        val_acc = (pred[val_mask] == labels[val_mask]).float().mean()
        test_acc = (pred[test_mask] == labels[test_mask]).float().mean()

        # Save the best validation accuracy and the corresponding test accuracy.
        if best_val_acc < val_acc:
            best_val_acc = val_acc
            best_test_acc = test_acc

        # Backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if e % 5 == 0:
            print('In epoch {}, loss: {:.3f}, val acc: {:.3f} (best {:.3f}), test acc: {:.3f} (best {:.3f})'.format(
                e, loss, val_acc, best_val_acc, test_acc, best_test_acc))
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes)
train(g, model)

AttributeError: ignored

Training on GPU
---------------

Training on GPU requires to put both the model and the graph onto GPU
with the ``to`` method, similar to what you will do in PyTorch.

.. code:: python

   g = g.to('cuda')
   model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes).to('cuda')
   train(g, model)




What’s next?
------------

-  :doc:`How does DGL represent a graph <2_dglgraph>`?
-  :doc:`Write your own GNN module <3_message_passing>`.
-  :doc:`Link prediction (predicting existence of edges) on full
   graph <4_link_predict>`.
-  :doc:`Graph classification <5_graph_classification>`.
-  :doc:`Make your own dataset <6_load_data>`.
-  `The list of supported graph convolution
   modules <apinn-pytorch>`.
-  `The list of datasets provided by DGL <apidata>`.




In [None]:
# Thumbnail Courtesy: Stanford CS224W Notes
# sphinx_gallery_thumbnail_path = '_static/blitz_1_introduction.png'