# DGCNN experiments as embedding method (LION15 SI)
In this notebook we implement a Deep Graph Convolutional Neural Network (DGCNN) [1] algorithm according to the colab: [Supervised graph classification with Deep Graph CNN
](https://colab.research.google.com/github/stellargraph/stellargraph/blob/master/demos/graph-classification/dgcnn-graph-classification.ipynb).

The DGCNN implememntation uses the [StellarGraph](https://www.stellargraph.io/) Library. The DGCNN implementation has been modified in such a way to function as an inductive graph embedding method. To this aim the DGCNN model is fed with input graphs and the output of the last hidden layer is extracted and used as the embedding of the input graph. The graph embedding are then used as input to a SVM with linear kernel for 10-fold cross-validation.

In addition, our modified version of the DGCNN accepts has input graphml files by means of the [NetworkX](https://networkx.org/) Library.

**References**

[1] An End-to-End Deep Learning Architecture for Graph Classification, M. Zhang, Z. Cui, M. Neumann, Y. Chen, AAAI-18. ([link](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/17146))

[2] Semi-supervised Classification with Graph Convolutional Networks, T. N. Kipf and M. Welling, ICLR 2017. ([link](https://arxiv.org/abs/1609.02907))

## Install Stellargraph
The folowing code installs the StellarGraph library and all the other python packages required for execution (TensorFlow, NetworkX, Pandas, )

In [1]:
# install StellarGraph if running on Google Colab
!pip install -g stellargraph
!pip install -q sklean

UsageError: Line magic function `%tensorflow_version` not found.


## Import all required packages

In [2]:
import warnings
warnings.filterwarnings("ignore")
from time import time
import stellargraph as sg
import pandas as pd
import numpy as np
import networkx as nx
import tensorflow as tf
from tensorflow import keras
from stellargraph.mapper import PaddedGraphGenerator
from stellargraph.layer import DeepGraphCNN, GCNSupervisedGraphClassification
from stellargraph import StellarGraph
from stellargraph import datasets
from stellargraph.datasets.dataset_loader import DatasetLoader
from tensorflow.keras import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense, Conv1D, MaxPool1D, Dropout, Flatten
from tensorflow.keras.losses import binary_crossentropy, categorical_crossentropy
import tensorflow as tf
from tensorflow.keras import backend as K
from sklearn import model_selection
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,matthews_corrcoef,accuracy_score,precision_score,f1_score, recall_score
import tqdm as tq
import networkx as nx
import operator
import random

## The loading function
Te following code loads a dataset in TID format

In [17]:
def load_dataset(dataset, root_dir='.', edge_labels_as_weights=False):

    def _load_from_txt_file(filename, names=None, dtype=None, index_increment=None):
        df = pd.read_csv(
            f"{root_dir}/{dataset}/{dataset}_{filename}.txt",
            header=None,
            index_col=False,
            dtype=dtype,
            names=names,
        )
        # We optional increment the index by 1 because indexing, e.g. node IDs, for this dataset starts
        # at 1 whereas the Pandas DataFrame implicit index starts at 0 potentially causing confusion selecting
        # rows later on.
        if index_increment:
            df.index = df.index + index_increment
        return df

    # edge information:
    df_graph = _load_from_txt_file(filename="A", names=["source", "target"])

    if edge_labels_as_weights and os.path.exists(f"{root_dir}/{dataset}/{dataset}_node_attributes.txt"):
        # there's some edge labels, that can be used as edge weights
        df_edge_labels = _load_from_txt_file(
            filename="edge_labels", names=["weight"], dtype=int
        )
        df_graph = pd.concat([df_graph, df_edge_labels], axis=1)

    # node information:
    df_graph_ids = _load_from_txt_file(
        filename="graph_indicator", names=["graph_id"], index_increment=1
    )

    df_node_labels = _load_from_txt_file(
        filename="node_labels", dtype="category", index_increment=1
    )
    # One-hot encode the node labels because these are used as node features in graph classification
    # tasks.
    df_node_features = pd.get_dummies(df_node_labels)

    if os.path.exists(f"{root_dir}/{dataset}/{dataset}_node_attributes.txt"):
        # there's some actual node attributes
        df_node_attributes = _load_from_txt_file(
            filename="node_attributes", dtype=np.float32, index_increment=1
        )

        df_node_features = pd.concat([df_node_features, df_node_attributes], axis=1)

    # graph information:
    df_graph_labels = _load_from_txt_file(
        filename="graph_labels", dtype="category", names=["label"], index_increment=1
    )

    # split the data into each of the graphs, based on the nodes in each one
    def graph_for_nodes(nodes):
        # each graph is disconnected, so the source is enough to identify the graph for an edge
        edges = df_graph[df_graph["source"].isin(nodes.index)]
        return StellarGraph(nodes, edges)

    groups = df_node_features.groupby(df_graph_ids["graph_id"])
    graphs = [graph_for_nodes(nodes) for _, nodes in groups]

    return graphs, df_graph_labels["label"]

## The edge removal function
THis function implements the graph attack strategies

In [21]:
def nx_edgeattack(G: nx.Graph, criteria = "random", percentage=30, verbose=False, random_state=42):
  at = percentage/100.0
  #remove_zero_weights(G)
  if criteria == "betweeness":
    score = nx.edge_betweenness(G).items()
  elif criteria == "degree":
    raise Exception("Wrong criteria")
  elif criteria == "random":
    score = list(G.edges())
    random.Random(random_state).shuffle(score)
    score = list(dict(zip(score,range(len(score)))).items())
  else:
    raise Exception("Wrong criteria")
  edges_to_remove = sorted(score, key=operator.itemgetter(1, 0), reverse=True)[0:int(len(score)*at)]
  #assert len(edges_to_remove) > 0, "Nothing to remove!"
  for e,w in edges_to_remove:
    G.remove_edge(e[0], e[1])
  if verbose:
    print("removed", edges_to_remove)
  return 0,len(edges_to_remove)

## Loading the dataset (and attack a copy)

In [27]:
import tqdm as tq
dataname = 'MUTAG' #@param ['MUTAG', 'KIDNEY', 'Kidney_9.2', 'PROTEINS', 'JE']
criteria = 'random' #@param ['random', 'betweeness']
percentage = 50 #@param [0, 5, 10, 20, 30, 40, 50] {type:"raw"}
ontest = False #@param {type:"boolean"}
import shutil
shutil.unpack_archive(f'../datasets/{dataname}.zip', '../datasets/')
graphs, graph_labels = load_dataset(dataname, root_dir='../datasets')
graphsadv = []  # no attack if dataset is loaded from TU format
for G in tq.tqdm(graphs, desc="Copy/Attack"):
  Gadv = G.to_networkx()
  ec = Gadv.number_of_edges()
  if percentage > 0: 
    n,e = nx_edgeattack(Gadv, criteria=criteria, percentage=percentage, random_state=42)
  graphsadv += [sg.StellarGraph.from_networkx(Gadv,node_type_default="default", edge_type_default="default", node_type_attr="type", edge_type_attr="type", edge_weight_attr="weight", node_features='feature')]
if graph_labels.nunique() > 2:
  y = pd.get_dummies(graph_labels)
else:
  y = pd.get_dummies(graph_labels, drop_first=True)
print(graphs[0].node_features())
print(graphs[0].info())
print(y)
nclasses = len(y.columns)
print("No Classes %d\n"%nclasses)
summary = pd.DataFrame(
    [(g.number_of_nodes(), g.number_of_edges(),ga.number_of_nodes(), ga.number_of_edges()) for g,ga in zip(graphs,graphsadv)],
    columns=["(Graph) nodes", "(Graph) edges", "(Attacked Graph) nodes", "(Attacked Graph) edges"],
)

Copy/Attack: 100%|██████████| 188/188 [00:01<00:00, 152.95it/s]

[[1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]]
StellarGraph: Undirected multigraph
 Nodes: 17, Edges: 38

 Node types:
  default: [17]
    Features: float32 vector, length 7
    Edge types: default-default->default

 Edge types:
    default-default->default: [38]
        Weights: all 1 (default)
        Features: none
     1
1    1
2    0
3    0
4    1
5    0
..  ..
184  1
185  0
186  0
187  1
188  0

[188 rows x 1 columns]
No Classes 1

       (Graph) nodes  (Graph) edges  (Attacked Graph) nodes  \
count         188.00         188.00                  188.00   
mean           17.93          39.59                   17.93   
std    




## Print graphs statistics

In [29]:
print(summary.describe().round(2))

       (Graph) nodes  (Graph) edges  (Attacked Graph) nodes  \
count         188.00         188.00                  188.00   
mean           17.93          39.59                   17.93   
std             4.59          11.40                    4.59   
min            10.00          20.00                   10.00   
25%            14.00          28.00                   14.00   
50%            17.50          38.00                   17.50   
75%            22.00          50.00                   22.00   
max            28.00          66.00                   28.00   

       (Attacked Graph) edges  
count                  188.00  
mean                    29.99  
std                      8.56  
min                     15.00  
25%                     21.00  
50%                     29.00  
75%                     38.00  
max                     50.00  


## Create the model

In [30]:
# The GCNN model
def create_dgcnn_model(generator, size, nouts, learnrate=0.0001):
      k = 35  # the number of rows for the output tensor
      layer_sizes = [size, size, size, nouts]
      dgcnn_model = DeepGraphCNN(layer_sizes=layer_sizes,activations=["tanh", "tanh", "tanh", "tanh"],k=k,bias=False,generator=generator)
      x_inp, x_out = dgcnn_model.in_out_tensors()
      x_out = Conv1D(filters=16, kernel_size=sum(layer_sizes), strides=sum(layer_sizes))(x_out)
      x_out = MaxPool1D(pool_size=2)(x_out)
      x_out = Conv1D(filters=32, kernel_size=5, strides=1)(x_out)
      x_out = Flatten()(x_out)
      x_out = Dense(units=size, activation="relu")(x_out)
      x_out = embedlayer = Dropout(rate=0.5)(x_out)
      if nouts > 2:
        predictions = Dense(units=nouts, activation="softmax")(x_out)
        model = Model(inputs=x_inp, outputs=predictions)
        model.compile(optimizer=Adam(lr=learnrate), loss=categorical_crossentropy, metrics=["acc"],)
      else:
        predictions = Dense(units=1, activation="sigmoid")(x_out)
        model = Model(inputs=x_inp, outputs=predictions)
        model.compile(optimizer=Adam(lr=learnrate), loss=binary_crossentropy, metrics=["acc"],)
      embedding = Model(inputs=x_inp, outputs=embedlayer)
      return model, embedding

## The Experimental Pipeline

In [31]:
# Load model parameters
params = {"layerdim": 32, "epochs": 100, "learningrate": 0.0001, "verbose": False, "seed": 42}

# Validation
start = time()
tot_preds = np.array([])
tot_targets = np.array([])
tot_acc = np.array([])
tot_prec = np.array([])
tot_F1 = np.array([])
tot_recall = np.array([])
tot_MCC = np.array([])
cv_folds = 10
tsize = 1.0 - (1.0 / float(cv_folds))
test_metrics = []
verbose = 1 if params['verbose'] else 0
cv_folds = 10 #@param {type:"slider", min:2, max:10, step:1}
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
setlist = []
if nclasses > 2:
  yy = np.argmax(y.values, axis=1)
  y = y.to_numpy()
else:
  y = y.to_numpy()
  yy = y
for train_index, test_index in tq.tqdm(list(skf.split(graphs,yy)), desc="fold: "):
  train_graphs = graph_labels[train_index+1]
  test_graphs = graph_labels[test_index+1]
  setlist += [set(test_index)]
  gen = PaddedGraphGenerator(graphs=graphs)
  genadv = PaddedGraphGenerator(graphs=graphsadv)
  train_gen = gen.flow(list(train_graphs.index - 1),targets=y[np.array(train_graphs.index-1)],batch_size=1,symmetric_normalization=False)
  test_gen = genadv.flow(list(test_graphs.index - 1),targets=y[np.array(test_graphs.index-1)],batch_size=1,symmetric_normalization=False)
  y_train = [x.tolist() for x in y[np.array(train_graphs.index-1)]] 
  y_test = [x.tolist() for x in y[np.array(test_graphs.index-1)]]
  model, embedding = create_dgcnn_model(gen, params['layerdim'], nclasses, learnrate=params['learningrate'])
  history = model.fit(train_gen, validation_data=test_gen, shuffle=False, epochs=params['epochs'], verbose=params['verbose'])
  X_test = embedding.predict(test_gen)
  X_train = embedding.predict(train_gen)
  y_pred = SVC(kernel='linear').fit(X_train,y_train).predict(X_test)
  tot_preds = np.append(tot_preds,y_pred)
  tot_targets = np.append(tot_targets,y_test)
  tot_acc = np.append(tot_acc, accuracy_score(y_test, y_pred))
  tot_prec = np.append(tot_prec, precision_score(y_test, y_pred, average='macro'))
  tot_F1 = np.append(tot_F1, f1_score(y_test, y_pred, average='macro'))
  tot_recall = np.append(tot_recall, recall_score(y_test, y_pred, average='macro'))
  tot_MCC = np.append(tot_MCC, matthews_corrcoef(y_test, y_pred))
temp = time() - start
hours = temp//3600
temp = temp - 3600*hours
minutes = temp//60
seconds = temp - 60*minutes
expired = '%d:%d:%d' %(hours,minutes,seconds)
print()
print(confusion_matrix(tot_targets, tot_preds))
print("Acc\t%.2f\u00B1%.2f"%((tot_acc * 100).mean(), (tot_acc * 100).std()))
print("Prec\t%.2f\u00B1%.2f"%(tot_prec.mean(), tot_prec.std()))
print("F1\t%.2f\u00B1%.2f"%(tot_F1.mean(), tot_F1.std()))
print("Recall\t%.2f\u00B1%.2f"%(tot_recall.mean(), tot_recall.std()))
print('MCC\t%.2f\u00B1%.2f'%(tot_MCC.mean(), tot_MCC.std()))

fold: 100%|██████████| 10/10 [04:55<00:00, 29.57s/it]


[[ 44  19]
 [ 11 114]]
Acc	84.09±6.13
Prec	0.85±0.08
F1	0.81±0.08
Recall	0.81±0.09
MCC	0.65±0.15



