<a href="https://colab.research.google.com/github/giordamaug/LION15_Experiments/blob/main/notebook/iNP2V_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Inductive Netpro2vec iNP2V experiments as embedding method (LION15 SI)
In this notebook we implement the inductive Netpro2vec method (iNP2V) [1] algorithm.

The iNP2V implememntation uses the [igraph](https://igraph.org/) Library for graph management, and [Gensim](https://radimrehurek.com/gensim) with its doc2vec as paragraph embedding method.

**References**

[1] I. Manipur, M. Manzo, I. Granata, M. Giordano, L. Maddalena and M. R. Guarracino, "Netpro2vec: a Graph Embedding Framework for Biomedical Applications," in IEEE/ACM Transactions on Computational Biology and Bioinformatics, doi: [10.1109/TCBB.2021.3078089](https://ieeexplore.ieee.org/document/9425591).

## Mount Google Drive
The following code snippet mounts a Google Drive directory where testing datasets are stored

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
%cd '/content/drive/MyDrive/CDS-GROUP-ROOT'

Mounted at /content/drive
/content/drive/MyDrive/CDS-GROUP-ROOT


## Download the iNP2V (github) software 
iNP2V software is downloaded and installed in the colab by means of `git` utility.

In [None]:
!pip install --upgrade -q git+https://github.com/cds-group/Netpro2vec.git
!pip install -q python-igraph

[?25l[K     |                                | 10 kB 29.9 MB/s eta 0:00:01[K     |▏                               | 20 kB 23.0 MB/s eta 0:00:01[K     |▎                               | 30 kB 16.5 MB/s eta 0:00:01[K     |▍                               | 40 kB 14.3 MB/s eta 0:00:01[K     |▌                               | 51 kB 5.7 MB/s eta 0:00:01[K     |▋                               | 61 kB 6.2 MB/s eta 0:00:01[K     |▊                               | 71 kB 5.6 MB/s eta 0:00:01[K     |▉                               | 81 kB 6.3 MB/s eta 0:00:01[K     |█                               | 92 kB 6.6 MB/s eta 0:00:01[K     |█                               | 102 kB 5.3 MB/s eta 0:00:01[K     |█▏                              | 112 kB 5.3 MB/s eta 0:00:01[K     |█▏                              | 122 kB 5.3 MB/s eta 0:00:01[K     |█▎                              | 133 kB 5.3 MB/s eta 0:00:01[K     |█▍                              | 143 kB 5.3 MB/s eta 0:00:01[K 

## The edge attack routine
Currently two strategies have been implemneted for edge removal attacks:
1. random selection
2. based on betweness centrality of edges

The routine accepts as input the removal criteria (`random`, `betweeness`), the amount (in `percentage`) of edges to remove, and a random seed (for reprucibility of experiments)

In [None]:
import igraph as ig
import operator
import random
def ig_edgeattack(G: ig.Graph, criteria = "random", percentage=30, verbose=False, random_state=42):
  at = percentage/100.0
  if criteria == "betweeness":
    score = list((G.edge_betweenness()))
  elif criteria == "degree":
    raise Exception("Wrong criteria")
  elif criteria == "random":
    score = list(range(G.ecount()))
    random.Random(random_state).shuffle(score)
  else:
    raise Exception("Wrong criteria")
  score = list(dict(zip(range(len(score)),score)).items())
  b = sorted(score, key=operator.itemgetter(1, 0), reverse=True)[0:int(len(score)*at)]
  edges_to_remove = [e for e,w in b]
  #assert len(edges_to_remove) > 0, "Nothing to remove!"
  if verbose: 
    print("removed", [ (G.es[edge].source, G.es[edge].target) for edge in edges_to_remove])
  G.delete_edges(edges_to_remove)
  return 0,len(edges_to_remove)

## The dataset loading function
The dataset of graphs in (`graphml` formats) is converted into the [igraph](https://igraph.org/python/doc/tutorial/tutorial.html) library data structures. The loading procedure (`load_graphs`) generates two copies of the dataset: the first is the original graph set, while the second is the dataset modified by the attacking routine (edge removal). The third argument is a pandas dataframe containing for each graph name the associated label (used for classification/training).

In [None]:
import os
import tqdm.notebook as tq
import igraph as ig  
#
# load_graphs : load graph in graphml format into stellargrap format
#     - input_path : dir of graphml files 
#     - dataname: name of dataset
#     - fmt='graphml' : input graph formal (graphml or edgelist)
#
def load_graphs(input_path, dataname, fmt='graphml', ontest=True, percentage=20, criteria='random'):
      datapath = f'{input_path}/{dataname}/{fmt}'
      if not os.path.isdir(datapath):
      	raise Exception(f'Wrong input path! {datapath}')
      filenames = os.listdir(datapath)
      print("Loading " + dataname + " graphs with igraph...")
      graphs = []
      graphsadv = []
      targets = []
      dfl = pd.read_csv(f'{input_path}/{dataname}/{dataname}.txt', sep='\t')
      last_column = dfl.iloc[:,[0] + [-1]]
      labelset = set()
      for file in tq.tqdm(last_column['Samples'].values):
            igG = ig.load(os.path.join(datapath,f'{file}.{fmt}'))
            igGadv = igG.copy()
            if percentage > 0: 
	            	ec = igG.ecount()
	            	n,e = ig_edgeattack(igGadv, criteria=criteria, percentage=percentage, random_state=42)
            graphs.append(igG)
            graphsadv.append(igGadv)
            targets.append(last_column[last_column['Samples'].astype(str) == file].iloc[:,-1:])
      from sklearn import preprocessing
      le = preprocessing.LabelEncoder()
      le.fit(np.ravel(targets))
      y = le.transform(np.ravel(targets))
      if ontest:
      	return graphs,graphsadv,y
      else:
      	return graphsadv,graphs,y

## Load the dataset in graphml format
This code snippet loads the dataset of graphs (in `graphml` format) as well as it performs graph attacks (see the above defined `load_graph` procedure. At the end a summary of graph modifications is printed out.



In [None]:
import pandas as pd
import numpy as np
import random
import operator
#@title  { form-width: "30%" }

dataset = 'MUTAG' #@param ['MUTAG', 'KIDNEY', 'Kidney_9.2', 'KIDNEYFOR', 'PROTEINS', 'BREAST', 'JE']
criteria = 'betweeness' #@param ['random', 'betweeness']
percentage = 50 #@param ["0.5", "1.0", "5.0", "10", "20", "30", "40", "50", "0"] {type:"raw"}
ontest = False #@param {type:"boolean"}
path = '/content/drive/MyDrive/CDS-GROUP-ROOT/TUDatasets' #@param {type:"string"}
graphs, graphs_adv, y = load_graphs(path,dataset, criteria=criteria, percentage=percentage, ontest=ontest)
nclasses = len(set(y))
print("No Classes: %d\n"%nclasses)
summary = pd.DataFrame(
    [(g.vcount(), g.ecount(),ga.vcount(), ga.ecount()) for g,ga in zip(graphs,graphs_adv)],
    columns=["(Graph) nodes", "(Graph) edges", "(Attacked Graph) nodes", "(Attacked Graph) edges"],
)
summary.describe().round(2)

Loading MUTAG graphs with igraph...


  0%|          | 0/188 [00:00<?, ?it/s]

No Classes: 2



Unnamed: 0,(Graph) nodes,(Graph) edges,(Attacked Graph) nodes,(Attacked Graph) edges
count,188.0,188.0,188.0,188.0
mean,17.93,10.2,17.93,19.79
std,4.59,2.87,4.59,5.7
min,10.0,5.0,10.0,10.0
25%,14.0,7.0,14.0,14.0
50%,17.5,10.0,17.5,19.0
75%,22.0,13.0,22.0,25.0
max,28.0,17.0,28.0,33.0


# Load deault parameters for the model

We can load a default parameter setting for the model from a file.


In [None]:
import os
import json
method='iNP2V'
confpath = '/content/drive/MyDrive/CDS-GROUP-ROOT/TUDatasets/LION15_results' #@param {type:"string"}
path = os.path.join(confpath, f'{method}_{dataset}_params.json')
if os.path.isfile(path):
  params = json.load( open( path, 'r' ) )
  print(params)
else:
  print("No default found!")

{'agg_by': [1], 'cut_off': [0.1], 'dimensions': 512, 'encodew': False, 'epochs': 400, 'extractor': [1], 'min_count': 2, 'prob_type': ['ndd'], 'save_vocab': True, 'seed': 1, 'verbose': False, 'vertex_attribute': 'label', 'workers': 4}


## ... or tune the parameters (skip this cell if using default)
Otherwise we can experiment with different parameter setting.
Once you choose your paramter configuration, you can store it in a file by enabling `save_params` flag.

In [None]:

dimensions = 512 #@param {type:"slider", min:32, max:1024, step:32}
extractors = [1,1] #@param {type:"raw"}
distributions = ['ndd','tm1'] #@param {type:"raw"}
cutoffs = [0.1,0.1] #@param {type:"raw"}
aggregator = [1,0] #@param {type:"raw"}
epochs = 200 #@param {type:"slider", min:0, max:1000, step:10}
mincount = 2 #@param {type:"slider", min:1, max:10, step:1}
seed =  1#@param {type:"integer"}
verbose = False #@param {type:"boolean"}
encode_words = False #@param {type:"boolean"}
vertex_label = 'label' #@param ['None', 'label']
vlabel = vertex_label if vertex_label != 'None' else None
workers = 1 #@param {type:"slider", min:1, max:10, step:1}
save_params = False #@param {type:"boolean"}
outpath = '/content/drive/MyDrive/CDS-GROUP-ROOT/TUDatasets/LION15_results' #@param {type:"string"}
from netpro2vec.Netpro2vec import Netpro2vec
params = {'dimensions': dimensions, 
          'extractor':extractors,
          'prob_type':distributions,
          'cut_off':cutoffs, 
          'agg_by':aggregator,
          'verbose':verbose,
          'vertex_attribute':vlabel, 
          'encodew':encode_words,
          'epochs':epochs,
          'seed':seed,
          'min_count':mincount,
          'workers':workers
          }
import json
if save_params:
  path = os.path.join(outpath, f'{method}_{dataset}_params.json')
  json.dump( params, open( path, 'w' ) )

## Validation
Thsi is the main part of the experiment. Once data have been loaded and modified (by the attacking routine), we carry on a stratified 10-fold cross-validation of the pipeline consisting in the iNP2V embedding model plus a SVM linear kernel classifier.

In each fold a iNP2V model is trained on 90% of the dataset; the so-trained model is used to embed both the training and the testing samples (inductively). Then the embedding arrays are used as training/testing inputs of the SVM classifier. We collect predictions results for all folds, and the mean and standard deviation of several metric are printed out (Accuracy, Precision, F-measure, Recall and Matthews Correlation Coefficients). 

In [None]:

from netpro2vec.Netpro2vec import Netpro2vec
import warnings
warnings.filterwarnings("ignore")
#@title  { form-width: "30%" }
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,matthews_corrcoef,accuracy_score,precision_score,f1_score, recall_score

from time import time
start = time()
G = np.array(graphs)
Gadv = np.array(graphs_adv)
cv_folds = 10 #@param {type:"slider", min:2, max:10, step:1}
tot_preds = np.array([])
tot_targets = np.array([])
tot_acc = np.array([])
tot_prec = np.array([])
tot_F1 = np.array([])
tot_recall = np.array([])
tot_MCC = np.array([])
skf = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=42)
for train_index, test_index in tq.tqdm(list(skf.split(G,y)), desc="fold: "):
    G_train, G_test = G[train_index], Gadv[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model = Netpro2vec(**params)
    X_train = model.fit(G_train).get_embedding()
    X_test = np.array(model.infer_vector(G_test))
    y_pred = SVC(kernel='linear').fit(X_train,y_train).predict(X_test)
    tot_preds = np.append(tot_preds,y_pred)
    tot_targets = np.append(tot_targets,y_test)
    tot_acc = np.append(tot_acc, accuracy_score(y_test, y_pred))
    tot_prec = np.append(tot_prec, precision_score(y_test, y_pred, average='macro'))
    tot_F1 = np.append(tot_F1, f1_score(y_test, y_pred, average='macro'))
    tot_recall = np.append(tot_recall, recall_score(y_test, y_pred, average='macro'))
    tot_MCC = np.append(tot_MCC, matthews_corrcoef(y_test, y_pred))
temp = time() - start
hours = temp//3600
temp = temp - 3600*hours
minutes = temp//60
seconds = temp - 60*minutes
expired = '%d:%d:%d' %(hours,minutes,seconds)
print()
print(confusion_matrix(tot_targets, tot_preds))
print("Acc\t%.2f\u00B1%.2f"%((tot_acc * 100).mean(), (tot_acc * 100).std()))
print("Prec\t%.2f\u00B1%.2f"%(tot_prec.mean(), tot_prec.std()))
print("F1\t%.2f\u00B1%.2f"%(tot_F1.mean(), tot_F1.std()))
print("Recall\t%.2f\u00B1%.2f"%(tot_recall.mean(), tot_recall.std()))
print('MCC\t%.2f\u00B1%.2f'%(tot_MCC.mean(), tot_MCC.std()))

fold:   0%|          | 0/10 [00:00<?, ?it/s]


[[  0  63]
 [  0 125]]
Acc	66.49±2.28
Prec	0.33±0.01
F1	0.40±0.01
Recall	0.50±0.00
MCC	0.00±0.00


## Saving results in CSV
Thsi code snippet saves the execution results in a file in 

In [None]:
method = 'iNP2V'
#@title  { form-width: "30%" }
outpath = '/content/drive/MyDrive/CDS-GROUP-ROOT/TUDatasets/LION15_results' #@param {type:"string"}
from datetime import datetime
import pandas as pd
path = os.path.join(outpath, f'{method}_{dataset}_e{params["epochs"]}.csv')
if not os.path.exists(path):
  dfres = pd.DataFrame(columns=['mode', 'criteria', '% attack','avg edge del', 'acc','prec','f1','recall','MCC','cm', 'date', 'time'])
  dfres.to_csv(path, index=False)
dfres = pd.read_csv(path)
mode = 'test' if ontest else 'train'
dfres = dfres.append({'mode' : mode, 'criteria' : criteria, '% attack': str(percentage), 
                      'avg edge del' : "%.2f"%(abs(float(summary.describe().iat[1,1]) - float(summary.describe().iat[1,3]))),
                      'acc' : "%.2f\u00B1%.2f"%(tot_acc.mean(), tot_acc.std()), 
                      'prec' : "%.2f\u00B1%.2f"%(tot_prec.mean(), tot_prec.std()),
                      'f1' : "%.2f\u00B1%.2f"%(tot_F1.mean(), tot_F1.std()),
                      'recall' : "%.2f\u00B1%.2f"%(tot_recall.mean(), tot_recall.std()),
                      'MCC' : "%.2f\u00B1%.2f"%(tot_MCC.mean(), tot_MCC.std()),
                      'cm' : f'{confusion_matrix(tot_targets, tot_preds)}'.replace('\n',''),
                      'date': datetime.now().strftime("%m/%d/%Y, %H:%M:%S"),
                      'time' : expired}, ignore_index=True)
dfres.to_csv(path, index=False)
dfres

Unnamed: 0,mode,criteria,% attack,avg edge del,acc,prec,f1,recall,MCC,cm,date,time
0,test,betweeness,0,0.0,0.82±0.07,0.82±0.08,0.79±0.09,0.80±0.10,0.61±0.17,[[ 45 18] [ 16 109]],"09/28/2021, 08:38:15",0:0:40
1,test,betweeness,5,0.49,0.76±0.09,0.76±0.10,0.74±0.09,0.76±0.09,0.52±0.19,[[48 15] [30 95]],"09/28/2021, 08:39:22",0:0:41
2,test,betweeness,10,1.53,0.63±0.12,0.69±0.09,0.63±0.12,0.69±0.10,0.39±0.19,[[55 8] [61 64]],"09/28/2021, 08:40:58",0:0:40
3,test,betweeness,20,3.53,0.44±0.18,0.40±0.29,0.36±0.19,0.55±0.09,0.14±0.21,[[56 7] [99 26]],"09/28/2021, 08:42:18",0:0:41
4,test,betweeness,30,5.5,0.41±0.15,0.30±0.27,0.31±0.12,0.52±0.03,0.07±0.13,[[ 52 11] [100 25]],"09/28/2021, 08:43:44",0:0:41
5,test,betweeness,40,7.46,0.46±0.20,0.33±0.28,0.36±0.18,0.54±0.08,0.11±0.22,[[49 14] [87 38]],"09/28/2021, 08:44:43",0:0:40
6,test,betweeness,50,9.59,0.37±0.11,0.19±0.05,0.27±0.05,0.50±0.00,0.00±0.00,[[ 57 6] [112 13]],"09/28/2021, 08:45:49",0:0:41
7,train,random,0,0.0,0.82±0.07,0.82±0.08,0.79±0.09,0.80±0.10,0.61±0.17,[[ 45 18] [ 16 109]],"09/28/2021, 08:47:23",0:0:41
8,train,random,5,0.49,0.70±0.10,0.70±0.11,0.69±0.10,0.71±0.12,0.41±0.22,[[47 16] [40 85]],"09/28/2021, 08:49:23",0:0:41
9,train,random,10,1.53,0.76±0.11,0.76±0.12,0.74±0.11,0.76±0.11,0.52±0.23,[[49 14] [32 93]],"09/28/2021, 08:50:24",0:0:41
