In this notebook, we aim to perform node **classification** using **graph embeddings**. We will extract node embeddings from the graph using two techniques: 

**DeepWalk** and **Node2Vec**. 

These embeddings will then be used as input features to train and evaluate **classification models**.

In [None]:
from deepwalk_skipgram import deepwalk_skipgram
from evaluate_embedding_node_classification import evaluate_embedding_node_classification 
from evaluate_embedding_node_classification_rf import evaluate_embedding_node_classification_rf
from evaluate_embedding_node_classification_svm import evaluate_embedding_node_classification_svm

from torch_geometric.datasets import Planetoid
import numpy as np
from torch_geometric.utils import to_networkx
import networkx as nx
from node2vec import Node2Vec


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
dataset = Planetoid(root='data/CiteSeer', name='CiteSeer')
data = dataset[0]

print(f'Number of nodes: {data.num_nodes}')
print(f'Number of edges: {data.num_edges}')
print(f'Number of features: {data.num_node_features}')
print(f'Number of classes: {dataset.num_classes}')

Number of nodes: 3327
Number of edges: 9104
Number of features: 3703
Number of classes: 6


  if osp.exists(f) and torch.load(f) != _repr(self.pre_transform):
  if osp.exists(f) and torch.load(f) != _repr(self.pre_filter):
  self.data, self.slices = torch.load(self.processed_paths[0])


In [3]:
G = to_networkx(data, node_attrs=['x'], to_undirected=True)

In [4]:
adj_matrix=nx.to_numpy_array(G)

# Node2Vec

In [20]:
# Initialize Node2Vec
node2vec = Node2Vec(
    G, dimensions=64, walk_length=20, num_walks=10, workers=4
)

# Train Node2Vec
model = node2vec.fit(window=10, min_count=1, batch_words=4)

# Get embeddings
embeddings = model.wv  # Word2Vec model, embeddings accessible via model.wv

embedding_matrix = np.array([model.wv[str(node)] for node in range(len(model.wv))])

embedding_matrix



Computing transition probabilities: 100%|██████████| 3327/3327 [00:00<00:00, 15781.92it/s]


array([[-0.09893435, -0.4263198 ,  0.8625401 , ..., -1.2171283 ,
        -0.6660046 , -0.6723451 ],
       [-0.2682902 , -0.6559411 ,  0.70598024, ..., -1.1611893 ,
         0.07304472, -0.2867335 ],
       [-0.16343555, -0.29298434,  0.8810016 , ..., -0.20318702,
        -0.5131983 , -1.4762343 ],
       ...,
       [-0.02471617, -0.8680827 , -0.43844387, ..., -0.5989203 ,
         0.03432708, -0.29996353],
       [ 0.35099536, -0.6044491 ,  0.02776687, ..., -0.03767155,
         0.02319954, -0.07565139],
       [-0.48768967, -0.3172055 ,  0.9945586 , ..., -0.78757244,
        -0.42890227, -0.09246605]], dtype=float32)

In this code, we use the **Node2Vec** algorithm to generate node embeddings for our graph G. First, we initialize the Node2Vec model with parameters like embedding dimensions (64), walk length (20), number of walks per node (10), and parallel workers (4). The model is then trained using the **skip-gram** approach with a window size of 10, a minimum count of 1, and a batch size of 4. After training, we access the embeddings through the model.wv object, which stores the node embeddings.Finally, we create an embedding matrix containing all node embeddings for further analysis or classification tasks.

In [18]:
R, S = evaluate_embedding_node_classification(embedding_matrix, data.y.numpy())

Classification Report for seed 1:
              precision    recall  f1-score   support

           0       0.67      0.01      0.02       238
           1       0.39      0.41      0.40       531
           2       0.66      0.67      0.67       601
           3       0.42      0.60      0.50       631
           4       0.63      0.68      0.65       537
           5       0.56      0.41      0.47       457

    accuracy                           0.52      2995
   macro avg       0.55      0.46      0.45      2995
weighted avg       0.54      0.52      0.50      2995

Classification Report for seed 2:
              precision    recall  f1-score   support

           0       0.07      0.00      0.01       238
           1       0.48      0.42      0.45       531
           2       0.65      0.67      0.66       601
           3       0.44      0.71      0.54       631
           4       0.64      0.69      0.66       537
           5       0.64      0.45      0.53       457

    accur

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [12]:
R, S = evaluate_embedding_node_classification_svm(embedding_matrix, data.y.numpy())

Classification Report for SVM (seed 1):
              precision    recall  f1-score   support

           0       0.33      0.01      0.02       238
           1       0.36      0.46      0.40       531
           2       0.67      0.66      0.66       601
           3       0.42      0.59      0.49       631
           4       0.66      0.66      0.66       537
           5       0.60      0.41      0.49       457

    accuracy                           0.52      2995
   macro avg       0.51      0.46      0.45      2995
weighted avg       0.52      0.52      0.50      2995

Classification Report for SVM (seed 2):
              precision    recall  f1-score   support

           0       0.18      0.02      0.04       238
           1       0.47      0.43      0.45       531
           2       0.69      0.66      0.67       601
           3       0.42      0.73      0.53       631
           4       0.70      0.65      0.68       537
           5       0.62      0.42      0.50       45

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [13]:
R, S = evaluate_embedding_node_classification_rf(embedding_matrix, data.y.numpy())

Classification Report for Random Forest (seed 1):
              precision    recall  f1-score   support

           0       0.49      0.14      0.22       238
           1       0.44      0.53      0.48       531
           2       0.64      0.65      0.65       601
           3       0.49      0.63      0.55       631
           4       0.68      0.72      0.70       537
           5       0.70      0.45      0.55       457

    accuracy                           0.57      2995
   macro avg       0.58      0.52      0.53      2995
weighted avg       0.58      0.57      0.56      2995

Classification Report for Random Forest (seed 2):
              precision    recall  f1-score   support

           0       0.24      0.05      0.09       238
           1       0.50      0.50      0.50       531
           2       0.65      0.67      0.66       601
           3       0.50      0.65      0.56       631
           4       0.66      0.74      0.70       537
           5       0.63      0.5

# DeepWalk

In [None]:
embedding = deepwalk_skipgram(adj_matrix, 64, 80, 10, 8, 10, 1)
print(embedding)


[[-0.37095743  0.05554513  0.39400974 ...  0.0966529   0.14258593
   0.41375005]
 [ 0.08652798 -0.03317666  0.20059016 ...  0.08500014  0.02109049
   0.59266287]
 [-0.35050303 -0.59170437  0.3946816  ...  0.44476083  0.2118759
   0.16605836]
 ...
 [-0.04889694 -0.11678085 -0.30817577 ...  0.48883709  0.4220567
  -0.05476727]
 [ 0.3297362  -0.36138195  0.66463429 ... -0.25822192  0.22414826
  -0.58632076]
 [ 0.05145163 -0.64699495  0.63103318 ... -0.44660971  0.08342396
   0.74478781]]


This code defines a function **deepwalk_skipgram** that computes node embeddings for a graph using the DeepWalk algorithm with the skip-gram model. The function takes an adjacency matrix adj_matrix and several hyperparameters, including the embedding dimension, walk length, the number of random walks per node, number of workers for parallel processing, context window size, and the number of negative samples for training. It first samples random walks from the graph using the **sample_random_walks** function, converts these walks into a format compatible with Word2Vec, and then trains a Word2Vec model using the skip-gram approach. The resulting embeddings for each node are retrieved and returned as a numpy array. If a node is not present in the learned embeddings, a random vector is assigned.

In [15]:
# Now, pass the embeddings to the node classification evaluation function
R, S = evaluate_embedding_node_classification(embedding, data.y.numpy())

Classification Report for seed 1:
              precision    recall  f1-score   support

           0       0.45      0.02      0.04       238
           1       0.39      0.41      0.40       531
           2       0.63      0.66      0.65       601
           3       0.45      0.57      0.51       631
           4       0.62      0.72      0.66       537
           5       0.60      0.47      0.53       457

    accuracy                           0.53      2995
   macro avg       0.52      0.48      0.46      2995
weighted avg       0.53      0.53      0.51      2995

Classification Report for seed 2:
              precision    recall  f1-score   support

           0       0.12      0.01      0.02       238
           1       0.41      0.51      0.46       531
           2       0.66      0.67      0.67       601
           3       0.47      0.57      0.51       631
           4       0.65      0.70      0.68       537
           5       0.59      0.49      0.54       457

    accur



In [None]:
# Now, pass the embeddings to the node classification evaluation function (SVM)
R, S = evaluate_embedding_node_classification_svm(embedding, data.y.numpy())

Classification Report for SVM (seed 1):
              precision    recall  f1-score   support

           0       0.41      0.05      0.08       238
           1       0.36      0.47      0.41       531
           2       0.65      0.64      0.65       601
           3       0.44      0.61      0.51       631
           4       0.70      0.66      0.68       537
           5       0.70      0.46      0.56       457

    accuracy                           0.53      2995
   macro avg       0.54      0.48      0.48      2995
weighted avg       0.55      0.53      0.52      2995

Classification Report for SVM (seed 2):
              precision    recall  f1-score   support

           0       0.15      0.02      0.03       238
           1       0.37      0.58      0.45       531
           2       0.68      0.65      0.67       601
           3       0.46      0.55      0.50       631
           4       0.71      0.67      0.69       537
           5       0.65      0.43      0.52       45

In [None]:
# Now, pass the embeddings to the node classification evaluation function (RandomForest)
R, S = evaluate_embedding_node_classification_rf(embedding, data.y.numpy())

Classification Report for Random Forest (seed 1):
              precision    recall  f1-score   support

           0       0.43      0.14      0.21       238
           1       0.46      0.52      0.49       531
           2       0.63      0.66      0.64       601
           3       0.48      0.61      0.54       631
           4       0.68      0.73      0.70       537
           5       0.75      0.51      0.61       457

    accuracy                           0.57      2995
   macro avg       0.57      0.53      0.53      2995
weighted avg       0.58      0.57      0.56      2995

Classification Report for Random Forest (seed 2):
              precision    recall  f1-score   support

           0       0.30      0.09      0.14       238
           1       0.41      0.51      0.46       531
           2       0.70      0.65      0.68       601
           3       0.50      0.62      0.55       631
           4       0.72      0.73      0.72       537
           5       0.64      0.5

**Node2Vec** and **DeepWalk** are both algorithms used to generate **node embeddings** for graphs, but they differ in how they sample random walks and capture node relationships:

**DeepWalk** is simpler and faster, focusing mainly on local neighborhood information, while **Node2Vec** offers a more sophisticated, tunable method that can capture both local and global graph structures.