## Link prediction with node2vec

Goal is to predict whether a woman attended an event in the Southern Women bipartite network using node2vec embeddings.To do so I will load and verify the graph. Build positive and negative pairs. Do one full train test run. Then repeat the entire procedure ten times and average the results.


Load libraries and set seeds



In [10]:
import numpy as np
import networkx as nx
from node2vec import Node2Vec
import random

# Set random seed for reproducibility
random.seed(42)
np.random.seed(42)



Read the network and confirm it loaded correctly. We're loading a bipartite network where yype 1 & 2 nodes = 18 women and yype 3 nodes = 14 events. Edges represents women attendance of events relationships


In [12]:
G = nx.Graph()

with open("data/southern_women.net", 'r') as file:
    # Skippinf the header
    first_line = file.readline()
    n_nodes = int(first_line.split()[1])
    
    # Read all nodes
    for line in file:
        if line.startswith("*"):
            break
        else:
            # Parsing node_id "name" type
            parts = line.split("\"")
            node_id = parts[0].strip()
            name = parts[1].strip()
            node_type = int(parts[2].strip())
            
            G.add_node(node_id, name=name, type=node_type)
    
    # Read all edges
    for line in file:
        if line.startswith("*"):
            continue
        edge = line.split()[:2]
        if len(edge) == 2:
            G.add_edge(edge[0], edge[1])

print(f"Loaded graph with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")

Loaded graph with 32 nodes and 89 edges


Example nodes

In [14]:
for i, (node, data) in enumerate(list(G.nodes(data=True))[:10]):
    print(f"  Node {node}: {data['name']} (type {data['type']})")



  Node 1: Evelyn (type 1)
  Node 2: Laura (type 1)
  Node 3: Theresa (type 1)
  Node 4: Brenda (type 1)
  Node 5: Charlotte (type 1)
  Node 6: Frances (type 1)
  Node 7: Eleanor (type 1)
  Node 8: Pearl (type 1)
  Node 9: Ruth (type 1)
  Node 10: Verne (type 2)


Example edges

In [15]:
for i, (u, v) in enumerate(list(G.edges())[:5]):
    print(f"  {G.nodes[u]['name']} → {G.nodes[v]['name']}")

  Evelyn → Event 1
  Evelyn → Event 2
  Evelyn → Event 3
  Evelyn → Event 4
  Evelyn → Event 5


Lets separate nodes of women from nodes of events so to ensure negative examples only connect women-event and not women-women or events-events.


In [16]:
women = [n for n in G.nodes() if G.nodes[n]['type'] in [1, 2]]
events = [n for n in G.nodes() if G.nodes[n]['type'] == 3]

print(f"{len(women)} women")
print(f"{len(events)} events")

18 women
14 events


Verified! There are 18 women and 14 events as expected. Now lets create positive (edge exist) examples and negative exxample (edge does not exist)

In [18]:
positive_edges = list(G.edges())

print(f"{len(positive_edges)} positive examples (existing edges)")


89 positive examples (existing edges)


We need 89 negative examples too. 

In [19]:
negative_edges = []

# Keep track of existing edges for faster lookup
existing_edges = set(G.edges()) | set((v, u) for u, v in G.edges())

#Sampling 89 negative pairs
attempts = 0
max_attempts = len(positive_edges) * 100  # Safety limit

while len(negative_edges) < len(positive_edges) and attempts < max_attempts:
    # Randomly pick a woman and an event
    woman = random.choice(women)
    event = random.choice(events)
    
    # Check if this pair is NOT connected
    if (woman, event) not in existing_edges and (event, woman) not in existing_edges:
        negative_edges.append((woman, event))
    
    attempts += 1

print(f"{len(negative_edges)} negative examples")

89 negative examples


Lets now make test and train split

In [27]:
all_edges = positive_edges + negative_edges
all_labels = [1] * len(positive_edges) + [0] * len(negative_edges)


# Shuffle all examples randomly
indices = list(range(len(all_edges)))
random.shuffle(indices)

# Split 80/20
split_point = int(0.8 * len(indices))
train_idx = indices[:split_point]
test_idx = indices[split_point:]

# Create train and test sets
train_edges = [all_edges[i] for i in train_idx]
test_edges = [all_edges[i] for i in test_idx]
train_labels = [all_labels[i] for i in train_idx]
test_labels = [all_labels[i] for i in test_idx]

print(f"Training set: {len(train_edges)} edges")
print(f"Positive: {sum(train_labels)} | Negative: {len(train_labels) - sum(train_labels)}")
print(f"Test set: {len(test_edges)} edges")
print(f"Positive: {sum(test_labels)} | Negative: {len(test_labels) - sum(test_labels)}")

# Remove positive test edges from graph to prevent data leakage
test_positive_edges = [e for i, e in enumerate(test_edges) if test_labels[i] == 1]

print(f"Original graph: {G.number_of_edges()} edges")
G_train = G.copy()
G_train.remove_edges_from(test_positive_edges)
print(f"Training graph: {G_train.number_of_edges()} edges")



Training set: 142 edges
Positive: 68 | Negative: 74
Test set: 36 edges
Positive: 21 | Negative: 15
Original graph: 89 edges
Training graph: 68 edges


Lets now generate Node2Vec embeddings. 

In [28]:
# Generate embeddings on the TRAINING graph 
node2vec = Node2Vec(
    G_train,
    dimensions=64,
    walk_length=30,
    num_walks=200,
    p=1,
    q=1,
    workers=4,
    seed=42,
    quiet=False  # Show progress
)

# Fit the Word2Vec model
model = node2vec.fit(window=10, min_count=1, batch_words=4, seed=42)
wv = model.wv


Computing transition probabilities:   0%|          | 0/32 [00:00<?, ?it/s]

Les now create edge features (vector of each women/event) from node embedding. I will use Hadamard product that is element wise multiplication. 

In [30]:

# Function to create edge features (Hadamard product)
def create_edge_features(edges, wv):
    features = []
    for u, v in edges:
        u_emb = wv[u]  # Get embedding for node u
        v_emb = wv[v]  # Get embedding for node v
        edge_emb = u_emb * v_emb  # Hadamard product (element-wise multiply)
        features.append(edge_emb)
    return np.array(features)

# Create features for training and test sets

X_train = create_edge_features(train_edges, wv)
X_test = create_edge_features(test_edges, wv)

y_train = np.array(train_labels)
y_test = np.array(test_labels)

print(f"Training features: {X_train.shape}")
print(f"{X_train.shape[0]} examples, each with {X_train.shape[1]} features")
print(f"Test features: {X_test.shape}")
print(f"{X_test.shape[0]} examples, each with {X_test.shape[1]} features")



Training features: (142, 64)
142 examples, each with 64 features
Test features: (36, 64)
36 examples, each with 64 features


Baseline

In [31]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score


clf = LogisticRegression(max_iter=1000, random_state=42)
clf.fit(X_train, y_train)

Make predictions on test set

In [32]:
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]  # Probability of class 1

Evaluate performance

In [35]:

acc = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_proba)
f1 = f1_score(y_test, y_pred)

print(f"  Accuracy:  {acc:.4f}  ({acc*100:.1f}%)")
print(f"  ROC AUC:   {auc:.4f}")
print(f"  F1 Score:  {f1:.4f}")




  Accuracy:  0.5833  (58.3%)
  ROC AUC:   0.7746
  F1 Score:  0.4444


The accuracy seems low but ROC AUC of 0.775 is actually decent. It is not randonly classifying the links

Note: I have not done any hyperparamer tuning here because we do not have enough data validations set. Also it will be not practical because we are doing 10 random runs ahead. Lets run now LR 10 times. Rational is that we have small dataset so high variance in results.One run might get lucky (or unlucky) with the split. Running experiment 10 times will gives us confidence with our results.

In [37]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score
from tqdm import tqdm

# Store results from all runs
results_lr = []


for run in tqdm(range(10), desc="LR Runs"):
    
    # split data
    indices = list(range(len(all_edges)))
    random.shuffle(indices)
    
    split_point = int(0.8 * len(indices))
    train_idx = indices[:split_point]
    test_idx = indices[split_point:]
    
    train_edges_run = [all_edges[i] for i in train_idx]
    test_edges_run = [all_edges[i] for i in test_idx]
    train_labels_run = [all_labels[i] for i in train_idx]
    test_labels_run = [all_labels[i] for i in test_idx]
    
    # create training graph
    test_positive = [e for i, e in enumerate(test_edges_run) if test_labels_run[i] == 1]
    G_train_run = G.copy()
    G_train_run.remove_edges_from(test_positive)
    
    # generate node2vec embeddings
    node2vec = Node2Vec(
        G_train_run,
        dimensions=64,
        walk_length=30,
        num_walks=200,
        p=1, q=1,
        workers=4,
        seed=42 + run,  # Different seed for each run
        quiet=True
    )
    model = node2vec.fit(window=10, min_count=1, batch_words=4, seed=42 + run)
    wv = model.wv
    
    # create edge features
    X_train_run = create_edge_features(train_edges_run, wv)
    X_test_run = create_edge_features(test_edges_run, wv)
    y_train_run = np.array(train_labels_run)
    y_test_run = np.array(test_labels_run)
    
    # train classifier
    clf_lr = LogisticRegression(max_iter=1000, random_state=42)
    clf_lr.fit(X_train_run, y_train_run)
    
    # evaluate
    y_pred = clf_lr.predict(X_test_run)
    y_proba = clf_lr.predict_proba(X_test_run)[:, 1]
    
    acc = accuracy_score(y_test_run, y_pred)
    auc = roc_auc_score(y_test_run, y_proba)
    f1 = f1_score(y_test_run, y_pred)
    
    results_lr.append({
        'accuracy': acc,
        'auc': auc,
        'f1': f1
    })

LR Runs:   0%|          | 0/10 [00:00<?, ?it/s]

LR Runs: 100%|██████████| 10/10 [01:05<00:00,  6.59s/it]


In [None]:


# average and std dev of acc, auc, f1

acc_mean = np.mean([r['accuracy'] for r in results_lr])
acc_std = np.std([r['accuracy'] for r in results_lr])
auc_mean = np.mean([r['auc'] for r in results_lr])
auc_std = np.std([r['auc'] for r in results_lr])
f1_mean = np.mean([r['f1'] for r in results_lr])
f1_std = np.std([r['f1'] for r in results_lr])

print(f"\n{'Metric':<12} {'Mean':<12} {'Std Dev':<12} {'Range':<20}")
print("-" * 50)
print(f"{'Accuracy':<12} {acc_mean:.4f}      {acc_std:.4f}      [{min(r['accuracy'] for r in results_lr):.4f}, {max(r['accuracy'] for r in results_lr):.4f}]")
print(f"{'ROC AUC':<12} {auc_mean:.4f}      {auc_std:.4f}      [{min(r['auc'] for r in results_lr):.4f}, {max(r['auc'] for r in results_lr):.4f}]")
print(f"{'F1 Score':<12} {f1_mean:.4f}      {f1_std:.4f}      [{min(r['f1'] for r in results_lr):.4f}, {max(r['f1'] for r in results_lr):.4f}]")



Metric       Mean         Std Dev      Range               
--------------------------------------------------
Accuracy     0.5917      0.1224      [0.2778, 0.7500]
ROC AUC      0.7513      0.0586      [0.6571, 0.8297]
F1 Score     0.5398      0.1840      [0.0714, 0.7273]


Acc of 0.59 is moderate but variance is high. On other hand AUC is pretty good means model ranks the links well. F1 is 0.54 which is reasonable for smaller datasets. High std dev of ACC, F1 shows thats sentivity due to small data. So AUC is more of stable and infomative than ACC here. 

Lets improve our classifier by building random forest model and running 10 experiments and finally averaging the metrices. 

In [41]:

from sklearn.ensemble import RandomForestClassifier

results_rf = []


for run in tqdm(range(10), desc="RF Runs"):
    
    # split data
    indices = list(range(len(all_edges)))
    random.shuffle(indices)
    
    split_point = int(0.8 * len(indices))
    train_idx = indices[:split_point]
    test_idx = indices[split_point:]
    
    train_edges_run = [all_edges[i] for i in train_idx]
    test_edges_run = [all_edges[i] for i in test_idx]
    train_labels_run = [all_labels[i] for i in train_idx]
    test_labels_run = [all_labels[i] for i in test_idx]
    
    # create training graph
    test_positive = [e for i, e in enumerate(test_edges_run) if test_labels_run[i] == 1]
    G_train_run = G.copy()
    G_train_run.remove_edges_from(test_positive)
    
    # generate node2vec embeddings
    node2vec = Node2Vec(
        G_train_run,
        dimensions=64,
        walk_length=30,
        num_walks=200,
        p=1, q=1,
        workers=4,
        seed=42 + run,
        quiet=True
    )
    model = node2vec.fit(window=10, min_count=1, batch_words=4, seed=42 + run)
    wv = model.wv
    
    # create edge features
    X_train_run = create_edge_features(train_edges_run, wv)
    X_test_run = create_edge_features(test_edges_run, wv)
    y_train_run = np.array(train_labels_run)
    y_test_run = np.array(test_labels_run)
    
    # train RF
    clf_rf = RandomForestClassifier(
        n_estimators=100,
        random_state=42,
        n_jobs=-1  # Use all CPU cores
    )
    clf_rf.fit(X_train_run, y_train_run)
    
    # evaluate
    y_pred = clf_rf.predict(X_test_run)
    y_proba = clf_rf.predict_proba(X_test_run)[:, 1]
    
    acc = accuracy_score(y_test_run, y_pred)
    auc = roc_auc_score(y_test_run, y_proba)
    f1 = f1_score(y_test_run, y_pred)
    
    results_rf.append({
        'accuracy': acc,
        'auc': auc,
        'f1': f1
    })



RF Runs: 100%|██████████| 10/10 [01:43<00:00, 10.35s/it]


In [43]:


acc_mean_rf = np.mean([r['accuracy'] for r in results_rf])
acc_std_rf = np.std([r['accuracy'] for r in results_rf])
auc_mean_rf = np.mean([r['auc'] for r in results_rf])
auc_std_rf = np.std([r['auc'] for r in results_rf])
f1_mean_rf = np.mean([r['f1'] for r in results_rf])
f1_std_rf = np.std([r['f1'] for r in results_rf])

print(f"\n{'Metric':<12} {'Mean':<12} {'Std Dev':<12} {'Range':<20}")
print("-" * 50)
print(f"{'Accuracy':<12} {acc_mean_rf:.4f}      {acc_std_rf:.4f}      [{min(r['accuracy'] for r in results_rf):.4f}, {max(r['accuracy'] for r in results_rf):.4f}]")
print(f"{'ROC AUC':<12} {auc_mean_rf:.4f}      {auc_std_rf:.4f}      [{min(r['auc'] for r in results_rf):.4f}, {max(r['auc'] for r in results_rf):.4f}]")
print(f"{'F1 Score':<12} {f1_mean_rf:.4f}      {f1_std_rf:.4f}      [{min(r['f1'] for r in results_rf):.4f}, {max(r['f1'] for r in results_rf):.4f}]")




Metric       Mean         Std Dev      Range               
--------------------------------------------------
Accuracy     0.7556      0.0984      [0.6111, 0.9444]
ROC AUC      0.9155      0.0577      [0.8359, 0.9938]
F1 Score     0.7159      0.1096      [0.5333, 0.9286]


Clearly the ACC improved from 0.59 to 0.75. AUC and F1 score also improved by big leap.

**Final commentary:** So basically I framed the problem as a binary classification task on woman–event pairs. Positives were the 89 observed attendance links while negatives were 89 uniformly sampled missing woman–event pair. For each run I did a stratified 80–20 split, removed the positive test edges from the graph to avoid leakage, learned node2vec embeddings on the pruned graph with dim=64, walk length=30, 200 walks per node, p=q=1 and represented a pair by the Hadamard product of its two node vectors. I repeated the entire pipeline 10 times (fresh negatives and split each time). 

Random Forest clearly outperformed Logistic Regression on these embeddings. Averaged over 10 runs, RF reached about 0.76 accuracy, 0.92 ROC-AUC, and 0.72 F1, compared with LR at ~0.59 accuracy, 0.75 AUC, and 0.54 F1. The gap suggests non-linear interactions between the two node embeddings matter. An ensemble of trees captures these patterns better than a linear boundary. The high AUC indicates the model ranks true links above non-links reliably, while F1 shows balanced precision and recall on the balanced test set. Some variance remains due to the small graph and the stochastic nature of walks and sampling, which is why averaging across 10 independent runs is important.

Lastly rational for choice of using the hadamard product is that it is a simple, well-supported operator for link prediction with node2vec and directly models feature interactions between the two endpoints. For example if both have high value edge feature will be high and if they disagree edge feature will be lower. 