# Notebook 03 – Generate Embeddings

**Author:** Demetrios Agourakis  
**ORCID:** [0000-0002-8596-5097](https://orcid.org/0000-0002-8596-5097)  
**License:** MIT License  
**Code DOI:** [10.5281/zenodo.16752238](https://doi.org/10.5281/zenodo.16752238)  
**Data DOI:** [10.17605/OSF.IO/2AQP7](https://doi.org/10.17605/OSF.IO/2AQP7)  
**Version:** 1.0 – Last updated: 2025-08-07

This notebook computes vector embeddings for each node in the symbolic graph.  
Primary method: **Truncated SVD** on the sparse adjacency (robust, dependency-light).  
Optional method (if installed): **node2vec** random-walk embeddings.


In [1]:
import numpy as np
import pandas as pd
import networkx as nx
from pathlib import Path
import random

SEED = 42
random.seed(SEED)
np.random.seed(SEED)


def get_root_path():
    current = Path.cwd()
    while current != current.parent:
        if (current / "README.md").exists():
            return current
        current = current.parent
    return Path.cwd()


ROOT = get_root_path()
DATA = ROOT / "data"
RESULTS = ROOT / "results"
DATA.mkdir(exist_ok=True)
RESULTS.mkdir(exist_ok=True)

graph_path = RESULTS / "word_network.graphml"
if not graph_path.exists():
    raise FileNotFoundError(f"Graph not found at: {graph_path}")
G = nx.read_graphml(graph_path)
print(f"Graph loaded: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")

# Parameters
EMBED_DIM = 128  # embedding dimensionality
METHOD = "svd"  # 'svd' (default) or 'node2vec'

Graph loaded: 77165 nodes, 542600 edges


In [2]:
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import normalize
from scipy.sparse import csr_matrix
import numpy as np

nodes = list(G.nodes())
node_index = {n: i for i, n in enumerate(nodes)}

A = nx.to_scipy_sparse_array(
    G, nodelist=nodes, weight="weight", dtype=np.float64, format="csr"
)

row_sums = np.array(A.sum(axis=1)).ravel()
row_sums[row_sums == 0.0] = 1.0
D_inv = csr_matrix(
    (1.0 / row_sums, (np.arange(A.shape[0]), np.arange(A.shape[0]))), shape=A.shape
)
A_norm = D_inv @ A

svd = TruncatedSVD(n_components=EMBED_DIM, random_state=SEED)
X = svd.fit_transform(A_norm)
X = normalize(X, norm="l2", axis=1)

emb_cols = [f"emb_{i}" for i in range(X.shape[1])]
emb_df = (
    pd.DataFrame(X, index=nodes, columns=emb_cols)
    .reset_index()
    .rename(columns={"index": "node"})
)
print(emb_df.head())

       node     emb_0     emb_1     emb_2     emb_3     emb_4     emb_5  \
0     there  0.028328  0.001385 -0.006058  0.003627  0.006712  0.006427   
1  position  0.169598  0.006578 -0.016645  0.002050  0.006176 -0.001688   
2      true  0.038515  0.007363  0.000378 -0.002922  0.007982  0.011319   
3    honest  0.291430 -0.007006 -0.015285 -0.018848  0.018152 -0.005375   
4      beat  0.044751  0.060966 -0.005857  0.011766 -0.003583 -0.004401   

      emb_6     emb_7     emb_8  ...   emb_118   emb_119   emb_120   emb_121  \
0  0.003370  0.005006  0.009515  ... -0.208297  0.191555  0.054995 -0.328719   
1 -0.000508  0.007833  0.032174  ... -0.160407 -0.073331  0.167797 -0.135133   
2 -0.002269  0.009037  0.008605  ...  0.002440 -0.002906 -0.113090  0.016463   
3 -0.003696  0.000438  0.003351  ... -0.039796  0.052120 -0.083340  0.070672   
4 -0.018799  0.203360  0.935140  ...  0.002254  0.008646 -0.004605  0.028234   

    emb_122   emb_123   emb_124   emb_125   emb_126   emb_127  
0 -0

In [3]:
# Optional node2vec method
try:
    if METHOD.lower() == "node2vec":
        from node2vec import Node2Vec

        node2vec = Node2Vec(
            G,
            dimensions=EMBED_DIM,
            walk_length=40,
            num_walks=10,
            p=1,
            q=1,
            workers=1,
            seed=SEED,
            weight_key="weight",
            quiet=True,
        )
        model = node2vec.fit(window=10, min_count=1, batch_words=64)
        nodes = list(G.nodes())
        import numpy as np

        X_nv = np.vstack(
            [
                model.wv[str(n)] if str(n) in model.wv else np.zeros(EMBED_DIM)
                for n in nodes
            ]
        )
        from sklearn.preprocessing import normalize

        X_nv = normalize(X_nv, norm="l2", axis=1)
        emb_cols = [f"emb_{i}" for i in range(X_nv.shape[1])]
        emb_df = (
            pd.DataFrame(X_nv, index=nodes, columns=emb_cols)
            .reset_index()
            .rename(columns={"index": "node"})
        )
        print("node2vec embeddings computed.")
except Exception as e:
    print(f"node2vec not used: {e}")

In [4]:
emb_path = DATA / "symbolic_embeddings.csv"
emb_df.to_csv(emb_path, index=False)
print(f"Embeddings saved to: {emb_path}")

Embeddings saved to: /Users/demetriosagourakis/Library/Mobile Documents/com~apple~CloudDocs/Biologia Fractal/entropic-symbolic-society/NHB_Symbolic_Mainfold/data/symbolic_embeddings.csv


## ✅ Notebook Summary

In this notebook, we generated node embeddings for the symbolic network using a robust SVD-based method (and optionally node2vec if installed).  
Results were saved to `data/symbolic_embeddings.csv` with columns `node, emb_0, ..., emb_{d-1}`.

---

## ▶️ Next Step

Proceed to **Notebook 04 – Merge Metrics and Embeddings**, to join `symbolic_metrics.csv` and `symbolic_embeddings.csv` into a consolidated table for clustering and manifold visualization.
