# Notebook 04 – Merge Metrics and Embeddings

**Author:** Demetrios Agourakis  
**ORCID:** [0000-0002-8596-5097](https://orcid.org/0000-0002-8596-5097)  
**License:** MIT License  
**Code DOI:** [10.5281/zenodo.16752238](https://doi.org/10.5281/zenodo.16752238)  
**Data DOI:** [10.17605/OSF.IO/2AQP7](https://doi.org/10.17605/OSF.IO/2AQP7)  
**Version:** 1.0 – Last updated: 2025-08-07

This notebook merges symbolic network metrics with node embeddings to build a consolidated dataset for clustering and manifold visualization.


In [1]:
import pandas as pd
from pathlib import Path


def get_root_path():
    current = Path.cwd()
    while current != current.parent:
        if (current / "README.md").exists():
            return current
        current = current.parent
    return Path.cwd()


ROOT = get_root_path()
DATA = ROOT / "data"
RESULTS = ROOT / "results"
DATA.mkdir(exist_ok=True)
RESULTS.mkdir(exist_ok=True)

metrics_path = DATA / "symbolic_metrics.csv"
emb_path = DATA / "symbolic_embeddings.csv"

if not metrics_path.exists():
    raise FileNotFoundError(f"Missing metrics file at: {metrics_path}")
if not emb_path.exists():
    raise FileNotFoundError(f"Missing embeddings file at: {emb_path}")

metrics = pd.read_csv(metrics_path)
emb = pd.read_csv(emb_path)

print(f"Loaded metrics: {metrics.shape}, embeddings: {emb.shape}")

Loaded metrics: (77165, 11), embeddings: (77165, 129)


In [4]:
# --- Schema checks and robust key normalization for 'node' ---

import numpy as np


def _ensure_node_column(df, name="metrics"):
    # Se veio com índice salvo como 'Unnamed: 0'
    if "node" not in df.columns:
        if "Unnamed: 0" in df.columns:
            df = df.rename(columns={"Unnamed: 0": "node"})
        else:
            # fallback: promover o índice a coluna
            df = df.reset_index().rename(columns={"index": "node"})
    # Normalizar: string, strip, lower
    df["node"] = df["node"].astype(str).str.strip().str.lower()
    # Remover entradas vazias/placeholder
    bad = df["node"].isin(["", "nan", "none"])
    if bad.any():
        print(f"[{name}] dropping {bad.sum()} rows with empty/invalid node")
    df = df[~bad].dropna(subset=["node"])
    # Deduplicar
    if df["node"].duplicated().any():
        dups = df["node"].duplicated().sum()
        print(f"[{name}] dropping {dups} duplicated node entries")
        df = df.drop_duplicates(subset=["node"], keep="first")
    return df


metrics = _ensure_node_column(metrics, name="metrics")
emb = _ensure_node_column(emb, name="emb")

# Diagnóstico rápido de interseção
left_only = set(metrics["node"]) - set(emb["node"])
right_only = set(emb["node"]) - set(metrics["node"])
print(f"nodes in metrics only: {len(left_only)} | in emb only: {len(right_only)}")

# Merge inner por node
merged = pd.merge(metrics, emb, on="node", how="inner")
print(f"Merged shape: {merged.shape}")

# Identificar colunas de embedding e sanear NaNs
emb_cols = [c for c in merged.columns if c.startswith("emb_")]
if len(emb_cols) == 0:
    raise RuntimeError(
        "No embedding columns found. Expected columns like 'emb_0', 'emb_1', ..."
    )

# Checagens pós-merge (agora devem passar)
assert merged["node"].isna().sum() == 0, "Merged dataset contains NaN in 'node'."
assert (
    merged[emb_cols].isna().sum().sum() == 0
), "Merged dataset contains NaN in embeddings."

merged.head()

[metrics] dropping 2 rows with empty/invalid node
[emb] dropping 2 rows with empty/invalid node
nodes in metrics only: 0 | in emb only: 0
Merged shape: (77163, 139)


Unnamed: 0,node,in_degree,out_degree,total_degree,in_strength,out_strength,total_strength,pagerank,closeness,betweenness,...,emb_118,emb_119,emb_120,emb_121,emb_122,emb_123,emb_124,emb_125,emb_126,emb_127
0,there,84,36,120,238.0,113.0,351.0,0.0001,0.056718,5.7e-05,...,-0.208297,0.191555,0.054995,-0.328719,-0.102927,-0.063031,-0.139056,-0.034042,-0.285167,0.024825
1,position,83,62,145,158.0,119.0,277.0,4.6e-05,0.057992,0.000128,...,-0.160407,-0.073331,0.167797,-0.135133,-0.140792,0.00581,-0.108114,0.119782,-0.058968,-0.023226
2,true,161,36,197,504.0,115.0,619.0,0.00015,0.059237,7.5e-05,...,0.00244,-0.002906,-0.11309,0.016463,-0.006559,-0.034161,-0.044763,0.099563,0.082961,0.089605
3,honest,108,51,159,453.0,112.0,565.0,8.4e-05,0.057231,9.9e-05,...,-0.039796,0.05212,-0.08334,0.070672,0.214605,-0.025857,-0.074921,0.076047,0.18043,0.143099
4,beat,100,52,152,276.0,113.0,389.0,7.9e-05,0.057367,0.000127,...,0.002254,0.008646,-0.004605,0.028234,-0.011257,-0.023648,-0.036888,-0.022101,-0.003045,-0.005218


In [5]:
out_csv = DATA / "symbolic_metrics_embeddings.csv"
merged.to_csv(out_csv, index=False)
print(f"Consolidated dataset saved to: {out_csv}")

# Save embedding column names as a sidecar file (useful for downstream notebooks)
emb_cols_path = DATA / "embedding_columns.txt"
with open(emb_cols_path, "w") as f:
    for c in [c for c in merged.columns if c.startswith("emb_")]:
        f.write(f"{c}\n")
print(f"Embedding column list saved to: {emb_cols_path}")

Consolidated dataset saved to: /Users/demetriosagourakis/Library/Mobile Documents/com~apple~CloudDocs/Biologia Fractal/entropic-symbolic-society/NHB_Symbolic_Mainfold/data/symbolic_metrics_embeddings.csv
Embedding column list saved to: /Users/demetriosagourakis/Library/Mobile Documents/com~apple~CloudDocs/Biologia Fractal/entropic-symbolic-society/NHB_Symbolic_Mainfold/data/embedding_columns.txt


## ✅ Notebook Summary

We merged node-level symbolic network metrics with vector embeddings to produce a unified dataset:
- Input: `data/symbolic_metrics.csv` and `data/symbolic_embeddings.csv`
- Output: `data/symbolic_metrics_embeddings.csv`
- A helper file `data/embedding_columns.txt` lists embedding columns (e.g., `emb_0 ... emb_127`)

---

## ▶️ Next Step

Proceed to **Notebook 05 – Clustering Analysis**, where we will select an optimal number of clusters and evaluate cluster validity (e.g., silhouette analysis) on the merged feature space.
