# Notebook 01 – Data Preprocessing and Network Construction

**Author:** Demetrios Agourakis  
**ORCID:** [0000-0002-8596-5097](https://orcid.org/0000-0002-8596-5097)  
**License:** MIT License  
**Code DOI:** [10.5281/zenodo.16752238](https://doi.org/10.5281/zenodo.16752238)  
**Data DOI:** [10.17605/OSF.IO/2AQP7](https://doi.org/10.17605/OSF.IO/2AQP7)  
**Version:** 1.0 – Last updated: 2025-08-07

This notebook loads and cleans the SWOW-EN (Small World of Words) dataset.  
It prepares the symbolic data for graph construction and subsequent cognitive manifold modelling.


In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import random
from pathlib import Path

SEED = 42
random.seed(SEED)
np.random.seed(SEED)


def get_root_path():
    current = Path.cwd()
    while current != current.parent:
        if (current / "README.md").exists():
            return current
        current = current.parent
    return Path.cwd()


ROOT = get_root_path()
DATA = ROOT / "data"
RESULTS = ROOT / "results"
DATA.mkdir(exist_ok=True)
RESULTS.mkdir(exist_ok=True)

print(f"Project root: {ROOT}")
print(f"Data path: {DATA}")
print(f"Results path: {RESULTS}")

Project root: /Users/demetriosagourakis/Library/Mobile Documents/com~apple~CloudDocs/Biologia Fractal/entropic-symbolic-society/NHB_Symbolic_Mainfold
Data path: /Users/demetriosagourakis/Library/Mobile Documents/com~apple~CloudDocs/Biologia Fractal/entropic-symbolic-society/NHB_Symbolic_Mainfold/data
Results path: /Users/demetriosagourakis/Library/Mobile Documents/com~apple~CloudDocs/Biologia Fractal/entropic-symbolic-society/NHB_Symbolic_Mainfold/results


In [14]:
# 🔍 Dataset filename and expected location
filename = "SWOW-EN.complete.20180827.csv"
file_path = DATA / filename

# ❗ Manual file placement required
if not file_path.exists():
    raise FileNotFoundError(
        f"Dataset not found at {file_path}. Please download it manually from https://osf.io/2AQP7 "
        f"and place it in the 'data/' directory as '{filename}'."
    )

# 📖 Load dataset
df = pd.read_csv(file_path)
print(f"Dataset shape: {df.shape}")
df.head()

Dataset shape: (1356362, 18)


Unnamed: 0.1,Unnamed: 0,id,participantID,created_at,age,nativeLanguage,gender,education,city,country,section,cue,R1Raw,R2Raw,R3Raw,R1,R2,R3
0,1,1500428,130332,2018-01-07 04:29:38,61,United States,Ma,5.0,Pepperell,United States,seed,there,position,place,point,position,place,point
1,2,1500426,130332,2018-01-07 04:29:38,61,United States,Ma,5.0,Pepperell,United States,seed,true,honest,fact,indisputable,honest,fact,indisputable
2,3,1500424,130332,2018-01-07 04:29:38,61,United States,Ma,5.0,Pepperell,United States,seed,beat,drum,policeman,beatnik,drum,policeman,beatnik
3,4,1500438,130332,2018-01-07 04:29:38,61,United States,Ma,5.0,Pepperell,United States,seed,like,affection,simile,compare,affection,simile,compare
4,5,1500430,130332,2018-01-07 04:29:38,61,United States,Ma,5.0,Pepperell,United States,seed,telephone,receiver,hamdset,wires,receiver,handset,wires


In [16]:
# ✅ Column mapping based on your dataset
df = df.rename(columns={"cue": "stimulus", "R1": "association"})

initial_shape = df.shape
df = df.dropna(subset=["stimulus", "association"])
df = df[df["stimulus"].apply(lambda x: isinstance(x, str))]
df = df[df["association"].apply(lambda x: isinstance(x, str))]
print(f"Dropped {initial_shape[0] - df.shape[0]} rows with missing or invalid entries.")

df["stimulus"] = df["stimulus"].str.strip().str.lower()
df["association"] = df["association"].str.strip().str.lower()

cleaned_path = DATA / "symbolic_cleaned_data.csv"
df.to_csv(cleaned_path, index=False)
print(f"Cleaned data saved to: {cleaned_path}")

Dropped 163 rows with missing or invalid entries.
Cleaned data saved to: /Users/demetriosagourakis/Library/Mobile Documents/com~apple~CloudDocs/Biologia Fractal/entropic-symbolic-society/NHB_Symbolic_Mainfold/data/symbolic_cleaned_data.csv


In [17]:
G = nx.DiGraph()

for _, row in df.iterrows():
    stim = row["stimulus"]
    assoc = row["association"]
    weight = row.get("strength", 1.0)
    if not G.has_edge(stim, assoc):
        G.add_edge(stim, assoc, weight=weight)
    else:
        G[stim][assoc]["weight"] += weight

print(
    f"Graph constructed with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges."
)

isolated_nodes = list(nx.isolates(G))
print(f"Isolated nodes: {len(isolated_nodes)}")

graph_path = RESULTS / "word_network.graphml"
nx.write_graphml(G, graph_path)
print(f"Graph saved to: {graph_path}")

Graph constructed with 77165 nodes and 542600 edges.
Isolated nodes: 0
Graph saved to: /Users/demetriosagourakis/Library/Mobile Documents/com~apple~CloudDocs/Biologia Fractal/entropic-symbolic-society/NHB_Symbolic_Mainfold/results/word_network.graphml


## ✅ Notebook Summary

In this notebook, we:

- Loaded and verified the SWOW-EN dataset manually from OSF,
- Cleaned and standardised symbolic associations into lowercase string pairs,
- Constructed a directed graph using NetworkX with weighted edges,
- Saved the cleaned data (`symbolic_cleaned_data.csv`) and the resulting graph (`word_network.graphml`) for downstream analysis.

---

## ▶️ Next Step

Proceed to **Notebook 02 – Network Metrics**, where we will compute centrality and topological properties over the constructed symbolic graph, including:

- Degree and strength distributions,
- Clustering coefficients,
- PageRank and other symbolic influence measures.

Ensure that the file `word_network.graphml` is available in the `results/` directory before running the next notebook.
