# Traverse: Spotify End-to-End with Clustering

Load Spotify listening history, enrich with genre/style metadata from a records CSV,
build a co-occurrence graph with **genre vs. style category tracking**, and visualize
with clustering in the Cosmograph frontend.

**Prerequisites:**
```bash
pip install -e ".[dev]"
cd src/traverse/cosmograph/app && npm install && npm run build
```

## 1. Configuration

Update these paths to match your local setup.

In [None]:
from pathlib import Path

EXTENDED_DIR = Path(r"C:\Users\xtrem\Documents\Datasets\Spotify\anthony\ExtendedStreamingHistory")
RECORDS_CSV  = Path(r"C:\Users\xtrem\Documents\Datasets\records.csv")
CACHE_DIR    = Path("_out")
FORCE_REBUILD = False

## 2. Load and Cache Canonical Tables

On first run this ingests the Spotify Extended Streaming History, enriches with
genres/styles from records.csv via `FastGenreStyleEnricher`, and caches the result
as parquet in `_out/`. Subsequent runs load from cache instantly.

In [None]:
from traverse.data.spotify_extended_minimal import load_spotify_extended_minimal
from traverse.processing.enrich_fast import FastGenreStyleEnricher
from traverse.processing.cache import CanonicalTableCache

cache = CanonicalTableCache(
    cache_dir=CACHE_DIR,
    build_fn=lambda: load_spotify_extended_minimal(EXTENDED_DIR),
    enrich_fn=lambda t: FastGenreStyleEnricher(records_csv=str(RECORDS_CSV)).run(t),
    force=FORCE_REBUILD,
)
plays_wide, tracks_wide = cache.load_or_build()

print(f"plays_wide:  {plays_wide.shape[0]:,} rows, {plays_wide.shape[1]} cols")
print(f"tracks_wide: {tracks_wide.shape[0]:,} rows, {tracks_wide.shape[1]} cols")
plays_wide.head(3)

## 3. Build the Co-occurrence Graph with Category Tracking

Walk every play, split its genre/style tags separately, and feed tag pairs to
`CooccurrenceBuilder` with a `tag_categories` mapping so each node accumulates
a majority-vote `category` field ("genre" or "style").

Each play's `played_at` timestamp becomes the timeline value (`first_seen_ts`).

In [None]:
import pandas as pd
from traverse.processing.normalize import split_tags, pretty_label
from traverse.graph.cooccurrence import CooccurrenceBuilder

builder = CooccurrenceBuilder(min_cooccurrence=2, max_nodes=500)

for played_at, genres, styles in plays_wide[
    ["played_at", "genres", "styles"]
].itertuples(index=False):
    genre_tags = split_tags(genres)
    style_tags = split_tags(styles)
    tags = genre_tags + style_tags

    # Build tag -> category mapping for this row
    tag_categories = {}
    for t in genre_tags:
        tag_categories[t] = "genre"
    for t in style_tags:
        tag_categories[t] = "style"

    ts_ms = (
        int(pd.Timestamp(played_at).value // 1_000_000)
        if pd.notna(played_at)
        else None
    )
    builder.add(tags, timestamp_ms=ts_ms, label_fn=pretty_label,
                tag_categories=tag_categories)

graph = builder.build()
print(f"Graph: {len(graph['points'])} nodes, {len(graph['links'])} edges")

# Category breakdown
from collections import Counter
cat_counts = Counter(p.get("category") for p in graph["points"])
for cat, n in cat_counts.most_common():
    print(f"  {cat}: {n} nodes")

## 4. Export JSON and Serve

Write the graph to the frontend's `dist/` directory, then start the built-in
static server. The exported JSON includes `meta.clusterField` so the frontend
automatically clusters and colors nodes by genre vs. style.

Use the **Cluster by category** checkbox in the header to toggle clustering on/off.

In [None]:
from traverse.graph.adapters_cosmograph import CosmographAdapter, detect_cluster_field
from traverse.cosmograph.server import serve, _default_dist_dir

# Detect if graph has category data and build meta
cluster_field = detect_cluster_field(graph)
meta = {"clusterField": cluster_field} if cluster_field else None
print(f"Cluster field: {cluster_field}")

# Write JSON into the frontend dist/
out_path = _default_dist_dir() / "cosmo_genres_spotify_cluster.json"
CosmographAdapter.write(graph, out_path, meta=meta)
print(f"Wrote {out_path}")
print()
print("Starting server â€” open in browser:")
print("  http://127.0.0.1:8080/?data=/cosmo_genres_spotify_cluster.json")
print()
print("Press Ctrl+C (or interrupt the kernel) to stop.")

serve(port=8080)

---

## Appendix: PyCosmograph Inline Widget

Render the clustered graph directly in the notebook without starting a server.
Nodes are colored and clustered by their `category` (genre vs. style).

In [None]:
# pip install cosmograph  # uncomment to install
import pandas as pd
from cosmograph import cosmo

points_df = pd.DataFrame(graph["points"])
links_df = pd.DataFrame(graph["links"])

BRIGHT_PALETTE = [
    "#00e5ff",  # cyan
    "#ff4081",  # pink
    "#76ff03",  # lime
    "#ffea00",  # yellow
    "#e040fb",  # purple
    "#ff6e40",  # orange
]

w = cosmo(
    points=points_df,
    links=links_df,
    point_id_by="id",
    link_source_by="source",
    link_target_by="target",
    point_label_by="label",
    link_include_columns=["weight"],
    point_size=0.2,
    show_labels=True,
    # Clustering by genre/style category
    point_color_by="category",
    point_color_palette=BRIGHT_PALETTE,
    point_cluster_by="category",
    simulation_cluster=0.8,
    show_cluster_labels=True,
    scale_cluster_labels=True,
    use_point_color_strategy_for_cluster_labels=True,
    point_include_columns=["category"],
)
w  # renders inline