# Semantic Pattern Synthesis
This notebook synthesizes experiments analyzing the topological and functional structure of contextual meaning. We focus on the verb **'run'**, examining BERT trajectories, curvature, clustering, and functional mapping.

## 1. Load contextual sentences
We define 48 contextual uses of *'run'* evenly across four semantic functions:

In [None]:
sentences = [
    # descriptive (12)
    "She likes to run early in the morning before work.",
    "The marathon runner collapsed after a grueling 26-mile run.",
    "The river runs along the eastern border.",
    "The software update runs automatically.",
    "His nose began to run during the allergy season.",
    "The ink began to run in the rain.",
    "We saw a deer run across the highway at dusk.",
    "The children ran through the field chasing butterflies.",
    "The mountain stream runs fast in spring.",
    "The play ran for three consecutive seasons.",
    "They ran diagnostics to isolate the error.",
    "She ran her fingers along the dusty shelf.",
    # evaluative (12)
    "That joke is starting to run thin.",
    "The show has had a good run but it's time to end.",
    "Don’t let your temper run away with you.",
    "The campaign is running out of steam.",
    "He let the story run its course.",
    "This film had a limited theatrical run.",
    "That rumor ran rampant on social media.",
    "He was caught in a run of bad luck.",
    "He’s running behind schedule again.",
    "She’s running on empty after a tough week.",
    "The idea ran against conventional wisdom.",
    "His excuses are beginning to run thin.",
    # narrative (12)
    "He ran into trouble with the tax authorities.",
    "She ran the numbers twice before presenting them.",
    "The suspect tried to run from the scene.",
    "We did a test run before the final performance.",
    "The faucet was left on and water began to run.",
    "She screamed and began to run down the street.",
    "He let his imagination run wild.",
    "They ran a tight ship in that department.",
    "They run tests on samples before publishing results.",
    "Let’s run through the list one more time.",
    "He ran his hand over the sculpture's surface.",
    "The children ran screaming through the sprinkler.",
    # performative (12)
    "Run the idea by me again.",
    "Let the engine run for five minutes.",
    "Run for your life!",
    "She decided to run the experiment again.",
    "They’re running a sale on electronics this weekend.",
    "She’s running for president of the council.",
    "The manager ran the meeting efficiently.",
    "He used to run track in college.",
    "They run a family-owned bakery downtown.",
    "He decided to run for office in the next election.",
    "Please run this report by end of day.",
    "Run the installer before rebooting."
]
# 🧠 Matching semantic function per sentence
semantic_functions = (
    ["descriptive"] * 12 +
    ["evaluative"] * 12 +
    ["narrative"] * 12 +
    ["performative"] * 12
)

## 2. BERT Trajectory Extraction
We extract the contextual embedding of each 'run' token across 12 BERT layers.

In [None]:
from transformers import BertTokenizer, BertModel
import torch, numpy as np
from tqdm import tqdm

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased", output_hidden_states=True)
model.eval()

trajectories = []
valid_sentences = []

print("🔍 Matching token: 'run'")

for sent in tqdm(sentences):
    inputs = tokenizer(sent, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    input_ids = inputs["input_ids"][0]
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    print(f"🧾 Sentence: {sent}")
    print(f"🔤 Tokens: {tokens}")

    match_indices = [i for i, tok in enumerate(tokens) if "run" in tok.lower()]
    if not match_indices:
        print("⚠️ No 'run' token found — skipping.")
        continue

    idx = match_indices[0]
    layers = outputs.hidden_states
    curve = torch.stack([layer[0, idx] for layer in layers])  # shape: (13, 768)
    curve = curve[-12:]  # last 12 layers only

    if curve.shape == (12, 768):
        trajectories.append(curve.numpy())
        valid_sentences.append(sent)
    else:
        print("❌ Unexpected shape — skipping.")

print(f"\n✅ Extracted {len(trajectories)} valid trajectories.")

## 3. PCA Dimensionality Reduction

In [None]:
from sklearn.decomposition import PCA
flat = np.stack(trajectories).reshape(len(trajectories), -1)
pca = PCA(n_components=30)
reduced = pca.fit_transform(flat)


## 4. Scree Plot

In [None]:
import matplotlib.pyplot as plt
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Scree Plot')
plt.grid(True)
plt.show()

## 5. UMAP Projection + DBSCAN

In [None]:
from umap import UMAP
from sklearn.cluster import DBSCAN
umap = UMAP(n_components=3)
embed = umap.fit_transform(flat)
db = DBSCAN(eps=0.8, min_samples=3)
labels = db.fit_predict(embed)
print("Labels:", labels)

## 6. Visualize UMAP Projection

In [None]:
fig = plt.figure(figsize=(10,7))
ax = fig.add_subplot(111, projection='3d')
sc = ax.scatter(embed[:,0], embed[:,1], embed[:,2], c=labels, cmap='tab10')
plt.legend(*sc.legend_elements(), title="Cluster")
plt.title("UMAP + DBSCAN Clustering")
plt.show()

## 7. Cluster to Function Distribution

In [None]:
# ⛳ Make sure this matches the filtered trajectories
valid_funcs = [semantic_functions[i] for i, traj in enumerate(trajectories) if traj.shape == (12, 768)]

# 🧭 Map functions to clusters
from collections import defaultdict
cluster_map = defaultdict(list)

for i, lbl in enumerate(labels):
    if i < len(valid_funcs):
        cluster_map[lbl].append(valid_funcs[i])

# 🖨️ Print function composition of each cluster
for cl, fs in cluster_map.items():
    print(f"\n🔹 Cluster {cl}:")
    for func in sorted(set(fs)):
        print(f"  {func}: {fs.count(func)} / {len(fs)}")