
    # Transformer Fundamentals – Guided Notebook 04 — Softmax Weighting
    **Date:** 2025-10-29  
    **Style:** Guided, hands-on; from-scratch first, then frameworks; interactive visuals

    ## Learning Objectives

- Deepen intuition for softmax as a temperature-controlled probability distribution.
- Explore temperature τ and its effect on focus vs diffusion.
- Observe sparsity patterns and entropy of attention weights.


    ## TL;DR
    Softmax converts scores into a probability distribution; temperature controls confidence vs exploration.


## Concept Overview
- Softmax emphasizes relative differences; temperature τ < 1 sharpens, τ > 1 smooths.
- Entropy is a useful summary of distribution sharpness.


In [None]:

# %% [setup] Environment check & minimal installs (run once per kernel)
# Target: Python 3.12.12, PyTorch 2.5+, transformers 4.44+, datasets 3+, ipywidgets 8+, matplotlib 3.8+
import sys, platform, subprocess, os

print("Python:", sys.version)
print("Platform:", platform.platform())

# Optional: uncomment to install/upgrade on this machine (internet required)
# !pip install --upgrade pip
# !pip install "torch>=2.5" "transformers>=4.44" "datasets>=3.0.0" "ipywidgets>=8.1.0" "matplotlib>=3.8" "umap-learn>=0.5.6"

try:
    import torch
    print("Torch:", torch.__version__, "| CUDA available:", torch.cuda.is_available())
    if torch.cuda.is_available():
        print("CUDA device name:", torch.cuda.get_device_name(0))
except Exception as e:
    print("PyTorch not available yet:", e)

%config InlineBackend.figure_format = 'retina'
from IPython.display import display, HTML
try:
    import ipywidgets as widgets
    from ipywidgets import interact, interactive
    print("ipywidgets:", widgets.__version__)
except Exception as e:
    print("ipywidgets not available yet:", e)

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)


In [None]:

# %% [utils] Small helpers used throughout
import numpy as np

def softmax(x, axis=-1):
    x = x - np.max(x, axis=axis, keepdims=True)
    e = np.exp(x)
    return e / np.sum(e, axis=axis, keepdims=True)

def cosine_sim(a, b, eps=1e-9):
    a_norm = a / (np.linalg.norm(a, axis=-1, keepdims=True) + eps)
    b_norm = b / (np.linalg.norm(b, axis=-1, keepdims=True) + eps)
    return np.dot(a_norm, b_norm.T)

def show_heatmap(mat, xticklabels=None, yticklabels=None, title=""):
    plt.figure()
    plt.imshow(mat, aspect="auto")
    plt.colorbar()
    if xticklabels is not None: plt.xticks(range(len(xticklabels)), xticklabels, rotation=45, ha="right")
    if yticklabels is not None: plt.yticks(range(len(yticklabels)), yticklabels)
    plt.title(title)
    plt.tight_layout()
    plt.show()


In [None]:

# %% [experiment] Temperature & entropy exploration (NumPy)
import ipywidgets as widgets
def entropy(p, eps=1e-12):
    p = np.clip(p, eps, 1.0)
    return -np.sum(p * np.log(p), axis=-1).mean()

def softmax_temp(x, tau=1.0, axis=-1):
    x = x / max(tau, 1e-6)
    return softmax(x, axis=axis)

def temp_demo(tau=1.0):
    T, d = 10, 16
    X = np.random.randn(T, d)
    A = X @ X.T  # symmetric scores
    P = softmax_temp(A, tau=tau, axis=-1)
    print("Temperature:", tau, "| mean entropy:", entropy(P))
    show_heatmap(P, title=f"Softmax with Temperature={tau}")

widgets.interact(temp_demo, tau=widgets.FloatLogSlider(base=10, min=-2, max=1, step=0.05, value=1.0))


### Framework Tie-in
- In generation APIs, temperature is applied to logits before sampling; it changes token selection confidence.



---
### Bonus: Multilingual Extension
- Swap the tokenizer/model for a multilingual variant (e.g., `bert-base-multilingual-cased` or `xlm-roberta-base`).
- Repeat a small slice of the notebook (tokenization, attention map) on non-English sentences and compare.



---
## Reflection & Next Steps
- What changed when you tweaked dimensions, temperatures, or prompts?
- Where did the attention concentrate, and did it match your intuition?
- Re-run the interactive widgets on your own text.
- Save a copy of the figures that best illustrate your understanding.
