# üß™ Workshop 2: Objective Design and Local Dynamics

> This notebook is provided in a clean, non-executed version for the reader to try out on the problem.
>
> A model answer and executive summary can be found in the *worked* version.

This workshop builds directly on Workshop 1 and marks the transition from gradient interpretation to explicit optimisation dynamics, opening Part 2 of the series.

Where Workshop 1 focused on what gradients are and how they are structured, this workshop focuses on what gradients do when they are applied repeatedly. Gradients are no longer treated as static sensitivity maps, but as drivers of parameter evolution.

The central shift in perspective is:
- from *‚Äúwhat does this gradient look like?‚Äù*
- to *‚Äúwhat happens when I follow it?‚Äù*

---

**Conceptual emphasis**

The workshop develops intuition for:
- how a single gradient descent step turns sensitivity into motion,
- how repeated local updates accumulate into global behaviour,
- how objective structure influences optimisation trajectories,
- and how gradient geometry affects stability, speed, and convergence.

Rather than introducing full training pipelines, the focus remains on controlled, interpretable systems where optimisation dynamics can be reasoned about directly.

--- 

**Key ideas explored include**:
- gradient descent as repeated application of vector‚ÄìJacobian products,
- implicit objective functions defined by upstream weighting,
- the relationship between gradient magnitude, direction, and parameter motion,
- conditioning and anisotropy in gradient-driven updates,
- how symmetry, nonlinearity, and curvature shape optimisation paths,
- and visualising optimisation as movement through parameter space.

--- 

**How this workshop fits in the series**

This workshop serves as the conceptual bridge between:
- gradient flow and sensitivity analysis (Workshop 1),
- and more advanced optimisation topics such as learning rates, curvature, and second-order effects (later in Part 2).

By the end of this workshop, gradient descent is no longer a formula, but a geometric process whose behaviour can be anticipated from gradient structure alone.

---

**What this workshop deliberately does not cover**
- Neural network modules (nn.Module)
- Optimisers such as Adam, RMSProp, etc.
- Datasets, batching, or training loops

Those elements are introduced only after optimisation dynamics are conceptually understood.

---

**Recommended prerequisites**
- Completion of Tutorials 1‚Äì4
- Workshop 1: From Gradient Flow to Optimisation Intuition
- Comfort with gradients, Jacobians, and basic optimisation ideas
- Familiarity with linear algebra and nonlinear mappings

---

**Author: Angze Li**

**Last updated: 2026-02-19**

**Version: v1.0**

## üß© Problem: Designing an Objective via Upstream Gradients

> Optimisation is not only about how to minimise a loss
> ‚Äî it is also about what objective you choose.

In this problem, you will implicitly define an objective by choosing an upstream gradient.

Consider:
```python
X = torch.randn(5, 3, requires_grad=True)

Y = torch.tanh(X @ X.T)
```
Here:
- `Y` is a **5√ó5 tensor** measuring pairwise interactions,
- the output is *symmetric and non-scalar*.

---

### Task
1. Construct an upstream gradient matrix V such that:
    - diagonal entries of Y are emphasised,
    - off-diagonal entries are penalised.
2. Call:
```python
Y.backward(V)
```
3. Inspect `X.grad`.

---

### Questions to think about
- What implicit scalar objective are you optimising?
- How does changing the diagonal/off-diagonal weighting affect `X.grad`?
- Which entries of `X` are encouraged to grow or shrink?
- Can you interpret this as encouraging **self-similarity** over **cross-similarity**?

---

### Hint

> You are not optimising `Y` directly.
> You are optimising a **weighted trace-like** functional of `Y`.

---

### Why this problem matters (Bridge to Part 2)

This problem quietly introduces:
- custom objective design,
- structure-aware optimisation,
- gradients as *design tools*, not just training signals.

Without using:
- optimisers,
- learning rates,
- training loops,

you have already answered:

>‚ÄúIf I *were* to optimise this system, what direction would the parameters move?‚Äù

That is exactly the mindset needed for Part 2.

## Solution

## üîó Trailer: From Gradient Structure to a Single Update Step

So far in Part 1, we have treated gradients as objects to inspect rather than tools to use.
We decomposed them, visualised them, and asked where sensitivity lives inside a tensor.

Now we briefly connect that structure to motion.

### What is gradient descent?

At its simplest, gradient descent is a rule for updating parameters in order to reduce a scalar objective.

Given a scalar function
$$L(X),$$
a single gradient descent step with step size $\eta > 0$ is:
$$X_{\text{new}} = X - \eta \,\nabla_X L.$$

That‚Äôs it.

There is no optimiser, no momentum, no learning rate schedule ‚Äî just:
- a gradient (direction),
- and a step size (scale).

### What is the ‚Äúloss‚Äù in our case?

In this notebook, we did not define a conventional loss function.

Instead, we implicitly defined a scalar objective via an upstream gradient:
$$L(X) = v^\top \cdot Y,
\quad \text{where } Y = \tanh(X X^\top).$$

The gradient we computed and visualised throughout this workshop is therefore:
$$\nabla_X (v^\top \cdot Y).$$

Every heatmap you plotted is a **map of how a single gradient descent step would move `X`.**

### One explicit update step

Using the averaged gradient you computed, a single update would be:
$$X_{\text{new}} = X - \eta \,\nabla_X (v^\top \cdot Y).$$

What does this mean in practice?
- Each entry of `X` moves in the direction indicated by the heatmap.
- Regions with larger magnitude move **more strongly**.
- Positive and negative regions correspond to opposing update directions, not just strength.
- The update respects:
    - the symmetry of $X X^\top$,
    - the structure imposed by the upstream weighting `v`,
    - and the nonlinear gating of tanh.

Nothing ‚Äúnew‚Äù happens here.

Optimisation is simply **repeated application of the sensitivity patterns** you have already analysed.

### Why this closes Part 1 (and previews Part 2)

In Part 1, gradients were treated as:
- quantities to compute,
- structures to interpret,
- and signals to decompose.

This final step shows that:
- **every optimisation algorithm is just a rule for turning gradients into motion**,
- the heatmaps you plotted literally encode *where parameters will move next*,
- upstream objectives shape optimisation *before* any optimiser is introduced.

In **Part 2**, we will:
- repeat this step many times,
- vary step size $\eta$,
- introduce conditioning, curvature, and geometry,
- and study how these local updates accumulate into global behaviour.

Conceptually, nothing new is added ‚Äî only repetition.

That repetition is optimisation.