Let’s take a **concrete example** of a **mouse navigating a maze** to understand what inputs and outputs go into the **policy gradient neural network**, and how it learns from a few episodes.

---

##  Setup: Mouse Maze

### Environment:

* A **5x5 maze** grid.
* The **mouse starts at (0, 0)**.
* The **goal (cheese)** is at (4, 4).
* The mouse can take 4 actions: `up`, `down`, `left`, `right`.

Each step gives:

* -1 reward (for time penalty)
* +10 reward if it reaches the goal.

---

##  Neural Network (Policy)

We have a neural network (NN) that **takes the current state** (position of the mouse) and **outputs probabilities for the 4 actions**.

* **Input to NN**: current state (e.g., (0, 0), (1, 3), etc.)

  * We'll use a 2D position as a vector, like `[x, y]`
* **Output of NN**: 4 numbers — probabilities for each action:

  * $0.1, 0.6, 0.2, 0.1$ → (e.g., 60% chance to go down)

---

##  Let’s Simulate 3 Episodes

We'll walk through:

1. What the NN sees (input)
2. What it outputs (action probabilities)
3. What it selects (sampled action)
4. What reward it gets
5. What it sends to the gradient update

---

### Episode 1:

#### Step 1:

* **State**: \[0, 0]
* **NN Output**: $0.1, 0.6, 0.2, 0.1$
* **Sampled Action**: `down` (index 1)
* **New State**: \[1, 0], reward = -1

#### Step 2:

* **State**: \[1, 0]
* **NN Output**: $0.2, 0.5, 0.2, 0.1$
* **Action**: `down` → \[2, 0], reward = -1

#### Step 3:

* ...
* **Eventually** reaches goal at step 8
* **Total reward $R(\tau_1)$**: +3 (because -1×7 + 10)

 For each step:

* Compute:

  $$
  \nabla_\theta \log \pi_\theta(a_t | s_t)
  $$
* Multiply by $R(\tau_1) = 3$

---

### Episode 2:

* Different path.
* Reaches goal in 12 steps.
* $R(\tau_2) = -2$

Now the gradient update will **reduce** the probability of the actions it took.

---

### Episode 3:

* Failed to reach goal in 15 steps.
* $R(\tau_3) = -15$

Bad episode → negative gradient will **discourage those actions**.

---

##  What Goes Into the Policy Gradient Update

Let’s summarize it:

### From each episode:

We collect sequences of:

| Time | State   | Action | log\_prob(action) | Reward |
| ---- | ------- | ------ | ----------------- | ------ |
| 0    | \[0, 0] | down   | -0.51             | -1     |
| 1    | \[1, 0] | down   | -0.69             | -1     |
| ...  | ...     | ...    | ...               | ...    |
| 8    | \[4, 4] | —      | —                 | +10    |

Then compute:

$$
\text{Loss} = -\sum_t \log \pi_\theta(a_t|s_t) \cdot R_t
$$

This is the quantity the NN uses to **compute gradients and update its weights**.

---

##  input-output summary:

| **Input to Neural Net**   | **Output from Neural Net**              |
| ------------------------- | --------------------------------------- |
| State = position `[x, y]` | Action probabilities `[p1, p2, p3, p4]` |
| Example: `[2, 1]`         | `[0.1, 0.7, 0.1, 0.1]`                  |

This output is used to **sample an action**, and its log-probability is used to compute the gradient scaled by reward.

---

##  Repeat:

After 3 episodes, we average the gradients and **update the weights** so that:

* Good actions become more likely (positive reward)
* Bad actions become less likely (negative reward)

---



##let's walk through it step by step in **Episode 1**, showing **what happens at each step**:

---

##  Goal:

At **every step**, we:

* Feed the current **state** into the policy network.
* Get the **probability distribution over actions**.
* **Sample an action** from that distribution.
* Take the action → get next state & reward.
* Store:

  * $\log \pi_\theta(a_t | s_t)$ → log probability of the taken action
  * $R_t$ → return/reward

---

##  Episode 1 — Suppose the Mouse Reaches Cheese in 8 Steps

| Step | State   | NN Output (action probs) | Sampled Action | Next State | Reward |
| ---- | ------- | ------------------------ | -------------- | ---------- | ------ |
| 0    | \[0, 0] | $0.1, 0.6, 0.2, 0.1$     | `down` (1)     | \[1, 0]    | -1     |
| 1    | \[1, 0] | $0.2, 0.5, 0.2, 0.1$     | `down` (1)     | \[2, 0]    | -1     |
| 2    | \[2, 0] | $0.25, 0.4, 0.25, 0.1$   | `right` (3)    | \[2, 1]    | -1     |
| 3    | \[2, 1] | $0.1, 0.2, 0.6, 0.1$     | `up` (0)       | \[1, 1]    | -1     |
| 4    | \[1, 1] | $0.1, 0.1, 0.1, 0.7$     | `right` (3)    | \[1, 2]    | -1     |
| 5    | \[1, 2] | $0.3, 0.2, 0.1, 0.4$     | `right` (3)    | \[1, 3]    | -1     |
| 6    | \[1, 3] | $0.2, 0.5, 0.1, 0.2$     | `down` (1)     | \[2, 3]    | -1     |
| 7    | \[2, 3] | $0.1, 0.1, 0.1, 0.7$     | `right` (3)    | \[2, 4]    | +10    |

---

##  What Happens Internally at Each Step?

At **each step $t$**:

1. You pass state $s_t$ into the neural network.
2. Get softmax output → action probabilities: $\pi_\theta(a|s_t)$
3. Sample action $a_t$ from this distribution.
4. Compute $\log \pi_\theta(a_t | s_t)$
5. Save it for training.

Yes, **every step gives one probability per action**, but you only use the **log-prob of the sampled action** in the gradient:

$$
\text{Save: } \nabla_\theta \log \pi_\theta(a_t | s_t)
$$

You don’t use all action probs — only the one you actually **sampled and took**.

---

##  Example:

Let’s take Step 0 again:

* State: \[0, 0]
* NN output: $0.1, 0.6, 0.2, 0.1$
* Chosen action: `down` (index 1)
* So:

  * $\pi_\theta(a_0 | s_0) = 0.6$
  * $\log \pi_\theta(a_0 | s_0) = \log(0.6) \approx -0.51$

This value is saved and **multiplied by reward later** in training.

---

##  After the Episode Ends

We compute **returns** $R_t$ for each time step (total future reward from that step onward).

For example (discount factor $\gamma = 1$):

| Step | Reward $r_t$ | Return $R_t$ |
| ---- | ------------ | ------------ |
| 0    | -1           | 3            |
| 1    | -1           | 4            |
| 2    | -1           | 5            |
| 3    | -1           | 6            |
| 4    | -1           | 7            |
| 5    | -1           | 8            |
| 6    | -1           | 9            |
| 7    | +10          | 10           |

---

##  Final Gradient Computation

For each step:

$$
\text{Gradient} = \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot R_t
$$

These gradients are **accumulated**, and we **take an average over the episode** to do the **parameter update**.

---

##  Summary

* Yes, **every step** outputs a full action probability vector.
* We **only use the log-prob of the taken action**.
* That log-prob is scaled by the **total return** from that point onward.
* Gradient ascent makes **good actions more probable**, and **bad ones less**.

---



 **SUMMARY**


---

##  What’s Actually Happening?

At the **end of every episode**, the agent:

* Looks back at **all the steps it took** (states, actions, rewards).
* Evaluates how **good or bad the entire episode was** based on **total reward or return**.
* Then uses that signal to **update the weights of the policy network**, so it:

  * **Increases** the likelihood of the good actions.
  * **Decreases** the likelihood of the bad actions.

---

##  Per Episode vs Per Step

* In **Policy Gradient (REINFORCE)**:

  * The **policy network is updated once per episode**, not after every step.
  * All the gradients are **collected** during the episode.
  * At the end, the network uses the **final return or discounted returns** to evaluate all the actions taken.

>the **final result** (or cumulative reward) decides whether the whole episode was **worth reinforcing or not**.

---

