## **What is SARSA?**

SARSA is a **reinforcement learning algorithm** similar to Q-learning, but it updates its values **using the action actually taken by the current policy** instead of the best possible future action.

The name **SARSA** comes from the tuple it uses for updates:

$$
(S, A, R, S', A')
$$

* $S$ = current state
* $A$ = action taken
* $R$ = reward received
* $S'$ = next state
* $A'$ = next action chosen by **the same policy**

---

## **On-Policy Updates**

* **On-policy** means the agent learns about the policy it is currently following.
* In SARSA, if the policy is ε-greedy, both the action in the current state and the action in the next state are chosen **with ε-greedy**.
* This makes SARSA’s updates reflect **the actual exploration** the agent does.

---

## **Update Rule**

SARSA’s update equation is:

$$
Q(S, A) \leftarrow Q(S, A) + \alpha \big[ R + \gamma Q(S', A') - Q(S, A) \big]
$$

* Compare with Q-learning:

  * Q-learning uses $\max_{a'} Q(S', a')$ (best possible future action).
  * SARSA uses $Q(S', A')$ for the **action the policy actually picks**.

---

## **Key Difference from Q-learning**

* **Q-learning** is **off-policy**: it learns as if it always takes the best action next time.
* **SARSA** is **on-policy**: it learns about the actions it *really* takes, including exploratory ones.

---

## **Tiny Example**

Imagine the agent has a risky shortcut to the goal:

* **Q-learning** might learn “always take shortcut” because it assumes you’ll handle it perfectly.
* **SARSA** might learn “sometimes take safer route” because it updates using the actual ε-greedy moves, which include risky mistakes.