## **What is SARSA?**

SARSA is a **reinforcement learning algorithm** similar to Q-learning, but it updates its values **using the action actually taken by the current policy** instead of the best possible future action.

The name **SARSA** comes from the tuple it uses for updates:

$$
(S, A, R, S', A')
$$

* $S$ = current state
* $A$ = action taken
* $R$ = reward received
* $S'$ = next state
* $A'$ = next action chosen by **the same policy**

---

## **On-Policy Updates**

* **On-policy** means the agent learns about the policy it is currently following.
* In SARSA, if the policy is ε-greedy, both the action in the current state and the action in the next state are chosen **with ε-greedy**.
* This makes SARSA’s updates reflect **the actual exploration** the agent does.

---

## **Update Rule**

SARSA’s update equation is:

$$
Q(S, A) \leftarrow Q(S, A) + \alpha \big[ R + \gamma Q(S', A') - Q(S, A) \big]
$$

* Compare with Q-learning:

  * Q-learning uses $\max_{a'} Q(S', a')$ (best possible future action).
  * SARSA uses $Q(S', A')$ for the **action the policy actually picks**.

---

## **Key Difference from Q-learning**

* **Q-learning** is **off-policy**: it learns as if it always takes the best action next time.
* **SARSA** is **on-policy**: it learns about the actions it *really* takes, including exploratory ones.

---

## **Tiny Example**

Imagine the agent has a risky shortcut to the goal:

* **Q-learning** might learn “always take shortcut” because it assumes you’ll handle it perfectly.
* **SARSA** might learn “sometimes take safer route” because it updates using the actual ε-greedy moves, which include risky mistakes.

In [1]:
import numpy as np
rng = np.random.default_rng(0)

# --- Tiny 1D world: states 0..4 (4 is terminal/goal) ---
n_states = 5
actions = {0:"left", 1:"right"}
n_actions = len(actions)

def step(state, action):
    """Environment dynamics."""
    if state == 4:                      # terminal
        return state, 0, True
    if action == 0:                     # left
        next_state = max(0, state-1)
    else:                               # right
        next_state = min(4, state+1)
    reward = 10 if next_state == 4 else -1
    done = (next_state == 4)
    return next_state, reward, done

# --- SARSA hyperparams ---
alpha  = 0.5      # learning rate
gamma  = 0.9      # discount factor
epsilon = 0.2     # ε-greedy exploration
episodes = 300

Q = np.zeros((n_states, n_actions))

def epsilon_greedy(state, eps):
    """Pick action with ε-greedy policy from Q."""
    if rng.random() < eps:
        return rng.integers(n_actions)           # explore
    # break ties randomly to avoid sticking to action 0
    best = np.flatnonzero(Q[state] == Q[state].max())
    return rng.choice(best)

# --- SARSA training loop ---
for ep in range(episodes):
    s = 0
    a = epsilon_greedy(s, epsilon)
    done = False
    while not done:
        s_next, r, done = step(s, a)
        if not done:
            a_next = epsilon_greedy(s_next, epsilon)    # on-policy next action
            td_target = r + gamma * Q[s_next, a_next]
        else:
            a_next = None
            td_target = r                               # terminal; no bootstrap
        td_error = td_target - Q[s, a]
        Q[s, a] += alpha * td_error

        s, a = s_next, (a_next if a_next is not None else 0)

# --- Show learned Q and policy ---
print("Q-table:")
print(np.round(Q, 2))

policy = np.argmax(Q, axis=1)
print("\nGreedy policy by state (0=left, 1=right):")
for s in range(n_states):
    print(f"state {s}: {policy[s]} ({actions[policy[s]]})")


Q-table:
[[ 1.9   3.18]
 [ 1.21  5.71]
 [ 1.09  6.08]
 [ 2.11 10.  ]
 [ 0.    0.  ]]

Greedy policy by state (0=left, 1=right):
state 0: 1 (right)
state 1: 1 (right)
state 2: 1 (right)
state 3: 1 (right)
state 4: 0 (left)
