# ðŸ§  Unit 5.3: SARSA (State-Action-Reward-State-Action)

**Course:** Advanced Machine Learning (AICC 303)  
**Topic:** 5.4 Model Free Algorithms (SARSA)

**Difference from Q-Learning:**
*   **Q-Learning (Off-Policy):** Learns the value of the *optimal* policy, even if acting randomly.
*   **SARSA (On-Policy):** Learns the value of the *policy being followed* (including exploration).

**Update Rule:**
$Q(s,a) \leftarrow Q(s,a) + \alpha [R + \gamma Q(s',a') - Q(s,a)]$
(Notice we use the actual next action $a'$, not $\max_{a'} Q$.)

In [None]:
import numpy as np
import gymnasium as gym

env = gym.make('FrozenLake-v1', is_slippery=False)
q_table = np.zeros([env.observation_space.n, env.action_space.n])

alpha = 0.8
gamma = 0.95
epsilon = 0.1
episodes = 1000

def choose_action(state):
    if np.random.rand() < epsilon:
        return env.action_space.sample()
    return np.argmax(q_table[state, :])

for i in range(episodes):
    state, _ = env.reset()
    action = choose_action(state)
    
    done = False
    trunc = False
    
    while not (done or trunc):
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        
        next_action = choose_action(next_state)
        
        # SARSA Update
        q_target = reward + gamma * q_table[next_state, next_action]
        q_table[state, action] += alpha * (q_target - q_table[state, action])
        
        state = next_state
        action = next_action

print("Final Q-Table (SARSA):\n", np.round(q_table, 2))