# TD Learning on FrozenLake (SARSA & Q-learning)

Notebook ini menyajikan implementasi **Temporal Difference (TD) Learning** pada lingkungan **FrozenLake-v1** dari Gymnasium:

- **SARSA (on-policy)**
- **Q-learning (off-policy)**

Tujuan: memahami *alur kode* (banyak komentar) dan melihat perbedaan gaya belajar keduanya.

> **Catatan:** Jalankan sel-sel *berurutan dari atas ke bawah*. Jika belum memasang Gymnasium, jalankan `pip install` di sel Setup terlebih dahulu.

## 0) Setup & Imports
Jika Gymnasium belum terpasang di lingkungan Anda, jalankan sel `pip install`.
Anda bisa mengabaikan sel `pip` jika sudah terpasang.

In [None]:
# Jika diperlukan, buka komentar baris berikut untuk memasang dependensi:
# !pip install gymnasium==0.29.1 numpy matplotlib

import numpy as np
import matplotlib.pyplot as plt
import random
import gymnasium as gym

## 1) Environment: FrozenLake
FrozenLake adalah grid 4x4 berisi state START (S), FROZEN (F), HOLE (H), dan GOAL (G).
Agen ingin mencapai GOAL tanpa jatuh ke HOLE. Pada mode `is_slippery=True`, aksi bersifat stokastik (*licin*).

In [None]:
env = gym.make("FrozenLake-v1", is_slippery=True)
n_states = env.observation_space.n
n_actions = env.action_space.n
n_states, n_actions

## 2) Utilitas Umum
- **epsilon-greedy**: eksplorasi vs eksploitasi.
- **evaluate_policy**: mengukur sukses rate policy greedy terhadap Q.
- **rolling_mean**: untuk menghaluskan kurva hasil.

In [None]:
def epsilon_greedy(Q, s, epsilon):
    """Pilih aksi dengan epsilon-greedy dari tabel Q.
    - Dengan probabilitas epsilon: pilih aksi acak (eksplorasi)
    - Selainnya: pilih aksi argmax Q (eksploitasi)
    """
    if random.random() < epsilon:
        return random.randrange(n_actions)
    else:
        return int(np.argmax(Q[s]))

def evaluate_policy(Q, episodes=500):
    """Evaluasi policy greedy(Q): kembalikan rata-rata reward (success rate)."""
    wins = 0
    for _ in range(episodes):
        s, _ = env.reset()
        done = False
        while not done:
            a = int(np.argmax(Q[s]))
            s, r, terminated, truncated, _ = env.step(a)
            done = terminated or truncated
            if done and r > 0:
                wins += 1
    return wins / episodes

def rolling_mean(x, w=50):
    if len(x) < w:
        return np.array(x, dtype=float)
    c = np.cumsum(np.insert(x, 0, 0))
    return (c[w:] - c[:-w]) / float(w)

## 3) SARSA (on-policy)
**Rumus update:**
$$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha\,[r_{t+1} + \gamma Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)]$$

- Update memakai **aksi berikutnya yang benar-benar diambil** (on-policy).
- Cocok di lingkungan berisiko/acak; cenderung lebih konservatif.

In [None]:
def train_sarsa(
    num_episodes=5000,
    alpha=0.1,
    gamma=0.99,
    eps_start=1.0,
    eps_end=0.05,
):
    Q = np.zeros((n_states, n_actions), dtype=float)
    eps_history = []
    perf_history = []  # success rate evaluasi berkala

    for ep in range(1, num_episodes + 1):
        # Linear decay epsilon
        epsilon = max(eps_end, eps_start - (eps_start - eps_end) * ep / num_episodes)
        eps_history.append(epsilon)

        s, _ = env.reset()
        a = epsilon_greedy(Q, s, epsilon)
        done = False
        while not done:
            s2, r, terminated, truncated, _ = env.step(a)
            done = terminated or truncated
            if not done:
                a2 = epsilon_greedy(Q, s2, epsilon)
                target = r + gamma * Q[s2, a2]
            else:
                target = r  # terminal
            td_error = target - Q[s, a]
            Q[s, a] += alpha * td_error
            s, a = s2, (epsilon_greedy(Q, s2, epsilon) if not done else 0)

        # evaluasi berkala (setiap 200 ep)
        if ep % 200 == 0:
            perf = evaluate_policy(Q, episodes=300)
            perf_history.append(perf)

    return Q, eps_history, perf_history


## 4) Q-learning (off-policy)
**Rumus update:**
$$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha\,[r_{t+1} + \gamma \max_a Q(s_{t+1},a) - Q(s_t,a_t)]$$

- Target memakai **aksi optimal** di state berikutnya (off-policy).
- Cenderung lebih agresif mengejar optimal; populer karena sederhana dan kuat.

In [None]:
def train_q_learning(
    num_episodes=5000,
    alpha=0.1,
    gamma=0.99,
    eps_start=1.0,
    eps_end=0.05,
):
    Q = np.zeros((n_states, n_actions), dtype=float)
    eps_history = []
    perf_history = []

    for ep in range(1, num_episodes + 1):
        epsilon = max(eps_end, eps_start - (eps_start - eps_end) * ep / num_episodes)
        eps_history.append(epsilon)

        s, _ = env.reset()
        done = False
        while not done:
            a = epsilon_greedy(Q, s, epsilon)
            s2, r, terminated, truncated, _ = env.step(a)
            done = terminated or truncated
            best_next = 0 if done else np.max(Q[s2])
            target = r + gamma * best_next
            td_error = target - Q[s, a]
            Q[s, a] += alpha * td_error
            s = s2

        if ep % 200 == 0:
            perf = evaluate_policy(Q, episodes=300)
            perf_history.append(perf)

    return Q, eps_history, perf_history


## 5) Latih & Bandingkan
Kita melatih kedua algoritma dengan hyperparameter sama (bisa diubah).

In [None]:
EPISODES = 10000
ALPHA = 0.1
GAMMA = 0.99
EPS_START = 1.0
EPS_END = 0.05

Q_sarsa, eps_sarsa, perf_sarsa = train_sarsa(EPISODES, ALPHA, GAMMA, EPS_START, EPS_END)
Q_q, eps_q, perf_q = train_q_learning(EPISODES, ALPHA, GAMMA, EPS_START, EPS_END)

avg_sarsa = evaluate_policy(Q_sarsa, episodes=1000)
avg_q = evaluate_policy(Q_q, episodes=1000)
avg_sarsa, avg_q

## 6) Plot Kinerja (Success Rate)
Kita plot success rate hasil evaluasi berkala (setiap 200 episode).

In [None]:
x = np.arange(200, EPISODES+1, 200)
plt.figure()
plt.plot(x, perf_sarsa, marker='o', label='SARSA')
plt.plot(x, perf_q, marker='s', label='Q-learning')
plt.xlabel('Episodes')
plt.ylabel('Success rate (eval 300 eps)')
plt.title('FrozenLake TD Control: SARSA vs Q-learning')
plt.legend(); plt.show()

## 7) Ringkasan Perbandingan
- **SARSA**: on-policy → target memakai aksi yang benar-benar diambil; sering lebih konservatif.
- **Q-learning**: off-policy → target pakai aksi terbaik (\(\max\)); cenderung agresif.
- Performa akhir sering mirip; pada lingkungan licin, SARSA bisa lebih stabil, Q-learning bisa mengejar optimal lebih cepat namun fluktuatif.

Silakan ubah *hyperparameter* (episodes, alpha, gamma, epsilon schedule) untuk mengeksplorasi.