# TD Learning on CartPole (SARSA & Q-learning)

Notebook ini menerapkan **Temporal Difference (TD) Learning** pada lingkungan **CartPole-v1** (Gymnasium),
dengan pendekatan **discretization** (mendisretkan state kontinu menjadi indeks diskrit) agar bisa memakai tabel-Q.

Algoritma:
- **SARSA (on-policy)**
- **Q-learning (off-policy)**

Kita sertakan komentar rinci agar mudah mengikuti alur kode.

## 0) Setup & Imports
Jika perlu, jalankan `pip install` untuk memasang Gymnasium.
Anda bisa melewati sel `pip` jika sudah terpasang.

In [None]:
# Jika perlu, hapus komentar berikut:
# !pip install gymnasium==0.29.1 numpy matplotlib

import numpy as np
import matplotlib.pyplot as plt
import random
import gymnasium as gym

## 1) Environment: CartPole-v1
CartPole memiliki state kontinyu 4 dimensi: 
1) posisi cart, 2) kecepatan cart, 3) sudut tiang, 4) kecepatan sudut tiang.

Aksi (diskrit): 0 = dorong ke kiri, 1 = dorong ke kanan.

Episode berakhir jika tiang jatuh melewati batas sudut atau cart melewati batas posisi. Reward = 1 per langkah. Maksimal langkah biasanya 500.

In [None]:
env = gym.make("CartPole-v1")
n_actions = env.action_space.n
n_actions

## 2) Discretization (State Binning)
Karena state kontinu, kita perlu memetakannya ke indeks diskrit agar Q-table bisa dipakai.

Strategi sederhana:
- Tetapkan rentang (min, max) untuk tiap dimensi (clipping agar stabil).
- Bagi tiap dimensi menjadi sejumlah bin (contoh: [6, 6, 12, 12]).
- Gunakan `np.digitize` untuk memetakan nilai ke indeks bin.
- Gabungkan ke satu indeks tunggal dengan radix/basis (atau simpan sebagai tuple indeks).

In [None]:
# Konfigurasi bin untuk tiap dimensi state
NUM_BINS = (6, 6, 12, 12)  # cart_pos, cart_vel, pole_angle, pole_vel

# Batasan (clipping) untuk tiap dimensi agar diskretisasi tidak meledak
STATE_BOUNDS = np.array([
    [-2.4, 2.4],        # cart position (env termination bound)
    [-3.0, 3.0],        # cart velocity (dibatasi agar masuk akal)
    [-0.2095, 0.2095],  # pole angle (~12 degrees)
    [-3.5, 3.5]         # pole velocity at tip (dibatasi)
], dtype=float)

def create_bins(low, high, bins):
    """Buat batas bin (tanpa termasuk -inf/inf) untuk np.digitize."""
    return np.linspace(low, high, bins - 1)

# Precompute batas bin untuk tiap dimensi
BIN_EDGES = [create_bins(STATE_BOUNDS[i,0], STATE_BOUNDS[i,1], NUM_BINS[i]) for i in range(4)]

def discretize_state(state):
    """Map state kontinu -> tuple indeks diskrit (i0,i1,i2,i3)."""
    s = np.array(state, dtype=float)
    # Clip agar dalam batas
    s = np.clip(s, STATE_BOUNDS[:,0], STATE_BOUNDS[:,1])
    idxs = [int(np.digitize(s[i], BIN_EDGES[i])) for i in range(4)]
    # Pastikan indeks dalam [0, bins-1]
    idxs = [min(NUM_BINS[i]-1, max(0, idxs[i])) for i in range(4)]
    return tuple(idxs)

def q_shape():
    return (*NUM_BINS, n_actions)  # contoh: (6,6,12,12,2)

q_shape()

## 3) Utilitas: epsilon-greedy, evaluasi, rolling mean
- `epsilon_greedy`: pilih aksi berdasarkan Q-table diskrit.
- `evaluate`: rata-rata reward per episode untuk policy greedy.
- `rolling_mean`: smoothing plot reward.

In [None]:
def epsilon_greedy(Q, state_idx, epsilon):
    if random.random() < epsilon:
        return random.randrange(n_actions)
    return int(np.argmax(Q[state_idx]))

def evaluate(Q, episodes=20):
    """Evaluasi policy greedy dari Q, kembalikan rata-rata total reward per episode."""
    tot = 0.0
    for _ in range(episodes):
        s, _ = env.reset()
        s_idx = discretize_state(s)
        done = False
        ep_reward = 0
.        while not done:
            a = int(np.argmax(Q[s_idx]))
            s, r, terminated, truncated, _ = env.step(a)
            s_idx = discretize_state(s)
            ep_reward += r
            done = terminated or truncated
        tot += ep_reward
    return tot / episodes

def rolling_mean(x, w=50):
    if len(x) < w:
        return np.array(x, dtype=float)
    c = np.cumsum(np.insert(x, 0, 0))
    return (c[w:] - c[:-w]) / float(w)

## 4) SARSA (on-policy)
Update memakai aksi berikutnya yang benar-benar diambil (on-policy).

In [None]:
def train_sarsa(
    episodes=2000,
    alpha=0.1,
    gamma=0.99,
    eps_start=1.0,
    eps_end=0.05,
):
    Q = np.zeros(q_shape(), dtype=float)
    rewards = []
    eval_hist = []

    for ep in range(1, episodes+1):
        epsilon = max(eps_end, eps_start - (eps_start - eps_end) * ep / episodes)
        s, _ = env.reset()
        s_idx = discretize_state(s)
        a = epsilon_greedy(Q, s_idx, epsilon)

        done = False
        ep_reward = 0.0
        while not done:
            s_next, r, term, trunc, _ = env.step(a)
            ep_reward += r
            done = term or trunc
            s_next_idx = discretize_state(s_next)
            if not done:
                a_next = epsilon_greedy(Q, s_next_idx, epsilon)
                target = r + gamma * Q[s_next_idx + (a_next,)]
            else:
                target = r
            td_error = target - Q[s_idx + (a,)]
            Q[s_idx + (a,)] += alpha * td_error
            s_idx, a = s_next_idx, (a_next if not done else 0)

        rewards.append(ep_reward)
        if ep % 50 == 0:
            eval_hist.append(evaluate(Q, episodes=10))

    return Q, rewards, eval_hist


## 5) Q-learning (off-policy)
Update memakai aksi optimal di state berikutnya (off-policy).

In [None]:
def train_q_learning(
    episodes=2000,
    alpha=0.1,
    gamma=0.99,
    eps_start=1.0,
    eps_end=0.05,
):
    Q = np.zeros(q_shape(), dtype=float)
    rewards = []
    eval_hist = []

    for ep in range(1, episodes+1):
        epsilon = max(eps_end, eps_start - (eps_start - eps_end) * ep / episodes)
        s, _ = env.reset()
        s_idx = discretize_state(s)
        done = False
        ep_reward = 0.0
        while not done:
            a = epsilon_greedy(Q, s_idx, epsilon)
            s_next, r, term, trunc, _ = env.step(a)
            ep_reward += r
            done = term or trunc
            s_next_idx = discretize_state(s_next)
            best_next = 0.0 if done else np.max(Q[s_next_idx])
            target = r + gamma * best_next
            td_error = target - Q[s_idx + (a,)]
            Q[s_idx + (a,)] += alpha * td_error
            s_idx = s_next_idx

        rewards.append(ep_reward)
        if ep % 50 == 0:
            eval_hist.append(evaluate(Q, episodes=10))

    return Q, rewards, eval_hist


## 6) Latih & Bandingkan
Anda bisa memodifikasi hyperparameter untuk melihat pengaruhnya ke performa.
Reward per episode menggambarkan lamanya tiang bertahan (maksimal ~500).

In [None]:
EPISODES = 4000
ALPHA = 0.1
GAMMA = 0.99
EPS_START = 1.0
EPS_END = 0.05

Q_sarsa, rew_sarsa, eval_sarsa = train_sarsa(EPISODES, ALPHA, GAMMA, EPS_START, EPS_END)
Q_q, rew_q, eval_q = train_q_learning(EPISODES, ALPHA, GAMMA, EPS_START, EPS_END)

print('Greedy eval (avg reward) SARSA:', evaluate(Q_sarsa, episodes=30))
print('Greedy eval (avg reward) Q-learn:', evaluate(Q_q, episodes=30))

## 7) Plot Reward per Episode (dengan Rolling Mean)
Plot satu per satu agar jelas tren belajarnya.

In [None]:
plt.figure()
plt.plot(rew_sarsa)
plt.plot(rolling_mean(rew_sarsa, 50))
plt.title('SARSA: Reward per Episode (CartPole)')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.show()

plt.figure()
plt.plot(rew_q)
plt.plot(rolling_mean(rew_q, 50))
plt.title('Q-learning: Reward per Episode (CartPole)')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.show()

## 8) Catatan & Variasi
- Discretization mempermudah implementasi tabel-Q, tetapi ada trade-off **quantization error**.
- Anda bisa meningkatkan jumlah bin untuk dimensi tertentu (misal sudut tiang) agar lebih presisi, namun Q-table membesar.
- Alternatif yang lebih kuat: **function approximation** (linear, tile-coding) atau **Deep Q-Network (DQN)** jika ingin menghindari diskretisasi.