# Chapter 18: Reinforcement Learning

**Tujuan:** Memahami konsep dasar Reinforcement Learning (RL), Markov Decision Process, Q-Learning (tabular), Policy Gradient (REINFORCE), dan Deep Q-Network (DQN) dengan contoh praktis.

---

## 1. RL Fundamentals

* **Agent** berinteraksi dengan **Environment**

* Pada tiap timestep $t$:

  * Agent mengamati state $s_t$
  * Memilih aksi $a_t$
  * Mendapat reward $r_{t+1}$
  * Pindah ke state berikutnya $s_{t+1}$

* Tujuan: memaksimalkan *expected discounted return*:

  $
  G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}
  $

  dengan $\gamma \in [0,1)$ sebagai discount factor.

---

### 1.1 Markov Decision Process (MDP)

* Didefinisikan oleh tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$:

  * $\mathcal{S}$: himpunan state
  * $\mathcal{A}$: himpunan aksi
  * $P(s'|s,a)$: probabilitas transisi
  * $R(s,a)$: reward yang diharapkan
  * $\gamma$: discount factor

* **Policy** $\pi(a|s)$: probabilitas memilih aksi $a$ pada state $s$

* **Value function** (nilai ekspektasi suatu state):

  $
  V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+1} \,\middle|\, s_t = s \right]
  $

* **Q-function** (nilai ekspektasi untuk pasangan state dan aksi):

  $
  Q^\pi(s,a) = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+1} \,\middle|\, s_t = s, a_t = a \right]
  $

---

## 2. Tabular Q-Learning (Off-Policy)

* Merupakan metode off-policy TD control

* **Aturan update:**

  $
  Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s,a) \right]
  $

  * $\alpha$: learning rate
  * $\gamma$: discount factor
  * $r$: reward langsung
  * $s'$: state berikutnya
  * $a'$: aksi optimal pada $s'$

* **Eksplorasi vs Eksploitasi**:

  * Gunakan strategi **ε-greedy**:

    * Pilih aksi acak dengan probabilitas $\varepsilon$
    * Pilih aksi terbaik (dari Q-table) dengan probabilitas $1 - \varepsilon$

---

In [None]:
import numpy as np
if not hasattr(np, "bool8"):
    np.bool8 = np.bool_

import gym, random
from collections import deque

env = gym.make("FrozenLake-v1", is_slippery=False)
nS = env.observation_space.n
nA = env.action_space.n

α = 0.8
γ = 0.95

# ε parameters: start high, decay to min_epsilon
ε_start, ε_min, ε_decay = 1.0, 0.01, 0.995

n_episodes = 10000
eval_interval = 1000

Q = np.zeros((nS, nA))

def unpack_reset(ret):
    return ret if not isinstance(ret, tuple) else ret[0]

def unpack_step(ret):
    if len(ret) == 4:
        s2, r, done, _ = ret
    else:
        s2, r, terminated, truncated, _ = ret
        done = terminated or truncated
    return s2, r, done

def choose_action(s, ε):
    return env.action_space.sample() if random.random() < ε else np.argmax(Q[s])

# Track performance
scores = []
ε = ε_start
for ep in range(1, n_episodes+1):
    s = unpack_reset(env.reset())
    done = False
    while not done:
        a = choose_action(s, ε)
        s2, r, done = unpack_step(env.step(a))
        Q[s,a] += α * (r + γ * np.max(Q[s2]) - Q[s,a])
        s = s2
    # decay ε
    ε = max(ε_min, ε * ε_decay)

    # periodically evaluate
    if ep % eval_interval == 0:
        # evaluate with ε=0
        total = 0
        for _ in range(100):
            s = unpack_reset(env.reset())
            done, ep_r = False, 0
            while not done:
                a = np.argmax(Q[s])
                s, r, done = unpack_step(env.step(a))
                ep_r += r
            total += ep_r
        avg = total / 100
        scores.append((ep, avg))
        print(f"Episode {ep:5d}, ε={ε:.3f}, eval avg reward={avg:.2f}")

# final evaluation
total = 0
for _ in range(200):
    s = unpack_reset(env.reset())
    done, ep_r = False, 0
    while not done:
        a = np.argmax(Q[s])
        s, r, done = unpack_step(env.step(a))
        ep_r += r
    total += ep_r
print("Final average reward over 200 eval episodes:", total/200)

  deprecation(
  deprecation(


Episode  1000, ε=0.010, eval avg reward=1.00
Episode  2000, ε=0.010, eval avg reward=1.00
Episode  3000, ε=0.010, eval avg reward=1.00
Episode  4000, ε=0.010, eval avg reward=1.00
Episode  5000, ε=0.010, eval avg reward=1.00
Episode  6000, ε=0.010, eval avg reward=1.00
Episode  7000, ε=0.010, eval avg reward=1.00
Episode  8000, ε=0.010, eval avg reward=1.00
Episode  9000, ε=0.010, eval avg reward=1.00
Episode 10000, ε=0.010, eval avg reward=1.00
Final average reward over 200 eval episodes: 1.0


## 3. Policy Gradient (REINFORCE)

* **Objective**: maksimalkan $\mathbb{E}_\pi[G_t]$
* **Gradient estimator**:

  $
    \nabla_\theta J(\theta)
    \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=0}^{T_i-1}
      \nabla_\theta \log\pi_\theta(a_t|s_t)\;G_t
  $

---

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import gym

env = gym.make("CartPole-v1")
n_inputs  = env.observation_space.shape[0]
n_actions = env.action_space.n

# Model policy
model = keras.Sequential([
    keras.layers.Dense(32, activation="relu", input_shape=(n_inputs,)),
    keras.layers.Dense(n_actions, activation="softmax")
])
optimizer = keras.optimizers.Adam(learning_rate=0.01)

def run_episode(env, model):
    states, actions, rewards = [], [], []
    s, done = env.reset(), False
    while not done:
        probs = model(np.array([s], dtype=np.float32)).numpy()[0]
        a = np.random.choice(n_actions, p=probs)
        s2, r, done, _ = env.step(a)
        states.append(s); actions.append(a); rewards.append(r)
        s = s2

    # compute returns
    returns = []
    G = 0
    for r in reversed(rewards):
        G = r + 0.99 * G
        returns.insert(0, G)
    returns = np.array(returns, dtype=np.float32)

    return np.array(states, dtype=np.float32), \
           np.array(actions, dtype=np.int32), \
           returns

@tf.function
def train_step(states, actions, returns):
    with tf.GradientTape() as tape:
        logits = model(states, training=True)
        act_masks = tf.one_hot(actions, n_actions)
        log_probs = tf.reduce_sum(act_masks * tf.math.log(logits + 1e-8), axis=1)
        loss = -tf.reduce_mean(log_probs * returns)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return loss

# Training loop
for episode in range(300):
    states, actions, returns = run_episode(env, model)
    loss = train_step(states, actions, returns)
    if episode % 50 == 0:
        print(f"Episode {episode}, loss={loss.numpy():.3f}")

  deprecation(
  deprecation(
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Episode 0, loss=6.569




Episode 50, loss=11.049
Episode 100, loss=9.164
Episode 150, loss=21.965
Episode 200, loss=23.967
Episode 250, loss=6.766


## 4. Deep Q‑Network (DQN)

* Gunakan **NN** untuk approx $Q(s,a;\theta)$
* **Replay buffer**, **target network**, **ɛ‑greedy**
* **Update**:

  $
    y = r + \gamma \max_{a'}Q(s',a';\theta^-) ,\quad
    \theta \leftarrow \arg\min_\theta (Q(s,a;\theta)-y)^2
  $

In [2]:
import numpy as np
if not hasattr(np, "bool8"):
    np.bool8 = np.bool_
import tensorflow as tf
from tensorflow import keras
import gym, random
from collections import deque

env = gym.make("CartPole-v1")
n_inputs  = env.observation_space.shape[0]
n_actions = env.action_space.n

def unpack_reset(ret):
    return ret if not isinstance(ret, tuple) else ret[0]

def unpack_step(ret):
    if len(ret) == 4:
        obs, reward, done, _ = ret
    else:
        obs, reward, terminated, truncated, _ = ret
        done = terminated or truncated
    return obs, reward, done

# Build Q‑network
def build_q_network():
    model = keras.Sequential([
        keras.layers.Dense(32, activation="relu", input_shape=(n_inputs,)),
        keras.layers.Dense(32, activation="relu"),
        keras.layers.Dense(n_actions)
    ])
    model.compile(optimizer="adam", loss="mse")
    return model

q_net = build_q_network()
target_net = build_q_network()
target_net.set_weights(q_net.get_weights())

buffer = deque(maxlen=2000)
epsilon = 1.0
gamma = 0.99
batch_size = 64
update_target_every = 50

episodes = 300
for ep in range(episodes):
    s = unpack_reset(env.reset())
    done = False
    ep_reward = 0
    while not done:
        if random.random() < epsilon:
            a = env.action_space.sample()
        else:
            q_vals = q_net.predict(np.array([s], dtype=np.float32), verbose=0)[0]
            a = np.argmax(q_vals)
        s2, r, done = unpack_step(env.step(a))
        buffer.append((s, a, r, s2, done))
        s = s2
        ep_reward += r

        if len(buffer) >= batch_size:
            batch = random.sample(buffer, batch_size)
            states  = np.array([b[0] for b in batch], dtype=np.float32)
            actions = np.array([b[1] for b in batch], dtype=np.int32)
            rewards = np.array([b[2] for b in batch], dtype=np.float32)
            next_s  = np.array([b[3] for b in batch], dtype=np.float32)
            dones   = np.array([b[4] for b in batch], dtype=np.float32)

            q_next = target_net.predict(next_s, verbose=0)
            targets = rewards + gamma * np.max(q_next, axis=1) * (1 - dones)
            q_vals  = q_net.predict(states, verbose=0)
            q_vals[np.arange(batch_size), actions] = targets
            q_net.train_on_batch(states, q_vals)

    # Update target network
    if ep % update_target_every == 0:
        target_net.set_weights(q_net.get_weights())
    epsilon = max(0.1, epsilon * 0.995)

    if ep % 50 == 0:
        print(f"Ep {ep:3d}, reward {ep_reward:.1f}, eps {epsilon:.3f}")

Ep   0, reward 59.0, eps 0.995
Ep  50, reward 18.0, eps 0.774
Ep 100, reward 37.0, eps 0.603
Ep 150, reward 14.0, eps 0.469
Ep 200, reward 15.0, eps 0.365
Ep 250, reward 72.0, eps 0.284


---

## Ringkasan Chapter 18

1. **Q‑Learning** tabular mempelajari Q‑table via Bellman update.
2. **Policy Gradient (REINFORCE)** optimalkan langsung parameter policy.
3. **DQN** gunakan neural network, replay buffer, dan target network untuk stabilitas.
4. Banyak varian: Double DQN, Dueling DQN, Prioritized Experience Replay, Actor‑Critic, PPO, dll.