# Dueling Double DQN

**Dueling Double DQN** combinƒÉ douƒÉ idei puternice:
1. **Dueling Network Architecture**  
2. **Double DQN (separarea ac»õiunii de evaluare)**

AceastƒÉ combina»õie produce unul dintre cei mai stabili »ôi performan»õi algoritmi
value-based pentru control discret.

---

## üé≠ 1. Dueling Network Architecture

√én loc ca re»õeaua sƒÉ prezicƒÉ direct Q-values pentru fiecare ac»õiune, Dueling DQN
√Æmparte estimarea √Æn douƒÉ fluxuri:

### üîπ Value Stream (V)

EvalueazƒÉ ‚Äûvaloarea‚Äù stƒÉrii:

$$
V(s)
$$

### üîπ Advantage Stream (A)

EvalueazƒÉ avantajul fiecƒÉrei ac»õiuni:

$$
A(s, a)
$$

### üîπ Combina»õia finalƒÉ (normalizare a avantajelor)

$$
Q(s,a) = V(s) + A(s,a) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(s,a')
$$

AceastƒÉ formulƒÉ ajutƒÉ √Æn stƒÉrile unde *ac»õiunea exactƒÉ nu conteazƒÉ foarte mult*.

---

## üéØ 2. Double DQN

Double DQN previne supraestimarea Q-values separ√¢nd alegerea ac»õiunii de evaluarea ei:

- Re»õeaua **online** alege ac»õiunea  
- Re»õeaua **target** o evalueazƒÉ  

$$
y = r + \gamma \, Q_{\text{target}}
\big(s',\, \arg\max_a Q_{\text{online}}(s',a)\big)
$$

Acest mecanism stabilizeazƒÉ puternic √ÆnvƒÉ»õarea.

---

## üîß 3. Soft Target Updates (œÑ)

√én loc de o copie brutalƒÉ a re»õelei online √Æn target, folosim un update lin:

$$
\theta_{\text{target}} \leftarrow 
\tau \, \theta_{\text{online}} +
(1 - \tau) \, \theta_{\text{target}}
$$

unde œÑ ‚âà 0.005.

Avantaje:
- stabilitate mult mai mare  
- convergen»õƒÉ linƒÉ  
- folosit √Æn algoritmii moderni (SAC, TD3)

---

## üöÄ Beneficii majore

‚úî √énvƒÉ»õare mai rapidƒÉ √Æn stƒÉri ‚Äûne-informative‚Äù  
‚úî Target fƒÉrƒÉ supraestimƒÉri  
‚úî Variabilitate redusƒÉ  
‚úî Performan»õƒÉ excelentƒÉ pe probleme complexe precum:
- LunarLander  
- MountainCar  
- Acrobot  
- Atari  

---

## üß© Pipeline (pe scurt)

1. SelectƒÉm ac»õiunea cu **re»õeaua online**  
2. CalculƒÉm valoarea »õintƒÉ cu **re»õeaua target**  
3. CombinƒÉm V + A ‚Üí Q prin formula dueling  
4. Facem gradient descent  
5. AplicƒÉm soft update (œÑ)  

---

## üìå Pe scurt

**Dueling Double DQN =  
Double DQN (stabil) + Dueling Network (eficient) + Soft Updates (modern)**

Una dintre cele mai bune arhitecturi value-based pentru control discret »ôi
componentƒÉ esen»õialƒÉ √Æn **Rainbow DQN**.


In [None]:
import gymnasium as gym
import tensorflow as tf
import numpy as np
from torch.utils.tensorboard import SummaryWriter
from collections import deque
import random
import matplotlib.pyplot as plt
from IPython.display import clear_output
from tensorflow.keras import Model, layers, optimizers

# ============================================================
# Replay Buffer
# ============================================================

class Memory(object):
    def __init__(self, memory_size: int) -> None:
        self.memory_size = memory_size
        self.buffer = deque(maxlen=memory_size)

    def add(self, experience) -> None:
        self.buffer.append(experience)

    def size(self):
        return len(self.buffer)

    def sample(self, batch_size: int):
        batch_size = min(batch_size, len(self.buffer))
        indexes = np.random.choice(np.arange(len(self.buffer)), size=batch_size, replace=False)
        return [self.buffer[i] for i in indexes]


# ============================================================
# Dueling Network
# ============================================================

class Qnetwork(Model):
    def __init__(self):
        super(Qnetwork, self).__init__()

        self.fc1 = layers.Dense(64, activation='relu')
        self.fc_value = layers.Dense(256, activation='relu')
        self.fc_adv = layers.Dense(256, activation='relu')

        self.value = layers.Dense(1)     # V(s)
        self.adv = layers.Dense(2)       # A(s,a) ‚Äî CartPole: 2 actions

    def call(self, x):
        x = tf.convert_to_tensor(x, dtype=tf.float32)
        x = self.fc1(x)

        value = self.fc_value(x)
        adv = self.fc_adv(x)

        value = self.value(value)
        adv = self.adv(adv)

        adv_mean = tf.reduce_mean(adv, axis=1, keepdims=True)
        Q = value + adv - adv_mean  # dueling combine

        return Q

    def select_action(self, obs):
        obs = np.array(obs, dtype=np.float32)[np.newaxis, :]
        Q = self.call(obs)
        return int(tf.argmax(Q[0]).numpy())


# ============================================================
# Hyperparameters
# ============================================================

GAMMA = 0.99
INITIAL_EPSILON = 0.99
FINAL_EPSILON = 0.0001
EXPLORATION_STEPS = 20000
REPLAY_MEMORY = 50000
BATCH = 32
UPDATE_STEPS = 10
LR = 1e-3 * 5
NUM_EPISODES = 1000

# ============================================================
# Main
# ============================================================

if __name__ == "__main__":
    tf.keras.backend.set_floatx('float32')

    memory_replay = Memory(REPLAY_MEMORY)
    epsilon = INITIAL_EPSILON
    learn_steps = 0
    begin_learn = False

    writer = SummaryWriter('ddqn-tf2')

    env = gym.make("CartPole-v1")

    online_q = Qnetwork()
    target_q = Qnetwork()
    target_q.set_weights(online_q.get_weights())

    mse = tf.keras.losses.MeanSquaredError()
    optim = optimizers.Adam(LR)

    reward_history = []

    for epoch in range(NUM_EPISODES):

        state, info = env.reset()
        episode_reward = 0

        for t in range(200):

            # ------------------------------------------------
            # Epsilon-greedy
            # ------------------------------------------------
            if random.random() < epsilon:
                action = random.randint(0, 1)
            else:
                action = online_q.select_action(state)

            next_state, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated
            episode_reward += reward

            # store experience
            memory_replay.add((state, next_state, action, reward, done))
            state = next_state

            # ------------------------------------------------
            # Learn
            # ------------------------------------------------
            if memory_replay.size() > 100:
                if not begin_learn:
                    print("Learning starts!")
                    begin_learn = True

                learn_steps += 1

                # sync target network
                if learn_steps % UPDATE_STEPS == 0:
                    target_q.set_weights(online_q.get_weights())

                batch = memory_replay.sample(BATCH)
                batch_state, batch_next_state, batch_action, batch_reward, batch_done = zip(*batch)

                batch_state = np.asarray(batch_state, dtype=np.float32)
                batch_next_state = np.asarray(batch_next_state, dtype=np.float32)
                batch_action = np.asarray(batch_action, dtype=np.int32)
                batch_reward = np.asarray(batch_reward, dtype=np.float32)
                batch_done = np.asarray(batch_done, dtype=np.float32)

                # -------------------------------
                # Double DQN update
                # -------------------------------
                next_action_online = tf.argmax(online_q(batch_next_state), axis=1)
                next_action_online = tf.cast(next_action_online, tf.int32)
                next_index = tf.stack([tf.range(tf.shape(next_action_online)[0]), next_action_online], axis=1)

                q_target_next = target_q(batch_next_state)
                q_target_selected = tf.gather_nd(q_target_next, next_index)

                y = batch_reward + (1.0 - batch_done) * GAMMA * q_target_selected

                # -------------------------------
                # Compute loss
                # -------------------------------
                with tf.GradientTape() as tape:
                    q_pred = online_q(batch_state)
                    batch_action_idx = tf.stack([tf.range(tf.shape(batch_action)[0]), batch_action], axis=1)
                    q_pred_selected = tf.gather_nd(q_pred, batch_action_idx)
                    loss = mse(q_pred_selected, y)

                grads = tape.gradient(loss, online_q.trainable_variables)
                optim.apply_gradients(zip(grads, online_q.trainable_variables))

                # update epsilon
                if epsilon > FINAL_EPSILON:
                    epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORATION_STEPS

                writer.add_scalar("loss", float(loss), learn_steps)

            if done:
                break

        reward_history.append(episode_reward)
        writer.add_scalar("episode_reward", episode_reward, epoch)

        # ----------------------------------------------------
        # LIVE PLOT
        # ----------------------------------------------------
        # ----------------------------
        # LIVE PLOT
        # ----------------------------
        if epoch % 5 == 0:
            clear_output(wait=True)
            plt.figure(figsize=(10,4))

            # Reward
            plt.plot(
                reward_history,
                label="Reward",
                color="blue",
                alpha=0.3,  
                linewidth=1
            )

            # Moving average
            if len(reward_history) > 20:
                ma = np.convolve(reward_history, np.ones(20)/20, mode='valid')
                plt.plot(
                    range(19, len(reward_history)),
                    ma,
                    label="Moving Avg (20 eps)",
                    color="orange",
                    linewidth=2.5
                )

            plt.title("Dueling Double DQN ‚Äî Training Progress")
            plt.xlabel("Episode")
            plt.ylabel("Reward")
            plt.grid(True, alpha=0.3)
            plt.legend()
            plt.show()


# OptimizƒÉri pentru Dueling Double DQN

## ‚úî Soft Updates (œÑ = 0.005)

√énlocuie»ôte copia bruscƒÉ a re»õelei target cu o actualizare linƒÉ:

$$
\theta_{\text{target}} \leftarrow 
\tau\,\theta_{\text{online}} + (1-\tau)\,\theta_{\text{target}}
$$

‚Üí reduce oscila»õiile »ôi stabilizeazƒÉ √ÆnvƒÉ»õarea.

---

## ‚úî Gradient Clipping

LimiteazƒÉ norma gradientului pentru a evita divergen»õa »ôi valori NaN.  
AsigurƒÉ update-uri mici, controlate »ôi stabile.

---

## ‚úî Q-value Clipping

AplicƒÉm limitare pe predic»õiile Q:

$$
Q \leftarrow \text{clip}(Q,\,-500,\,500)
$$

‚Üí previne ‚Äûexplozia‚Äù valorilor √Æn stƒÉri instabile.

---

## ‚úî Learning Rate mic + Re»õea mai ad√¢ncƒÉ

- **LR = 3e-4**  
- douƒÉ straturi dense a c√¢te **128 neuroni**

‚Üí stabilitate mai mare »ôi reprezentare mai bunƒÉ a func»õiei Q.

---

## ‚úî Smooth Epsilon Decay (0.995)

Agentul reduce epsilon treptat:

- explorare ‚Üí exploatare fƒÉrƒÉ salturi bru»ôte  
- comportament mai natural »ôi mai stabil

---

## ‚úî Replay Buffer Vectorizat

Memoria folose»ôte matrici NumPy, nu liste:

- sampling rapid  
- update-uri eficiente  
- integrare optimƒÉ cu TensorFlow  

---

## üìå Pe scurt

Aceste optimizƒÉri fac Dueling Double DQN **mult mai stabil, mai rapid »ôi mai robust**, oferind un training consistent chiar »ôi √Æn environmente dificile.


In [None]:
import gymnasium as gym
import tensorflow as tf
import numpy as np
import random
from collections import deque
import matplotlib.pyplot as plt
from IPython.display import clear_output
from tensorflow.keras import Model, layers, optimizers


# ============================================================
# Replay Buffer (optimized)
# ============================================================

class ReplayBuffer:
    def __init__(self, size, state_dim):
        self.mem_size = size
        self.mem_cntr = 0

        self.state_memory = np.zeros((size, state_dim), dtype=np.float32)
        self.next_state_memory = np.zeros((size, state_dim), dtype=np.float32)
        self.action_memory = np.zeros(size, dtype=np.int32)
        self.reward_memory = np.zeros(size, dtype=np.float32)
        self.done_memory = np.zeros(size, dtype=np.float32)

    def store(self, state, next_state, action, reward, done):
        idx = self.mem_cntr % self.mem_size
        self.state_memory[idx] = state
        self.next_state_memory[idx] = next_state
        self.action_memory[idx] = action
        self.reward_memory[idx] = reward
        self.done_memory[idx] = done
        self.mem_cntr += 1

    def sample(self, batch_size):
        max_mem = min(self.mem_cntr, self.mem_size)
        batch = np.random.choice(max_mem, batch_size, replace=False)

        return (
            self.state_memory[batch],
            self.next_state_memory[batch],
            self.action_memory[batch],
            self.reward_memory[batch],
            self.done_memory[batch]
        )


# ============================================================
# Dueling Network (optimized)
# ============================================================

class DuelingQNetwork(Model):
    def __init__(self, n_actions):
        super().__init__()

        self.fc1 = layers.Dense(128, activation='relu')
        self.fc2 = layers.Dense(128, activation='relu')

        self.value_fc = layers.Dense(128, activation='relu')
        self.adv_fc = layers.Dense(128, activation='relu')

        self.value = layers.Dense(1)
        self.advantage = layers.Dense(n_actions)

    def call(self, x):
        x = tf.convert_to_tensor(x, dtype=tf.float32)
        x = self.fc1(x)
        x = self.fc2(x)

        V = self.value(self.value_fc(x))
        A = self.advantage(self.adv_fc(x))

        A_mean = tf.reduce_mean(A, axis=1, keepdims=True)
        Q = V + (A - A_mean)

        return tf.clip_by_value(Q, -500, 500)

    def act(self, obs):
        obs = obs[np.newaxis, :].astype(np.float32)
        Q = self.call(obs)[0]
        return int(tf.argmax(Q).numpy())


# ============================================================
# Hyperparameters
# ============================================================

GAMMA = 0.99
LR = 3e-4
BATCH = 32
MEM_SIZE = 50000
NUM_EPISODES = 600

INITIAL_EPS = 1.0
FINAL_EPS = 0.02
EPS_DECAY = 0.995

TAU = 0.005   # soft update


# ============================================================
# Soft update function
# ============================================================

def soft_update(target, online, tau=TAU):
    target_weights = target.get_weights()
    online_weights = online.get_weights()
    new_weights = []

    for tw, ow in zip(target_weights, online_weights):
        new_weights.append(tw * (1 - tau) + ow * tau)

    target.set_weights(new_weights)


# ============================================================
# Training Loop
# ============================================================

env = gym.make("CartPole-v1")
state_dim = env.observation_space.shape[0]
n_actions = env.action_space.n

buffer = ReplayBuffer(MEM_SIZE, state_dim)
online = DuelingQNetwork(n_actions)
target = DuelingQNetwork(n_actions)
target.set_weights(online.get_weights())

optimizer = optimizers.Adam(LR)
mse = tf.keras.losses.MeanSquaredError()

epsilon = INITIAL_EPS
reward_history = []

for episode in range(NUM_EPISODES):

    state, _ = env.reset()
    ep_reward = 0

    for step in range(200):

        # ----------------------------
        # Epsilon-greedy action
        # ----------------------------
        if random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = online.act(state)

        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        ep_reward += reward

        # store transition
        buffer.store(state, next_state, action, reward, float(done))
        state = next_state

        # ----------------------------
        # TRAINING
        # ----------------------------
        if buffer.mem_cntr > 1000:

            states, next_states, actions, rewards, dones = buffer.sample(BATCH)

            # Double DQN ‚Äî FIXED with correct dtypes
            next_actions = tf.argmax(
                online(next_states), axis=1, output_type=tf.int32
            )

            indices = tf.stack(
                [tf.range(BATCH, dtype=tf.int32), next_actions], axis=1
            )

            next_Q = tf.gather_nd(target(next_states), indices)
            y = rewards + GAMMA * (1 - dones) * next_Q

            # ----------------------------
            # Compute loss
            # ----------------------------
            with tf.GradientTape() as tape:
                q_pred = online(states)
                action_idx = tf.stack(
                    [tf.range(BATCH, dtype=tf.int32), actions], axis=1
                )
                pred = tf.gather_nd(q_pred, action_idx)
                loss = mse(pred, y)

            grads = tape.gradient(loss, online.trainable_variables)
            grads = [tf.clip_by_norm(g, 5.0) for g in grads]
            optimizer.apply_gradients(zip(grads, online.trainable_variables))

            # Soft update for stability
            soft_update(target, online)

        if done:
            break

    # ----------------------------
    # Epsilon decay
    # ----------------------------
    epsilon = max(FINAL_EPS, epsilon * EPS_DECAY)
    reward_history.append(ep_reward)

    # ----------------------------
    # LIVE PLOT
    # ----------------------------
    if episode % 5 == 0:
        clear_output(wait=True)
        plt.figure(figsize=(10,4))

        # Reward
        plt.plot(
            reward_history,
            label="Reward",
            color="blue",
            alpha=0.3,  
            linewidth=1
        )

        # Moving average
        if len(reward_history) > 20:
            ma = np.convolve(reward_history, np.ones(20)/20, mode='valid')
            plt.plot(
                range(19, len(reward_history)),
                ma,
                label="Moving Avg (20 eps)",
                color="orange",
                linewidth=2.5
            )

        plt.title("Dueling Double DQN ‚Äî Training Progress")
        plt.xlabel("Episode")
        plt.ylabel("Reward")
        plt.grid(True, alpha=0.3)
        plt.legend()
        plt.show()
