# Deep Q Learning.  

If the number of states becomes too large, tabular learning becomes unfeasible. A way to circumvent this problem is to model $Q$ as a deep neural network that receives as inputs the state and the action, and outputs the corresponding $Q$ value. The parameters of this network can then be learned via gradient descent by considering a loss that measures how far $Q$ is from satisfying Bellman's equation.

This is the basic conceptual idea behind the major breakthrough made in 2013 by a team from DeepMind https://arxiv.org/abs/1312.5602, which allowed them to train a DQN (Deep Q Network) to play several Atari games; in fact, the networke learned "by itself" starting from the pixel data of the frames of the computer games. This remarkable breakthrough was the starting point to a spectacular series of achievements that included mastering high-complexity strategy games such as chess and Go, and most importantly, the computation of the 3D structure of proteins https://www.nobelprize.org/prizes/chemistry/2024/summary/.  

A naive version of the DQN algorithm allude to above, can be summarizes as follows:

1. Initialize $Q(s, a)$ using a standard deep learning initialization.

2. By interacting with the environment, obtain the tuple $(s, a, r, s')$.

3. Compute the loss:

$${\cal L} =
\left\{
\begin{array}{l}
 (Q(s, a) - r)^2  \quad,\text{ if episode ended}\;, \\
 (Q(s, a) - (r + \gamma \max_{a'} Q(s' ,a'))^2 \quad, \text{ otherwise} \;.
\end{array}
\right.
$$

4. Update $Q(s, a)$ using an appropriate gradient descent based algorithm, to minimize the loss with respect to the $Q$ model parameters.

5. Repeat from step 2 until convergence (in the sense described for tabular-$Q$ learning) is achieved.

We'll now focus on a single Atari game, the iconic Pong https://ale.farama.org/environments/pong/.

In [1]:
# !pip install 'stable_baselines3'

In [2]:
import gymnasium as gym
from gymnasium import spaces
from stable_baselines3.common import atari_wrappers
import ale_py

# Download and install ROMs (cant make it work)
#wget http://www.atarimania.com/roms/Roms.rar
#unrar x Roms.rar
#ale-import-roms ROMS/

env = gym.make("PongNoFrameskip-v4", render_mode="rgb_array")

obs,_ = env.reset()

print(obs.shape)

env.render()

2025-06-30 04:03:36.366720: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1751252616.496013  398400 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1751252616.543882  398400 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1751252616.908635  398400 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1751252616.908661  398400 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1751252616.908663  398400 computation_placer.cc:177] computation placer alr

(210, 160, 3)


A.L.E: Arcade Learning Environment (version 0.11.1+2750686)
[Powered by Stella]


array([[[  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0],
        ...,
        [109, 118,  43],
        [109, 118,  43],
        [109, 118,  43]],

       [[109, 118,  43],
        [109, 118,  43],
        [109, 118,  43],
        ...,
        [109, 118,  43],
        [109, 118,  43],
        [109, 118,  43]],

       [[109, 118,  43],
        [109, 118,  43],
        [109, 118,  43],
        ...,
        [109, 118,  43],
        [109, 118,  43],
        [109, 118,  43]],

       ...,

       [[ 53,  95,  24],
        [ 53,  95,  24],
        [ 53,  95,  24],
        ...,
        [ 53,  95,  24],
        [ 53,  95,  24],
        [ 53,  95,  24]],

       [[ 53,  95,  24],
        [ 53,  95,  24],
        [ 53,  95,  24],
        ...,
        [ 53,  95,  24],
        [ 53,  95,  24],
        [ 53,  95,  24]],

       [[ 53,  95,  24],
        [ 53,  95,  24],
        [ 53,  95,  24],
        ...,
        [ 53,  95,  24],
        [ 53,  95,  24],
        [ 53,  95,  24]]

Unfortunately, the previous version of DQN doesn't work very well. So we need to do perform some upgrades.

- On one hand, we need to explore the environment (using random actions); on the other, we want to use the knowledge gained by the $Q$-function; this is the famous "exploration versus exploitation dilemma".  We will resolve this dilemma by introducing a parameter $\epsilon\in[0,1]$, that will decrease with the number of iterations, and will be used to decide, with probability $\epsilon$, if the agent will take a random action or will choose the action prescribed by $Q$.

- We will also introduce a **replay buffer**. This will store a "large" number of transitions $(s, a, r, s')$ that will be used to construct training data batches to update the parameters of $Q$ using gradient descent.

- Finally, using $Q$ itself to generate the targets for the loss will make the training very unstable. To circumvent this problem, we will introduce another DQN $\hat Q$, called the **target network**, that is periodically synchronized with the main $Q$ network, but otherwise remains unchanged for a given number of iterations.   

o $\epsilon$ vai dexendo e quando chega a 0.5, metade das vezes explora, metade das vezes faz a melhor ação


Upgraded DQN algorithm:  

1. Initialize parameters for $Q(s,a)$ and $\hat Q(s,a)$,$\;\epsilon \leftarrow 1.0$, and empty the replay buffer.

2. With probability $\epsilon$, select a random action $a$; otherwise, $a = \text{argmax}_a Q(s, a)$.

3. Execute action $a$ in an emulator and observe the reward, $r$, and the next state, $s'$.

4. Store the transition $(s, a, r, s')$ in the replay buffer.

5. Sample a random mini-batch of transitions from the replay buffer.

6. For every transition in the buffer, calculate the target:

$$
y =
\left\{
\begin{array}{l}
 r  \quad,\text{ if episode ended}\;, \\
 r + \gamma \max_{a'} \hat Q(s' ,a') \quad, \text{ otherwise} \;.
\end{array}
\right.
$$

7. Calculate the loss: ${\cal L}=(Q(s,a)-y)^2$.

8. Update $Q(s, a)$ using an appropriate gradient descent based algorithm, to minimize the loss with respect to the $Q$ model parameters.

9. Repeat from step 2 until convergence (in the sense described before) is achieved.


We will now present the Lapan's implementation of this algorithm, as exposed in chapter 6 of his book; see also
https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Third-Edition.

Note that Lapan uses PyTorch to implement the DQNs. It is a quite instructive exercise to translate the code to tensorflow/keras.  

# **Wrappers:** For efficiency and conceptual reasons, we need to preprocess the environment. For instance:

- We can reduce the game frames by considering only a monochromatic version of them and a lower resolution. This is "hidden" in the *atari_wrappers.AtariWrapper* class. After applying this, our images will have a shape of $(84,84,1)$; recall that originally they had a shape equal to $(210,160,3)$. Note that this class takes care of a lot of more relevant preprocessing details (see Chapter 6 of Lapan's book for more details).   

- This isn't the way PyTorch is designed to receive information; it expects the form (channels, height, width). The ImageToPyTorch wrapper, presented below, takes care of this.

- The agent won't be able to learn how to play Pong if we only provide still images of the game. To learn how to play Pong, we need to learn about dynamics. So we need to pack a given number (n_steps) of consecutive images into a "small video" with n_steps frames. This is taken care of by the BufferWrapper (see below).

Here is the code for the wrappers.  

In [3]:
import numpy as np

# Gym dá no formato (H, W, C) que é o que o tensorflow espera
# class ImageToPyTorch(gym.ObservationWrapper):
#     def __init__(self, env):
#         super(ImageToPyTorch, self).__init__(env)
#         obs = self.observation_space
#         assert isinstance(obs, gym.spaces.Box)
#         assert len(obs.shape) == 3
#         new_shape = (obs.shape[-1], obs.shape[0], obs.shape[1])
#         self.observation_space = gym.spaces.Box(
#             low=obs.low.min(), high=obs.high.max(),
#             shape=new_shape, dtype=obs.dtype)
#
#    def observation(self, observation):
#        return np.moveaxis(observation, 2, 0)

import typing as tt

class BufferWrapper(gym.ObservationWrapper): 
    def __init__(self, env, n_steps):
        super(BufferWrapper, self).__init__(env)
        obs = env.observation_space
        assert isinstance(obs, spaces.Box)
        new_obs = gym.spaces.Box(
            #obs.low.repeat(n_steps, axis=0), obs.high.repeat(n_steps, axis=0),
            # antes os channels tavam no inicio agora estao no fim, agora temos (H, W, C * n_steps) não (C * n_steps, H, W)
            obs.low.repeat(n_steps, axis=-1), obs.high.repeat(n_steps, axis=-1),
            dtype=obs.dtype)

        # old_shape = obs.shape
        # new_shape = (old_shape[0], old_shape[1], old_shape[2] * n_steps)
        # new_obs = gym.spaces.Box(
        #     low=np.repeat(obs.low, n_steps, axis=-1),
        #     high=np.repeat(obs.high, n_steps, axis=-1),
        #     shape=new_shape,
        #     dtype=obs.dtype
        # )

        self.observation_space = new_obs
        self.buffer = collections.deque(maxlen=n_steps)

    def reset(self, *, seed: tt.Optional[int] = None, options: tt.Optional[dict[str, tt.Any]] = None):
        for _ in range(self.buffer.maxlen-1): # preencher o buffer com frames vazias
            self.buffer.append(self.env.observation_space.low)
        obs, extra = self.env.reset() # reset gym env
        return self.observation(obs), extra

    def observation(self, observation: np.ndarray) -> np.ndarray:
        self.buffer.append(observation)
        return np.concatenate(list(self.buffer), axis=-1) # concat along channel (last in the list)
        #return np.concatenate(self.buffer)


def make_env(env_name: str, **kwargs):
    env = gym.make(env_name, **kwargs)
    env = atari_wrappers.AtariWrapper(env, clip_reward=False, noop_max=0)
    #env = ImageToPyTorch(env)
    env = BufferWrapper(env, n_steps=4)
    return env

For the DQN model, we will use a typical convolution network, with a conv base followed by a dense network. Importantly, instead of modeling $Q$ as a function of the pair $(s,a)$ that outputs the corresponding value, i.e.,

$$Q: {\cal S}\times {\cal A} \rightarrow \mathbb{R}\;,$$

where, for our environment, we have

$${\cal S}=\mathbb{R}^{84\times 84\times 4}$$

and

$${\cal A} =\{0,1,2,3,4,5\} \subset \mathbb{R}\;,$$

we will use a dual representation   

$$Q: {\cal S} \rightarrow \mathbb{R}^6\;,$$

that given the state, outputs the value for each possible action.

The PyTorch code should be, by now, self-explanatory. Notice nonetheless that it requires a bit more work, than the one needed in Keras to set dimensions of the layers.

In [4]:
import tensorflow as tf
from tensorflow.keras import layers

print("Num GPUs Available:", len(tf.config.list_physical_devices('GPU')))

class DQN(tf.keras.Model):
    def __init__(self, input_shape, n_actions):
        super(DQN, self).__init__() # ta a fazer init da superclass nn.Module

        print(f"DQN input shape: {input_shape}")

        self.rescale = layers.Rescaling(1./255) # normalização aqui
        self.conv1 = layers.Conv2D(filters=32, kernel_size=8, strides=4, activation="relu", input_shape=input_shape)
        self.conv2 = layers.Conv2D(filters=64, kernel_size=4, strides=2, activation="relu")
        self.conv3 = layers.Conv2D(filters=64, kernel_size=3, strides=1, activation="relu")
        self.flatten  = layers.Flatten()

        # fully connected
        self.fc = layers.Dense(units=512, activation='relu')
        self.out = layers.Dense(units=n_actions, activation = None)


    def call(self, inputs): # requiered for subclasses of tf.keras.Model
        x = self.rescale(inputs)

        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.flatten(x)

        x = self.fc(x)
        q = self.out(x)
        return q


Num GPUs Available: 1


In [5]:
# # bom exercicio é traduzir este codigo de torch para keras com tensorflow

# import torch
# import torch.nn as nn

# class DQN(nn.Module):
#     def __init__(self, input_shape, n_actions):
#         super(DQN, self).__init__() # ta a fazer init da superclass nn.Module
#         #__super__().__init

#         self.conv = nn.Sequential(
#             nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4), # convolução
#             nn.ReLU(), # função de ativação relu
#             nn.Conv2d(32, 64, kernel_size=4, stride=2), # 32 é igual ao 32 a cima, o keras faz isso automatico
#             nn.ReLU(),
#             nn.Conv2d(64, 64, kernel_size=3, stride=1),
#             nn.ReLU(),
#             nn.Flatten(),
#         )
#         size = self.conv(torch.zeros(1, *input_shape)).size()[-1] # ta a aplicar o vetor de convolução a um vetor de zeros oara ver o tamanho dele (.size = .shape)
#         self.fc = nn.Sequential(
#             nn.Linear(size, 512), # rede densa
#             nn.ReLU(), # relu
#             nn.Linear(512, n_actions) # mais uma densa
#         )

#     def forward(self, x: torch.ByteTensor):
#         # scale on GPU
#         xx = x / 255.0
#         return self.fc(self.conv(xx))

Next we define several parameters and variables and construct the class fro replay buffer.  

In [6]:
from dataclasses import dataclass
from typing import Tuple

DEFAULT_ENV_NAME = "PongNoFrameskip-v4"
MEAN_REWARD_BOUND = 19

GAMMA = 0.99
BATCH_SIZE = 32
REPLAY_SIZE = 10000
LEARNING_RATE = 1e-4
SYNC_TARGET_FRAMES = 1000
REPLAY_START_SIZE = 10000

EPSILON_DECAY_LAST_FRAME = 150000
EPSILON_START = 1.0 # começamos com probabilidade 1 de fazer algo ao calhas
EPSILON_FINAL = 0.01 # acabamos com probabilidade 0.01 de fazer algo ao calhas

State = np.ndarray
Action = int
BatchTensors = tt.Tuple[
    tf.Tensor,           # current state (batch, H, W, C)
    tf.Tensor,           # actions
    tf.Tensor,               # rewards
    tf.Tensor,           # done || trunc
    tf.Tensor           # next state
]
# BatchTensors = Tuple[
#     np.ndarray,   # current state batch, shape (B, H, W, C)
#     np.ndarray,   # actions batch, shape (B,)
#     np.ndarray,   # rewards batch, shape (B,)
#     np.ndarray,   # done flags batch, shape (B,)
#     np.ndarray    # next state batch, shape (B, H, W, C)
# ]

@dataclass
class Experience:
    state: State
    action: Action
    reward: float
    done_trunc: bool
    new_state: State


class ExperienceBuffer:
    def __init__(self, capacity: int):
        self.buffer = collections.deque(maxlen=capacity)

    def __len__(self):
        return len(self.buffer)

    def append(self, experience: Experience):
        self.buffer.append(experience)

    def sample(self, batch_size: int) -> tt.List[Experience]:
        indices = np.random.choice(len(self), batch_size, replace=False)
        return [self.buffer[idx] for idx in indices]

We now create the agent class.

In [7]:
class Agent:
    def __init__(self, env: gym.Env, exp_buffer: ExperienceBuffer):
        self.env = env
        self.exp_buffer = exp_buffer
        self.state: tt.Optional[np.ndarray] = None
        self._reset()

    def _reset(self):
        self.state, _ = env.reset()
        self.total_reward = 0.0

    # @torch.no_grad()
    def play_step(self, net: DQN,
                  epsilon: float = 0.0) -> tt.Optional[float]:
        done_reward = None

        if np.random.random() < epsilon: # com probabilidade epsilon joga ao calhas
            action = env.action_space.sample()
        else: # caso contrario usa a informação do Q

            #state_v = torch.as_tensor(self.state).to(device)
            state_v = tf.convert_to_tensor(self.state, dtype=tf.float32)

            #state_v.unsqueeze_(0) # gera uma dimensão, erro comum
            state_v = tf.expand_dims(state_v, axis=0)

            #q_vals_v = net(state_v) # gera os Qs
            q_values = net(state_v)  

            #_, act_v = torch.max(q_vals_v, dim=1) # queremos os maiores na dimensão 1
            act_v = tf.argmax(q_values, axis=1)

            #action = int(act_v.item()) # faz a ação que tem o melhor Q
            action = int(act_v[0])
            # action = int(act_idx.numpy()[0])

        # do step in the environment
        new_state, reward, is_done, is_tr, _ = self.env.step(action) # joga essa ação
        self.total_reward += reward

        exp = Experience( #
            state=self.state, action=action, reward=float(reward),
            done_trunc=is_done or is_tr, new_state=new_state
        )
        # informação esta guardada na forma (0,a,r,done,s'). e o buffer é uma lista dessas coisas
        self.exp_buffer.append(exp)
        self.state = new_state
        if is_done or is_tr:
            done_reward = self.total_reward
            self._reset()
        return done_reward

We will also need to transform our samples taken from the buffer into tensors that can be fed to our networks. This is achieved by this simple function.

In [8]:
from typing import List

#def batch_to_tensors(batch: tt.List[Experience], device: torch.device) -> BatchTensors:
def batch_to_tensors(batch: List[Experience]) -> BatchTensors:
    states, actions, rewards, dones, new_state = [], [], [], [], []
    for e in batch:
        states.append(e.state)
        actions.append(e.action)
        rewards.append(e.reward)
        dones.append(e.done_trunc)
        new_state.append(e.new_state)
    
    #states_t = torch.as_tensor(np.asarray(states))
    states_t = tf.convert_to_tensor(states, dtype=tf.float32)

    #actions_t = torch.LongTensor(actions)
    actions_t = tf.convert_to_tensor(actions, dtype=tf.int32)

    #rewards_t = torch.FloatTensor(rewards)
    rewards_t = tf.convert_to_tensor(rewards, dtype=tf.float32)

    #dones_t = torch.BoolTensor(dones)
    dones_t = tf.convert_to_tensor(dones, dtype=tf.bool)
    
    #new_states_t = torch.as_tensor(np.asarray(new_state))
    new_states_t = tf.convert_to_tensor(new_state, dtype=tf.float32)

    #return states_t.to(device), actions_t.to(device), rewards_t.to(device), \
    #       dones_t.to(device),  new_states_t.to(device)
    return states_t, actions_t, rewards_t, dones_t,  new_states_t

A not-so-easy piece of code, even being quite small, is the one to compute the loss.

In [9]:
def calc_loss(batch: tt.List[Experience], net: DQN, tgt_net: DQN): # -> torch.Tensor:
    states_t, actions_t, rewards_t, dones_t, new_states_t = batch_to_tensors(batch)

    q_values = net(states_t)

    #indices = tf.stack([tf.range(BATCH_SIZE), actions_t], axis=1)
    # q_values has shape (B, n_actions)
    batch_range = tf.range(tf.shape(q_values)[0], dtype=actions_t.dtype)   # shape (B,)
    indices     = tf.stack([batch_range, actions_t], axis=1)               # shape (B,2)

    #state_action_values = tf.expand_dims(q_values, indices)
    state_action_values = tf.gather_nd(q_values, indices)

    next_q = tf.reduce_max(tgt_net(new_states_t), axis=1)
    next_q = next_q * tf.cast(tf.logical_not(dones_t), tf.float32)

    expected_state_action_values = next_q * GAMMA + rewards_t

    loss_fn = tf.keras.losses.MeanSquaredError()
    loss = loss_fn(expected_state_action_values, state_action_values)

    return loss

In [10]:
# def calc_loss(batch: tt.List[Experience], net: DQN, tgt_net: DQN,
#               device: torch.device) -> torch.Tensor:
#     states_t, actions_t, rewards_t, dones_t, new_states_t = batch_to_tensors(batch)

#     state_action_values = net(states_t).gather(
#         1, actions_t.unsqueeze(-1)
#     ).squeeze(-1) # squeeze retira uma dimensão de [[...]] para [...]
#     with torch.no_grad():
#         next_state_values = tgt_net(new_states_t).max(1)[0]
#         next_state_values[dones_t] = 0.0
#         next_state_values = next_state_values.detach() # não calcular gradientes

#     expected_state_action_values = next_state_values * GAMMA + rewards_t
#     return nn.MSELoss()(state_action_values, expected_state_action_values)

Finally, we have the code for the training loop. Note that this training procedure is quite expensive it will be unfeaseble without the access to a GPU.  

Ver tensorboard com `tensorboard --logdir=logs  --port=6006`

In [11]:
import cv2
import argparse
import collections
import time
import numpy as np
import tensorflow as tf


parser = argparse.ArgumentParser()
parser.add_argument("--dev", default="cpu", help="Device name, default=cpu")
parser.add_argument("--env", default=DEFAULT_ENV_NAME,
                    help="Name of the environment, default=" + DEFAULT_ENV_NAME)
args, _ = parser.parse_known_args()
# device = torch.device(args.dev)

#env = wrappers.make_env(args.env)
env = make_env(args.env)
print(f"Environment observation space: {env.observation_space.shape}")
net = DQN(env.observation_space.shape, env.action_space.n)
tgt_net = DQN(env.observation_space.shape, env.action_space.n)

# Actually initialize both models by calling them with dummy input
dummy_input = tf.zeros((1,) + env.observation_space.shape, dtype=tf.float32)
print(f"Dummy input shape: {dummy_input.shape}")
net(dummy_input)  # This creates the weights
tgt_net(dummy_input)  # This creates the weights


log_dir = f"logs/{args.env}_{int(time.time())}"
writer = tf.summary.create_file_writer(log_dir)

#writer = SummaryWriter(comment="-" + args.env)
print("Network architecture:")
#net.build((None,) + env.observation_space.shape)
#net.summary()

buffer = ExperienceBuffer(REPLAY_SIZE)
agent = Agent(env, buffer)
epsilon = EPSILON_START

optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
#optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
total_rewards = []
frame_idx = 0
ts_frame = 0
ts = time.time()
best_m_reward = None


while True:
    frame_idx += 1
    epsilon = max(EPSILON_FINAL, EPSILON_START - frame_idx / EPSILON_DECAY_LAST_FRAME)

    reward = agent.play_step(net, epsilon)
    if reward is not None:
        total_rewards.append(reward)
        speed = (frame_idx - ts_frame) / (time.time() - ts)
        ts_frame = frame_idx
        ts = time.time()
        m_reward = np.mean(total_rewards[-100:])
        print(f"{frame_idx}: done {len(total_rewards)} games, reward {m_reward:.3f}, "
              f"eps {epsilon:.2f}, speed {speed:.2f} f/s")

        # TensorBoard logging for TensorFlow
        with writer.as_default():
            tf.summary.scalar("epsilon", epsilon, step=frame_idx)
            tf.summary.scalar("speed", speed, step=frame_idx)
            tf.summary.scalar("reward_100", m_reward, step=frame_idx)
            tf.summary.scalar("reward", reward, step=frame_idx)
            writer.flush()
        #writer.add_scalar("epsilon", epsilon, frame_idx)
        #writer.add_scalar("speed", speed, frame_idx)
        #writer.add_scalar("reward_100", m_reward, frame_idx)
        #writer.add_scalar("reward", reward, frame_idx)


        if best_m_reward is None or best_m_reward < m_reward:
            #torch.save(net.state_dict(), args.env + "-best_%.0f.dat" % m_reward)
            net.save_weights(args.env + "-best_%.0f.dat" % m_reward + ".weights.h5")
            if best_m_reward is not None:
                print(f"Best reward updated {best_m_reward:.3f} -> {m_reward:.3f}")
            best_m_reward = m_reward
        if m_reward > MEAN_REWARD_BOUND:
            print("Solved in %d frames!" % frame_idx)
            break
    if len(buffer) < REPLAY_START_SIZE:
        continue

    # copy weights from net to tgt_net
    #if frame_idx % SYNC_TARGET_FRAMES == 0:
    #    tgt_net.load_state_dict(net.state_dict())
    if frame_idx % SYNC_TARGET_FRAMES == 0:
        tgt_net.set_weights(net.get_weights())

    # optimizer.zero_grad() # precisamos de fazer manualmente, se não acumulam, em tensorflow keras n temos
    batch = buffer.sample(BATCH_SIZE)


    #loss_t = calc_loss(batch, net, tgt_net)
    #loss_t.backward()
    #optimizer.step()
    with tf.GradientTape() as tape:
        loss_t = calc_loss(batch, net, tgt_net)
    grads = tape.gradient(loss_t, net.trainable_variables)
    optimizer.apply_gradients(zip(grads, net.trainable_variables))

    if frame_idx % 10 == 0:
        print(f"training in frame {frame_idx}, Loss = {loss_t:.4f}")
writer.close()


Environment observation space: (84, 84, 4)
DQN input shape: (84, 84, 4)
DQN input shape: (84, 84, 4)


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
I0000 00:00:1751252623.245956  398400 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 4273 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5


Dummy input shape: (1, 84, 84, 4)


2025-06-30 04:03:43.780757: W external/local_xla/xla/service/gpu/llvm_gpu_backend/default/nvptx_libdevice_path.cc:40] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  ipykernel_launcher.runfiles/cuda_nvcc
  ipykern/cuda_nvcc
  
  /usr/local/cuda
  /opt/cuda
  /mnt/DataDisk/PersonalFiles/2025/Masters/FMAP -  Fundamentos Matemáticos para Aprendizagem Profunda/FMAP-Project/.venv/lib/python3.12/site-packages/tensorflow/python/platform/../../../nvidia/cuda_nvcc
  /mnt/DataDisk/PersonalFiles/2025/Masters/FMAP -  Fundamentos Matemáticos para Aprendizagem Profunda/FMAP-Project/.venv/lib/python3.12/site-packages/tensorflow/python/platform/../../../../nvidia/cuda_nvcc
  /mnt/DataDisk/PersonalFiles/2025/Masters/FMAP -  Fundamentos Matemáticos para Aprendizagem Profunda/FMAP-Project/.venv/lib/python3.12/site-pack

Network architecture:
1104: done 1 games, reward -19.000, eps 0.99, speed 1222.08 f/s
2104: done 2 games, reward -19.000, eps 0.99, speed 792.40 f/s
2866: done 3 games, reward -19.667, eps 0.98, speed 964.72 f/s
3908: done 4 games, reward -19.500, eps 0.97, speed 997.05 f/s
4749: done 5 games, reward -19.600, eps 0.97, speed 938.76 f/s
5716: done 6 games, reward -19.833, eps 0.96, speed 989.14 f/s
6628: done 7 games, reward -20.000, eps 0.96, speed 910.06 f/s
7469: done 8 games, reward -20.125, eps 0.95, speed 806.46 f/s
8231: done 9 games, reward -20.222, eps 0.95, speed 722.00 f/s
9398: done 10 games, reward -19.900, eps 0.94, speed 737.64 f/s


2025-06-30 04:03:58.238416: W external/local_xla/xla/service/gpu/llvm_gpu_backend/nvptx_backend.cc:110] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2025-06-30 04:03:58.240699: W external/local_xla/xla/service/gpu/llvm_gpu_backend/nvptx_backend.cc:110] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2025-06-30 04:03:58.243251: W external/local_xla/xla/service/gpu/llvm_gpu_backend/nvptx_backend.cc:110] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
error: libdevice not found at ./libdevice.10.bc
2025-06-30 04:03:58.245332: W external/local_xla/xla/service/gpu/llvm_gpu_backend/nvptx_backend.cc:110] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
error: libdevice not found at ./libdevice.10.bc
2025-06-30 04:03:58.247613: W external/local_xla/xla/service/gpu/llvm_gpu_backend/nvptx_backend.cc:110] libdevice is required by this HLO module but was not found at ./

UnknownError: {{function_node __wrapped__Pow_device_/job:localhost/replica:0/task:0/device:GPU:0}} JIT compilation failed. [Op:Pow] name: 

In [None]:
# import cv2
# import argparse
# import collections
# import time
# import numpy as np
# import torch
# import torch.optim as optim
# from torch.utils.tensorboard import SummaryWriter


# parser = argparse.ArgumentParser()
# parser.add_argument("--dev", default="cpu", help="Device name, default=cpu")
# parser.add_argument("--env", default=DEFAULT_ENV_NAME,
#                     help="Name of the environment, default=" + DEFAULT_ENV_NAME)
# args, _ = parser.parse_known_args()
# device = torch.device(args.dev)

# #env = wrappers.make_env(args.env)
# env = make_env(args.env)
# net = DQN(env.observation_space.shape, env.action_space.n).to(device)
# tgt_net = DQN(env.observation_space.shape, env.action_space.n).to(device)
# writer = SummaryWriter(comment="-" + args.env)
# print(net)

# buffer = ExperienceBuffer(REPLAY_SIZE)
# agent = Agent(env, buffer)
# epsilon = EPSILON_START

# optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
# total_rewards = []
# frame_idx = 0
# ts_frame = 0
# ts = time.time()
# best_m_reward = None


# while True:
#     frame_idx += 1
#     epsilon = max(EPSILON_FINAL, EPSILON_START - frame_idx / EPSILON_DECAY_LAST_FRAME)

#     reward = agent.play_step(net, device, epsilon)
#     if reward is not None:
#         total_rewards.append(reward)
#         speed = (frame_idx - ts_frame) / (time.time() - ts)
#         ts_frame = frame_idx
#         ts = time.time()
#         m_reward = np.mean(total_rewards[-100:])
#         print(f"{frame_idx}: done {len(total_rewards)} games, reward {m_reward:.3f}, "
#               f"eps {epsilon:.2f}, speed {speed:.2f} f/s")
#         writer.add_scalar("epsilon", epsilon, frame_idx)
#         writer.add_scalar("speed", speed, frame_idx)
#         writer.add_scalar("reward_100", m_reward, frame_idx)
#         writer.add_scalar("reward", reward, frame_idx)
#         if best_m_reward is None or best_m_reward < m_reward:
#             torch.save(net.state_dict(), args.env + "-best_%.0f.dat" % m_reward)
#             if best_m_reward is not None:
#                 print(f"Best reward updated {best_m_reward:.3f} -> {m_reward:.3f}")
#             best_m_reward = m_reward
#         if m_reward > MEAN_REWARD_BOUND:
#             print("Solved in %d frames!" % frame_idx)
#             break
#     if len(buffer) < REPLAY_START_SIZE:
#         continue
#     if frame_idx % SYNC_TARGET_FRAMES == 0:
#         tgt_net.load_state_dict(net.state_dict())

#     optimizer.zero_grad() # precisamos de fazer manualmente, se não acumulam, em tensorflow keras n temos
#     batch = buffer.sample(BATCH_SIZE)
#     loss_t = calc_loss(batch, net, tgt_net, device)
#     loss_t.backward()
#     optimizer.step()
#     if frame_idx % 10 == 0:
#         print(f"training in frame {frame_idx}, Loss = {loss_t:.4f}")
# writer.close()


The evolution of our agent's ability to play pong is recorded in the following collection of videos
https://www.youtube.com/playlist?list=PLMVwuZENsfJklt4vCltrWq0KV9aEZ3ylu