Backgammon : An Atari Environment https://ale.farama.org/environments/backgammon/

Agent: Dueling DQN (Deep Q-Learning Network)

Goal: To move all pieces off the board for either the RED or WHITE Player


*   Action Space:      Discrete(0: FIRE, 1:RIGHT, 2: LEFT)
*   Observation Space: Box(0, 255, (210, 160, 3) uint8)
*   Environment Import: gymnasium.make("ALEBackgammon-v5")
*   Observation Type:   rgb, grayscale, ram
*   Variants:           v5 or ram-v5
*   Difficulty          3 choices

Version History
*   Version One: Based upon "COMP6008 Reinforcement Learning: Practical_8_Deep_Q_Learning_Solutions"
*   Version Two: Improvements to the Practicum
*   Referencing Daisy's research (team member)






In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [18]:
!wandb login

[34m[1mwandb[0m: Currently logged in as: [33mtristancarlisle[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
# TO be run once per Google Colab session
!apt update
!apt-get install xvfb x11-utils
!apt-get install -y xvfb
!python -m pip install gymnasium[atari]
!python -m pip install pyvirtualdisplay
!python -m pip install -- upgrade swig
!python -m pip install --upgrade pyvirtualdisplay moviepy
!python -m pip install --upgrade gymnasium[accept-rom-license,atari,box2d,classic_control,mujoco,toy_text]
!pip install box2d-py==2.3.5

Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Ign:3 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Get:4 https://r2u.stat.illinois.edu/ubuntu jammy Release [5,713 B]
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy Release.gpg [793 B]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,031 kB]
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [8,396 kB]
Get:8 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,598 kB]
Get:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:12 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 Packages [33.9 kB]
Get:13 http:

The below Libraries must be imported:

In [20]:
import os
os.environ['XDG_RUNTIME_DIR'] = '/tmp/runtime-tristan'
import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import gymnasium as gym
from gymnasium import spaces
from collections import deque
from pyvirtualdisplay import Display
import moviepy.editor as mpy
import wandb
# from torchinfo import summary
import os
# Create random number generator
rng = np.random.default_rng()
# create and start virtual display
display = Display(backend="xvfb")
display.start()

<pyvirtualdisplay.display.Display at 0x7fea74d63590>

In [4]:
import ale_py
gym.register_envs(ale_py)

In [5]:
device = "cpu"
if torch.cuda.is_available():
  device = "cuda"
  torch.set_default_device(torch.device(device))
torch.cuda.is_available()
# When testing this, one wants TRUE as the response

True

In [6]:
torch.get_default_device()

device(type='cuda', index=0)

Creating the CLASS for Dueling Q-Network

In [39]:
# Source: COMP6008 Reinforcement Learning: Practical_8_Deep_Q_Learning_Solutions
# source: https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/model.py
class DuelingQNetwork(nn.Module):
  def __init__(self, input_size, hidden_sizes, output_size, learning_rate):
    super().__init__()
    #creation of network layers
    layers = nn.ModuleList()

    #input layer
    layers.append(nn.Linear(input_size, hidden_sizes[0]))
    #Above line altered for using input_size directly
    layers.append(nn.ReLU())

    # hidden layers
    for i in range(len(hidden_sizes)-1):
      layers.append(nn.Linear(hidden_sizes[i], hidden_sizes[i+1]))
      layers.append(nn.ReLU())

    # output/special layers (state value and action advantage)
    # outputs a 1D tensor of size 1 with state value
    self.state_value = nn.Linear(hidden_sizes[-1],1)
    # outputs a 1D tensor of size output_size with the action for advantage values
    self.adv = nn.Linear(hidden_sizes[-1], output_size)
    # combine layers into feed-forward network
    self.net = nn.Sequential(*layers)
    # select loss function and optimizer
    self.criterion = nn.MSELoss()
    self.optimizer = torch.optim.Adam([
        {"params": self.net.parameters()},
        {"params": self.state_value.parameters()},
        {"params": self.adv.parameters()}],
                                      lr=learning_rate)
    # initialise the weights according to dueling network architecture
    self.net.apply(self.init_weights)
    self.state_value.apply(self.init_weights)
    self.adv.apply(self.init_weights)

  def init_weights(self, m):
    if type(m) == nn.Linear or type(m) == nn.Conv2d:
      torch.nn.init.xavier_uniform_(m.weight)

  def forward(self,x):
    # Reshape the input to a 1D tensor before feeding it into the network
    x1 = self.net(x)
    # state value output
    state_value = self.state_value(x1)
    # advantage output
    adv = self.adv(x1)
    # return output of Q-network for the input x
    return state_value + adv - adv.mean(dim=-1,keepdim=True)

  def update(self, inputs, targets):
    # Update network weights from inputs and targets
    self.optimizer.zero_grad()
    outputs = self.forward(inputs)
    loss = self.criterion(outputs, targets)
    loss.backward()
    self.optimizer.step()

  def copy_from(self, qnetwork):
    # copy weights from another Q-network
    self.net.load_state_dict(qnetwork.net.state_dict())
    self.state_value.load_state_dict(qnetwork.state_value.state_dict())
    self.adv.load_state_dict(qnetwork.adv.state_dict())


Creation of the Environment for Backgammon in Atari

Practicum 8's code was for a continous observation space but I required a Discrete Observation space. This would require modifications to the Agent, so that it could b set for discrete

In [40]:
# Creation of the environment
# source: COMP6008 Reinforcement Learning: Practical_8_Deep_Q_Learning_Solutions
# source: https://github.com/chengxi600/RLStuff/blob/master/Q%20Learning/Atari_DQN.ipynb
env = gym.make("ALE/Backgammon-v5", render_mode="rgb_array_list",obs_type="ram")

gamma = 0.99
hidden_sizes = (128, 128)
learning_rate = 0.001
epsilon = 1.0
min_epsilon = 0.01
tau = 0.1
rep_omega = 0.2
replay_size = 50000
minibatch_size = 32
target_update = 1000
epsilon_update = 1500000
max_episodes = 700
max_steps = 18000
criterion_episodes = 5

wandb.init(project='Bron-Dueling-DDQN', config={
    'gamma': gamma,
    'hidden_sizes':hidden_sizes,
    'epsilon':epsilon,
    'min_epsilon':min_epsilon,
    'rep_omega':rep_omega,
    'tau':tau,
    'learning_rate': learning_rate,
    'epsilon_update': epsilon_update,
    'replay_size': replay_size,
    'minibatch_size': minibatch_size,
    'target_update_freq': target_update,
    'num_episodes': max_episodes,
    'max_steps_per_episode': max_steps,
    'criterion_episodes':criterion_episodes
})






VBox(children=(Label(value='0.005 MB of 0.005 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

Creating the CLASS for Dueling Q-Network with Prioritised Experience Replay

This is a very different code to the previous Class

In [43]:
# Dueling Double Deep Q-network with Prioritised Experience Replay
# Source: COMP6008 Reinforcement Learning: Practical_8_Deep_Q_Learning_Solutions
class AgentDuelingDDQNREP():
  def __init__(self, env, gamma,
               hidden_sizes=(32,32),
               learning_rate=0.001,
               epsilon=0.1,
               min_epsilon=0.01,
               tau=0.1,
               rep_omega=0.2,
               replay_size=10000,
               minibatch_size=32,
               epsilon_update=50000,
               target_update=20):
    # Checking the state space type:
    #continuous = isinstance(env.observation_space, spaces.Box) and len(env.observation_space.shape) == 1
    #assert continuous, "Observation space must be continous with shape (n,)"
    self.state_channels = env.observation_space.shape[0]
    print(self.state_channels)
    self.state_dims = env.observation_space.shape
    # Check the action space has the correct type
    assert isinstance(env.action_space, spaces.Discrete), "Action space must be discrete"
    self.num_actions = env.action_space.n

    # create dueling Q-networks for action-value function
    self.qnet = DuelingQNetwork(self.state_channels, hidden_sizes, self.num_actions, learning_rate)
    self.target_qnet = DuelingQNetwork(self.state_channels, hidden_sizes, self.num_actions, learning_rate)

    # Copy weights from Q-network to target Q-network
    self.target_qnet.copy_from(self.qnet)

    # initialise replay buffer
    self.replay_buffer = deque(maxlen=replay_size)
    self.env = env
    self.gamma = gamma
    self.epsilon = epsilon
    self.init_epsilon = epsilon
    self.min_epsilon = min_epsilon
    self.tau = tau
    self.rep_omega = rep_omega
    self.minibatch_size = minibatch_size
    self.target_update = target_update
    self.target_update_idx = 0
    self.epsilon_update = epsilon_update
    self.epsilon_update_idx = 0

  def _linear_decay_epsilon_update(self):
    epsilon = 1 - self.epsilon_update_idx / self.epsilon_update
    epsilon = (self.init_epsilon - self.min_epsilon) * epsilon + self.min_epsilon
    epsilon = np.clip(epsilon, self.min_epsilon, self.init_epsilon)
    self.epsilon_update_idx += 1
    self.epsilon = epsilon

  def behaviour(self, state):
    # Exploratory behaviour policy
    if rng.uniform() >= self.epsilon:
      # convert state to torch format
      if not torch.is_tensor(state):
        state = torch.tensor(state, dtype=torch.float)
      # exploitation with probability 1-epsilon; break ties randomly
      q = self.qnet(state).detach()
      j = rng.permutation(self.num_actions)
      action= j[q[j].argmax().item()]
    else:
      # exploitation with probability epsilon
      action=self.env.action_space.sample()
    self._linear_decay_epsilon_update()
    return action

  def policy(self, state):
    # convert state to torch format
    if not torch.is_tensor(state):
      state = torch.tensor(state, dtype=torch.float)
    # greedy policy
    q = self.qnet(state).detach()
    return q.argmax().item()

  def td_error(self, state, action, reward, next_state, terminated):
    with torch.no_grad():
        if terminated:
            td_target = reward
    # calculate td error for prioritised experience reply
        else:
            next_action = self.qnet(next_state).detach().argmax()
            next_q = self.target_qnet(next_state).detach()
            td_target = reward + self.gamma * next_q[next_action]
    td_error = td_target - self.qnet(state)[action]
    #td_error = td_target - self.qnet(state)[action]
    return td_error.item()

  def update(self):
    # Update Q-network if there is enough experience
    if len(self.replay_buffer) >= self.minibatch_size:
      priorities = np.array([np.abs(sample[5] + 0.02) for idx, sample in enumerate(self.replay_buffer)])
      scaled_priorities = priorities ** self.rep_omega
      pri_sum = np.sum(scaled_priorities)
      probs = scaled_priorities / pri_sum

      batch = rng.choice(len(self.replay_buffer), size=self.minibatch_size, replace=False, p=probs)
      #print(self.minibatch_size)
      #print(self.state_dims)
      
      #calculate inputs and targets for the transistions in the minibatch
      inputs = torch.zeros((self.minibatch_size, self.state_dims[0]))
      targets = torch.zeros((self.minibatch_size, self.num_actions))
      for n, index in enumerate(batch):
        state, action, reward, next_state, terminated, _ = self.replay_buffer[index]
        state = state.clone()
        next_state = next_state.clone()
        # inputs are states
        inputs[n, :] = state.view(-1)
        targets_q_values = self.target_qnet(state.unsqueeze(0)).detach().squeeze(0)
        targets[n, :] = targets_q_values
        if terminated:
          targets[n,action] = reward
        elif not terminated:
          #double learning
          next_action = self.qnet(next_state).detach().argmax()
          next_q = self.target_qnet(next_state).detach()
          targets[n, action] = reward + self.gamma*next_q[next_action]
      self.qnet.update(inputs, targets)
    # copies weights from Q-network to target Q-network
    self.target_update_idx += 1
    if self.target_update_idx % self.target_update == 0:
      self.update_networks()

# Below not in original Practicum
  def update_networks(self):
    for target, online in zip(self.target_qnet.parameters(), self.qnet.parameters()):
      target_ratio = (1.0-self.tau) * target.data
      online_ratio = self.tau * online.data
      mix = target_ratio + online_ratio
      target.data.copy_(mix)

  def train(self, max_episodes, stop_criterion, criterion_episodes):
    # train the agent for a number of episodes
    rewards = []
    num_steps = 0
    for episodes in range(max_episodes):
      state, _ = env.reset()
      # convert state to torch format
      state = torch.tensor(state, dtype=torch.float)
      terminated = False
      truncated = False
      rewards.append(0)
      while not (terminated or truncated) and num_steps < 18000:
        # select action by following behaviour policy
        action = self.behaviour(state)
        # send action to the environment
        next_state, reward, terminated, truncated, _ = env.step(action)
        # convert next state to torch format
        next_state = torch.tensor(next_state, dtype=torch.float)
        # calculate td error for prioritised experience replay and add experience to replay buffer
        per = self.td_error(state, action, reward, next_state, terminated)
        self.replay_buffer.append((state.clone(), action, reward, next_state.clone(), terminated, per))
        # Update Q-network
        self.update()
        state = next_state
        rewards[-1] += reward
        num_steps += 1
        print(f'action: {action}, reward: {reward}, per: {per}')
      wandb.log({
                    'episode': episodes,
                    'Episode Reward':rewards[episodes],
                    'total steps': num_steps,
                    'epsilon': self.epsilon,

                })
      print(f"\rEpisode {episodes+1} done: steps: {num_steps}, rewards = {rewards[episodes]}", end="")
      if episodes >= criterion_episodes-1 and stop_criterion(rewards[-criterion_episodes:]):
        print(f"\nStopping criterion statisfied after {episodes} episodes")
        break
    # plot rewards received during training
    #plt.figures(dpi=100)
    #plt.plot(range(1, len(rewards)+1), rewards, label=f"Rewards")
    #plt.xlabel("Episodes")
    #plt.ylabel("Rewards per episode")
    #plt.legend(loc="lower right")
    #plt.grid()
    #plt.show()

  def save(self, path):
    # save network weights to a file
    torch.save(self.qnet.state_dict(), path)
    torch.save(self.target_qnet.state_dict(), path)

  def load(self, path):
    # load network weights from a file
    self.qnet.load_state_dict(torch.load(path))
    self.target_qnet,copy_from(self.qnet)

In [44]:
%wandb
agent = AgentDuelingDDQNREP(env,
                            gamma=gamma,
                            hidden_sizes=hidden_sizes,
                            learning_rate=learning_rate,
                            epsilon=epsilon,
                            min_epsilon=min_epsilon,
                            tau=tau,
                            rep_omega=rep_omega,
                            replay_size=replay_size,
                            minibatch_size=minibatch_size,
                            target_update=target_update,
                            epsilon_update=epsilon_update)
agent.train(max_episodes, lambda x : min(x) >= 1000, criterion_episodes)

128
action: 2, reward: 0.0, per: 4.339866638183594
action: 0, reward: 0.0, per: 5.4223785400390625
action: 0, reward: 0.0, per: 2.2196426391601562
action: 1, reward: 0.0, per: 151.99227905273438
action: 1, reward: 0.0, per: 64.82538604736328
action: 0, reward: 0.0, per: -39.43195343017578
action: 1, reward: 0.0, per: 17.881927490234375
action: 1, reward: 0.0, per: 147.66012573242188
action: 1, reward: 0.0, per: 115.36630249023438
action: 2, reward: 0.0, per: 59.505638122558594
action: 2, reward: 0.0, per: -5.968666076660156
action: 2, reward: 0.0, per: 8.541511535644531
action: 0, reward: 0.0, per: -11.441722869873047
action: 1, reward: 0.0, per: 84.29571533203125
action: 2, reward: 0.0, per: -11.28802490234375
action: 1, reward: 0.0, per: 104.20701599121094
action: 0, reward: 0.0, per: -46.177188873291016
action: 2, reward: 0.0, per: 53.396968841552734
action: 0, reward: 0.0, per: -32.069801330566406
action: 0, reward: 0.0, per: -7.766731262207031
action: 1, reward: 0.0, per: 88.08023

AttributeError: module 'matplotlib.pyplot' has no attribute 'figures'

In [47]:
import imageio
class VideoRecorderRAM:
    def __init__(self, dir_name, fps=30):
        self.dir_name = dir_name
        self.fps = fps
        self.frames = []

    def reset(self):
        self.frames = []

    def record(self, frame):
        self.frames.append(frame)

    def save(self, file_name):
        path = os.path.join(self.dir_name, file_name)
        imageio.mimsave(path, self.frames, fps=self.fps, macro_block_size = None)

In [49]:
# virtualise one episode
state, _ = env.reset()
terminated = False
truncated = False
steps = 0
total_reward = 0
while not (terminated or truncated or steps > max_steps):
  # take action based on policy
  action = agent.policy(state)
  print(action)
  # environment receives the action and returns
  # next observation, reward, terminated, truncated and additional information if applicable
  state, reward, terminated, truncated, info = env.step(action)
  total_reward += reward
  steps += 1

print(f"Reward: {total_reward}")

# store RGB frames for the entire episode
frames = env.render()

#close the environment
env.close()

v= VideoRecorderRAM('BackgammonVids')
v.frames=frames
vfilename = f'DDDQN_Backgammon.mp4'
v.save(vfilename)

# create and play video clips using frames and given fps
#clip = mpy.ImageSequenceClip(frames, fps=50)
#clip.ipython_display(rd_kwargs=dict(logger=None))



2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2


Save the weights of the Q-Network to file

In [None]:
agent.save("backgammon.128x128.DuelingDDQREP.pt")

Suggestion for changing agent to DISCRETE:


class AgentDuelingDDQNREP:
    def __init__(self, env, gamma, hidden_sizes, learning_rate, epsilon, rep_omega, replay_size, minibatch_size, target_update):
        # Check if observation space is discrete
        if isinstance(env.observation_space, spaces.Discrete):
            self.state_dims = env.observation_space.n  # For discrete spaces, use the number of states
        else:
            raise ValueError("Observation space must be discrete")

        assert isinstance(env.action_space, spaces.Discrete), "Action space must be discrete"
        # Initialize the rest of your agent...

# Create the agent
agent = AgentDuelingDDQNREP(env, gamma, hidden_sizes, learning_rate, epsilon, rep_omega, replay_size, minibatch_size, target_update)