# Installing the dependencies:

In [6]:
!git clone https://github.com/Near32/comaze-python.git ; cd comaze-python; git checkout develop-rl-template; git pull; git status; pip install -e .

fatal: destination path 'comaze-python' already exists and is not an empty directory.
Already on 'develop-rl-template'
Your branch is up to date with 'origin/develop-rl-template'.
remote: Enumerating objects: 30, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 17 (delta 8), reused 17 (delta 8), pack-reused 0[K
Unpacking objects: 100% (17/17), done.
From https://github.com/Near32/comaze-python
   52a41aa..4117798  develop-rl-template -> origin/develop-rl-template
Updating 52a41aa..4117798
Fast-forward
 comaze/agents/abstract_agent.py                    |  14 [32m++[m[31m-[m
 comaze/agents/rl/abstract_on_policy_rl_agent.py    |  54 [32m++++++[m[31m----[m
 comaze/agents/rl/simple_on_policy_rl_agent.py      |  11 [32m+[m[31m-[m
 comaze/env/comaze.py                               |  24 [32m+++[m[31m--[m
 setup.py                                           |  17 [32m++[m[31m--[m
 .../rl/test_trainin

# Before continuing any further, please restart the kernel (Runtime->restart runtime) in order to make the installed packaged available.

# 1) Create a simple (non-communicating, but coordinating) On-Policy RL Agent:

## What does the player know and see about the game?

Taking a look at the observation space of the game CoMaze will allow us to understand what will our player see at each step. The observation space is comprised of the following elements:


*   an arena/game board (giving us the position of the agent, among other things),
*   the list of directional moves available to the current player (among ["LEFT", "RIGHT", "UP", "DOWN", "SKIP"]),
*   the last message coming from the other player (if any),
*   and the secret rule of the current player (specifying whether to reach a given-color goal before that of another color).

In this first example, we will only care about the arena/game board and the list of directional moves that are available to the current player.

### RL-based player's observation space:

In order to make it possible for a (deep) RL-based player to make sense of arena/game board and possible directional moves, those have been pre-formatted and will be delivered at each step to your player via a Dict structure containing the following:

*    the arena/game board is stored in the key "encoded_pov" and takes the shape of a 3D tensor of size 7x7x12. The arena/game board is indeed 7 tile-large in width and 7 tile-long in height. Each entry in that tensor corresponds to the nature of the corresponding tile. 
*    the list of directional moves available to the current player is stored in the key "available_moves". It is represented by a 1D vector of size 5. Entry i contains a 1 if the i-th directional move (amongamong ["LEFT", "RIGHT", "UP", "DOWN", "SKIP"]) is available to the current player. 

As we will see below in the example, the "encoded_pov" will be dealt with a CNN while the "available_moves" will be dealt with a fully-connected layer...


## Action Space: What moves can the player do?

At each step, the player can execute a move that consist of choosing the following two elements:

*    a directional move (among ["LEFT", "RIGHT", "UP", "DOWN", "SKIP"]),
*    and a message (among [EMPTY_MESSAGE, "Q", "W", "E", "R", T", "Y", "U", "I", "O", "P"]).

When the agent chooses the "SKIP" directional move, then the game forces the message to be the EMPTY_MESSAGE, i.e. no message will be passed onto the next player. 

(Note that, in the current version of the game, we only allow a vocabulary size of V=10 and a maximum sentence length of L=1.)

In this first example, we will only care about the directional move, thus focusing on our player's ability to coordinate with each other without any communication.

As we will see below, the player is expected to output a discrete action/move id from the range [0,5].

## Template: Have a go at modifying it!

Below is the template of an RL-based player.
Please have a go at modifying the architecture of its neural network, in the method build_model.
Be careful to accomodate any change made in the network topography in the select_action method.

---



In [4]:
from typing import Any
from typing import Dict
from typing import List
from typing import Callable
from typing import Optional

import numpy as np 

import torch 
import torch.nn as nn
import torch.nn.functional as F
from torch import distributions 

from comaze.agents.rl import AbstractOnPolicyRLAgent
from comaze.agents.utils import dict_encoded_pov_avail_moves_extract_exp_fn, discrete_direction_only_format_move_fn


class SimpleOnPolicyRLAgent(AbstractOnPolicyRLAgent):
  """
  Simple on-policy RL agents using PyTorch.
  
  Call init_rl_algo at the end of the init function.

  The output of select_action must be a dictionnary containing:
    - "action": the actual action that needs to be transformed 
                using the format_move_fn function.
    - "log_prob_action": the log likelihood over the action
                          distribution. 
  
  Note the default extract_exp_fn and format_move_fn functions.
  They are the minimum to allow any learning to take place.

  As AbstractAgent requests it, you also need to implement:
    - agent_id: Agent's unique id.
    - select_action: Agent's action selection logic.
  """

  def __init__(
    self, 
    learning_rate: float=1e-4,
    discount_factor: float=0.99,
    num_actions: int=5,
    pov_shape: List[int]=[7,7,12],
    use_cuda: Optional[Any]=False,
    ) -> None:
    """
    Initializes the agent.
    """
    nn.Module.__init__(self=self)
    AbstractOnPolicyRLAgent.__init__(
      self=self,
      extract_exp_fn=dict_encoded_pov_avail_moves_extract_exp_fn, 
      format_move_fn=discrete_direction_only_format_move_fn,
      learning_rate=learning_rate,
      discount_factor=discount_factor,
    )

    self.num_actions = num_actions
    self.pov_shape = pov_shape
    self.use_cuda = use_cuda
    self.build_agent()
    
    self.init_rl_algo()
  
  @property
  def agent_id(self) -> str:
    #################
    ## MODIFY HERE ##
    return "simple_onpolicy_rlagent"
    ## MODIFY HERE ##
    #################
    
  def build_agent(self):
    #################
    ## MODIFY HERE ##
    self.embed_pov_size = 256
    self.embed_pov = nn.Sequential(
      nn.Conv2d(in_channels=self.pov_shape[-1], out_channels=32, kernel_size=3, stride=1, padding=1),
      nn.ReLU(),
      nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3, stride=2, padding=1),
      nn.ReLU(),
      nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3, stride=1, padding=1),
      nn.ReLU(),
      nn.Flatten(),
      nn.Linear(512, self.embed_pov_size),
      nn.ReLU(),
    )
    
    self.embed_action_size = 128
    self.embed_action_space = nn.Linear(self.num_actions, self.embed_action_size)
    
    policy_input_size = self.embed_pov_size+self.embed_action_size
    self.policy = nn.Linear(policy_input_size, self.num_actions)
    ## MODIFY HERE ##
    #################

    if self.use_cuda:
      self.cuda()

  def get_formatted_inputs(self, obs):
    nobs = {}
    for k,v in obs.items():
      if 'pov' in k:
        # move channels around:
        assert len(v.shape)==3
        v = np.transpose(v, (2,0,1))
      nv = torch.from_numpy(v).unsqueeze(0).float()
      nobs[k] = nv.cuda() if self.use_cuda else nv
    return nobs

  def select_action(self, observation: Any) -> Dict[str, Any]:
    """
    Returns agent's action given `observation`.
    """

    obs = self.get_formatted_inputs(observation)

    pov_input = obs["encoded_pov"]
    action_space = obs["available_moves"]
    
    #################
    ## MODIFY HERE ##
    pov_emb = self.embed_pov(pov_input)
    action_emb = self.embed_action_space(action_space)
    
    pov_action_emb = torch.cat((pov_emb, action_emb), dim=1)
    action_pred = self.policy(pov_action_emb)
    ## MODIFY HERE ##
    #################

    action_prob = F.softmax(action_pred, dim = -1)  
    avail_action_prob = action_prob * obs["available_moves"]
    dist = distributions.Categorical(avail_action_prob)
    action = dist.sample()
    log_prob_action = dist.log_prob(action)

    action_dict = {
      "action": action.item(),
      "log_prob_action": log_prob_action
    }

    return action_dict

In [12]:
import random
from typing import Callable
import pandas as pd 

from functools import partial
from tqdm import tqdm 
from tensorboardX import SummaryWriter 

from comaze.env import TwoPlayersCoMazeGym
from comaze.agents import AbstractAgent, SimpleOnPolicyRLAgent


def two_players_environment_loop(
    agent1: AbstractAgent,
    agent2: AbstractAgent,
    environment,
    max_episode_length,
):
  """
  Loop runner for the environment.
  """
  # Setup environment.
  state = environment.reset()

  # Book-keeping.
  t = 0
  done = False
  trajectory = list()
  cum_reward = 0

  ebar = tqdm(total=max_episode_length, position=1)
  while not done and t<=max_episode_length:
    ebar.update(1)
    # Turn-based game.
    if t%2 == 0:
      move = agent1.select_move(state)
    else:
      move = agent2.select_move(state)
  
    # Progress simulation.
    next_state, reward, done, info = environment.step(move)

    if t==max_episode_length:
      done = True
      reward += -1

    for agent in [agent1, agent2]:
      agent.update(move, next_state, reward, done)

    # Book-keeping.
    trajectory.append((t, state, move, reward, next_state, done, info))

    cum_reward += reward
    t = t + 1
    state = next_state
  

  # Dump logs.
  pd.DataFrame(trajectory).to_csv("{}-{}.csv".format(
      agent1.agent_id, agent2.agent_id)
  )

  return cum_reward, trajectory

## Let us test the players:

In [7]:
use_cuda = True # make sure to use a GPU in Runtime->change runtime type... 
sparse_reward = False

agent1 = SimpleOnPolicyRLAgent( 
  learning_rate=1e-4,
  discount_factor=0.99,
  num_actions=5,
  pov_shape=[7,7,12],
  use_cuda=use_cuda,
)

agent2 = SimpleOnPolicyRLAgent( 
  learning_rate=1e-4,
  discount_factor=0.99,
  num_actions=5,
  pov_shape=[7,7,12],
  use_cuda=use_cuda,
)

max_episode_length = 50
verbose = False 

environment_kwargs = {
    "level":"1",
    "verbose":verbose,
}

environment = TwoPlayersCoMazeGym(**environment_kwargs)

two_players_environment_loop(
    agent1=agent1,
    agent2=agent2,
    environment=environment,
    max_episode_length=max_episode_length,
)


  0%|          | 0/50 [00:00<?, ?it/s][A
  4%|▍         | 2/50 [00:00<00:02, 16.63it/s][A
  6%|▌         | 3/50 [00:00<00:03, 12.66it/s][A
  8%|▊         | 4/50 [00:00<00:04, 10.62it/s][A
 10%|█         | 5/50 [00:00<00:04,  9.76it/s][A
 12%|█▏        | 6/50 [00:00<00:04,  9.23it/s][A
 14%|█▍        | 7/50 [00:00<00:04,  8.81it/s][A
 16%|█▌        | 8/50 [00:00<00:04,  8.57it/s][A
 18%|█▊        | 9/50 [00:00<00:04,  8.38it/s][A
 20%|██        | 10/50 [00:01<00:04,  8.28it/s][A
 22%|██▏       | 11/50 [00:01<00:04,  8.26it/s][A
 24%|██▍       | 12/50 [00:01<00:04,  8.29it/s][A
 26%|██▌       | 13/50 [00:01<00:04,  8.32it/s][A
 28%|██▊       | 14/50 [00:01<00:04,  8.27it/s][A
 30%|███       | 15/50 [00:01<00:04,  8.25it/s][A
 32%|███▏      | 16/50 [00:01<00:04,  8.24it/s][A
 34%|███▍      | 17/50 [00:01<00:03,  8.26it/s][A
 36%|███▌      | 18/50 [00:02<00:03,  8.20it/s][A
 38%|███▊      | 19/50 [00:02<00:03,  8.20it/s][A
 40%|████      | 20/50 [00:02<00:03,  8.20it/s]

Loss -0.1445927619934082 :: EP reward -1
Loss 0.11999654769897461 :: EP reward -1


# Training a diad of SimpleOnPolicyRLAgent agents:

In [11]:
use_cuda = True 
sparse_reward = False

agent1 = SimpleOnPolicyRLAgent( 
  learning_rate=1e-4,
  discount_factor=0.99,
  num_actions=5,
  pov_shape=[7,7,12],
  use_cuda=use_cuda,
)

agent2 = SimpleOnPolicyRLAgent( 
  learning_rate=1e-4,
  discount_factor=0.99,
  num_actions=5,
  pov_shape=[7,7,12],
  use_cuda=use_cuda,
)

#logging_path = './test_training.log'
#logger = SummaryWriter(logging_path)

max_episode_length = 50
nbr_training_episodes = 1000
verbose = False 

tbar = tqdm(total=nbr_training_episodes, position=0)
for episode in range(nbr_training_episodes):
  tbar.update(1)
  environment_kwargs = {
      "level":"1",
      "sparse_reward":sparse_reward,
      "verbose":verbose,
  }
  environment = TwoPlayersCoMazeGym(**environment_kwargs)

  episode_cum_reward, trajectory = two_players_environment_loop(
      agent1=agent1,
      agent2=agent2,
      environment=environment,
      max_episode_length=max_episode_length,
  )

  #logger.add_scalar("Training/EpisodeCumulativeReward", episode_cum_reward, episode)
  #logger.add_scalar("Training/NbrSteps", len(trajectory), episode)
  #logger.flush()



  0%|          | 0/1000 [00:00<?, ?it/s]
  0%|          | 0/50 [00:00<?, ?it/s][A
  4%|▍         | 2/50 [00:00<00:02, 16.94it/s][A
  6%|▌         | 3/50 [00:00<00:03, 13.05it/s][A
  8%|▊         | 4/50 [00:00<00:04, 11.16it/s][A
 10%|█         | 5/50 [00:00<00:04, 10.03it/s][A
 12%|█▏        | 6/50 [00:00<00:04,  9.34it/s][A
 14%|█▍        | 7/50 [00:00<00:04,  8.97it/s][A
 16%|█▌        | 8/50 [00:00<00:04,  8.70it/s][A
 18%|█▊        | 9/50 [00:00<00:04,  8.49it/s][A
 20%|██        | 10/50 [00:01<00:04,  8.46it/s][A
 22%|██▏       | 11/50 [00:01<00:04,  8.49it/s][A
 24%|██▍       | 12/50 [00:01<00:04,  8.40it/s][A
 26%|██▌       | 13/50 [00:01<00:04,  8.37it/s][A
 28%|██▊       | 14/50 [00:01<00:04,  8.35it/s][A
 30%|███       | 15/50 [00:01<00:04,  8.38it/s][A
 32%|███▏      | 16/50 [00:01<00:04,  8.36it/s][A
 34%|███▍      | 17/50 [00:01<00:03,  8.28it/s][A
 36%|███▌      | 18/50 [00:02<00:03,  8.32it/s][A
 38%|███▊      | 19/50 [00:02<00:03,  8.28it/s][A
 40%|██

KeyboardInterrupt: ignored

# 2) Create a communicating On-Policy RL Agent: