# Biblioteca de Algoritmos - Lab 03

Nos últimos anos, muitas bibliotecas RL foram desenvolvidas. Essas bibliotecas foram projetadas para ter todas as ferramentas necessárias para implementar e testar agentes de Aprendizado por Reforço .

Ainda assim, elas se diferem muito. É por isso que é importante escolher uma biblioteca que seja rápida, confiável e relevante para sua tarefa de RL. Do ponto de vista técnico, existem algumas coisas a se ter em mente ao considerar uma bilioteca para RL.

- **Suporte para bibliotecas de aprendizado de máquina existentes:** Como o RL normalmente usa algoritmos baseados em gradiente para aprender e ajustar funções de política, você vai querer que ele suporte sua biblioteca favorita (Tensorflow, Keras, Pytorch, etc.)
- **Escalabilidade:** RL é computacionalmente intensivo e ter a opção de executar de forma distribuída torna-se importante ao atacar ambientes complexos.
- **Composibilidade:** Os algoritmos de RL normalmente envolvem simulações e muitos outros componentes. Você vai querer uma biblioteca que permita reutilizar componentes de algoritmos de RL, que seja compatível com várias estruturas de aprendizado profundo.

[Aqui](https://docs.google.com/spreadsheets/d/1ZWhViAwCpRqupA5E_xFHSaBaaBZ1wAjO6PvmmEEpXGI/edit#gid=0) você consegue visualizar uma lista com algumas bibliotecas existentes.

<img src="https://i1.wp.com/neptune.ai/wp-content/uploads/RL-tools.png?resize=1024%2C372&ssl=1" width=500>


## Ray RLlib

[Ray](https://docs.ray.io/en/latest/) é uma plataforma de execução distribuída que fornece bases para paralelismo e escalabilidade que são simples de usar e permitem que os programas Python sejam escalados em qualquer lugar, de um notebook a um grande cluster. Além disso, construída sobre o Ray, temos a [RLlib](https://docs.ray.io/en/latest/rllib.html), que fornece uma API unificada que pode ser aproveitada em uma ampla gama de aplicações.

<br>

<img src="https://miro.medium.com/max/1838/1*_bomm09XtiZfQ52Kfz9Ciw.png" width=600>


A RLlib foi projetada para oferecer suporte a várias estruturas de aprendizado profundo (TensorFlow e PyTorch) e pode ser acessada por meio de uma API Python simples. Atualmente, ela vem com uma [série de algoritmos RL](https://docs.ray.io/en/latest/rllib-algorithms.html#available-algorithms-overview).

Em particular, a RLlib permite um desenvolvimento rápido porque torna mais fácil construir algoritmos RL escaláveis ​​por meio da reutilização e montagem de implementações existentes. A RLlib também permite que os desenvolvedores usem redes neurais criadas com várias estruturas de aprendizado profundo e se integra facilmente a simuladores de terceiros.


## (Iniciar Colab) Configuração

Você precisará fazer uma cópia deste notebook em seu Google Drive antes de editar. Você pode fazer isso com **Arquivo → Salvar uma cópia no Drive**.

In [None]:
import os
from google.colab import drive
drive.mount("/content/gdrive")
isColab = True

In [None]:
# Seu trabalho será armazenado em uma pasta chamada `minicurso_rl` por padrão 
# para evitar que o tempo limite da instância do Colab exclua suas edições

DRIVE_PATH = "/content/gdrive/MyDrive/minicurso_rl/lab03"
DRIVE_PYTHON_PATH = DRIVE_PATH.replace("\\", "")
if not os.path.exists(DRIVE_PYTHON_PATH):
  %mkdir -p $DRIVE_PATH

In [None]:
! wget http://www.atarimania.com/roms/Roms.rar
! mkdir /content/ROM/
! unrar e /content/Roms.rar /content/ROM/ -y
! python -m atari_py.import_roms /content/ROM/ > /dev/null 2>&1

## (Iniciar local) Configuração

In [1]:
import os
isColab = False

In [2]:
import copy

# Seu trabalho será armazenado em uma pasta chamada `minicurso_rl` por padrão 
# para evitar que o tempo limite da instância do Colab exclua suas edições
CONTENT_PATH = "./content"
if not os.path.exists(CONTENT_PATH):
  %mkdir $CONTENT_PATH

CKPT_PATH = "./ckpt"
if not os.path.exists(CKPT_PATH):
  %mkdir $CKPT_PATH

if not isColab:
  DRIVE_PATH = copy.deepcopy(CONTENT_PATH)

In [3]:
! wget http://www.atarimania.com/roms/Roms.rar
! mkdir ./content/ROM/
! mv ./Roms.rar ./content/
! unrar e ./content/Roms.rar ./content/ROM/ -y
! python -m atari_py.import_roms ./content/ROM/ > /dev/null 2>&1

--2021-11-01 23:53:54--  http://www.atarimania.com/roms/Roms.rar
Resolvendo www.atarimania.com (www.atarimania.com)... 195.154.81.199
Conectando-se a www.atarimania.com (www.atarimania.com)|195.154.81.199|:80... conectado.
A requisição HTTP foi enviada, aguardando resposta... 200 OK
Tamanho: 11128004 (11M) [application/x-rar-compressed]
Salvando em: “Roms.rar”


2021-11-01 23:55:35 (109 KB/s) - “Roms.rar” salvo [11128004/11128004]

mkdir: não foi possível criar o diretório “./content/ROM/”: Arquivo existe

UNRAR 5.61 beta 1 freeware      Copyright (c) 1993-2018 Alexander Roshal


Extracting from ./content/Roms.rar

Extracting  ./content/ROM/HC ROMS.zip                                   36  OK 
Extracting  ./content/ROM/ROMS.zip                                      7 99  OK 
All OK


In [2]:

! pip install aiohttp --force-reinstall > /dev/null 2>&1

In [37]:
! pip install --upgrade pip > /dev/null 2>&1
! pip install tensorflow > /dev/null 2>&1
# ! pip install 'ray[default]==1.4.0' > /dev/null 2>&1
! pip install gputil > /dev/null 2>&1

## (Sempre) Outras configurações

In [23]:
# a versão do ray compatível com a implementação dos agentes disponibilizada é a 1.4.0
!pip install 'aioredis==1.3.1' --force-reinstall > /dev/null 2>&1 
!pip install 'ray==1.4.0' --force-reinstall > /dev/null 2>&1 
!pip install 'ray[rllib]==1.4.0' --force-reinstall > /dev/null 2>&1 
!pip install 'ray[tune]==1.4.0' --force-reinstall > /dev/null 2>&1 

# !pip install torch > /dev/null 2>&1 
!pip install lz4 --force-reinstall > /dev/null 2>&1 

# Dependências necessárias para gravar os vídeos
!apt-get install -y xvfb x11-utils > /dev/null 2>&1 
!pip install pyvirtualdisplay==0.2.* > --force-reinstall > /dev/null 2>&1 

# Ambiente da competição
!pip install --upgrade ceia-soccer-twos > /dev/null 2>&1

In [3]:
! pip install aioredis --upgrade > /dev/null 2>&1 
! pip install --upgrade ray > /dev/null 2>&1
! pip install --upgrade 'ray[default]' > /dev/null 2>&1
! pip install --upgrade 'ray[rllib]' > /dev/null 2>&1
! pip install --upgrade 'ray[tune]' > /dev/null 2>&1

In [4]:
# Inicializa uma instância de um display virtual
from pyvirtualdisplay import Display
display = Display(visible=False, size=(1400, 900))
_ = display.start()

In [5]:
# Carrega a extensão do notebook TensorBoard
%load_ext tensorboard

## (Sempre) Ambiente

O OpenAI Gym possui um wrapper VideoRecorder que pode gravar um vídeo do ambiente em formato MP4. Abaixo iremos interagir no ambiente do [Carpole](https://gym.openai.com/envs/CartPole-v0/) executando ações aleatórias e gravar o resultado.

In [6]:
import gym
from gym.wrappers.monitoring.video_recorder import VideoRecorder

environment_id = "CartPole-v0"

In [7]:
import gym
from gym.wrappers.monitoring.video_recorder import VideoRecorder

env = gym.make(environment_id)
before_training = os.path.join(
    DRIVE_PATH, "{}_before_training.mp4".format(
        environment_id)
)
print(before_training)

video = VideoRecorder(env, before_training)
env.reset()
for i in range(200):
    env.render()
    video.capture_frame()
    observation, reward, done, info = env.step(env.action_space.sample())

video.close()
env.close()


./content/CartPole-v0_before_training.mp4




O código acima salvou o arquivo de vídeo no seu Drive. Para exibi-lo no notebook, você precisa de uma função auxiliar.

In [8]:
from base64 import b64encode
def render_mp4(videopath: str) -> str:
  mp4 = open(videopath, 'rb').read()
  base64_encoded_mp4 = b64encode(mp4).decode()
  return f'<video width=400 controls><source src="data:video/mp4;' \
         f'base64,{base64_encoded_mp4}" type="video/mp4"></video>'

O código abaixo renderiza os resultados. Você deve obter um vídeo semelhante ao abaixo.

In [9]:
from IPython.display import HTML
html = render_mp4(before_training)
HTML(html)

## Treinando um agente de Aprendizado por Reforço

Primeiro, vamos começar a executar o Ray em segundo plano. Executar um `ray.shutdown()` seguido por um `ray.init()` deve dar início às coisas.

In [9]:
import ray

ray.shutdown()
ray.init(ignore_reinit_error=True, include_dashboard=False)



{'node_ip_address': '192.168.1.15',
 'raylet_ip_address': '192.168.1.15',
 'redis_address': '192.168.1.15:6379',
 'object_store_address': '/tmp/ray/session_2021-10-31_12-37-29_455920_41060/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-10-31_12-37-29_455920_41060/sockets/raylet',
 'webui_url': None,
 'session_dir': '/tmp/ray/session_2021-10-31_12-37-29_455920_41060',
 'metrics_export_port': 44508,
 'node_id': 'd73bee7c4cae317a8e8596851125ed68aa1aabec5de752da633d1e44'}

### Basic Python API

Em alto nível, RLlib fornece uma classe Trainer que contém uma política para interação com o ambiente. Por meio da interface do Trainer, a política pode ser treinada, avaliada ou computar uma ação. 

Para cada algoritmo gostaríamos de configurar os parâmetros (taxa de aprendizado, tamanho da rede, tamanho do batch, etc.) de acordo com a nossa aplicação.  Para isso o Ray fornece dois níveis de paramêtros que podemos alterar. Primeiramente temos os parâmetros comuns a todos os algoritmos. Você pode conferir uma lista com os parâmetros disponíveis através desse [link](https://docs.ray.io/en/latest/rllib-training.html#common-parameters).

E para cada [algoritmo disponível no ray](https://docs.ray.io/en/latest/rllib-algorithms.html#available-algorithms-overview) temos os parâmetros específicos. Na imagem abaixo podemos ver os parâmetros específicos para o algoritmo [Policy Gradient](https://docs.ray.io/en/latest/rllib-algorithms.html#policy-gradients).


<img src='https://drive.google.com/uc?id=1yKJDJViHE_F9JH7NTQMYtQL3KLBJoJyk' width="500" >


In [None]:
import ray
import ray.rllib.agents.pg as pg
from ray.tune.logger import pretty_print

config = pg.DEFAULT_CONFIG.copy()
config["num_gpus"] = 0
config["num_workers"] = 1
config["lr"] = 0.0004
config["framework"] = "torch"

trainer = pg.PGTrainer(config=config, env=environment_id)
episodes = 1000

for i in range(episodes):
   # Executa uma iteração de treinamento da política com Policy Gradient (PG)
   result = trainer.train()
   print(pretty_print(result))

   if i % 100 == 0:
       checkpoint = trainer.save()
       print("checkpoint saved at", checkpoint)

last_checkpoint = trainer.save()

In [11]:
print("Last checkpoint saved at", last_checkpoint)

Last checkpoint saved at /home/bruno/ray_results/PG_CartPole-v0_2021-10-31_12-37-36las2czzw/checkpoint_001000/checkpoint-1000


Agora vamos criar outro vídeo, mas desta vez escolha a ação recomendada pelo modelo treinado em vez de agir aleatoriamente.

In [12]:
trainer = pg.PGTrainer(config=config, env=environment_id)
trainer.restore(last_checkpoint)

after_training = os.path.join(
    DRIVE_PATH, "{}after_training_basic_api.mp4".format(environment_id)
)
after_video = VideoRecorder(env, after_training)
observation = env.reset()
done = False
while not done:
  env.render()
  after_video.capture_frame()
  action = trainer.compute_action(observation)
  observation, reward, done, info = env.step(action)
after_video.close()
env.close()
html = render_mp4(after_training)
HTML(html)

2021-10-31 12:41:34,696	INFO trainable.py:377 -- Restored on 192.168.1.15 from checkpoint: /home/bruno/ray_results/PG_CartPole-v0_2021-10-31_12-37-36las2czzw/checkpoint_001000/checkpoint-1000
2021-10-31 12:41:34,697	INFO trainable.py:385 -- Current state after restoring: {'_iteration': 1000, '_timesteps_total': None, '_time_total': 215.6465482711792, '_episodes_total': 1208}


### Usando ambiente ou modelos personalizados

A API Python fornece a flexibilidade necessária para aplicar o RLlib a novos problemas. Você precisará usar esta API se desejar usar ambientes ou modelos personalizados com RLlib. Abaixo veremos um exemplo de um ambiente e um modelo customizado.

<br>


Para maiores informações veja em [APIs Python avançadas](https://docs.ray.io/en/latest/rllib-training.html#advanced-python-apis).

In [13]:
import gym
from gym.spaces import Discrete, Box
import numpy as np
import os
import random

import torch
import torch.nn as nn

import ray
from ray import tune
from ray.rllib.agents import pg
from ray.rllib.env.env_context import EnvContext
from ray.rllib.models import ModelCatalog
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
from ray.tune.logger import pretty_print

In [14]:
class SimpleCorridor(gym.Env):
    """Exemplo de um ambiente personalizado em que você tem que andar por um 
    corredor. Você pode configurar o comprimento do corredor através da 
    configuração do ambiente."""

    def __init__(self, config: EnvContext):
        self.end_pos = config["corridor_length"]
        self.cur_pos = 0
        self.action_space = Discrete(2)
        self.observation_space = Box(
            0.0, self.end_pos, shape=(1, ), dtype=np.float32)
        # Define a seed. É usado apenas para a recompensa final.
        self.seed(config.worker_index * config.num_workers)

    def reset(self):
        self.cur_pos = 0
        return [self.cur_pos]

    def step(self, action):
        assert action in [0, 1], action
        if action == 0 and self.cur_pos > 0:
            self.cur_pos -= 1
        elif action == 1:
            self.cur_pos += 1
        done = self.cur_pos >= self.end_pos
        # Produz uma recompensa aleatória quando atingirmos a meta.
        return [self.cur_pos], \
            random.random() * 2 if done else -0.1, done, {}

    def seed(self, seed=None):
        random.seed(seed)

In [15]:
class TorchCustomModel(TorchModelV2, nn.Module):
    """Exemplo de um modelo personalizado PyTorch que apenas delega para uma 
    fc-net."""

    def __init__(self, obs_space, action_space, num_outputs, model_config,
                 name):
        TorchModelV2.__init__(self, obs_space, action_space, num_outputs,
                              model_config, name)
        nn.Module.__init__(self)

        self.torch_sub_model = TorchFC(obs_space, action_space, num_outputs,
                                       model_config, name)

    def forward(self, input_dict, state, seq_lens):
        input_dict["obs"] = input_dict["obs"].float()
        fc_out, _ = self.torch_sub_model(input_dict, state, seq_lens)
        return fc_out, []

    def value_function(self):
        return torch.reshape(self.torch_sub_model.value_function(), [-1])

In [16]:
# Também pode registrar a função de criar um ambiente explicitamente com:
# register_env("corridor", lambda config: SimpleCorridor(config))

# Registrar o modelo customizado
ModelCatalog.register_custom_model(
    "my_model", TorchCustomModel
)

config = {
    "env": SimpleCorridor,  # ou "corridor" se registrado
    "env_config": {
        "corridor_length": 5,
    },
    "model": {
        "custom_model": "my_model",
        "vf_share_layers": True,
    },
    "num_workers": 1,  
    "framework": "torch",
}

stop = {
    "training_iteration": 50,
    "timesteps_total": 100000,
    "episode_reward_mean": 0.1,
}

In [None]:
pg_config = pg.DEFAULT_CONFIG.copy()
pg_config.update(config)
pg_config["lr"] = 1e-3

trainer = pg.PGTrainer(config=pg_config, env=SimpleCorridor)
# executa o loop de treinamento manual e imprime os resultados após cada iteração
for _ in range(stop["training_iteration"]):
    result = trainer.train()
    print(pretty_print(result))
    
    # pare o treinamento caso tiver alcançado a quantidade de steps desejada
    # ou caso a recompensa desejada seja alcançada
    if result["timesteps_total"] >= stop["timesteps_total"] or \
            result["episode_reward_mean"] >= stop["episode_reward_mean"]:
        break

### Ray Tune

Todos os Trainers do RLlib são compatíveis com a API do [Ray Tune](https://docs.ray.io/en/master/tune/index.html). Isso permite que eles sejam facilmente usados em experimentos com o Tune. Por exemplo, o código a seguir executa o mesmo treino com o CartPole com o algoritmo PG.

In [None]:
import ray
config = {
    "env": environment_id,
    "framework": "torch",
}
stop = {"episode_reward_mean": 150, "timesteps_total": 100000}

# Executar o treinamento
analysis = ray.tune.run(
    "PG",
    config=config,
    stop=stop,
    checkpoint_freq=10,
    checkpoint_at_end=True,
    local_dir=os.path.join(DRIVE_PATH, "results")
)


Embora o objeto de análise retornado do `ray.tune.run` anteriormente não tivesse nenhuma instância Trainer, ele tem todas as informações necessárias para reconstruir um de um checkpoint salvo.

O retorno do Ray Tune é um objeto [ExperimentAnalysis](https://docs.ray.io/en/latest/tune/api_docs/analysis.html?highlight=ExperimentAnalysis#experimentanalysis-tune-experimentanalysis) onde é possível resgatar qual o melhor checkpoint do treino.

In [19]:
from ray.rllib.agents.pg import PGTrainer

# restaurar um Trainer 
trial = analysis.get_best_logdir("episode_reward_mean", "max")
checkpoint = analysis.get_best_checkpoint(
  trial,
  "training_iteration",
  "max",
)
trainer = PGTrainer(config=config)
trainer.restore(checkpoint)

2021-10-31 12:42:18,231	INFO trainable.py:377 -- Restored on 192.168.1.15 from checkpoint: /home/bruno/Workspace/ceia-rl-curso/LAB_03/content/results/PG/PG_CartPole-v0_0b6f3_00000_0_2021-10-31_12-41-42/checkpoint_000093/checkpoint-93
2021-10-31 12:42:18,232	INFO trainable.py:385 -- Current state after restoring: {'_iteration': 93, '_timesteps_total': None, '_time_total': 28.425389051437378, '_episodes_total': 199}


Agora vamos criar outro vídeo, mas desta vez escolha a ação recomendada pelo modelo treinado com a API Tune.

In [20]:
after_training = after_training = os.path.join(
    DRIVE_PATH, "{}after_training_tune.mp4".format(environment_id)
)
after_video = VideoRecorder(env, after_training)
observation = env.reset()
done = False
while not done:
  env.render()
  after_video.capture_frame()
  action = trainer.compute_action(observation)
  observation, reward, done, info = env.step(action)
after_video.close()
env.close()
# You should get a video similar to the one below. 
html = render_mp4(after_training)
HTML(html)

O Tune gera arquivos do [Tensorboard](https://www.tensorflow.org/tensorboard) automaticamente durante o `tune.run()` Para visualizar a aprendizagem no tensorboard, execute o célula abaixo:

In [None]:
# %tensorboard --logdir /content/gdrive/MyDrive/minicurso_rl/lab03/results/PG
%tensorboard --logdir /content/results/PG

## Hyperparameter Tuning com o Ray Tune

[Ray Tune](https://docs.ray.io/en/latest/tune/index.html) é uma biblioteca para execução de experimentos e ajuste de hiperparâmetros. Vamos agora tentar encontrar hiperparâmetros que possam resolver o ambiente [Cartpole](https://gym.openai.com/envs/CartPole-v1/) no menor número de passos de tempo. Esteja preparado para que demore um pouco para ser executado.

In [None]:
parameter_search_config = {
    "env": environment_id,
    "framework": "torch",
    "num_gpus": 1,  # porcentagem da gpu disponível para treino
    "num_workers": 1,  # número de workers além do processo principal; no colab deve ser 1 pois só há 2 CPUs

    # Hyperparameter tuning
    "model": {
      "fcnet_hiddens": ray.tune.grid_search([[32], [64]]),
      "fcnet_activation": ray.tune.grid_search(["linear", "relu"]),
    },
    "lr": ray.tune.uniform(1e-7, 1e-2)
}

# To explicitly stop or restart Ray, use the shutdown API.
ray.shutdown()

ray.init(
  num_cpus=2,
  include_dashboard=False,
  ignore_reinit_error=True,
  log_to_driver=False,
)

parameter_search_analysis = ray.tune.run(
  "PG",
  config=parameter_search_config,
  stop=stop,
  num_samples=5,
  metric="timesteps_total",
  mode="min",
)

In [None]:
print(
  "Melhores hiperparâmetros encontrados:",
  parameter_search_analysis.best_config,
)

Especificando num_samples = 5 significa que você obterá cinco amostras aleatórias para a taxa de aprendizagem. Para cada um deles, existem dois valores para o tamanho da camada oculta e dois valores para a função de ativação. Portanto, haverá 5 * 2 * 2 = 20 tentativas, mostradas com seus status na saída da célula à medida que o cálculo é executado.

Observe que Ray mostra a melhor configuração atual à medida que avança. Isso inclui todos os valores padrão que foram definidos, o que é um bom lugar para encontrar outros parâmetros que podem ser ajustados.


# Exercício

Agora que você conhece a API básica do Ray Tune e da RLLib, **utilize o ambiente `BreakoutNoFrameskip-v4` e treine agentes com os algoritmos A3C, PPO e SAC**. Lembre-se de utilizar também o tensorboard para acompanhar e comparar as curvas de aprendizado de suas execuções.

Descrições dos algoritmos e seus respectivos hiperparâmetros podem ser encontrados [aqui](https://docs.ray.io/en/latest/rllib-algorithms.html#available-algorithms-overview).

#### 0. (Re)Imports + env

In [38]:
import gym
from gym.wrappers.monitoring.video_recorder import VideoRecorder
from gym.spaces import Discrete, Box

import ray
import ray.rllib.agents.ppo as pg
from ray.tune.logger import pretty_print
from ray import tune
from ray.rllib.env.env_context import EnvContext
from ray.rllib.models import ModelCatalog
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
from ray.rllib.agents.ppo import PPOTrainer as PGTrainer

import numpy as np
import os
import random

import torch
import torch.nn as nn

from stable_baselines3.common.atari_wrappers import AtariWrapper
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecFrameStack

In [50]:
environment_id = "BreakoutNoFrameskip-v4"
env = gym.make(environment_id)

# env = make_vec_env(environment_id, wrapper_class=AtariWrapper)
# env = VecFrameStack(env, 4)

action_size = env.action_space.n
observation_size = env.observation_space.shape[0]
print(f"Action size: {action_size}\nObservation size: {observation_size}")

Action size: 4
Observation size: 84


In [69]:
def one_hot_encode(targets: np.ndarray, nb_classes: int):
    """Get one_hot_encode from integer action
    Thanks to: https://stackoverflow.com/a/42874726/5128626

    Args:
        targets (List[int]): Lista com inteiros
        nb_classes (int): número de classes

    Returns:
        List[List[float]]: Array of encoded targets
    """
    res = np.eye(nb_classes)[np.array(targets).reshape(-1)]
    return res.reshape(list(targets.shape)+[nb_classes])

In [63]:
ray.shutdown()
ray.init(ignore_reinit_error=True, include_dashboard=False)

{'node_ip_address': '192.168.1.15',
 'raylet_ip_address': '192.168.1.15',
 'redis_address': '192.168.1.15:6379',
 'object_store_address': '/tmp/ray/session_2021-11-03_13-38-50_486558_108689/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-11-03_13-38-50_486558_108689/sockets/raylet',
 'webui_url': None,
 'session_dir': '/tmp/ray/session_2021-11-03_13-38-50_486558_108689',
 'metrics_export_port': 64455,
 'node_id': '6e197c1ece8e2392d78e25d693dd4e90fdc8657154f5bfcfdeb17aaa'}

#### 1. Visualizar agente aleatório

In [78]:
# INSIRA AQUI O CÓDIGO PARA TREINAMENTO SOBRE O BreakoutNoFrameskip-v4
before_training = os.path.join(
    DRIVE_PATH, "{}_before_training.mp4".format(
        environment_id)
)
print(before_training)

video = VideoRecorder(env, before_training)
env.reset()
for i in range(200):
    env.render()
    video.capture_frame()
    # action = one_hot_encode(np.array([env.action_space.sample()]), action_size)
    observation, reward, done, info = env.step([env.action_space.sample()])

video.close()
env.close()

html = render_mp4(before_training)
HTML(html)


./content/BreakoutNoFrameskip-v4_before_training.mp4


#### 2. Treinar agente utilizando ray rllib

In [79]:
ray.shutdown()
ray.init(ignore_reinit_error=True, include_dashboard=False)

{'node_ip_address': '192.168.1.15',
 'raylet_ip_address': '192.168.1.15',
 'redis_address': '192.168.1.15:6379',
 'object_store_address': '/tmp/ray/session_2021-11-03_13-43-36_992002_108689/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-11-03_13-43-36_992002_108689/sockets/raylet',
 'webui_url': None,
 'session_dir': '/tmp/ray/session_2021-11-03_13-43-36_992002_108689',
 'metrics_export_port': 62446,
 'node_id': 'b222134e15fb43b4479b6cd06344fbc74ec66baa5934ccb0c97c2542'}

In [80]:
config = pg.DEFAULT_CONFIG.copy()
# config["num_gpus"] = 0
# config["num_workers"] = 1
# config["lr"] = 0.0004
config["framework"] = "torch"


In [81]:
trainer = pg.PPOTrainer(config=config, env=environment_id)
episodes = 1

for i in range(1, episodes+1):
    # Executa uma iteração de treinamento da política com Policy Gradient (PG)
    result = trainer.train()

    if i % 1 == 0:
        checkpoint = trainer.save()
        print(pretty_print(result))
        print("checkpoint saved at", checkpoint)

last_checkpoint = trainer.save()


agent_timesteps_total: 4000
custom_metrics: {}
date: 2021-11-03_13-59-01
done: false
episode_len_mean: 775.6363636363636
episode_media: {}
episode_reward_max: 5.0
episode_reward_mean: 1.3636363636363635
episode_reward_min: 0.0
episodes_this_iter: 22
episodes_total: 22
experiment_id: 126fc15895af4c1fa38bec0a3c4ceb96
hostname: bruno-odyssey-mint
info:
  learner:
    default_policy:
      learner_stats:
        allreduce_latency: 0.0
        cur_kl_coeff: 0.2
        cur_lr: 5.0e-05
        entropy: 0.05571094514673459
        entropy_coeff: 0.0
        kl: 0.08028511056909338
        policy_loss: -0.014752400573343039
        total_loss: 0.2522818208672106
        vf_explained_var: 0.5763480067253113
        vf_loss: 0.2509771999903023
  num_agent_steps_sampled: 4000
  num_steps_sampled: 4000
  num_steps_trained: 4000
iterations_since_restore: 1
node_ip: 192.168.1.15
num_healthy_workers: 1
off_policy_estimator: {}
perf:
  cpu_util_percent: 59.50866900175131
  gpu_util_percent0: 0.1568826

In [82]:
print("Last checkpoint saved at", last_checkpoint)

Last checkpoint saved at /home/bruno/ray_results/PPO_BreakoutNoFrameskip-v4_2021-11-03_13-43-43rse_5b3r/checkpoint_000001/checkpoint-1


In [83]:
trainer = pg.PPOTrainer(config=config, env=environment_id)
trainer.restore(last_checkpoint)

after_training = os.path.join(
    DRIVE_PATH, "{}after_training_basic_api.mp4".format(environment_id)
)
after_video = VideoRecorder(env, after_training)
observation = env.reset()
done = False
while not done:
    env.render()
    after_video.capture_frame()
    action = trainer.compute_action(observation)
    observation, reward, done, info = env.step(action)
after_video.close()
env.close()


2021-11-03 13:59:08,179	INFO trainable.py:377 -- Restored on 192.168.1.15 from checkpoint: /home/bruno/ray_results/PPO_BreakoutNoFrameskip-v4_2021-11-03_13-43-43rse_5b3r/checkpoint_000001/checkpoint-1
2021-11-03 13:59:08,181	INFO trainable.py:385 -- Current state after restoring: {'_iteration': 1, '_timesteps_total': None, '_time_total': 912.3262348175049, '_episodes_total': 22}


RuntimeError: number of dims don't match in permute

In [None]:
# Visualizar
html = render_mp4(after_training)
HTML(html)

#### 3. Treinar agente usando modelo pré-treinado

In [None]:
class TorchCustomModel(TorchModelV2, nn.Module):
    """Exemplo de um modelo personalizado PyTorch que apenas delega para uma 
    fc-net."""

    def __init__(self, obs_space, action_space, num_outputs, model_config,
                 name):
        TorchModelV2.__init__(self, obs_space, action_space, num_outputs,
                              model_config, name)
        nn.Module.__init__(self)

        self.torch_sub_model = TorchFC(obs_space, action_space, num_outputs,
                                       model_config, name)

    def forward(self, input_dict, state, seq_lens):
        input_dict["obs"] = input_dict["obs"].float()
        fc_out, _ = self.torch_sub_model(input_dict, state, seq_lens)
        return fc_out, []

    def value_function(self):
        return torch.reshape(self.torch_sub_model.value_function(), [-1])

Error: Kernel is dead

In [None]:
# Também pode registrar a função de criar um ambiente explicitamente com:
# register_env("corridor", lambda config: SimpleCorridor(config))

# Registrar o modelo customizado
ModelCatalog.register_custom_model(
    "my_model", TorchCustomModel
)

config = {
    "env": environment_id, 
    "env_config": {},
    "model": {
        "custom_model": "my_model",
        "vf_share_layers": True,
    },
    "num_workers": 1,  
    "framework": "torch",
}

stop = {
    "training_iteration": 50,
    "timesteps_total": 100000,
    "episode_reward_mean": 0.1,
}


In [None]:
pg_config = pg.DEFAULT_CONFIG.copy()
pg_config.update(config)
pg_config["lr"] = 1e-3

trainer = pg.PGTrainer(config=pg_config, env=SimpleCorridor)
# executa o loop de treinamento manual e imprime os resultados após cada iteração
for _ in range(stop["training_iteration"]):
    result = trainer.train()
    print(pretty_print(result))
    
    # pare o treinamento caso tiver alcançado a quantidade de steps desejada
    # ou caso a recompensa desejada seja alcançada
    if result["timesteps_total"] >= stop["timesteps_total"] or \
            result["episode_reward_mean"] >= stop["episode_reward_mean"]:
        break


#### 4. Ray Tune

In [None]:
config = {
    "env": environment_id,
    "framework": "torch",
}
stop = {"episode_reward_mean": 150, "timesteps_total": 100000}

# Executar o treinamento
analysis = ray.tune.run(
    "PG",
    config=config,
    stop=stop,
    checkpoint_freq=10,
    checkpoint_at_end=True,
    local_dir=os.path.join(DRIVE_PATH, "results")
)


In [None]:
# restaurar um Trainer 
trial = analysis.get_best_logdir("episode_reward_mean", "max")
checkpoint = analysis.get_best_checkpoint(
  trial,
  "training_iteration",
  "max",
)
trainer = PGTrainer(config=config)
trainer.restore(checkpoint)


In [None]:
after_training = after_training = os.path.join(
    DRIVE_PATH, "{}after_training_tune.mp4".format(environment_id)
)
after_video = VideoRecorder(env, after_training)
observation = env.reset()
done = False
while not done:
  env.render()
  after_video.capture_frame()
  action = trainer.compute_action(observation)
  observation, reward, done, info = env.step(action)
after_video.close()
env.close()
# You should get a video similar to the one below. 
html = render_mp4(after_training)
HTML(html)


In [None]:
# %tensorboard --logdir /content/gdrive/MyDrive/minicurso_rl/lab03/results/PG
%tensorboard --logdir /content/results/PG

#### 5. Hyperparameter Tune

In [None]:
parameter_search_config = {
    "env": environment_id,
    "framework": "torch",
    "num_gpus": 1,  # porcentagem da gpu disponível para treino
    "num_workers": 1,  # número de workers além do processo principal; no colab deve ser 1 pois só há 2 CPUs

    # Hyperparameter tuning
    "model": {
      "fcnet_hiddens": ray.tune.grid_search([[32], [64]]),
      "fcnet_activation": ray.tune.grid_search(["linear", "relu"]),
    },
    "lr": ray.tune.uniform(1e-7, 1e-2)
}

# To explicitly stop or restart Ray, use the shutdown API.
ray.shutdown()

ray.init(
  num_cpus=2,
  include_dashboard=False,
  ignore_reinit_error=True,
  log_to_driver=False,
)

parameter_search_analysis = ray.tune.run(
  "PG",
  config=parameter_search_config,
  stop=stop,
  num_samples=5,
  metric="timesteps_total",
  mode="min",
)

In [None]:
print(
  "Melhores hiperparâmetros encontrados:",
  parameter_search_analysis.best_config,
)

# Bônus

Como tarefa bônus, experimente com os algoritmos aprendidos no ambiente `soccer_twos`, que será utilizado na competição final deste curso*. Para facilitar, utilize a variação `team_vs_policy` como no laboratório anterior.

<img src="https://raw.githubusercontent.com/bryanoliveira/soccer-twos-env/master/images/screenshot.png" height="400">

> Visualização do ambiente

Este ambiente consiste em um jogo de futebol de carros 2x2, ou seja, o objetivo é marcar um gol no adversário o mais rápido possível. Na variação `team_vs_policy`, seu agente controla um jogador do time azul e joga contra um time aleatório. Mais informações sobre o ambiente podem ser encontradas [no repositório](https://github.com/bryanoliveira/soccer-twos-env) e [na documentação do Unity ml-agents](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Learning-Environment-Examples.md#soccer-twos).


**Sua tarefa é treinar um agente com a interface do Ray apresentada, experimentando com diferentes algoritmos e hiperparâmetros.**


<br>

*A variação utilizada na competição será a `multiagent_player`, mas agentes treinados para `team_vs_policy` podem ser facilmente adaptados. Na seção "Exportando seu agente treinado" o agente "MyDqnSoccerAgent" faz exatamente isso.

Utilize o ambiente instanciado abaixo para executar o algoritmo de treinamento. Ao final da execução, a recompensa do seu agente por episódio deve tender a +2.

In [10]:
import gym
from gym.wrappers.monitoring.video_recorder import VideoRecorder
from gym.spaces import Discrete, Box

import ray
import ray.rllib.agents.ppo as pg
from ray.tune.logger import pretty_print
from ray import tune
from ray.rllib.env.env_context import EnvContext
from ray.rllib.models import ModelCatalog
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
from ray.rllib.agents.ppo import PPOTrainer

import numpy as np
import os
import random

import torch
import torch.nn as nn

from stable_baselines3.common.atari_wrappers import AtariWrapper
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecFrameStack

In [11]:
import soccer_twos

# Fecha o ambiente caso tenha sido aberto anteriormente
try: env.close()
except: pass

env = soccer_twos.make(
    variation=soccer_twos.EnvType.team_vs_policy,
    flatten_branched=True, # converte o action_space de MultiDiscrete para Discrete
    single_player=True, # controla um dos jogadores enquanto os outros ficam parados
    opponent_policy=lambda *_: 0,  # faz os oponentes ficarem parados
)

environment_id = "soccer-v0"

# Obtem tamanhos de estado e ação
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

print("Tamanho do estado: {}, tamanho da ação: {}".format(state_size, action_size))
env.close()

[INFO] Connected to Unity environment with package version 2.1.0-exp.1 and communication version 1.5.0


INFO:mlagents_envs.environment:Connected to Unity environment with package version 2.1.0-exp.1 and communication version 1.5.0


[INFO] Connected new brain: SoccerTwos?team=1


INFO:mlagents_envs.environment:Connected new brain: SoccerTwos?team=1


[INFO] Connected new brain: SoccerTwos?team=0


INFO:mlagents_envs.environment:Connected new brain: SoccerTwos?team=0


Tamanho do estado: 336, tamanho da ação: 27


In [12]:
def create_rllib_env(env_config: dict = {}):
    # suporte a múltiplas instâncias do ambiente na mesma máquina
    if hasattr(env_config, "worker_index"):
        env_config["worker_id"] = (
            env_config.worker_index * env_config.get("num_envs_per_worker", 1)
            + env_config.vector_index
        )
    return soccer_twos.make(**env_config)

# registra ambiente no Ray
tune.registry.register_env(environment_id, create_rllib_env)

In [13]:
NUM_ENVS_PER_WORKER = 2

Utilize a configuração abaixo como ponto de partida para seus testes. 

A parte mais imporante é a chave `env_config`, que configura o ambiente para ser compatível com o agente disponibilizado para exportação do seu agente. Neste ponto do curso você já deve conseguir testar as outras variações do ambiente e utilizar as APIs do Ray para treinar um agente próximo (ou melhor) do que o [ceia_baseline_agent](https://drive.google.com/file/d/1WEjr48D7QG9uVy1tf4GJAZTpimHtINzE/view). Exemplos de como utilizar as outras variações podem ser encontrados [aqui](https://github.com/dlb-rl/rl-tournament-starter/). Ao utilizar essas variações, você deve utilizar também outras definições de agente para lidar com os diferentes espaços de observação e ação (que também estão presentes nos exemplos).

In [31]:
ray.shutdown()
ray.init(ignore_reinit_error=True, include_dashboard=False)

{'node_ip_address': '192.168.1.15',
 'raylet_ip_address': '192.168.1.15',
 'redis_address': '192.168.1.15:6379',
 'object_store_address': '/tmp/ray/session_2021-11-04_19-36-48_571507_43046/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-11-04_19-36-48_571507_43046/sockets/raylet',
 'webui_url': None,
 'session_dir': '/tmp/ray/session_2021-11-04_19-36-48_571507_43046',
 'metrics_export_port': 59372,
 'node_id': '24e48176bf33454187ee0e8b8c19f00115aa1c26e961c0219dbca350'}

In [None]:
analysis = tune.run(
    "PPO",
    config={
        # system settings
        "num_gpus": 1,
        "num_workers": 1,
        "num_envs_per_worker": NUM_ENVS_PER_WORKER,
        "log_level": "INFO",
        "framework": "torch",
        # RL setup
        "env": environment_id,
        "env_config": {
            "num_envs_per_worker": NUM_ENVS_PER_WORKER,
            "variation": soccer_twos.EnvType.team_vs_policy,
            "single_player": True,
            "flatten_branched": True,
        },
        "framework": "torch",
    },
    stop={
        # 10000000 (10M) de steps podem ser necessários para aprender uma política útil
        "timesteps_total": 10000000,
        # você também pode limitar por tempo, de acordo com o tempo limite do colab
        # "time_total_s": 14400, # 4h
        "time_total_s": 3600, # 1h
    },
    checkpoint_freq=100,
    checkpoint_at_end=True,
    local_dir=os.path.join(DRIVE_PATH, "results")
)

### Work with pre-trained model

In [15]:
class TorchCustomModel(TorchModelV2, nn.Module):
    """Exemplo de um modelo personalizado PyTorch que apenas delega para uma 
    fc-net."""

    def __init__(self, obs_space, action_space, num_outputs, model_config,
                 name):
        TorchModelV2.__init__(self, obs_space, action_space, num_outputs,
                              model_config, name)
        nn.Module.__init__(self)

        self.torch_sub_model = TorchFC(obs_space, action_space, num_outputs,
                                       model_config, name)

    def forward(self, input_dict, state, seq_lens):
        input_dict["obs"] = input_dict["obs"].float()
        fc_out, _ = self.torch_sub_model(input_dict, state, seq_lens)
        return fc_out, []

    def value_function(self):
        return torch.reshape(self.torch_sub_model.value_function(), [-1])

In [16]:
# Também pode registrar a função de criar um ambiente explicitamente com:
# register_env("corridor", lambda config: SimpleCorridor(config))

# Registrar o modelo customizado
ModelCatalog.register_custom_model(
    "my_model", TorchCustomModel
)

config = {
    "env": environment_id,
    "env_config": {
        "num_envs_per_worker": NUM_ENVS_PER_WORKER,
        "variation": soccer_twos.EnvType.team_vs_policy,
        "single_player": True,
        "flatten_branched": True,
    },
    "model": {
        "custom_model": "my_model",
        "vf_share_layers": True,
    },
    "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
    "num_workers": 1,
    "num_envs_per_worker": NUM_ENVS_PER_WORKER,
    "log_level": "INFO",
    "framework": "torch",
}

stop = {
    "training_iteration": 50,
    "timesteps_total": 100000,
    "episode_reward_mean": 1.6,
}


In [17]:
ppo_config = pg.DEFAULT_CONFIG.copy()
ppo_config.update(config)
ppo_config["lr"] = 1e-3

In [18]:
ray.shutdown()
ray.init(ignore_reinit_error=True, include_dashboard=False)

{'node_ip_address': '192.168.1.15',
 'raylet_ip_address': '192.168.1.15',
 'redis_address': '192.168.1.15:6379',
 'object_store_address': '/tmp/ray/session_2021-11-04_19-51-46_133507_90833/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-11-04_19-51-46_133507_90833/sockets/raylet',
 'webui_url': None,
 'session_dir': '/tmp/ray/session_2021-11-04_19-51-46_133507_90833',
 'metrics_export_port': 53813,
 'node_id': '593e4759afada773650da53b9f24ff116d39478f1ac5c60c947ef272'}

In [19]:
analysis = tune.run(
    "PPO",
    config=ppo_config,
    # stop={
    #     # 10000000 (10M) de steps podem ser necessários para aprender uma política útil
    #     "timesteps_total": 10000000,
    #     # você também pode limitar por tempo, de acordo com o tempo limite do colab
    #     # "time_total_s": 14400, # 4h
    # },
    stop=stop,
    checkpoint_freq=100,
    checkpoint_at_end=True,
    local_dir=os.path.join(DRIVE_PATH, "results")
)

Trial name,status,loc
PPO_soccer-v0_d165f_00000,PENDING,


[2m[36m(pid=91273)[0m 2021-11-04 19:52:03,138	INFO ppo.py:166 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.


[2m[36m(RolloutWorker pid=91272)[0m [INFO] Connected to Unity environment with package version 2.1.0-exp.1 and communication version 1.5.0
[2m[36m(RolloutWorker pid=91272)[0m [INFO] Connected new brain: SoccerTwos?team=1
[2m[36m(RolloutWorker pid=91272)[0m [INFO] Connected new brain: SoccerTwos?team=0


[2m[36m(pid=91272)[0m INFO:mlagents_envs.environment:Connected to Unity environment with package version 2.1.0-exp.1 and communication version 1.5.0
[2m[36m(pid=91272)[0m INFO:mlagents_envs.environment:Connected new brain: SoccerTwos?team=1
[2m[36m(pid=91272)[0m INFO:mlagents_envs.environment:Connected new brain: SoccerTwos?team=0
[2m[36m(pid=91272)[0m 2021-11-04 19:52:06,915	INFO rollout_worker.py:1542 -- Validating sub-env at vector index=0 ... (ok)
[2m[36m(pid=91272)[0m 2021-11-04 19:52:06,948	INFO ppo_tf_policy.py:329 -- `vf_share_layers=True` in your model. Therefore, remember to tune the value of `vf_loss_coeff`!
[2m[36m(pid=91272)[0m 2021-11-04 19:52:06,949	INFO catalog.py:406 -- Wrapping <class '__main__.TorchCustomModel'> as None
[2m[36m(pid=91272)[0m 2021-11-04 19:52:06,954	INFO torch_policy.py:147 -- TorchPolicy (worker=1) running on CPU.


[2m[36m(RolloutWorker pid=91272)[0m [INFO] Connected to Unity environment with package version 2.1.0-exp.1 and communication version 1.5.0
[2m[36m(RolloutWorker pid=91272)[0m [INFO] Connected new brain: SoccerTwos?team=1
[2m[36m(RolloutWorker pid=91272)[0m [INFO] Connected new brain: SoccerTwos?team=0


[2m[36m(pid=91272)[0m INFO:mlagents_envs.environment:Connected to Unity environment with package version 2.1.0-exp.1 and communication version 1.5.0
[2m[36m(pid=91272)[0m INFO:mlagents_envs.environment:Connected new brain: SoccerTwos?team=1
[2m[36m(pid=91272)[0m INFO:mlagents_envs.environment:Connected new brain: SoccerTwos?team=0


Trial name,status,loc
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273


[2m[36m(pid=91273)[0m 2021-11-04 19:52:07,712	INFO worker_set.py:104 -- Inferred observation/action spaces from remote worker (local worker has no env): {'default_policy': (Box([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
[2m[36m(pid=91273)[0m  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
[2m[36m(pid=91273)[0m  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
[2m[36m(pid=91273)[0m  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
[2m[36m(pid=91273)[0m  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
[2m[36m(pid=91273)[0m  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
[2m[36m(pid=91273)[0m  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
[2m[36m(pid=91273)[0m  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
[2m[36m(pid=91273)[0m  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

Trial name,status,loc
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273


[2m[36m(pid=91272)[0m 2021-11-04 19:52:10,246	INFO simple_list_collector.py:780 -- Trajectory fragment after postprocess_trajectory():
[2m[36m(pid=91272)[0m 
[2m[36m(pid=91272)[0m { 'agent0': { 'action_dist_inputs': np.ndarray((200, 27), dtype=float32, min=-0.011, max=0.009, mean=-0.001),
[2m[36m(pid=91272)[0m               'action_logp': np.ndarray((200,), dtype=float32, min=-3.304, max=-3.29, mean=-3.296),
[2m[36m(pid=91272)[0m               'actions': np.ndarray((200,), dtype=int64, min=0.0, max=26.0, mean=12.685),
[2m[36m(pid=91272)[0m               'advantages': np.ndarray((200,), dtype=float32, min=-0.007, max=0.004, mean=-0.0),
[2m[36m(pid=91272)[0m               'agent_index': np.ndarray((200,), dtype=int64, min=0.0, max=0.0, mean=0.0),
[2m[36m(pid=91272)[0m               'dones': np.ndarray((200,), dtype=float32, min=0.0, max=0.0, mean=0.0),
[2m[36m(pid=91272)[0m               'eps_id': np.ndarray((200,), dtype=int64, min=306262447.0, max=306262447.0

Trial name,status,loc
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273


Trial name,status,loc
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273


Trial name,status,loc
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273


Trial name,status,loc
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273


Trial name,status,loc
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-11-04_19-52-39
  done: false
  episode_len_mean: 784.8
  episode_media: {}
  episode_reward_max: 0.30399999022483826
  episode_reward_mean: -0.33920000195503236
  episode_reward_min: -2.0
  episodes_this_iter: 5
  episodes_total: 5
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.20000000000000004
          cur_lr: 0.0010000000000000005
          entropy: 3.2774983739340176
          entropy_coeff: 0.0
          kl: 0.018714393203324985
          policy_loss: -0.05116372086027617
          total_loss: -0.021187576755721083
          vf_explained_var: -0.9094807331920952
          vf_loss: 0.026233268562138805
        model: {}
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_ste

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,1,31.7223,4000,-0.3392,0.304,-2,784.8


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,1,31.7223,4000,-0.3392,0.304,-2,784.8


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,1,31.7223,4000,-0.3392,0.304,-2,784.8


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,1,31.7223,4000,-0.3392,0.304,-2,784.8


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,1,31.7223,4000,-0.3392,0.304,-2,784.8


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,1,31.7223,4000,-0.3392,0.304,-2,784.8


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-11-04_19-53-10
  done: false
  episode_len_mean: 694.1
  episode_media: {}
  episode_reward_max: 0.30399999022483826
  episode_reward_mean: -0.7696000009775161
  episode_reward_min: -2.0
  episodes_this_iter: 5
  episodes_total: 10
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.20000000000000004
          cur_lr: 0.0010000000000000005
          entropy: 3.2168722614165275
          entropy_coeff: 0.0
          kl: 0.05762385958112435
          policy_loss: -0.10852844776956225
          total_loss: 0.0013040197094381658
          vf_explained_var: -0.7115522699330443
          vf_loss: 0.09830769959772105
        model: {}
    num_agent_steps_sampled: 8000
    num_agent_steps_trained: 8000
    num_steps

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,2,62.7284,8000,-0.7696,0.304,-2,694.1


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,2,62.7284,8000,-0.7696,0.304,-2,694.1


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,2,62.7284,8000,-0.7696,0.304,-2,694.1


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,2,62.7284,8000,-0.7696,0.304,-2,694.1


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,2,62.7284,8000,-0.7696,0.304,-2,694.1


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,2,62.7284,8000,-0.7696,0.304,-2,694.1


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,2,62.7284,8000,-0.7696,0.304,-2,694.1


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-11-04_19-53-40
  done: false
  episode_len_mean: 750.1333333333333
  episode_media: {}
  episode_reward_max: 0.30399999022483826
  episode_reward_mean: -0.7797333339850108
  episode_reward_min: -2.0
  episodes_this_iter: 5
  episodes_total: 15
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.3
          cur_lr: 0.0010000000000000005
          entropy: 3.201761543878945
          entropy_coeff: 0.0
          kl: 0.03860649660941189
          policy_loss: -0.10373870325074482
          total_loss: -0.05636127265570785
          vf_explained_var: -0.681338115789557
          vf_loss: 0.035795482641251215
        model: {}
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_steps_sa

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,3,92.8393,12000,-0.779733,0.304,-2,750.133


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,3,92.8393,12000,-0.779733,0.304,-2,750.133


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,3,92.8393,12000,-0.779733,0.304,-2,750.133


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,3,92.8393,12000,-0.779733,0.304,-2,750.133


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,3,92.8393,12000,-0.779733,0.304,-2,750.133


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,3,92.8393,12000,-0.779733,0.304,-2,750.133


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-11-04_19-54-11
  done: false
  episode_len_mean: 769.7894736842105
  episode_media: {}
  episode_reward_max: 0.30399999022483826
  episode_reward_mean: -0.7208421057776401
  episode_reward_min: -2.0
  episodes_this_iter: 4
  episodes_total: 19
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.4500000000000001
          cur_lr: 0.0010000000000000005
          entropy: 3.1822870892863118
          entropy_coeff: 0.0
          kl: 0.035887355404406594
          policy_loss: -0.11216268539128284
          total_loss: -0.04845041368677411
          vf_explained_var: -0.7353103531304226
          vf_loss: 0.04756296285695987
        model: {}
    num_agent_steps_sampled: 16000
    num_agent_steps_trained: 16000

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,4,123.705,16000,-0.720842,0.304,-2,769.789


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,4,123.705,16000,-0.720842,0.304,-2,769.789


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,4,123.705,16000,-0.720842,0.304,-2,769.789


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,4,123.705,16000,-0.720842,0.304,-2,769.789


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,4,123.705,16000,-0.720842,0.304,-2,769.789


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2021-11-04_19-54-41
  done: false
  episode_len_mean: 797.5416666666666
  episode_media: {}
  episode_reward_max: 0.43320000171661377
  episode_reward_mean: -0.6359500003357729
  episode_reward_min: -2.0
  episodes_this_iter: 5
  episodes_total: 24
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.6750000000000002
          cur_lr: 0.0010000000000000005
          entropy: 3.1546545159432196
          entropy_coeff: 0.0
          kl: 0.028506481919408664
          policy_loss: -0.12335991881467322
          total_loss: -0.07209047258541149
          vf_explained_var: -0.7713049576487593
          vf_loss: 0.03202757083833398
        model: {}
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,5,153.641,20000,-0.63595,0.4332,-2,797.542


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,5,153.641,20000,-0.63595,0.4332,-2,797.542


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,5,153.641,20000,-0.63595,0.4332,-2,797.542


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,5,153.641,20000,-0.63595,0.4332,-2,797.542


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,5,153.641,20000,-0.63595,0.4332,-2,797.542


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,5,153.641,20000,-0.63595,0.4332,-2,797.542


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-11-04_19-55-11
  done: false
  episode_len_mean: 822.8571428571429
  episode_media: {}
  episode_reward_max: 0.43320000171661377
  episode_reward_mean: -0.6165285717163768
  episode_reward_min: -2.0
  episodes_this_iter: 4
  episodes_total: 28
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0124999999999997
          cur_lr: 0.0010000000000000005
          entropy: 3.1217516394071683
          entropy_coeff: 0.0
          kl: 0.0261567070883239
          policy_loss: -0.12266760571529307
          total_loss: -0.06508054672439974
          vf_explained_var: -0.8705085649926175
          vf_loss: 0.03110339180265944
        model: {}
    num_agent_steps_sampled: 24000
    num_agent_steps_trained: 24000
 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,6,183.48,24000,-0.616529,0.4332,-2,822.857


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,6,183.48,24000,-0.616529,0.4332,-2,822.857


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,6,183.48,24000,-0.616529,0.4332,-2,822.857


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,6,183.48,24000,-0.616529,0.4332,-2,822.857


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,6,183.48,24000,-0.616529,0.4332,-2,822.857


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,6,183.48,24000,-0.616529,0.4332,-2,822.857


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 28000
  custom_metrics: {}
  date: 2021-11-04_19-55-41
  done: false
  episode_len_mean: 826.9375
  episode_media: {}
  episode_reward_max: 0.43320000171661377
  episode_reward_mean: -0.6019625002518296
  episode_reward_min: -2.0
  episodes_this_iter: 4
  episodes_total: 32
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.5187500000000005
          cur_lr: 0.0010000000000000005
          entropy: 3.1153057952081005
          entropy_coeff: 0.0
          kl: 0.01715684425956573
          policy_loss: -0.12316144288727833
          total_loss: -0.07104899450530729
          vf_explained_var: -0.782491187639134
          vf_loss: 0.026055490953337042
        model: {}
    num_agent_steps_sampled: 28000
    num_agent_steps_trained: 28000
    num_s

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,7,213.656,28000,-0.601963,0.4332,-2,826.938


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,7,213.656,28000,-0.601963,0.4332,-2,826.938


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,7,213.656,28000,-0.601963,0.4332,-2,826.938


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,7,213.656,28000,-0.601963,0.4332,-2,826.938


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,7,213.656,28000,-0.601963,0.4332,-2,826.938


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,7,213.656,28000,-0.601963,0.4332,-2,826.938


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 32000
  custom_metrics: {}
  date: 2021-11-04_19-56-13
  done: false
  episode_len_mean: 793.9487179487179
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.6001230761026725
  episode_reward_min: -2.0
  episodes_this_iter: 7
  episodes_total: 39
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.5187500000000005
          cur_lr: 0.0010000000000000005
          entropy: 3.1391689646628596
          entropy_coeff: 0.0
          kl: 0.019319030750854203
          policy_loss: -0.12313607611082575
          total_loss: -0.006304808028583084
          vf_explained_var: -0.659709003022922
          vf_loss: 0.08749049142483742
        model: {}
    num_agent_steps_sampled: 32000
    num_agent_steps_trained: 32000


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,8,245.371,32000,-0.600123,1.858,-2,793.949


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,8,245.371,32000,-0.600123,1.858,-2,793.949


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,8,245.371,32000,-0.600123,1.858,-2,793.949


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,8,245.371,32000,-0.600123,1.858,-2,793.949


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,8,245.371,32000,-0.600123,1.858,-2,793.949


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,8,245.371,32000,-0.600123,1.858,-2,793.949


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,8,245.371,32000,-0.600123,1.858,-2,793.949


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 36000
  custom_metrics: {}
  date: 2021-11-04_19-56-45
  done: false
  episode_len_mean: 797.75
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.49269999970089307
  episode_reward_min: -2.0
  episodes_this_iter: 5
  episodes_total: 44
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.5187500000000005
          cur_lr: 0.0010000000000000005
          entropy: 3.108403450699263
          entropy_coeff: 0.0
          kl: 0.023026248262140264
          policy_loss: -0.15116457039470313
          total_loss: -0.09563861672355924
          vf_explained_var: -0.629818590482076
          vf_loss: 0.020554841520334845
        model: {}
    num_agent_steps_sampled: 36000
    num_agent_steps_trained: 36000
    num_ste

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,9,277.808,36000,-0.4927,1.858,-2,797.75


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,9,277.808,36000,-0.4927,1.858,-2,797.75


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,9,277.808,36000,-0.4927,1.858,-2,797.75


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,9,277.808,36000,-0.4927,1.858,-2,797.75


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,9,277.808,36000,-0.4927,1.858,-2,797.75


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,9,277.808,36000,-0.4927,1.858,-2,797.75


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-11-04_19-57-18
  done: false
  episode_len_mean: 813.3061224489796
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.4832408160579448
  episode_reward_min: -2.0
  episodes_this_iter: 5
  episodes_total: 49
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.2781249999999993
          cur_lr: 0.0010000000000000005
          entropy: 3.1169783156405213
          entropy_coeff: 0.0
          kl: 0.016927433865149572
          policy_loss: -0.13524794683000574
          total_loss: -0.07564958386143208
          vf_explained_var: -0.7834098528149307
          vf_loss: 0.02103555159397944
        model: {}
    num_agent_steps_sampled: 40000
    num_agent_steps_trained: 40000


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,10,310.067,40000,-0.483241,1.858,-2,813.306


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,10,310.067,40000,-0.483241,1.858,-2,813.306


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,10,310.067,40000,-0.483241,1.858,-2,813.306


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,10,310.067,40000,-0.483241,1.858,-2,813.306


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,10,310.067,40000,-0.483241,1.858,-2,813.306


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,10,310.067,40000,-0.483241,1.858,-2,813.306


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 44000
  custom_metrics: {}
  date: 2021-11-04_19-57-48
  done: false
  episode_len_mean: 811.3518518518518
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.5125703701266536
  episode_reward_min: -2.0
  episodes_this_iter: 5
  episodes_total: 54
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.2781249999999993
          cur_lr: 0.0010000000000000005
          entropy: 3.114772683061579
          entropy_coeff: 0.0
          kl: 0.014068245691480338
          policy_loss: -0.12721956253317135
          total_loss: -0.04805133506525508
          vf_explained_var: -0.7830765913250626
          vf_loss: 0.04711900494156546
        model: {}
    num_agent_steps_sampled: 44000
    num_agent_steps_trained: 44000
 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,11,340.662,44000,-0.51257,1.858,-2,811.352


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,11,340.662,44000,-0.51257,1.858,-2,811.352


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,11,340.662,44000,-0.51257,1.858,-2,811.352


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,11,340.662,44000,-0.51257,1.858,-2,811.352


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,11,340.662,44000,-0.51257,1.858,-2,811.352


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,11,340.662,44000,-0.51257,1.858,-2,811.352


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,11,340.662,44000,-0.51257,1.858,-2,811.352


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 48000
  custom_metrics: {}
  date: 2021-11-04_19-58-21
  done: false
  episode_len_mean: 816.9827586206897
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.5117034480489534
  episode_reward_min: -2.0
  episodes_this_iter: 4
  episodes_total: 58
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.2781249999999993
          cur_lr: 0.0010000000000000005
          entropy: 3.0765289163076748
          entropy_coeff: 0.0
          kl: 0.017747529377845562
          policy_loss: -0.15222892789670858
          total_loss: -0.09371733775272244
          vf_explained_var: -0.8482128503502057
          vf_loss: 0.01808049808486655
        model: {}
    num_agent_steps_sampled: 48000
    num_agent_steps_trained: 48000


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,12,372.9,48000,-0.511703,1.858,-2,816.983


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,12,372.9,48000,-0.511703,1.858,-2,816.983


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,12,372.9,48000,-0.511703,1.858,-2,816.983


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,12,372.9,48000,-0.511703,1.858,-2,816.983


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,12,372.9,48000,-0.511703,1.858,-2,816.983


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,12,372.9,48000,-0.511703,1.858,-2,816.983


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 52000
  custom_metrics: {}
  date: 2021-11-04_19-58-53
  done: false
  episode_len_mean: 819.9516129032259
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.4609935475933936
  episode_reward_min: -2.0
  episodes_this_iter: 4
  episodes_total: 62
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.2781249999999993
          cur_lr: 0.0010000000000000005
          entropy: 3.075351850191752
          entropy_coeff: 0.0
          kl: 0.017600964302178796
          policy_loss: -0.16225698313986262
          total_loss: -0.10602313344471997
          vf_explained_var: -0.7767078093944058
          vf_loss: 0.016136652454032854
        model: {}
    num_agent_steps_sampled: 52000
    num_agent_steps_trained: 52000


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,13,405.082,52000,-0.460994,1.858,-2,819.952


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,13,405.082,52000,-0.460994,1.858,-2,819.952


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,13,405.082,52000,-0.460994,1.858,-2,819.952


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,13,405.082,52000,-0.460994,1.858,-2,819.952


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,13,405.082,52000,-0.460994,1.858,-2,819.952


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,13,405.082,52000,-0.460994,1.858,-2,819.952


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 56000
  custom_metrics: {}
  date: 2021-11-04_19-59-24
  done: false
  episode_len_mean: 828.9242424242424
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.46335757501197583
  episode_reward_min: -2.0
  episodes_this_iter: 4
  episodes_total: 66
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.2781249999999993
          cur_lr: 0.0010000000000000005
          entropy: 3.0678212263250866
          entropy_coeff: 0.0
          kl: 0.012188194604048288
          policy_loss: -0.11581730937875646
          total_loss: -0.06155114271027106
          vf_explained_var: -0.8852664608468291
          vf_loss: 0.02649993704611896
        model: {}
    num_agent_steps_sampled: 56000
    num_agent_steps_trained: 56000

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,14,435.95,56000,-0.463358,1.858,-2,828.924


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,14,435.95,56000,-0.463358,1.858,-2,828.924


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,14,435.95,56000,-0.463358,1.858,-2,828.924


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,14,435.95,56000,-0.463358,1.858,-2,828.924


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,14,435.95,56000,-0.463358,1.858,-2,828.924


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,14,435.95,56000,-0.463358,1.858,-2,828.924


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 60000
  custom_metrics: {}
  date: 2021-11-04_19-59-54
  done: false
  episode_len_mean: 836.4929577464789
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.42176901370706693
  episode_reward_min: -2.0
  episodes_this_iter: 5
  episodes_total: 71
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.2781249999999993
          cur_lr: 0.0010000000000000005
          entropy: 3.0633584883905227
          entropy_coeff: 0.0
          kl: 0.016071576113264464
          policy_loss: -0.18503633062855931
          total_loss: -0.13381682954458218
          vf_explained_var: -0.7012746374453268
          vf_loss: 0.014606442472957556
        model: {}
    num_agent_steps_sampled: 60000
    num_agent_steps_trained: 6000

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,15,466.614,60000,-0.421769,1.858,-2,836.493


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,15,466.614,60000,-0.421769,1.858,-2,836.493


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,15,466.614,60000,-0.421769,1.858,-2,836.493


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,15,466.614,60000,-0.421769,1.858,-2,836.493


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,15,466.614,60000,-0.421769,1.858,-2,836.493


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,15,466.614,60000,-0.421769,1.858,-2,836.493


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 64000
  custom_metrics: {}
  date: 2021-11-04_20-00-25
  done: false
  episode_len_mean: 832.1973684210526
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.4142052629276326
  episode_reward_min: -2.0
  episodes_this_iter: 5
  episodes_total: 76
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.2781249999999993
          cur_lr: 0.0010000000000000005
          entropy: 3.0648412025103005
          entropy_coeff: 0.0
          kl: 0.013092950225887995
          policy_loss: -0.13762780876269423
          total_loss: -0.08681083688943056
          vf_explained_var: -0.7920830037004204
          vf_loss: 0.020989593124437718
        model: {}
    num_agent_steps_sampled: 64000
    num_agent_steps_trained: 64000

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,16,497.046,64000,-0.414205,1.858,-2,832.197


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,16,497.046,64000,-0.414205,1.858,-2,832.197


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,16,497.046,64000,-0.414205,1.858,-2,832.197


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,16,497.046,64000,-0.414205,1.858,-2,832.197


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,16,497.046,64000,-0.414205,1.858,-2,832.197


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,16,497.046,64000,-0.414205,1.858,-2,832.197


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 68000
  custom_metrics: {}
  date: 2021-11-04_20-00-56
  done: false
  episode_len_mean: 831.95
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.3762249995023012
  episode_reward_min: -2.0
  episodes_this_iter: 4
  episodes_total: 80
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.2781249999999993
          cur_lr: 0.0010000000000000005
          entropy: 3.0656464384448143
          entropy_coeff: 0.0
          kl: 0.015779842939178383
          policy_loss: -0.17899143270927892
          total_loss: -0.12633429434130428
          vf_explained_var: -0.7756437681054557
          vf_loss: 0.0167086865259735
        model: {}
    num_agent_steps_sampled: 68000
    num_agent_steps_trained: 68000
    num_step

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,17,528.387,68000,-0.376225,1.858,-2,831.95


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,17,528.387,68000,-0.376225,1.858,-2,831.95


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,17,528.387,68000,-0.376225,1.858,-2,831.95


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,17,528.387,68000,-0.376225,1.858,-2,831.95


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,17,528.387,68000,-0.376225,1.858,-2,831.95


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,17,528.387,68000,-0.376225,1.858,-2,831.95


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 72000
  custom_metrics: {}
  date: 2021-11-04_20-01-28
  done: false
  episode_len_mean: 835.5294117647059
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.37762352894334233
  episode_reward_min: -2.0
  episodes_this_iter: 5
  episodes_total: 85
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.2781249999999993
          cur_lr: 0.0010000000000000005
          entropy: 3.060369025507281
          entropy_coeff: 0.0
          kl: 0.013036957895225671
          policy_loss: -0.13254496049466394
          total_loss: -0.07341062287769971
          vf_explained_var: -0.9259326410549943
          vf_loss: 0.029434516614613434
        model: {}
    num_agent_steps_sampled: 72000
    num_agent_steps_trained: 72000

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,18,559.722,72000,-0.377624,1.858,-2,835.529


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,18,559.722,72000,-0.377624,1.858,-2,835.529


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,18,559.722,72000,-0.377624,1.858,-2,835.529


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,18,559.722,72000,-0.377624,1.858,-2,835.529


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,18,559.722,72000,-0.377624,1.858,-2,835.529


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,18,559.722,72000,-0.377624,1.858,-2,835.529


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,18,559.722,72000,-0.377624,1.858,-2,835.529


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 76000
  custom_metrics: {}
  date: 2021-11-04_20-01-59
  done: false
  episode_len_mean: 842.9213483146067
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.36065168494588873
  episode_reward_min: -2.0
  episodes_this_iter: 4
  episodes_total: 89
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.2781249999999993
          cur_lr: 0.0010000000000000005
          entropy: 3.015773751658778
          entropy_coeff: 0.0
          kl: 0.018417100450185006
          policy_loss: -0.1802360414537371
          total_loss: -0.13258131825094743
          vf_explained_var: -0.8650187502625168
          vf_loss: 0.005698266095401699
        model: {}
    num_agent_steps_sampled: 76000
    num_agent_steps_trained: 76000


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,19,591.259,76000,-0.360652,1.858,-2,842.921


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,19,591.259,76000,-0.360652,1.858,-2,842.921


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,19,591.259,76000,-0.360652,1.858,-2,842.921


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,19,591.259,76000,-0.360652,1.858,-2,842.921


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,19,591.259,76000,-0.360652,1.858,-2,842.921


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,19,591.259,76000,-0.360652,1.858,-2,842.921


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 80000
  custom_metrics: {}
  date: 2021-11-04_20-02-31
  done: false
  episode_len_mean: 849.6774193548387
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.34513978451810856
  episode_reward_min: -2.0
  episodes_this_iter: 4
  episodes_total: 93
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.2781249999999993
          cur_lr: 0.0010000000000000005
          entropy: 3.041025988260905
          entropy_coeff: 0.0
          kl: 0.016313500679905734
          policy_loss: -0.2021859095479432
          total_loss: -0.162944582692017
          vf_explained_var: -0.8403149522760863
          vf_loss: 0.002077131496550846
        model: {}
    num_agent_steps_sampled: 80000
    num_agent_steps_trained: 80000
  

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,20,622.597,80000,-0.34514,1.858,-2,849.677


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,20,622.597,80000,-0.34514,1.858,-2,849.677


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,20,622.597,80000,-0.34514,1.858,-2,849.677


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,20,622.597,80000,-0.34514,1.858,-2,849.677


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,20,622.597,80000,-0.34514,1.858,-2,849.677


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,20,622.597,80000,-0.34514,1.858,-2,849.677


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 84000
  custom_metrics: {}
  date: 2021-11-04_20-03-01
  done: false
  episode_len_mean: 855.8762886597938
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.3309072160843721
  episode_reward_min: -2.0
  episodes_this_iter: 4
  episodes_total: 97
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.2781249999999993
          cur_lr: 0.0010000000000000005
          entropy: 3.0378617540482553
          entropy_coeff: 0.0
          kl: 0.018289913742916147
          policy_loss: -0.1966089302083097
          total_loss: -0.15395487193018198
          vf_explained_var: -0.9745406589200419
          vf_loss: 0.0009873464515636756
        model: {}
    num_agent_steps_sampled: 84000
    num_agent_steps_trained: 84000

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,21,653.14,84000,-0.330907,1.858,-2,855.876


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,21,653.14,84000,-0.330907,1.858,-2,855.876


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,21,653.14,84000,-0.330907,1.858,-2,855.876


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,21,653.14,84000,-0.330907,1.858,-2,855.876


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,21,653.14,84000,-0.330907,1.858,-2,855.876


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,21,653.14,84000,-0.330907,1.858,-2,855.876


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 88000
  custom_metrics: {}
  date: 2021-11-04_20-03-32
  done: false
  episode_len_mean: 859.97
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.340979999601841
  episode_reward_min: -2.0
  episodes_this_iter: 4
  episodes_total: 101
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.2781249999999993
          cur_lr: 0.0010000000000000005
          entropy: 3.0368969525060345
          entropy_coeff: 0.0
          kl: 0.010919290440978034
          policy_loss: -0.07100443663175708
          total_loss: 0.004104993541684923
          vf_explained_var: -0.9573244269817106
          vf_loss: 0.05023391763944121
        model: {}
    num_agent_steps_sampled: 88000
    num_agent_steps_trained: 88000
    num_ste

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,22,683.763,88000,-0.34098,1.858,-2,859.97


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,22,683.763,88000,-0.34098,1.858,-2,859.97


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,22,683.763,88000,-0.34098,1.858,-2,859.97


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,22,683.763,88000,-0.34098,1.858,-2,859.97


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,22,683.763,88000,-0.34098,1.858,-2,859.97


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,22,683.763,88000,-0.34098,1.858,-2,859.97


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 92000
  custom_metrics: {}
  date: 2021-11-04_20-04-03
  done: false
  episode_len_mean: 869.9
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.34401999950408935
  episode_reward_min: -2.0
  episodes_this_iter: 4
  episodes_total: 105
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.2781249999999993
          cur_lr: 0.0010000000000000005
          entropy: 3.0336523056030273
          entropy_coeff: 0.0
          kl: 0.015891534795264443
          policy_loss: -0.1586960472607164
          total_loss: -0.11232031091486895
          vf_explained_var: -0.7506126476872352
          vf_loss: 0.010172833534838851
        model: {}
    num_agent_steps_sampled: 92000
    num_agent_steps_trained: 92000
    num_st

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,23,714.774,92000,-0.34402,1.858,-2,869.9


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,23,714.774,92000,-0.34402,1.858,-2,869.9


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,23,714.774,92000,-0.34402,1.858,-2,869.9


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,23,714.774,92000,-0.34402,1.858,-2,869.9


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,23,714.774,92000,-0.34402,1.858,-2,869.9


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,23,714.774,92000,-0.34402,1.858,-2,869.9


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 96000
  custom_metrics: {}
  date: 2021-11-04_20-04-35
  done: false
  episode_len_mean: 885.97
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.30401999950408937
  episode_reward_min: -2.0
  episodes_this_iter: 4
  episodes_total: 109
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.2781249999999993
          cur_lr: 0.0010000000000000005
          entropy: 3.0372663272324427
          entropy_coeff: 0.0
          kl: 0.01642084702432833
          policy_loss: -0.19351711486596415
          total_loss: -0.15413850901388032
          vf_explained_var: -0.7804634028224535
          vf_loss: 0.001969862268594224
        model: {}
    num_agent_steps_sampled: 96000
    num_agent_steps_trained: 96000
    num_s

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,24,746.657,96000,-0.30402,1.858,-2,885.97


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,24,746.657,96000,-0.30402,1.858,-2,885.97


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,24,746.657,96000,-0.30402,1.858,-2,885.97


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,24,746.657,96000,-0.30402,1.858,-2,885.97


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,24,746.657,96000,-0.30402,1.858,-2,885.97


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,RUNNING,192.168.1.15:91273,24,746.657,96000,-0.30402,1.858,-2,885.97


Result for PPO_soccer-v0_d165f_00000:
  agent_timesteps_total: 100000
  custom_metrics: {}
  date: 2021-11-04_20-05-07
  done: true
  episode_len_mean: 891.02
  episode_media: {}
  episode_reward_max: 1.8580000400543213
  episode_reward_mean: -0.26401999950408933
  episode_reward_min: -2.0
  episodes_this_iter: 5
  episodes_total: 114
  experiment_id: 513da40517ea4401b5c456a1b4a29a8a
  hostname: bruno-odyssey-mint
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.2781249999999993
          cur_lr: 0.0010000000000000005
          entropy: 3.0333656792999597
          entropy_coeff: 0.0
          kl: 0.0078036258533763095
          policy_loss: -0.07273034073727866
          total_loss: -0.032057337896267496
          vf_explained_var: -0.9514517092576591
          vf_loss: 0.02289536803308624
        model: {}
    num_agent_steps_sampled: 100000
    num_agent_steps_trained: 100000
    n

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_soccer-v0_d165f_00000,TERMINATED,192.168.1.15:91273,25,778.529,100000,-0.26402,1.858,-2,891.02


[2m[36m(pid=91272)[0m [2021-11-04 20:05:08,388 E 91272 91442] raylet_client.cc:159: IOError: Broken pipe [RayletClient] Failed to disconnect from raylet.
2021-11-04 20:05:08,489	INFO tune.py:630 -- Total run time: 789.24 seconds (787.88 seconds for the tuning loop).


In [None]:
# pg_config = pg.DEFAULT_CONFIG.copy()
# pg_config.update(config)
# pg_config["lr"] = 1e-3

# trainer = pg.PPOTrainer(config=pg_config, env=environment_id)
# # executa o loop de treinamento manual e imprime os resultados após cada iteração
# for _ in range(stop["training_iteration"]):
#     result = trainer.train()
#     print(pretty_print(result))
    
#     # pare o treinamento caso tiver alcançado a quantidade de steps desejada
#     # ou caso a recompensa desejada seja alcançada
#     if result["timesteps_total"] >= stop["timesteps_total"] or \
#             result["episode_reward_mean"] >= stop["episode_reward_mean"]:
#         break


## Exportando seu agente treinado

Assim como no Lab 02, você pode exportar seu agente treinado para ser executado como competidor no ambiente da competição ou simplesmente assistí-lo. Para isso, devemos definir uma classe de agente que implemente a interface e trate as observações/ações para o formato da competição. Abaixo, configuramos qual experimento/checkpoint exportar e guardamos a implementação em uma variável para salvá-la em um arquivo posteriormente.

In [20]:
ALGORITHM = "PPO"
TRIAL = analysis.get_best_logdir("episode_reward_mean", "max")
CHECKPOINT = analysis.get_best_checkpoint(
  TRIAL,
  "training_iteration",
  "max",
)
TRIAL, CHECKPOINT

('/home/bruno/Workspace/ceia-rl-curso/LAB_03/content/results/PPO/PPO_soccer-v0_d165f_00000_0_2021-11-04_19-51-59',
 '/home/bruno/Workspace/ceia-rl-curso/LAB_03/content/results/PPO/PPO_soccer-v0_d165f_00000_0_2021-11-04_19-51-59/checkpoint_000025/checkpoint-25')

In [21]:
agent_file = f"""
import pickle
import os

import gym
from gym_unity.envs import ActionFlattener
import ray
from ray import tune
from ray.tune.registry import get_trainable_cls

from soccer_twos import AgentInterface, DummyEnv


ALGORITHM = "{ALGORITHM}"
CHECKPOINT_PATH = os.path.join(
    os.path.dirname(os.path.abspath(__file__)), 
    "{CHECKPOINT.split("LAB_03/")[1]}"
)


class MyRaySoccerAgent(AgentInterface):
    def __init__(self, env: gym.Env):
        super().__init__()
        ray.init(ignore_reinit_error=True)

        self.flattener = ActionFlattener(env.action_space.nvec)

        # Load configuration from checkpoint file.
        config_path = ""
        if CHECKPOINT_PATH:
            config_dir = os.path.dirname(CHECKPOINT_PATH)
            config_path = os.path.join(config_dir, "params.pkl")
            # Try parent directory.
            if not os.path.exists(config_path):
                config_path = os.path.join(config_dir, "../params.pkl")

        # Load the config from pickled.
        if os.path.exists(config_path):
            with open(config_path, "rb") as f:
                config = pickle.load(f)
        else:
            # If no config in given checkpoint -> Error.
            raise ValueError(
                "Could not find params.pkl in either the checkpoint dir or "
                "its parent directory!"
            )

        # no need for parallelism on evaluation
        config["num_workers"] = 0
        config["num_gpus"] = 0

        # create a dummy env since it's required but we only care about the policy
        obs_space = env.observation_space
        act_space = self.flattener.action_space
        tune.registry.register_env(
            "DummyEnv",
            lambda *_: DummyEnv(obs_space, act_space),
        )
        config["env"] = "DummyEnv"

        # create the Trainer from config
        cls = get_trainable_cls(ALGORITHM)
        agent = cls(env=config["env"], config=config)
        # load state from checkpoint
        agent.restore(CHECKPOINT_PATH)
        # get default policy for evaluation
        self.policy = agent.get_policy()

    def act(self, observation):
        actions = {{}}
        for player_id in observation:
            # compute_single_action returns a tuple of (action, action_info, ...)
            # as we only need the action, we discard the other elements
            actions[player_id] = self.flattener.lookup_action(
                self.policy.compute_single_action(observation[player_id])[0]
            )
        return actions
"""

In [22]:
import os
import shutil

agent_name = "my_ray_soccer_agent"
agent_path = os.path.join(
    DRIVE_PATH, agent_name, agent_name) if isColab else os.path.join(DRIVE_PATH, agent_name)
os.makedirs(agent_path, exist_ok=True)

shutil.rmtree(agent_path)
os.makedirs(agent_path)

# salva a classe do agente
with open(os.path.join(agent_path, "agent.py"), "w") as f:
    f.write(agent_file)

# salva um __init__ para criar o módulo Python
with open(os.path.join(agent_path, "__init__.py"), "w") as f:
    f.write("from .agent import MyRaySoccerAgent")

# copia o trial inteiro, incluindo os arquivos de configuração do experimento
shutil.copytree(TRIAL, os.path.join(agent_path, TRIAL.split("LAB_03/")[1]))

# empacota tudo num arquivo .zip
if isColab:
    shutil.make_archive(os.path.join(DRIVE_PATH, agent_name),
                        "zip", os.path.join(DRIVE_PATH, agent_name))


Após empacotar todos os arquivos necessários para a execução do seu agente, será criado um arquivo `minicurso_rl/lab03/my_ray_soccer_agent.zip` nos arquivos do Colab e na pasta correspondente no Google Drive. Baixe o arquivo e extraia-o para alguma pasta no seu computador. 

Assumindo que o ambiente Python já está configurado (e.g. os pacotes no [requirements.txt](https://github.com/dlb-rl/rl-tournament-starter/blob/main/requirements.txt) estão instalados), rode `python -m soccer_twos.watch -m my_ray_soccer_agent` para assistir seu agente jogando contra si mesmo. 

Você também pode testar dois agentes diferentes jogando um contra o outro. Utilize o seguinte comando: `python -m soccer_twos.watch -m1 my_ray_soccer_agent -m2 ceia_baseline_agent`. Você pode baixar o agente *ceia_baseline_agent* [aqui](https://drive.google.com/file/d/1WEjr48D7QG9uVy1tf4GJAZTpimHtINzE/view).