<a href="https://colab.research.google.com/github/apwvt/S2DC-RL-Project/blob/self-play/S2DC%20Reinforcement%20Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Pull Project Code**

---
We first get the latest version of the project code.


In [19]:
from os import path

% cd /content/

if path.exists("S2DC-RL-Project"):
  ! cd S2DC-RL-Project && git pull
else:
  ! git clone https://github.com/apwvt/S2DC-RL-Project.git

/content
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 3 (delta 2), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (3/3), done.
From https://github.com/apwvt/S2DC-RL-Project
   06f4e40..5437b5a  self-play  -> origin/self-play
Updating 06f4e40..5437b5a
Fast-forward
 S2DC Reinforcement Learning.ipynb | 180 [32m+++++++++++++++++++++++++++++++[m[31m-------[m
 1 file changed, 150 insertions(+), 30 deletions(-)


In [20]:
# We add the repository folder to the module path.
import sys
sys.path.insert(1, "/content/S2DC-RL-Project")

**Mount Drive**

---
We mount a shared drive to have a place to save checkpoints.

In [21]:
from google.colab import drive
drive.mount('/content/Drive')

Drive already mounted at /content/Drive; to attempt to forcibly remount, call drive.mount("/content/Drive", force_remount=True).


In [2]:
%set_env LOGFOLDER=/content/Drive/Shareddrives/CS_ML_ENV/colab_env/logs

env: LOGFOLDER=/content/Drive/Shareddrives/CS_ML_ENV/colab_env/logs


**Setup Dependencies**

---
We install our repository as a local package.

In [3]:
# Avoid dumping loads of worthless output all over the screen
%%capture

% cd S2DC-RL-Project
! pip install -e .  # Local module
! pip install ray   # Distributed computing package
! pip install pettingzoo # Environment support
! pip install pettingzoo[magent] # Multi-agent environments
! pip install tensorboard

**Run Project Code**

---

Any project files can be loaded from here and used in programs.

In [5]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [6]:
# Switch to the git branch to use
! git checkout self-play

M	muzero_collab/muzero.py
Already on 'self-play'
Your branch is up to date with 'origin/self-play'.


In [8]:
# Set up M0 config
import os
import datetime
import torch

class BattleMuZeroConfig:
    def __init__(self):
        # More information is available here: https://github.com/werner-duvaud/muzero-general/wiki/Hyperparameter-Optimization

        self.seed = 0x1BADB007  # Seed for numpy, torch and the game
        self.max_num_gpus = None  # Fix the maximum number of GPUs to use. It's usually faster to use a single GPU (set it to 1) if it has enough memory. None will use every GPUs available



        ### Game
        self.observation_shape = (13, 13, 41)  # Dimensions of the game observation, must be 3D (channel, height, width). For a 1D array, please reshape it to (1, 1, length of array)
        self.action_space = list(range(21))  # Fixed list of all possible actions. You should only edit the length
        self.players = list(range(1))  # List of players. You should only edit the length
        self.stacked_observations = 5  # Number of previous observations and previous actions to add to the current observation

        # Evaluate
        self.muzero_player = 0  # Turn Muzero begins to play (0: MuZero plays first, 1: MuZero plays second)
        self.opponent = None  # Hard coded agent that MuZero faces to assess his progress in multiplayer games. It doesn't influence training. None, "random" or "expert" if implemented in the Game class



        ### Self-Play
        self.num_workers = 1  # Number of simultaneous threads/workers self-playing to feed the replay buffer
        self.selfplay_on_gpu = False
        self.max_moves = 500  # Maximum number of moves if game is not finished before
        self.num_simulations = 50  # Number of future moves self-simulated
        self.discount = 0.997  # Chronological discount of the reward
        self.temperature_threshold = None  # Number of moves before dropping the temperature given by visit_softmax_temperature_fn to 0 (ie selecting the best action). If None, visit_softmax_temperature_fn is used every time

        # Root prior exploration noise
        self.root_dirichlet_alpha = 0.25
        self.root_exploration_fraction = 0.25

        # UCB formula
        self.pb_c_base = 19652
        self.pb_c_init = 1.25



        ### Network
        self.network = "fullyconnected"  # "resnet" / "fullyconnected"
        self.support_size = 10  # Value and reward are scaled (with almost sqrt) and encoded on a vector with a range of -support_size to support_size. Choose it so that support_size <= sqrt(max(abs(discounted reward)))
     
        # Residual Network
        self.downsample = False  # Downsample observations before representation network, False / "CNN" (lighter) / "resnet" (See paper appendix Network Architecture)
        self.blocks = 1  # Number of blocks in the ResNet
        self.channels = 2  # Number of channels in the ResNet
        self.reduced_channels_reward = 2  # Number of channels in reward head
        self.reduced_channels_value = 2  # Number of channels in value head
        self.reduced_channels_policy = 2  # Number of channels in policy head
        self.resnet_fc_reward_layers = []  # Define the hidden layers in the reward head of the dynamic network
        self.resnet_fc_value_layers = []  # Define the hidden layers in the value head of the prediction network
        self.resnet_fc_policy_layers = []  # Define the hidden layers in the policy head of the prediction network

        # Fully Connected Network
        self.encoding_size = 8
        self.fc_representation_layers = []  # Define the hidden layers in the representation network
        self.fc_dynamics_layers = [16]  # Define the hidden layers in the dynamics network
        self.fc_reward_layers = [16]  # Define the hidden layers in the reward network
        self.fc_value_layers = [16]  # Define the hidden layers in the value network
        self.fc_policy_layers = [16]  # Define the hidden layers in the policy network



        ### Training
        self.results_path = os.path.join(os.environ.get("LOGFOLDER"), "battle", datetime.datetime.now().strftime("%Y-%m-%d--%H-%M-%S"))  # Path to store the model weights and TensorBoard logs
        self.save_model = True  # Save the checkpoint in results_path as model.checkpoint
        self.training_steps = 5000  # Total number of training steps (ie weights update according to a batch)
        self.batch_size = 128  # Number of parts of games to train on at each training step
        self.checkpoint_interval = 10  # Number of training steps before using the model for self-playing
        self.value_loss_weight = 1  # Scale the value loss to avoid overfitting of the value function, paper recommends 0.25 (See paper appendix Reanalyze)
        self.train_on_gpu = torch.cuda.is_available()  # Train on GPU if available

        self.optimizer = "Adam"  # "Adam" or "SGD". Paper uses SGD
        self.weight_decay = 1e-4  # L2 weights regularization
        self.momentum = 0.9  # Used only if optimizer is SGD

        # Exponential learning rate schedule
        self.lr_init = 0.02  # Initial learning rate
        self.lr_decay_rate = 0.9  # Set it to 1 to use a constant learning rate
        self.lr_decay_steps = 1000



        ### Replay Buffer
        self.replay_buffer_size = 500  # Number of self-play games to keep in the replay buffer
        self.num_unroll_steps = 10  # Number of game moves to keep for every batch element
        self.td_steps = 50  # Number of steps in the future to take into account for calculating the target value
        self.PER = True  # Prioritized Replay (See paper appendix Training), select in priority the elements in the replay buffer which are unexpected for the network
        self.PER_alpha = 0.5  # How much prioritization is used, 0 corresponding to the uniform case, paper suggests 1

        # Reanalyze (See paper appendix Reanalyse)
        self.use_last_model_value = True  # Use the last model to provide a fresher, stable n-step value (See paper appendix Reanalyze)
        self.reanalyse_on_gpu = False



        ### Adjust the self play / training ratio to avoid over/underfitting
        self.self_play_delay = 0  # Number of seconds to wait after each played game
        self.training_delay = 0  # Number of seconds to wait after each training step
        self.ratio = 1.5  # Desired training steps per self played step ratio. Equivalent to a synchronous version, training can take much longer. Set it to None to disable it

    def visit_softmax_temperature_fn(self, trained_steps):
        """
        Parameter to alter the visit count distribution to ensure that the action selection becomes greedier as training progresses.
        The smaller it is, the more likely the best action (ie with the highest visit count) is chosen.
        Returns:
            Positive float.
        """
        if trained_steps < 0.5 * self.training_steps:
            return 1.0
        elif trained_steps < 0.75 * self.training_steps:
            return 0.5
        else:
            return 0.25

In [12]:
# Begin the model's training
from muzero_collab.muzero import MuZero
import ray

muzero = MuZero("battle", BattleMuZeroConfig())

if ray.is_initialized():
  ray.shutdown()

ray.init()
muzero.train()

2021-03-11 03:15:44,726	INFO worker.py:665 -- Calling ray.init() again after it has already been called.



Shutting down workers...


2021-03-11 03:15:49,798	INFO services.py:1174 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m



Training...
Run tensorboard --logdir ./results and go to http://localhost:6006/ to see in real time the training performance.





Last test reward: 0.00. Training step: 0/5000. Played games: 0. Loss: 0.00
Last test reward: 0.00. Training step: 0/5000. Played games: 0. Loss: 0.00
Last test reward: 0.00. Training step: 0/5000. Played games: 0. Loss: 0.00
Last test reward: 0.00. Training step: 0/5000. Played games: 0. Loss: 0.00




Last test reward: 0.00. Training step: 0/5000. Played games: 0. Loss: 0.00
Last test reward: 0.00. Training step: 0/5000. Played games: 0. Loss: 0.00
Last test reward: 0.00. Training step: 0/5000. Played games: 0. Loss: 0.00
Last test reward: 0.00. Training step: 0/5000. Played games: 0. Loss: 0.00
Last test reward: 0.00. Training step: 0/5000. Played games: 0. Loss: 0.00
Last test reward: 0.00. Training step: 0/5000. Played games: 0. Loss: 0.00
Last test reward: 0.00. Training step: 0/5000. Played games: 0. Loss: 0.00

Shutting down workers...


Persisting replay buffer games to disk...


In [None]:
# Download the logs folder as a tarball
!tar czf results.tar.gz $LOGFOLDER

In [None]:
# Start tensorboard (if it works)
%tensorboard --logdir muzero_collab/results