In this tutorial we will create a custom gatherer to alter the way in which our agent selects actions. Before we begin, let's quickly go into the gathering process of PPO.

During the gathering phase, our agent collects experience via *n* workers. Each of these workers maintains its own copy of the policy as well as its own copy of the environment. The workers are independent: while they share the same version of the policy and environment, random processes are completely independent. Each worker selects an action at every step of the environment. In the standard gathering setup, this selection is purely based on the policy distribution's predicted parameters. Based on these parameters, the gatherer samples the action the agent will execute. We then also determine its probability (based on the PDF or PMF of the distribution) so the optimization can increase or decrease it depending on whether it is advantageous ar disadvantageous.

This sampling constitutes a stochastic policy. The advantage of a stochastic policy is its implicit exploration. Because the process of choosing an action is inherently stochastic, no additional care has to be taken of exploration. This advantage, however, comes with its drawbacks: one needs to still assure a good balance between exploration and exploitation. Setting schedules for decreasing the stochasticity requires knowledge about the convergence of the model in the environment. It also lacks flexibility when some spaces in the exploration space still need more exploration than others. Often, one instead lets the model itself predict also the second moment of the policy distribution (e.g. the variance of a Gaussian distribution). To prevent premature convergence, the objective is augmented with an entropy bonus that rewards high exploration. This allows us to use a stochastic policy without explicit exploration mechanisms.

In some applications, we may want to deviate from this approach and depend the sampling of actions on other factors than the predicted parameters. Such applications could for instance be models of curiosity. Note that conceptually, PPO is an on-policy algorithm. That means, that the optimization assumes the samples it optimizes on to be directly generated by the policy. However, this is not entirely true in practice. When optimizing, we usually use *mini batch* stochastic gradient descent methods and do multiple epochs per cycle. Naturally, this means that after the first update to the policy, every subsequent update will optimize on experience distributed differently than what the new policy would yield. This indicates some leverage over the gathering process.

The following showcases how to do this by means of a simple example: We will add epsilon greedy exploration as an additional exploration mechanism. First, let us import some basic we will need.

In [120]:
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
import random
import tensorflow as tf

To implement a custom gatherer, we will extend the ```Gatherer``` class. We do not want to alter the general behaviour of the gatherer (and this is also not recommended) but only change the action selection, and so we overwrite only the method ```select_action(self, predicted_parameters)```.

In [121]:
from dexterity.agent.gather import Gatherer

class EpsilonGreedyGatherer(Gatherer):

    def select_action(self, predicted_parameters: list):
        if random.random() < 0.9:
            action, action_probability = super(EpsilonGreedyGatherer, self).select_action(predicted_parameters)
        else:
            action = self.distribution.action_space.sample()
            action_probability = self.distribution.log_probability(tf.expand_dims(action, 0), *predicted_parameters)

        return action, action_probability

The signature is simple: we receive the parameters the policy predicted and return an action and its probability *given those parameters*. The latter we can easily calculate using the distribution of the policy (```self.distribution```).

Now that we have a new gatherer, we need to incorporate and train an agent with it. First, we create our environment.

In [122]:
from dexterity.common.transformers import StateNormalizationTransformer
from dexterity.common.wrappers import make_env

env = make_env("CartPole-v1", transformers=[StateNormalizationTransformer])

Note that in the above we only use the ```StateNormalizationTransformer``` because CartPole's reward function does not cope well with normalization. Next, we create our model and agent.

In [123]:
from dexterity.models import get_model_builder
from dexterity.agent.ppo_agent import PPOAgent

model_builder = get_model_builder("simple", "ffn")
agent = PPOAgent(
    model_builder,
    env,
    horizon=512,
    workers=12
)

Using [StateNormalizationTransformer] for preprocessing.
An MPI Optimizer with 1 ranks has been created; the following ranks optimize: [0]


For now, this agent would use the default gatherer. Let's change this.

In [124]:
agent.assign_gatherer(EpsilonGreedyGatherer)

That's it. Let's train this thing.

In [125]:
agent.drill(10, 3, 64)



Drill started using 1 processes for 6 workers of which 1 are optimizers. Worker distribution: [6].
IDs over Workers: [[0, 1, 2, 3, 4, 5]]
IDs over Optimizers: [[0, 1, 2, 3, 4, 5]]
[92mBefore Training[0m: r: [91m   21.09[0m; len: [94m   21.09[0m; n: [94m140[0m; loss: [[94m-[0m|[94m-[0m|[94m-[0m]; eps: [94m    0[0m; lr: [94m1.00e-03[0m; upd: [94m     0[0m; f: [94m   0.000[0mk; y.exp: [94m0.00000[0m; times:  ; took s [unknown time left]; mem: 1.77/33|0.51/8.36;
[92mCycle     1/20[0m: r: [91m   33.85[0m; len: [94m   33.85[0m; n: [94m 85[0m; loss: [[94m  0.04[0m|[94m   38.15[0m|[94m  0.68[0m]; eps: [94m  140[0m; lr: [94m1.00e-03[0m; upd: [94m   144[0m; f: [94m   3.072[0mk; times: [10.5|0.0|2.7] [79|0|21]; took 13.88s [4.4mins left]; mem: 1.8/33|0.51/8.36;
[92mCycle     2/20[0m: r: [91m   69.21[0m; len: [94m   69.21[0m; n: [94m 38[0m; loss: [[94m -0.16[0m|[94m   66.15[0m|[94m  0.64[0m]; eps: [94m  225[0m; lr: [94m1.00e-03[0m; 


KeyboardInterrupt

