<a href="https://colab.research.google.com/github/christianhidber/easyagents/blob/master/jupyter_notebooks/intro_train_args.ipynb" 
   target="_parent">
   <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Controlling training & evaluation or 'what do all these agent.train(...) args mean ?'

There are quite a few arguments you can pass to an agents `train(...)` method. Quite a few of them start 
with `num_....` like `num_iterations` or `num_iteratons_between_eval`. 
They are used to control 

* the actual training - who often is the policy updated ? - as well as
* the evaluation of the current policy - who well does the currently trained policy actually perform ?

The training is performed in units called 'iterations'.
After each iteration the policy is updated.
At the beginning of an iteration data is collected by playing `num_episodes_per_iteration` games, which
are also called episodes. 
If an episodes runs for more than `max_steps_per_episode` steps, the episode is stopped and a new episode
is started (by calling the gym environments `reset()` method).

To see how the policy changes during training we evaluate the policy every `num_iterations_between_eval`
iterations. Thus during evaluation the policy is not modified, we just use it to collect statistics
data like the average sum of rewards or the average number of steps per episode.

Thus call to `agent.train(...)`results in a loop like this:

````
    initialize policy (typicallsy with random values)
    for i in range(num_iterations) 
        play num_episodes_per_iteration episodes using the current policy and store all steps
            if an episode runs for more than max_steps_per_episode steps stop the episode and start a new one
        update the current policy num_epochs_per_iteration times using the collected steps

        every num_iterations_between_eval iterations:
            play num_eval_episodes episodes and collect statistics like avg reward and avg steps
````

We can actually see how the train-loop runs:


### Install packages (gym, tfagents, tensorflow,....)

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import sys
import warnings

warnings.filterwarnings('ignore')
if 'google.colab' in sys.modules:
    !apt-get install xvfb >/dev/null
    !pip install pyvirtualdisplay >/dev/null    
    
    from pyvirtualdisplay import Display
    Display(visible=0, size=(960, 720)).start() 
else:
    #  for local installation
    sys.path.append('..')

#### install easyagents

In [2]:
import sys
if 'google.colab' in sys.modules:
    !pip install easyagents >/dev/null    

## The training loop in action

In [3]:
from easyagents.agents import PpoAgent
from easyagents.callbacks import duration, log

ppoAgent = PpoAgent('CartPole-v0')
ppoAgent.train([duration.Fast()])

In [4]:
from easyagents.agents import PpoAgent
from easyagents.callbacks import duration, log

#ppoAgent = PpoAgent('CartPole-v0', backend='tensorforce')
#ppoAgent.train([log.Agent(), duration.Fast()], default_plots=False)

While in tensorforce we also first do a sequence of agent and policy. Note that in contrast to tfagents we do not
build up actor and critic policy networks but instead pass a network specification to the Agent.create call.
Moreover tensorforce implements already the train loop through its Runner class. 
Thus we only see 1 call to runner.run instead of the many api calls for tfagents.

## Seeding

To set a seed use:

In [1]:
import easyagents

easyagents.agents.seed = 0

Once set, the seed is applied before each call to train. Let's validate this using our log.Agent callback:

In [2]:
from easyagents.agents import PpoAgent
from easyagents.callbacks import duration, log

ppoAgent = PpoAgent('CartPole-v0')
ppoAgent.train([log.Agent(), duration.Fast()], default_plots=False)

Note that at the very beginning the calls to set the seeds for tensorflow, numpy and python.

## Iteration & Duration logging
Use the log.Iteration() callback to log the training progress:

In [1]:
from easyagents.agents import PpoAgent
from easyagents.callbacks import duration, log

ppoAgent = PpoAgent('CartPole-v0')
ppoAgent.train([log.Iteration(), duration.Fast()], default_plots=False)

The first line shows the result of an initial policy evaluation depicting the policy's performance before any 
training has happend. Policy evaluation happens every `num_iterations_between_eval` iterations and spans over
`num_episodes_per_eval` episodes. For every evaluation period the result is logged again. The `steps_done`value is 
the number of training steps (excluding the steps taken during evaluation). 
To see the training duration configuration use the log.Duration() callback:

In [2]:
from easyagents.agents import PpoAgent
from easyagents.callbacks import duration, log

ppoAgent = PpoAgent('CartPole-v0')
ppoAgent.train([log.Duration(), duration.Fast()], default_plots=False)

## Gym steps logging
Use the log.Step() callback to investigate how the agent interacts with the gym environment:

In [3]:
from easyagents.agents import PpoAgent
from easyagents.callbacks import duration, log

ppoAgent = PpoAgent('CartPole-v0')
ppoAgent.train([log.Step(), duration.Fast()], default_plots=False)

For each call to the gym environments step method you get a log entry, along with the action taken and current
observation. Each entry starts with 

[{gym_env_id} {instance_id}:{episode_in_instance}:{step_in_episode}]

followed by the id of the current training iteration as well as the current iteration step count.
If in a evaluation period you get the same statistics for the current evaluation episode.

You may easily implement other log callbacks to produce statistics specific to your problem domain.

## Fixing jupyter output cell clearing
It seems that jupyter / matplotlib backend changes its behaviour of outputing the current figure of an 
evaluated cell (if you can help here, please let use know by 
[creating an issue](https://github.com/christianhidber/easyagents/issues/new/choose)).

Nonetheless you may directly control easyagents jupyter ouput cell clearing behaviour through the plot.Clear()
callback:


In [3]:
from easyagents.agents import PpoAgent
from easyagents.callbacks import duration, plot

ppoAgent = PpoAgent('CartPole-v0')
ppoAgent.train([plot.Clear(on_train=False,on_play=False), duration.Fast()])

If your plot gets "doubled" after cell evaluation set on_train / on_play to True, if it disappears to False. 
Once plot.Clear() is called, the behaviour stays the same across a upcoming plots.