# Introduction

This notebook is a template for training a PPO model in Webots grid environment. In this notebook you are able to create your own reward function, setup environment, train and save model.

---
**NOTE**

To use this notebook, please first follow `UseGuide.md` to install the neccessary software and packages.

---

In [1]:
%%capture output 
# captures ALL output in cell to disable tensorflow warnings

import numpy as np
from stable_baselines import PPO1
from stable_baselines.common.callbacks import CheckpointCallback

In [2]:
import sys
sys.path.insert(0,'../backend')

# load our webotsgym
import webotsgym as wg

# Create a Custom Reward Function (optional)

You can create your own reward function and check done function in this block. These following variables and methods are for your use：
* `env.gps_actual`, get gps data for the current position.
* `env.gps_target`, get gps data for the target.
* `env.get_target_distance(normalized=False)`, calculate euclidean distance from the current position to target. `normalized` - (bool) If True, the distance will be normalized into the ratio of the actual distance to the maximum distance.
* `env.gps_visited_count`, the frequency of reaching the current position in the past steps
* `env.state.touching`, whether the agent touchs a obstacle or not.
* `env.steps_in_run`, how many time steps the agent has used in this episode.
* `env.total_reward`, the sum of the reward in this episode
* `targetband`, the threshold for judging whether the robot reachs the target or not. The default value is 0.05.

The  reward function and check done function in the cell below is an example for you.

In [3]:
class MyEval(wg.WbtRewardGrid):
    def __init__(self, env, config, targetband=0.05):
        super(MyEval, self).__init__(env, config)
        self.targetband = targetband

    def calc_reward(self):
        if self.env.get_target_distance() < self.targetband:
            reward = 10000
        else:
            reward = 0

            # step penalty
            target_distance = self.env.get_target_distance(normalized=True)
            step_penalty = -1
            lambda_ = 5
            reward += step_penalty * (1 - np.exp(-lambda_ * target_distance))

            # visited count penalty
            vc = self.env.gps_visited_count
            if vc > 3:
                reward += -0.2 * (vc - 2)**2

            # touching penalty
            if self.env.state.touching is True:
                reward -= 500

        return reward

    def check_done(self):
        if self.env.steps_in_run == 200:
            return True
        if self.env.total_reward < -1000:
            return True
        if self.env.get_target_distance() < self.targetband:
            return True
        return False

# Select Parameters for the Webots World

You can setup the Webots environment parameters for your training:

* `config.world_size` , setup the size of Webots environments for training. For example: `config.world_size = 8` will setup a square map of size 8x8 in Webots.
* `config.num_obstacles`, setup the number of obstacles. Each obstacle is a block of size 1x1.
* `config.sim_mode`, used to setup the speed of the simulation of Webots. 
`config.sim_mode = wg.config.SimSpeedMode.NORMAL`, run the simulation in the Real-Time mode.
`config.sim_mode = wg.config.SimSpeedMode.RUN`, run the simulation as fast as possible using all the available CPU power. 
`config.sim_mode = wg.config.SimSpeedMode.FAST`, run the simulation as fast as possible without the graphics rendering, hence the 3d window is black.


In [4]:
config = wg.WbtConfig()
config.world_size = 8
config.num_obstacles = 16
config.sim_mode = wg.config.SimSpeedMode.FAST

# Start our Webotsgym


The training environment will be created. If you want, you can use the custom reward class created above.

In [5]:
# set eval class  (choose custom or standard)
# eval_ = MyEval  # custom
eval_ = wg.WbtRewardGrid  # standard

env = wg.WbtGymGrid(config=config,
                    evaluate_class=eval_)

Accepting on Port:  10201


# Initialize a Model from Stable-Baselines

During training the model will be saved in `UseMe/model/grid/.log/` periodically. If you want to start your training from the trained model, please comment `new model` and use `from a trained model` .

More information of setting parameters for PPO model can be find [here](https://stable-baselines.readthedocs.io/en/master/modules/ppo1.html#parameters).

In [6]:
%%capture output 
# captures ALL output in cell to disable tensorflow warnings

model_name = "PPO_webots_grid"
checkpoint_callback = CheckpointCallback(save_freq=5000, save_path='model/grid/.log/',
                                         name_prefix=model_name)

# new model
model = PPO1("MlpPolicy", env) 

# # from a trained model
# load_model = "name_of _model"
# model = PPO1.load("model/grid/.log/{}".format(load_model), env)

# Train a Model on the Webotsgym


Train and a PPO model on the Webots grid environment and save it after training. Please setup the training parameters:

* `time_steps`, the total number of samples to train on.

More information of setting parameters for model training can be find [here](https://stable-baselines.readthedocs.io/en/master/modules/ppo1.html#parameters).

In [7]:
time_steps = 100000
model.learn(total_timesteps=time_steps, callback=checkpoint_callback)
model.save('model/grid/{}'.format(model_name))
print ('Training finished :)')

Training finished :)
