# Introduction

This notebook is a template for training a PPO model in Webots grid environment. In this notebook you are able to create your own reward function, setup environment, train and save model.

---
**NOTE**

To use this notebook, please first follow `UseGuide.md` to install the neccessary software and packages.

---

In [1]:
%%capture output 
# captures ALL output in cell to disable tensorflow warnings

import numpy as np
from stable_baselines import PPO1

In [2]:
import sys
sys.path.insert(0,'../backend')

# load our webotsgym
import webotsgym as wg

# Create a Custom Reward Function (optional)

You can create your own reward function and check done function in this block. These following variables and methods are for your use：
* `env.gps_actual`, get gps data for the current position.
* `env.gps_target`, get gps data for the target.
* `env.get_target_distance(normalized=False)`, calculate euclidean distance from the current position to target. `normalized` - (bool) If True, the distance will be normalized into the ratio of the actual distance to the maximum distance.
* `env.state.touching`, whether the agent touchs a obstacle or not.
* `env.time_steps`, how many time steps the agent has used in this episode.



Here is a quick example of `calc_reward()` and `check_done()`:
```python
    def calc_reward(self):
        if self.env.get_target_distance() < 0.05:
            reward = 10000
        else:
            reward = 0

            # step penalty
            target_distance = self.env.get_target_distance(normalized=True)
            reward += -1 * (1 - exponential_decay(target_distance, lambda_=5))

            # touching penalty
            if self.env.state.touching is True:
                reward -= 500

        return reward

    def check_done(self):
        if self.env.time_steps == 200:
            return True
        if self.env.get_target_distance() < 0.05:
            return True
        return False
```

In [3]:
class MyEval(wg.WbtRewardGrid):
    def __init__(self, env, config: wg.WbtConfig = wg.WbtConfig()):
        super(MyEval, self).__init__(env, config)

    def calc_reward(self):
        """calculate the reward of the current state
           Returns: (double) value of reward
        """
        pass
    
    def check_done(self):
        """check whether this episode should be ended or not
           Returns: (bool) end this episode or not
        """
        pass

# Select Parameters for the Webots World

You can setup the Webots environment parameters for your training:

* `config.world_size` , setup the size of Webots environments for training. For example: `config.world_size = 8` will setup a square map of size 8x8 in Webots.
* `config.num_obstacles`, setup the number of obstacles. Each obstacle is a block of size 1x1.
* `config.sim_mode`, used to setup the speed of the simulation of Webots. 
`config.sim_mode = wg.config.SimSpeedMode.NORMAL`, run the simulation in the Real-Time mode.
`config.sim_mode = wg.config.SimSpeedMode.RUN`, run the simulation as fast as possible using all the available CPU power. 
`config.sim_mode = wg.config.SimSpeedMode.FAST`, run the simulation as fast as possible without the graphics rendering, hence the 3d window is black.


In [4]:
config = wg.WbtConfig()
config.world_size = 8
config.num_obstacles = 16
config.sim_mode = wg.config.SimSpeedMode.RUN

# Start our Webotsgym


The training environment will be created. If you want to use your own reward class, please use `env = wg.WbtGymGrid(config=config,evaluate_class=MyEval)` and comment `env = wg.WbtGymGrid(config=config)`.

In [5]:
# normal
env = wg.WbtGymGrid(config=config)

Accepting on Port:  10201


In [6]:
# custom reward class
# env = wg.WbtGymGrid(config=config,
#                     evaluate_class=MyEval)

# Initialize a Model from Stable Baselines

More information of setting parameters for PPO model can be find [here](https://stable-baselines.readthedocs.io/en/master/modules/ppo1.html#parameters)

In [7]:
%%capture output 
# captures ALL output in cell to disable tensorflow warnings

model_name = "PPO_webots_grid"
model = PPO1("MlpPolicy", env)

# Train a Model on the Webotsgym


Train and a PPO model on the Webots grid environment and save it after training. Please setup the training parameters:

* `time_steps`, the total number of samples to train on.

More information of setting parameters for model training can be find [here](https://stable-baselines.readthedocs.io/en/master/modules/ppo1.html#parameters)

In [None]:
time_steps = 100000
model.learn(total_timesteps=time_steps)
model.save("model/grid/{}".format(model_name))

2
3
2
1
1
3
1
3
2
3
