# IBM Think 2020 Code Cafe - Tic Tac Toe with DeepMind OpenSpiel

![https://nextgrid.ai/wp-content/uploads/2020/05/ibmthink.png](https://nextgrid.ai/wp-content/uploads/2020/05/ibmthink.png)


# PyBullet hopper with stable-baselines

In this notebook, you will use a stable-baselines RL algorithm to train a one leg robot to move forward.  
The goal is to move forward as fast as possible and don't fall.  
The task considered to be solved when your robot consistently achieves score 3000.

## Hopper:
Your robot will solve [Hopper](https://gym.openai.com/envs/Hopper-v2/) environment. 
We will use implementation in [PyBullet gym](https://github.com/benelot/pybullet-gym).

The reward signal that used to train your model is the following:
* The agent gets a positive reward for distance walked forward, a successful agent can get 3000+ till the end of the episode.
* If walker fall episode ends.

## Instructions

Open this notebook on your machine and run all cells one by one.


## Tensorboard

Here we set the path for tensorboard logs.  
[Tensorboard](https://www.tensorflow.org/tensorboard) is a tool for visualization of machine learning experiments

In [2]:
tensorboard_dir = "tensorboard"

On this machine you can open tensorboard on following adress https://hub.nextgrid.ai/proxy/6006.

## Training

First, we import all stuff that will be needed

In [3]:
# Create virtual display to render on remote machine
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1, 1))
display.start()

import gym
import pybulletgym

from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.common.evaluation import evaluate_policy

from notebooks.utils import record_and_show

ModuleNotFoundError: No module named 'pyvirtualdisplay'

In this example we will use **Soft Actor-Critic algorithm (SAC)**:  \
**Full list** of algorithms available with stable-baselines: https://stable-baselines.readthedocs.io/en/master/guide/algos.html  
You can read more about SAC in **stable-baselines documentation:** https://stable-baselines.readthedocs.io/en/master/modules/sac.html

In [3]:
from stable_baselines import SAC
from stable_baselines.sac.policies import MlpPolicy

### Initialize environment

Before we start training we need to create an instance of the environment for our model. 

In [4]:
env_name = 'HopperPyBulletEnv-v0'
environment = DummyVecEnv([lambda: gym.make(env_name)])

WalkerBase::__init__




### Initialize model

Now we are creating an instance of the model.  
Note that this is a step where you can later change the parameters of the algorithm.  
Parameters that we used are from [stablebaselines-zoo](https://github.com/araffin/rl-baselines-zoo/blob/master/hyperparams/sac.yml) where you can find it also for other models and environemnts.

In [5]:
model = SAC(
    MlpPolicy, 
    environment,
    buffer_size=int(1e6),
    batch_size=256,
    learning_starts=1000,
    verbose=1,
    tensorboard_log=tensorboard_dir,
)





Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Please use `layer.__call__` method instead.
Instructions for updating:
Use keras.layers.Dense instead.






Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where






### Record video

This function we use to record the behavior of our model in the environment.  
First, it records an episode and save it into the `videos` folder, then open it in the notebook cell.

In [6]:
record_and_show(env_name, model, name="hopper_random")

WalkerBase::__init__




Saving video to  /home/jupyter/notebooks/videos/hopper_random-step-0-to-step-500.mp4


### Calculate score

The function below is a standard stable-baselines function for **model evaluation**. \
The most important argument that you should track is `n_eval_episodes`, \
that defines how many episodes you want to evaluate your model.

In [7]:
%%time
eval_env = DummyVecEnv([lambda: gym.make(env_name)])
new_evaluation = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print("Mean reward is", new_evaluation[0])

WalkerBase::__init__
Mean reward is 25.931894
CPU times: user 185 ms, sys: 28.2 ms, total: 213 ms
Wall time: 186 ms




### Train

Train our model. Note that we use **magic command ```%%time```** to track time for this cell.

In [8]:
%%time
model.learn(total_timesteps=20000, log_interval=100)



----------------------------------------
| current_lr              | 0.0003     |
| ent_coef                | 0.9505118  |
| ent_coef_loss           | -0.2534377 |
| entropy                 | 3.850677   |
| episodes                | 100        |
| fps                     | 360        |
| mean 100 episode reward | 20.3       |
| n_updates               | 169        |
| policy_loss             | -4.9298334 |
| qf1_loss                | 0.87219423 |
| qf2_loss                | 0.785406   |
| time_elapsed            | 3          |
| total timesteps         | 1168       |
| value_loss              | 0.29586118 |
----------------------------------------
----------------------------------------
| current_lr              | 0.0003     |
| ent_coef                | 0.77941763 |
| ent_coef_loss           | -1.121196  |
| entropy                 | 3.478249   |
| episodes                | 200        |
| fps                     | 192        |
| mean 100 episode reward | 17.8       |
| n_updates   

<stable_baselines.sac.sac.SAC at 0x7f9cb0d2c2d0>

After the run we can record video and see how our agent behaves in the simulation:

In [9]:
record_and_show(env_name, model, name="hopper_20k")

WalkerBase::__init__




Saving video to  /home/jupyter/notebooks/videos/hopper_20k-step-0-to-step-500.mp4


At this moment your hopper should learn to not to fall or even jump .

In [10]:
%%time
eval_env = DummyVecEnv([lambda: gym.make(env_name)])
new_evaluation = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print("Mean reward is", new_evaluation[0])

WalkerBase::__init__




Mean reward is 908.9998
CPU times: user 14.6 s, sys: 561 ms, total: 15.2 s
Wall time: 13.3 s


As you can see reward is becoming higher.  
So, let's train it a little longer (execution of this cell should take some time to finish):

In [11]:
%%time
model.learn(total_timesteps=80000, log_interval=50)

--------------------------------------
| current_lr              | 0.0003   |
| episodes                | 50       |
| fps                     | 867      |
| mean 100 episode reward | 20.3     |
| n_updates               | 0        |
| time_elapsed            | 0        |
| total timesteps         | 594      |
--------------------------------------
-----------------------------------------
| current_lr              | 0.0003      |
| ent_coef                | 0.021427257 |
| ent_coef_loss           | 0.030131638 |
| entropy                 | -1.2737393  |
| episodes                | 100         |
| fps                     | 107         |
| mean 100 episode reward | 149         |
| n_updates               | 15002       |
| policy_loss             | -61.292416  |
| qf1_loss                | 0.3338942   |
| qf2_loss                | 0.60347223  |
| time_elapsed            | 149         |
| total timesteps         | 16001       |
| value_loss              | 0.14256272  |
-------------------

<stable_baselines.sac.sac.SAC at 0x7f9cb0d2c2d0>

Let's take a look what we got:

In [12]:
record_and_show(env_name, model, name="hopper_100k")

WalkerBase::__init__




Saving video to  /home/jupyter/notebooks/videos/hopper_100k-step-0-to-step-500.mp4


In [13]:
%%time
eval_env = DummyVecEnv([lambda: gym.make(env_name)])
new_evaluation = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print("Mean reward is", new_evaluation[0])

WalkerBase::__init__




Mean reward is 903.7381
CPU times: user 14.4 s, sys: 603 ms, total: 15 s
Wall time: 13.1 s


The results will not be very exciting, however, the agent should make some progress.  
Now, let's run it for 400k steps and see how well our model could learn. \
If you have tensorboard running at the beginning of the notebook, you can monitor the training process there.  
(**Warning**: this cell could take some time to execute due to high number of steps and updates)

In [14]:
%%time
model.learn(total_timesteps=400000, log_interval=30)

--------------------------------------
| current_lr              | 0.0003   |
| episodes                | 30       |
| fps                     | 825      |
| mean 100 episode reward | 19.7     |
| n_updates               | 0        |
| time_elapsed            | 0        |
| total timesteps         | 333      |
--------------------------------------
--------------------------------------
| current_lr              | 0.0003   |
| episodes                | 60       |
| fps                     | 838      |
| mean 100 episode reward | 19.9     |
| n_updates               | 0        |
| time_elapsed            | 0        |
| total timesteps         | 642      |
--------------------------------------
--------------------------------------
| current_lr              | 0.0003   |
| episodes                | 90       |
| fps                     | 835      |
| mean 100 episode reward | 19.7     |
| n_updates               | 0        |
| time_elapsed            | 1        |
| total timesteps        

<stable_baselines.sac.sac.SAC at 0x7f9cb0d2c2d0>

Now let's see what our model learned after performing 500k steps in the simulation:

In [15]:
record_and_show(env_name, model, name="hopper_500k")

WalkerBase::__init__




Saving video to  /home/jupyter/notebooks/videos/hopper_500k-step-0-to-step-500.mp4


We also can check the score:

In [16]:
%%time
eval_env = DummyVecEnv([lambda: gym.make(env_name)])
new_evaluation = evaluate_policy(model, eval_env, n_eval_episodes=25, deterministic=True)
print("Mean reward is", new_evaluation[0])

WalkerBase::__init__




Mean reward is 950.9963
CPU times: user 34.9 s, sys: 1.46 s, total: 36.3 s
Wall time: 31.7 s


At this moment you should expect that the agent will start to move forward.  
However, the score will not be perfect, and model requires futher training.

### Saving the model

If you are happy with this model you can save it to a file and load later:

In [17]:
model.save("hopper_sac_500k")

So, at this moment we expected to have successfully trained policy that can steer hopper.  
But it unlikely to be perfect and you'd need to train it more :) \
Here are some advises what you can try next:
1. Continue the training and see how agent improves.
2. Train this algorithm in another [environment](https://gym.openai.com/) (start with simple ones like `CartPole` or `LunarLander`) to do that you need to change env_name at 5-th cell.
3. Train agent with another algorithm (different algorithms perform best on different problems)