# IBM Think 2020 Code Cafe - Bipedalwalker-v3 with OpenAI Gym

![https://nextgrid.ai/wp-content/uploads/2020/05/ibmthink.png](https://nextgrid.ai/wp-content/uploads/2020/05/ibmthink.png)


# BipedalWalker-v3 with stable-baselines

In this notebook, you will use a stable-baselines RL algorithm to train a 2D robot to walk.  
The goal is to walk through the field as fast as possible and don't fall.  
The task considered to be solved when your robot consistently achieves score 300.

![https://data.dllglobal.com/wp-content/uploads/2020/04/bipedalkwalker-v3-1024x681.png](https://data.dllglobal.com/wp-content/uploads/2020/04/bipedalkwalker-v3-1024x681.png)



## Bipedal walker:
Your robot will solve [BipedalWalker](https://gym.openai.com/envs/BipedalWalker-v2/) environment. 

The reward signal that used to train your model is the following:
* The agent gets a positive reward for distance walked forward, a successful agent can get 300+ when it reaches the end.
* If walker fall episode ends and it gets reward -100.
* There is negative reward proportional to the torque applied on the joint, which encourage the agent to walk smoothly with minimal torque.

You can read more about action and observation space on [OpenAI gym wiki](https://github.com/openai/gym/wiki/BipedalWalker-v2).

## Instructions

Open this notebook on your machine and run all cells one by one.


## Tensorboard

Here we set the path for tensorboard logs.  
[Tensorboard](https://www.tensorflow.org/tensorboard) is a tool for visualization of machine learning experiments

tensorboard_dir = "tensorboard"

#### Access tensorboard

Open tensorboard on following adress https://ibm-test.0x04.net/proxy/6006/.

## Training

First, we import all stuff that will be needed

In [19]:
# Create virtual display to render on remote machine
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1, 1))
display.start()

import gym

from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.common.evaluation import evaluate_policy

from notebooks.utils import record_and_show

In this example we will use **Proximal Policy Optimization algorithm (PPO)**:  \
**Full list** of algorithms available with stable-baselines: https://stable-baselines.readthedocs.io/en/master/guide/algos.html  
You can read more about PPO2 in **PPO2 documentation:** https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html

In [20]:
from stable_baselines.common.policies import MlpPolicy
from stable_baselines import PPO2

### Initialize environment

Before we start training we need to create an instance of the environment for our model. 

In [21]:
env_name = 'BipedalWalker-v3'
environment = DummyVecEnv([lambda: gym.make(env_name)])



### Initialize model

Now we are creating an instance of the model.  
Note that this is a step where you can later change the parameters of the algorithm.  
Parameters that we used are from [stablebaselines-zoo](https://github.com/araffin/rl-baselines-zoo/blob/master/hyperparams/ppo2.yml) where you can find it also for other models and environemnts.

In [22]:
%%capture
model = PPO2(
    MlpPolicy, 
    environment, 
    verbose=0,
    gamma=0.99, 
    n_steps=2048,  
    ent_coef=0.001, 
    noptepochs=10, 
    learning_rate=2.5e-4, 
    nminibatches=32,
    tensorboard_log=tensorboard_dir,
)

### Record video

This function we use to record the behavior of our model in the environment.  
First, it records an episode and save it into the `videos` folder, then open it in the notebook cell.

In [23]:
record_and_show(env_name, model, name="biwalker_random")

Saving video to  /home/jupyter/notebooks/videos/biwalker_random-step-0-to-step-500.mp4


### Calculate score

The function below is a standard stable-baselines function for **model evaluation**. \
The most important argument that you should track is `n_eval_episodes`, \
that defines how many episodes you want to evaluate your model.

In [24]:
%%time
eval_env = DummyVecEnv([lambda: gym.make(env_name)])
new_evaluation = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print("Mean reward is", new_evaluation[0])



Mean reward is -92.05519
CPU times: user 2.18 s, sys: 162 ms, total: 2.35 s
Wall time: 1.79 s


### Train

Train our model. Note that we use **magic command ```%%time```** to track time for this cell.

In [34]:
%%time
model.learn(total_timesteps=20000, log_interval=1)

CPU times: user 50.3 s, sys: 4.07 s, total: 54.3 s
Wall time: 36.5 s


<stable_baselines.ppo2.ppo2.PPO2 at 0x7f98bc153c50>

After the run we can record video and see how our agent behaves in the simulation:

In [26]:
record_and_show(env_name, model, name="biwalker_20k")

Saving video to  /home/jupyter/notebooks/videos/biwalker_20k-step-0-to-step-500.mp4


At this moment your walker should behave in an unpredictably strange way :)

In [27]:
%%time
eval_env = DummyVecEnv([lambda: gym.make(env_name)])
new_evaluation = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print("Mean reward is", new_evaluation[0])



Mean reward is -53.53816
CPU times: user 23 s, sys: 1.52 s, total: 24.5 s
Wall time: 18.7 s


As you can see reward is also not very impressive.  
So, let's train it a little longer (execution of this cell should take some time to finish):

In [28]:
%%time
model.learn(total_timesteps=80000, log_interval=1)

CPU times: user 3min 22s, sys: 16.8 s, total: 3min 38s
Wall time: 2min 26s


<stable_baselines.ppo2.ppo2.PPO2 at 0x7f98bc153c50>

Let's take a look what we got:

In [29]:
record_and_show(env_name, model, name="biwalker_100k")

Saving video to  /home/jupyter/notebooks/videos/biwalker_100k-step-0-to-step-500.mp4


In [30]:
%%time
eval_env = DummyVecEnv([lambda: gym.make(env_name)])
new_evaluation = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print("Mean reward is", new_evaluation[0])



Mean reward is -95.27967
CPU times: user 2.66 s, sys: 175 ms, total: 2.84 s
Wall time: 2.16 s


The results will not be very impressive, however, the agent should make some progress.  
Now, let's run it for 400k steps and see how well our model could learn. \
If you have tensorboard running at the beginning of the notebook, you can monitor the training process there.  
(**Warning**: this cell could take some time to execute due to high number of steps and updates)

In [None]:
%%time
model.learn(total_timesteps=400000, log_interval=1)

Now let's see what our model learned after performing 500k steps in the simulation:

In [None]:
record_and_show(env_name, model, name="biwalker_500k")

We also can check the score:

In [None]:
%%time
eval_env = DummyVecEnv([lambda: gym.make(env_name)])
new_evaluation = evaluate_policy(model, eval_env, n_eval_episodes=25, deterministic=True)
print("Mean reward is", new_evaluation[0])

At this moment you should expect that model starts do walk somehow.  
However, the score will be pretty low, due to the high penalty for falling down.

### Saving the model

If you are happy with this model you can save it to a file and load later:

In [17]:
model.save("biwalker_500k")

So, at this moment we expected to have successfully trained policy that can steer bipedal walker.  
But it unlikely to be perfect and you'd need to train it more :) \
Here are some advises what you can try next:
1. Continue the training and see how agent improves.
2. Train this algorithm in another [environment](https://gym.openai.com/) (start with simple ones like `CartPole` or `LunarLander`) to do that you need to change env_name at 5-th cell.
3. Train agent with another algorithm (different algorithms perform best on different problems)