<a href="https://colab.research.google.com/github/AI4Finance-LLC/ElegantRL/blob/master/BipedalWalker_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **BipedalWalker-v3 Example in ElegantRL**






# **Part 1: Testing Task Description**

[BipedalWalker-v3](https://gym.openai.com/envs/BipedalWalker-v2/) is a classic task in robotics since it performs one of the most fundamental skills: moving. In this task, our goal is to make a 2D biped walker to walk through rough terrain. BipedalWalker is a difficult task in continuous action space, and there are only a few RL implementations can reach the target reward.

In [1]:
from IPython.display import HTML
HTML(f"""<video src={"https://gym.openai.com/videos/2019-10-21--mqt8Qj1mwo/BipedalWalker-v2/original.mp4"} width=500 controls/>""") # the random demonstration of the task from OpenAI Gym

# **Part 2: Install ElegantRL**

In [2]:
# install elegantrl library
!pip install git+https://github.com/AI4Finance-LLC/ElegantRL.git

# **Part 3: Import Packages**


*   **elegantrl**
*   **OpenAI Gym**: 用于开发和比较强化学习算法的工具包
*   **PyBullet Gym**: OpenAI Gym MuJoCo环境的开源实现。


In [2]:
from elegantrl.run import *
from elegantrl.agent import AgentPPO
from elegantrl.env import PreprocessEnv
import gym
gym.logger.set_level(40) # Block warning

# **Part 4: 指定Agent和环境**

*   **args.agent**: 首先选择一个DRL算法来使用，用户可以从agent.py选择任意一个agent
*   **args.env**: 创建和预处理环境, 用户可以从OpenAI Gym定制自己的环境或预处理环境, 从env.py定制PyBullet Gym


>在完成**args**的初始化之前，请参阅run.py中的Arguments()了解关于可调超参数的更多细节.




In [4]:
args = Arguments(if_on_policy=False)
args.agent = AgentPPO()  # AgentSAC(), AgentTD3(), AgentDDPG()
args.env = PreprocessEnv(env=gym.make('BipedalWalker-v3'))
args.reward_scale = 2 ** -1  # RewardRange: -200 < -150 < 300 < 334
args.gamma = 0.95
args.rollout_num = 2 # the number of rollout workers (越大并不总是越快)


| env_name:  BipedalWalker-v3, action space if_discrete: False
| state_dim:   24, action_dim: 4, action_max: 1.0
| max_step:  1600, target_reward: 300


# **Part 5: 训练和评估 Agent**

> 训练和评估过程都在函数**train_and_evaluate_mp()**中完成，它的唯一参数是**args**。它包括DRL中的基本对象:

*   agent,
*   environment.

> 其中还包括了训练控制参数:

*   batch_size,
*   target_step,
*   reward_scale,
*   gamma, etc.

> 评估控制的参数:

*   break_step,
*   random_seed, etc.






In [None]:
train_and_evaluate_mp(args) # the training process will terminate once it reaches the target reward.

| multiprocessing, act_workers: 2
| multiprocessing, None:
| GPU id: 0, cwd: ./AgentPPO/BipedalWalker-v3_0
| Remove history
ID      Step      MaxR |    avgR      stdR       objA      objC
0   0.00e+00    -92.10 |


Understanding the above results::
*   **Step**: the total training steps.
*  **MaxR**: the maximum reward.
*   **avgR**: the average of the rewards.
*   **stdR**: the standard deviation of the rewards.
*   **objA**: the objective function value of Actor Network (Policy Network).
*   **objC**: the objective function value (Q-value)  of Critic Network (Value Network).