diff --git a/examples/mujoco/reproduction/ppo/README.md b/examples/mujoco/reproduction/ppo/README.md index 065f8b204..98a99666d 100644 --- a/examples/mujoco/reproduction/ppo/README.md +++ b/examples/mujoco/reproduction/ppo/README.md @@ -28,6 +28,7 @@ To view the full list of options, either view the code or run the example with t ## Known differences - While the original paper initialized weights by normal distribution (https://github.com/Breakend/baselines/blob/50ffe01d254221db75cdb5c2ba0ab51a6da06b0a/baselines/ppo1/mlp_policy.py#L28), we use orthogonal initialization as the latest openai/baselines does (https://github.com/openai/baselines/blob/9b68103b737ac46bc201dfb3121cfa5df2127e53/baselines/a2c/utils.py#L61). +- We used version v2 of the environments whereas the original results were reported for version v1, however this doesn't seem to introduce significant differences: https://github.com/openai/gym/pull/834 ## Results @@ -41,12 +42,12 @@ ChainerRL scores are based on 20 trials using different random seeds, using the python train_ppo.py --gpu -1 --seed [0-19] --env [env] ``` -| Environment | ChainerRL Score | Reported Score | -| -------------- |:---------------:|:--------------:| -| HalfCheetah-v2 | **2404**+/-185 | 2201+/-323 | -| Hopper-v2 | 2719+/-67 | **2790**+/-62 | -| Walker2d-v2 | 2994+/-113 | N/A | -| Swimmer-v2 | 111+/-4 | N/A | +| Environment | ChainerRL Score | Reported Score | +| ----------- |:---------------:|:--------------:| +| HalfCheetah | **2404**+/-185 | 2201+/-323 | +| Hopper | 2719+/-67 | **2790**+/-62 | +| Walker2d | 2994+/-113 | N/A | +| Swimmer | 111+/-4 | N/A | ### Training times diff --git a/examples/mujoco/reproduction/trpo/README.md b/examples/mujoco/reproduction/trpo/README.md index 365682c6e..e398988b6 100644 --- a/examples/mujoco/reproduction/trpo/README.md +++ b/examples/mujoco/reproduction/trpo/README.md @@ -25,6 +25,10 @@ python train_trpo.py [options] To view the full list of options, either view the code or run the example with the `--help` option. +## Known differences + +- We used version v2 of the environments whereas the original results were reported for version v1, however this doesn't seem to introduce significant differences: https://github.com/openai/gym/pull/834 + ## Results These scores are evaluated by average return +/- standard error of 100 evaluation episodes after 2M training steps. @@ -32,12 +36,12 @@ These scores are evaluated by average return +/- standard error of 100 evaluatio Reported scores are taken from the row Table 1 of [Deep Reinforcement Learning that Matters](https://arxiv.org/abs/1709.06560). Here we try to reproduce TRPO (Schulman et al. 2017) of the (64, 64) column, which corresponds to the default settings. -| Environment | ChainerRL Score | Reported Score | -| -------------- |:---------------:|:--------------:| -| HalfCheetah-v2 | **1474**+/-112 | 205+/-256 | -| Hopper-v2 | **3056**+/-44 | 2828+/-70 | -| Walker2d-v2 | 3073+/-59 | N/A | -| Swimmer-v2 | 200+/-25 | N/A | +| Environment | ChainerRL Score | Reported Score | +| ----------- |:---------------:|:--------------:| +| HalfCheetah | **1474**+/-112 | 205+/-256 | +| Hopper | **3056**+/-44 | 2828+/-70 | +| Walker2d | 3073+/-59 | N/A | +| Swimmer | 200+/-25 | N/A | ### Learning Curves