# PPO2 on Solo8 v2 Vanilla w/ Fixed Timestamp
Only use the time-based stopping criteria. This is more of a rudamentary test more than anything.

## Ensure that Tensorflow is using the GPU

In [1]:
import tensorflow as tf
if tf.test.gpu_device_name():
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
else:
    print("Please install GPU version of TF")

Default GPU Device: /device:GPU:0


## Define Experiment Tags

In [2]:
TAGS = ['solov2vanilla', 'gpu', 'standing_task']

# Import required libraries

In [3]:
from gym_solo.envs import solo8v2vanilla
from gym_solo.core import obs
from gym_solo.core import rewards
from gym_solo.core import termination as terms

import gym
import gym_solo



## Parse CLI arguments and register w/ wandb

This experiment will be using the auto trainer to handle all of the hyperparmeter running

In [4]:
from auto_trainer import params
import auto_trainer

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



Give the robot a total of 10 seconds simulation time to learn how to stand.

In [5]:
episode_length = 10 / solo8v2vanilla.Solo8VanillaConfig.dt
episode_length

10000.0

Create a basic config

In [6]:
config = params.WandbParameters().parse()

config.episodes = 800
config.episode_length = episode_length

config.num_workers = 6
config.eval_frequency = 10
config.eval_episodes = 3
config.fps = 15

# Create a 3 second gif
config.eval_render_freq = int(config.episode_length / (3 * config.fps))

config

Namespace(algorithm='PPO2', episode_length=10000.0, episodes=800, eval_episodes=3, eval_frequency=10, eval_render_freq=222, fps=15, num_workers=6, policy='MlpPolicy')

In [7]:
config, run = auto_trainer.get_synced_config(config, TAGS)
config

[34m[1mwandb[0m: Currently logged in as: [33magupta231[0m (use `wandb login --relogin` to force relogin)


{'episodes': 800, 'episode_length': 10000.0, 'policy': 'MlpPolicy', 'algorithm': 'PPO2', 'num_workers': 6, 'eval_episodes': 3, 'eval_frequency': 10, 'eval_render_freq': 222, 'fps': 15}

Add the following inputs to the robot / environment:

**Observations**
- TorsoIMU
- Motor encoder current values

**Reward**
- How upright the TorsoIMU is. Valued in $[-1, 1]$

**Termination Criteria**
- Terminate after $n$ timesteps

Note that the autotrainer requires that the training environment be a `VecEnv` and the testing environment be a standard `gym.Env` for multi-processing.

For us personally, we find that the easiest way to handle this is to create a Stable Baselines `VecEnv` generator (example can be found [here](https://stable-baselines.readthedocs.io/en/master/guide/examples.html#multiprocessing-unleashing-the-power-of-vectorized-environments)) and use that to generate both the training and testing environments.

We also like to link our generator with our W&B config so that we can dynamically change the environments based from the web interface. 

A full example can be seen below:

In [8]:
def make_env(length):
    def _init():
        env_config = solo8v2vanilla.Solo8VanillaConfig()
        env = gym.make('solo8vanilla-v0', config=env_config)

        env.obs_factory.register_observation(obs.TorsoIMU(env.robot))
        env.obs_factory.register_observation(obs.MotorEncoder(env.robot))

        env.reward_factory.register_reward(1, rewards.UprightReward(env.robot))
        env.termination_factory.register_termination(terms.TimeBasedTermination(length))
        return env
    return _init

### Create the Envs
Import the desired vectorized env

In [9]:
from stable_baselines.common.vec_env import SubprocVecEnv

Create training & testing environments

In [10]:
train_env = SubprocVecEnv([make_env(config.episode_length) 
                           for _ in range(config.num_workers)])
test_env = make_env(config.episode_length)()



## Learning
And we're off!

In [None]:
model, config, run = auto_trainer.train(train_env, test_env, config, TAGS, 
                                        log_freq=100, full_logging=False, run=run)





Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Please use `layer.__call__` method instead.




Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where




-------------------------------------
| approxkl           | 0.004525923  |
| clipfrac           | 0.056640625  |
| explained_variance | 0.546        |
| fps                | 869          |
| n_updates          | 1            |
| policy_entropy     | 17.03208     |
| policy_loss        | -0.013156263 |
| serial_timesteps   | 128          |
| time_elapsed       | 0.000273     |
| total_timesteps    | 768          |
| value_loss         | 0.036706142  |
-------------------------------------
