# PPO2 on Solo8 v2 Vanilla w/ Fixed Timestamp
Only use the time-based stopping criteria. This is more of a rudamentary test more than anything.

## Ensure that Tensorflow is using the GPU

In [1]:
import tensorflow as tf
if tf.test.gpu_device_name():
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
else:
    print("Please install GPU version of TF")

Default GPU Device: /device:GPU:0


## Define Experiment Tags

In [2]:
TAGS = ['solov2vanilla', 'gpu', 'home_pos_task']

## Get Solo Environment Configuration

Import the relevant libraries + rewards & observations

In [3]:
from gym_solo.envs import solo8v2vanilla
from gym_solo.core import obs
from gym_solo.core import rewards
from gym_solo.core import termination as terms

import gym
import gym_solo

Create the config for the enviornment

In [4]:
env_config = solo8v2vanilla.Solo8VanillaConfig()

## Parse CLI arguments and register w/ wandb

This experiment will be using the auto trainer to handle all of the hyperparmeter running

In [5]:
from auto_trainer import params
import auto_trainer

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



Create a basic config. Give the robot a total of 60 seconds simulation time to learn how to stand.

In [6]:
config = params.BaseParameters().parse()

config.episodes = 500000
config.episode_length = 10 / env_config.dt

config, run = auto_trainer.get_synced_config(config, TAGS)
config

[34m[1mwandb[0m: Currently logged in as: [33magupta231[0m (use `wandb login --relogin` to force relogin)


{'episodes': 500000, 'episode_length': 10000.0, 'policy': 'MlpPolicy', 'algorithm': 'PPO2'}

## Setup Environment
Add the following inputs to the robot / environment:

**Observations**
- TorsoIMU
- Motor encoder current values

**Reward**
- How upright the TorsoIMU is. Valued in $[-1, 1]$

**Termination Criteria**
- Terminate after $n$ timesteps

In [7]:
env = gym.make('solo8vanilla-v0', config=env_config)

env.obs_factory.register_observation(obs.TorsoIMU(env.robot))
env.obs_factory.register_observation(obs.MotorEncoder(env.robot))

# env.reward_factory.register_reward(1, rewards.UprightReward(env.robot))
env.reward_factory.register_reward(1, rewards.HomePositionReward(env.robot))

env.termination_factory.register_termination(terms.TimeBasedTermination(config.episode_length))



## Learning

In [8]:
model, config, run = auto_trainer.train(env, config, TAGS, log_freq=500, 
                                        full_logging=False, run=run)

Wrapping the env in a DummyVecEnv.




Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Please use `layer.__call__` method instead.




Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where




-------------------------------------
| approxkl           | 0.0037439335 |
| clipfrac           | 0.046875     |
| explained_variance | -0.146       |
| fps                | 209          |
| n_updates          | 1            |
| policy_entropy     | 17.031416    |
| policy_loss        | -0.023764359 |
| serial_timesteps   | 128          |
| time_elapsed       | 0.000361     |
| total_timesteps    | 128          |
| value_loss         | 16.367264    |
-------------------------------------

-------------------------------------
| approxkl           | 0.0033711412 |
| clipfrac           | 0.037109375  |
| explained_variance | 2.8e-05      |
| fps                | 658          |
| n_updates          | 500          |
| po

0,1
global_step,500088.0
_timestamp,1611162155.99661
loss/entropy_loss,22.13864
loss/policy_gradient_loss,-0.01128
loss/value_function_loss,7.52694
loss/approximate_kullback-leibler,0.00873
loss/clip_factor,0.09375
loss/loss,3.5308
input_info/discounted_rewards,46.11958
input_info/learning_rate,0.00025


0,1
global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
_timestamp,▁▂▂▂▂▂▂▃▃▃▃▃▃▃▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇██████
loss/entropy_loss,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇███
loss/policy_gradient_loss,▇▇▆▄▄▅▅▅▇▃▄▃▆▃▄▂▅▆▁▅█▆▆▆▃▃▅▆▂▇▃▆▄▃▅▆▂▃▅▆
loss/value_function_loss,▅▂▁▁▁▁▁▇▁▁▁█▁▁▁▁▁▁▁▁▇▂▁▁▁▂▁▁▂▁▂▁▁▁▂▂▁▂▁▁
loss/approximate_kullback-leibler,▁▁▃▃▅▅▅▃▂█▄▃▂▅▃▆▆▅▅▃▃▁▅▅▆▄▂▄▅▂▃▄▅▃▄▁▅▅▅▅
loss/clip_factor,▁▁▄▃▇▇▅▃▂▇▅▃▃▇▅█▇▆▇▅▅▁▇▇█▅▃▅▇▂▃▅▆▃▆▁▅▅▅▅
loss/loss,▅▂▁▁▁▁▁▇▁▁▁█▁▁▁▁▁▁▁▁▇▂▁▁▁▂▁▁▂▁▂▁▁▁▂▂▁▂▁▁
input_info/discounted_rewards,▄▅▇▇▅▅▅▅▆▅▅▁▆▄▅▆▇▇▄▆▂▇█▇▅▆▇▅▅▇█▆▆▅▇▆▅▆▆▅
input_info/learning_rate,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
