参考`tutorials/rllib_exercise`

安装，需要`tensorflow`以及`ray[rllib]`
`pip install ray[rllib]`

## 使用gym

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
import numpy as np

`gym`会提供多种变体的simulator的MDP接口。具体的可以参见它的文档，它提供的环境有[gym env](https://gym.openai.com/envs/#classic_control)

In [2]:
env = gym.make('CartPole-v0')
print('Created env:', env)

Created env: <TimeLimit<CartPoleEnv<CartPole-v0>>>


返回到MDP的初始状态

In [3]:
state = env.reset()
print('The starting state is:', state)

The starting state is: [0.04582552 0.00954718 0.02526218 0.01461649]


action有 0 & 1(moving left and right)

In [5]:
action = 0
state, reward, done, info = env.step(action)
print(state, reward, done, info)

[ 0.04601647 -0.18592778  0.02555451  0.31516166] 1.0 False {}


In [9]:
def rollout_policy(env, policy):
    state = env.reset()
    
    done = False
    cumulative_reward = 0

    while not done:
        # Choose a random action (either 0 or 1).
        action = policy(state)
        
        # Take the action in the environment.
        state, reward, done, _ = env.step(action)
        
        # Update the cumulative reward.
        cumulative_reward += reward

    # Return the cumulative reward.
    return cumulative_reward

def sample_policy1(state):
    return 0 if state[0] < 0 else 1

def sample_policy2(state):
    return 1 if state[0] < 0 else 0

reward1 = np.mean([rollout_policy(env, sample_policy1) for _ in range(100)])
reward2 = np.mean([rollout_policy(env, sample_policy2) for _ in range(100)])

print('The first sample policy got an average reward of {}.'.format(reward1))
print('The second sample policy got an average reward of {}.'.format(reward2))

The first sample policy got an average reward of 9.58.
The second sample policy got an average reward of 29.25.


In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
import ray
from ray.rllib.agents.ppo import PPOAgent, DEFAULT_CONFIG
from ray.tune.logger import pretty_print

In [2]:
ray.init(ignore_reinit_error=True)

Process STDOUT and STDERR is being redirected to /tmp/ray/session_2018-10-29_22-02-36_3386/logs.
Waiting for redis server at 127.0.0.1:51628 to respond...
Waiting for redis server at 127.0.0.1:52034 to respond...
Starting the Plasma object store with 1.61 GB memory.

View the web UI at http://localhost:8891/notebooks/ray_ui.ipynb?token=d05867c0b627172ef30ac650de3cfe5f13cb8170c5063776



{'node_ip_address': '222.195.64.115',
 'redis_address': '222.195.64.115:51628',
 'object_store_addresses': [ObjectStoreAddress(name='/tmp/ray/session_2018-10-29_22-02-36_3386/sockets/plasma_store', manager_name=None, manager_port=None)],
 'local_scheduler_socket_names': [],
 'raylet_socket_names': ['/tmp/ray/session_2018-10-29_22-02-36_3386/sockets/raylet'],
 'webui_url': 'http://localhost:8891/notebooks/ray_ui.ipynb?token=d05867c0b627172ef30ac650de3cfe5f13cb8170c5063776'}

- `num_workers` 被创建的actor的数量，决定了并行性的程度
- `num_sgd_iter` epoces数量
- `sgd_minibatch_size` SGD
- `model` 包含一个`dict`，描述了神经网络的参数. `fcnet_hiddens` 是一个关于hidden layers大小的参数list

In [3]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['num_sgd_iter'] = 10
config['sgd_minibatch_size'] = 256
config['model']['fcnet_hiddens'] = [50, 50]
config['num_cpus_per_worker'] = 0  # This avoids running out of resources in the notebook environment when this cell is re-executed

agent = PPOAgent(config, 'CartPole-v0')

Created LogSyncer for /home/drdh/ray_results/PPO_CartPole-v0_2018-10-29_22-02-39uco23dkt -> None
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-10-29 22:02:45,615	INFO multi_gpu_optimizer.py:59 -- LocalMultiGPUOptimizer devices ['/cpu:0']


In [4]:
for i in range(25):
    result = agent.train()
    print(pretty_print(result))

  if np.issubdtype(value, float):


date: 2018-10-29_22-03-13
done: false
episode_len_mean: 23.116751269035532
episode_reward_max: 64.0
episode_reward_mean: 23.116751269035532
episode_reward_min: 9.0
episodes_this_iter: 197
episodes_total: 197
experiment_id: 2b493b042b7842dbb6238a047536ffdf
hostname: lx
info:
  cur_lr: 4.999999873689376e-05
  entropy: 0.685346782207489
  grad_time_ms: 2184.725
  kl: 0.008550765924155712
  load_time_ms: 126.242
  num_steps_sampled: 4000
  num_steps_trained: 4000
  policy_loss: -0.01899544708430767
  sample_time_ms: 3195.979
  total_loss: 239.58241271972656
  update_time_ms: 552.287
  vf_explained_var: 0.0004984458209946752
  vf_loss: 239.59970092773438
iterations_since_restore: 1
node_ip: 222.195.64.115
pid: 3386
policy_reward_mean: {}
time_since_restore: 6.155289649963379
time_this_iter_s: 6.155289649963379
time_total_s: 6.155289649963379
timestamp: 1540821793
timesteps_since_restore: 4000
timesteps_this_iter: 4000
timesteps_total: 4000
training_iteration: 1

date: 2018-10-29_22-03-18
do

date: 2018-10-29_22-03-48
done: false
episode_len_mean: 192.84
episode_reward_max: 200.0
episode_reward_mean: 192.84
episode_reward_min: 77.0
episodes_this_iter: 24
episodes_total: 673
experiment_id: 2b493b042b7842dbb6238a047536ffdf
hostname: lx
info:
  cur_lr: 4.999999873689376e-05
  entropy: 0.559470534324646
  grad_time_ms: 1207.82
  kl: 0.0008315223967656493
  load_time_ms: 17.21
  num_steps_sampled: 40000
  num_steps_trained: 40000
  policy_loss: -0.001442252891138196
  sample_time_ms: 2814.336
  total_loss: 2533.8740234375
  update_time_ms: 65.945
  vf_explained_var: 0.019507916644215584
  vf_loss: 2533.875732421875
iterations_since_restore: 10
node_ip: 222.195.64.115
pid: 3386
policy_reward_mean: {}
time_since_restore: 41.25542593002319
time_this_iter_s: 3.513920783996582
time_total_s: 41.25542593002319
timestamp: 1540821828
timesteps_since_restore: 40000
timesteps_this_iter: 4000
timesteps_total: 40000
training_iteration: 10

date: 2018-10-29_22-03-52
done: false
episode_len_me

date: 2018-10-29_22-04-21
done: false
episode_len_mean: 192.96
episode_reward_max: 200.0
episode_reward_mean: 192.96
episode_reward_min: 118.0
episodes_this_iter: 25
episodes_total: 886
experiment_id: 2b493b042b7842dbb6238a047536ffdf
hostname: lx
info:
  cur_lr: 4.999999873689376e-05
  entropy: 0.5419458746910095
  grad_time_ms: 1036.222
  kl: 0.0014677790459245443
  load_time_ms: 2.255
  num_steps_sampled: 76000
  num_steps_trained: 76000
  policy_loss: 0.0014283251948654652
  sample_time_ms: 2565.785
  total_loss: 1685.67138671875
  update_time_ms: 11.286
  vf_explained_var: 0.026106519624590874
  vf_loss: 1685.669921875
iterations_since_restore: 19
node_ip: 222.195.64.115
pid: 3386
policy_reward_mean: {}
time_since_restore: 74.01923823356628
time_this_iter_s: 3.84794020652771
time_total_s: 74.01923823356628
timestamp: 1540821861
timesteps_since_restore: 76000
timesteps_this_iter: 4000
timesteps_total: 76000
training_iteration: 19

date: 2018-10-29_22-04-25
done: false
episode_len_me

存储一下模型，再恢复

In [5]:
checkpoint_path = agent.save()
print(checkpoint_path)

trained_config = config.copy()

test_agent = PPOAgent(trained_config, 'CartPole-v0')
test_agent.restore(checkpoint_path)

Created LogSyncer for /home/drdh/ray_results/PPO_CartPole-v0_2018-10-29_22-04-46i_skv_mr -> None


/home/drdh/ray_results/PPO_CartPole-v0_2018-10-29_22-02-39uco23dkt/checkpoint_25xflwn0j9/checkpoint-25


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-10-29 22:04:48,392	INFO multi_gpu_optimizer.py:59 -- LocalMultiGPUOptimizer devices ['/cpu:0']


使用restore的模型

In [6]:
env = gym.make('CartPole-v0')
state = env.reset()
done = False
cumulative_reward = 0

while not done:
    action = test_agent.compute_action(state)
    state, reward, done, _ = env.step(action)
    cumulative_reward += reward

print(cumulative_reward)

200.0
