# RL Exercise 5 - Evolution Strategies

**GOAL:** The goal of this exercise is to demonstrate how to use the evolution strategies (ES) algorithm.

ES is described in detail in https://arxiv.org/abs/1703.03864.

The ES algorithm works as follows.

- It maintains a distribution over policies (which in this case is a multivariate Gaussian distribution over the weights of a neural network policy represented by the mean of the Gaussian $\theta$).
- The mean of the distribution is updated at each iteration, from $\theta_0$ to $\theta_1$ to $\theta_2$ and so on.
- At each iteration, a large number of policies are sampled from the distribution over policies and rollouts are performed using these **perturbed policies**.
- The distribution over policies is updated by moving its mean in the direction of the perturbed policies that achieved higher reward.

Of the algorithms explored so far, this one is the closest to the Monte Carlo algorithm implemented in one of the earlier exercises.

**NOTE:** One interesting property of this algorithm is that it only cares about the rewards achieved in a given rollout. The algorithm does not need to know the states that were visited and so much less data has to be communicated.

In [1]:
import gym
import ray
from ray.rllib.es import ESAgent, DEFAULT_CONFIG

  from ._conv import register_converters as _register_converters


Instructions for updating:
Use the retry module or similar alternatives.


Start up Ray. This must be done before we instantiate any RL agents. We pass in num_workers=0 because the training agent's constructor will create a number of actors.

In [2]:
ray.init(num_workers=0)

Waiting for redis server at 127.0.0.1:42482 to respond...
Waiting for redis server at 127.0.0.1:52355 to respond...
Starting local scheduler with the following resources: {'CPU': 8, 'GPU': 0}.

View the web UI at http://localhost:8898/notebooks/ray_ui44606.ipynb?token=b7d8c131ad525d221aa469187a41cac7b6a262e4f55c8abc



{'local_scheduler_socket_names': ['/tmp/scheduler3128764'],
 'node_ip_address': '127.0.0.1',
 'object_store_addresses': [ObjectStoreAddress(name='/tmp/plasma_store62765532', manager_name='/tmp/plasma_manager19938805', manager_port=46649)],
 'redis_address': '127.0.0.1:42482',
 'webui_url': 'http://localhost:8898/notebooks/ray_ui44606.ipynb?token=b7d8c131ad525d221aa469187a41cac7b6a262e4f55c8abc'}

Instantiate an ESAgent object. We pass in a config object that specifies how the network and training procedure should be configured. Some of the parameters are the following.

- `num_workers` is the number of actors that the agent will create. This determines the degree of parallelism that will be used.
- `episodes_per_batch` is the minimum number of rollouts to perform at each iteration.
- `timesteps_per_batch` is the minimum number of steps of the environment to perform at each iteration.
- `noise_stdev` is the standard deviation of the multivariate Gaussian distribution over the neural net policy weights.
- `stepsize` is the size of the update to the distribution over policies to take at each iteration.

In [16]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['episodes_per_batch'] = 100
config['timesteps_per_batch'] = 1000
config['noise_stdev'] = 0.02
config['stepsize'] = 0.01

agent = ESAgent(config, 'MountainCar-v0')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Observation shape is (2,)
Not using any observation preprocessor.
Constructing fcnet [256, 256] <function tanh at 0x11dc82378>
Creating shared noise table.
Creating actors.


**EXERCISE:** Train the agent for some number of steps on the CartPole environment. Compare the performance to PPO from the previous exercise.

In [17]:
for _ in range(50):
    result = agent.train()

Collected 0 episodes 0 timesteps so far this iter
Collected 12 episodes 2400 timesteps so far this iter
Collected 24 episodes 4800 timesteps so far this iter
Collected 36 episodes 7200 timesteps so far this iter
Collected 48 episodes 9600 timesteps so far this iter
Collected 60 episodes 12000 timesteps so far this iter
Collected 72 episodes 14400 timesteps so far this iter
Collected 84 episodes 16800 timesteps so far this iter
Collected 96 episodes 19200 timesteps so far this iter
----------------------------------
| EvalEpRewMean       | nan      |
| EvalEpRewStd        | nan      |
| EvalEpLenMean       | nan      |
| EpRewMean           | -200     |
| EpRewStd            | 0        |
| EpLenMean           | 200      |
| Norm                | 514      |
| GradNorm            | 4.37     |
| UpdateRatio         | 0.115    |
| EpisodesThisIter    | 108      |
| EpisodesSoFar       | 108      |
| TimestepsThisIter   | 2.16e+04 |
| TimestepsSoFar      | 2.16e+04 |
| TimeElapsedThisIter | 

  tlogger.record_tabular("EvalEpRewMean", eval_returns.mean())
  ret = ret.dtype.type(ret / rcount)
  keepdims=keepdims)
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
  ret = ret.dtype.type(ret / rcount)
  tlogger.record_tabular("EvalEpLenMean", eval_lengths.mean())
  episode_reward_mean=eval_returns.mean(),
  episode_len_mean=eval_lengths.mean(),
  if np.issubdtype(value, float):
  if np.issubdtype(value, int):


Collected 0 episodes 0 timesteps so far this iter
Collected 12 episodes 2400 timesteps so far this iter
Collected 24 episodes 4800 timesteps so far this iter
Collected 36 episodes 7200 timesteps so far this iter
Collected 48 episodes 9600 timesteps so far this iter
Collected 60 episodes 12000 timesteps so far this iter
Collected 72 episodes 14400 timesteps so far this iter
Collected 84 episodes 16800 timesteps so far this iter
Collected 96 episodes 19200 timesteps so far this iter
----------------------------------
| EvalEpRewMean       | nan      |
| EvalEpRewStd        | nan      |
| EvalEpLenMean       | nan      |
| EpRewMean           | -200     |
| EpRewStd            | 0        |
| EpLenMean           | 200      |
| Norm                | 518      |
| GradNorm            | 4.38     |
| UpdateRatio         | 0.0809   |
| EpisodesThisIter    | 108      |
| EpisodesSoFar       | 216      |
| TimestepsThisIter   | 2.16e+04 |
| TimestepsSoFar      | 4.32e+04 |
| TimeElapsedThisIter | 

Collected 96 episodes 19200 timesteps so far this iter
----------------------------------
| EvalEpRewMean       | nan      |
| EvalEpRewStd        | nan      |
| EvalEpLenMean       | nan      |
| EpRewMean           | -200     |
| EpRewStd            | 0        |
| EpLenMean           | 200      |
| Norm                | 549      |
| GradNorm            | 4.39     |
| UpdateRatio         | 0.0381   |
| EpisodesThisIter    | 108      |
| EpisodesSoFar       | 958      |
| TimestepsThisIter   | 2.16e+04 |
| TimestepsSoFar      | 1.92e+05 |
| TimeElapsedThisIter | 3.08     |
| TimeElapsed         | 38.4     |
----------------------------------
Collected 0 episodes 0 timesteps so far this iter
Collected 12 episodes 2400 timesteps so far this iter
Collected 24 episodes 4800 timesteps so far this iter
Collected 30 episodes 6000 timesteps so far this iter
Collected 36 episodes 7200 timesteps so far this iter
Collected 42 episodes 8400 timesteps so far this iter
Collected 54 episodes 10800 ti

Collected 18 episodes 3600 timesteps so far this iter
Collected 30 episodes 6000 timesteps so far this iter
Collected 36 episodes 7200 timesteps so far this iter
Collected 42 episodes 8400 timesteps so far this iter
Collected 48 episodes 9600 timesteps so far this iter
Collected 54 episodes 10800 timesteps so far this iter
Collected 66 episodes 13200 timesteps so far this iter
Collected 78 episodes 15600 timesteps so far this iter
Collected 90 episodes 18000 timesteps so far this iter
----------------------------------
| EvalEpRewMean       | nan      |
| EvalEpRewStd        | nan      |
| EvalEpLenMean       | nan      |
| EpRewMean           | -200     |
| EpRewStd            | 0        |
| EpLenMean           | 200      |
| Norm                | 573      |
| GradNorm            | 5.07     |
| UpdateRatio         | 0.0299   |
| EpisodesThisIter    | 102      |
| EpisodesSoFar       | 1.69e+03 |
| TimestepsThisIter   | 2.04e+04 |
| TimestepsSoFar      | 3.38e+05 |
| TimeElapsedThisIte

Collected 0 episodes 0 timesteps so far this iter
Collected 12 episodes 2400 timesteps so far this iter
Collected 24 episodes 4800 timesteps so far this iter
Collected 36 episodes 7200 timesteps so far this iter
Collected 48 episodes 9600 timesteps so far this iter
Collected 60 episodes 12000 timesteps so far this iter
Collected 72 episodes 14400 timesteps so far this iter
Collected 84 episodes 16800 timesteps so far this iter
Collected 96 episodes 19200 timesteps so far this iter
----------------------------------
| EvalEpRewMean       | nan      |
| EvalEpRewStd        | nan      |
| EvalEpLenMean       | nan      |
| EpRewMean           | -200     |
| EpRewStd            | 0        |
| EpLenMean           | 200      |
| Norm                | 588      |
| GradNorm            | 4.41     |
| UpdateRatio         | 0.0263   |
| EpisodesThisIter    | 108      |
| EpisodesSoFar       | 2.55e+03 |
| TimestepsThisIter   | 2.16e+04 |
| TimestepsSoFar      | 5.1e+05  |
| TimeElapsedThisIter | 

Collected 0 episodes 0 timesteps so far this iter
Collected 12 episodes 2400 timesteps so far this iter
Collected 24 episodes 4800 timesteps so far this iter
Collected 36 episodes 7200 timesteps so far this iter
Collected 48 episodes 9600 timesteps so far this iter
Collected 60 episodes 12000 timesteps so far this iter
Collected 68 episodes 13600 timesteps so far this iter
Collected 80 episodes 16000 timesteps so far this iter
Collected 86 episodes 17200 timesteps so far this iter
Collected 92 episodes 18400 timesteps so far this iter
----------------------------------
| EvalEpRewMean       | nan      |
| EvalEpRewStd        | nan      |
| EvalEpLenMean       | nan      |
| EpRewMean           | -200     |
| EpRewStd            | 0        |
| EpLenMean           | 200      |
| Norm                | 595      |
| GradNorm            | 4.41     |
| UpdateRatio         | 0.0249   |
| EpisodesThisIter    | 104      |
| EpisodesSoFar       | 3.41e+03 |
| TimestepsThisIter   | 2.08e+04 |
| Ti

Collected 96 episodes 19200 timesteps so far this iter
----------------------------------
| EvalEpRewMean       | nan      |
| EvalEpRewStd        | nan      |
| EvalEpLenMean       | nan      |
| EpRewMean           | -200     |
| EpRewStd            | 0        |
| EpLenMean           | 200      |
| Norm                | 597      |
| GradNorm            | 4.37     |
| UpdateRatio         | 0.0245   |
| EpisodesThisIter    | 108      |
| EpisodesSoFar       | 4.15e+03 |
| TimestepsThisIter   | 2.16e+04 |
| TimestepsSoFar      | 8.3e+05  |
| TimeElapsedThisIter | 2.91     |
| TimeElapsed         | 133      |
----------------------------------
Collected 0 episodes 0 timesteps so far this iter
Collected 12 episodes 2400 timesteps so far this iter
Collected 24 episodes 4800 timesteps so far this iter
Collected 36 episodes 7200 timesteps so far this iter
Collected 48 episodes 9600 timesteps so far this iter
Collected 60 episodes 12000 timesteps so far this iter
Collected 72 episodes 14400 t

Collected 0 episodes 0 timesteps so far this iter
Collected 12 episodes 2400 timesteps so far this iter
Collected 24 episodes 4800 timesteps so far this iter
Collected 36 episodes 7200 timesteps so far this iter
Collected 48 episodes 9600 timesteps so far this iter
Collected 60 episodes 12000 timesteps so far this iter
Collected 72 episodes 14400 timesteps so far this iter
Collected 84 episodes 16800 timesteps so far this iter
Collected 96 episodes 19200 timesteps so far this iter
----------------------------------
| EvalEpRewMean       | nan      |
| EvalEpRewStd        | nan      |
| EvalEpLenMean       | nan      |
| EpRewMean           | -200     |
| EpRewStd            | 0        |
| EpLenMean           | 200      |
| Norm                | 598      |
| GradNorm            | 4.37     |
| UpdateRatio         | 0.0239   |
| EpisodesThisIter    | 108      |
| EpisodesSoFar       | 5.01e+03 |
| TimestepsThisIter   | 2.16e+04 |
| TimestepsSoFar      | 1e+06    |
| TimeElapsedThisIter | 

In [18]:
result

TrainingResult(timesteps_total=1067200, done=None, info={'weights_norm': 596.88995, 'grad_norm': 4.378503, 'update_ratio': 0.023926882, 'episodes_this_iter': 108, 'episodes_so_far': 5336, 'timesteps_this_iter': 21600, 'timesteps_so_far': 1067200, 'time_elapsed_this_iter': 2.8604001998901367, 'time_elapsed': 165.33793687820435}, episode_reward_mean=nan, episode_len_mean=nan, episodes_total=None, mean_accuracy=None, mean_validation_accuracy=None, mean_loss=None, neg_mean_loss=None, experiment_id='773acd52fa664b419eaf9e37d350e226', training_iteration=50, timesteps_this_iter=21600, time_this_iter_s=2.865279197692871, time_total_s=164.7182047367096, pid=21775, date='2018-03-28_22-33-29', timestamp=1522301609, hostname='DILBAG-M-X2Y6', config={'l2_coeff': 0.005, 'noise_stdev': 0.02, 'episodes_per_batch': 100, 'timesteps_per_batch': 1000, 'eval_prob': 0.003, 'return_proc_mode': 'centered_rank', 'num_workers': 3, 'stepsize': 0.01, 'observation_filter': 'MeanStdFilter', 'noise_size': 250000000,

**EXERCISE:** Instantiate an `ESAgent` object on the `MountainCar-v0` environment and train it for some number of steps. Compare the performance to PPO and A3C from the previous exercise.

In [19]:
agent.train()

Collected 0 episodes 0 timesteps so far this iter
Collected 12 episodes 2400 timesteps so far this iter
Collected 24 episodes 4800 timesteps so far this iter
Collected 36 episodes 7200 timesteps so far this iter
Collected 48 episodes 9600 timesteps so far this iter
Collected 60 episodes 12000 timesteps so far this iter
Collected 72 episodes 14400 timesteps so far this iter
Collected 84 episodes 16800 timesteps so far this iter
Collected 96 episodes 19200 timesteps so far this iter
----------------------------------
| EvalEpRewMean       | nan      |
| EvalEpRewStd        | nan      |
| EvalEpLenMean       | nan      |
| EpRewMean           | -200     |
| EpRewStd            | 0        |
| EpLenMean           | 200      |
| Norm                | 597      |
| GradNorm            | 4.41     |
| UpdateRatio         | 0.0239   |
| EpisodesThisIter    | 108      |
| EpisodesSoFar       | 5.44e+03 |
| TimestepsThisIter   | 2.16e+04 |
| TimestepsSoFar      | 1.09e+06 |
| TimeElapsedThisIter | 

  tlogger.record_tabular("EvalEpRewMean", eval_returns.mean())
  ret = ret.dtype.type(ret / rcount)
  keepdims=keepdims)
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
  ret = ret.dtype.type(ret / rcount)
  tlogger.record_tabular("EvalEpLenMean", eval_lengths.mean())
  episode_reward_mean=eval_returns.mean(),
  episode_len_mean=eval_lengths.mean(),
  if np.issubdtype(value, float):
  if np.issubdtype(value, int):


TrainingResult(timesteps_total=1088800, done=None, info={'weights_norm': 596.59485, 'grad_norm': 4.41005, 'update_ratio': 0.02394891, 'episodes_this_iter': 108, 'episodes_so_far': 5444, 'timesteps_this_iter': 21600, 'timesteps_so_far': 1088800, 'time_elapsed_this_iter': 3.2404000759124756, 'time_elapsed': 4655.842152833939}, episode_reward_mean=nan, episode_len_mean=nan, episodes_total=None, mean_accuracy=None, mean_validation_accuracy=None, mean_loss=None, neg_mean_loss=None, experiment_id='773acd52fa664b419eaf9e37d350e226', training_iteration=51, timesteps_this_iter=21600, time_this_iter_s=3.248440980911255, time_total_s=167.96664571762085, pid=21775, date='2018-03-28_23-48-20', timestamp=1522306100, hostname='DILBAG-M-X2Y6', config={'l2_coeff': 0.005, 'noise_stdev': 0.02, 'episodes_per_batch': 100, 'timesteps_per_batch': 1000, 'eval_prob': 0.003, 'return_proc_mode': 'centered_rank', 'num_workers': 3, 'stepsize': 0.01, 'observation_filter': 'MeanStdFilter', 'noise_size': 250000000, '