# Ray RLlib Tutorial - Explore RLlib Exercise Solutions

Â© 2019-2021, Anyscale. All Rights Reserved

![Anyscale Academy](../../../images/AnyscaleAcademyLogo.png)

This notebook contains the solutions for all the exercises in the RLlib tutorial.

First, we have to setup everything needed from the other notebooks.

In [None]:
import gym
import numpy as np
import pandas as pd
import json
import sys
import os

## 01 Introduction to Reinforcement Learning

### Exercise 1

Finish implementing the `rollout_policy` function below, which should take an environment *and* a policy. Recall that the *policy* is a function that takes in a *state* and returns an *action*. The main difference is that instead of choosing a **random action**, like we just did (with poor results), the action should be chosen **with the policy** (as a function of the state).

In [None]:
env = gym.make("CartPole-v1")
print("Created env:", env)

In [None]:
def rollout_policy(env, policy):
    state = env.reset()
    
    done = False
    cumulative_reward = 0

    # Keep looping as long as the simulation has not finished.
    while not done:
        # Choose a random action (either 0 or 1).
        action = policy(state)
        
        # Take the action in the environment.
        state, reward, done, _ = env.step(action)
        
        # Update the cumulative reward.
        cumulative_reward += reward
        
    # Return the cumulative reward.
    return cumulative_reward

def sample_policy1(state):
    return 0 if state[0] < 0 else 1

def sample_policy2(state):
    return 1 if state[0] < 0 else 0

In [None]:
reward1 = np.mean([rollout_policy(env, sample_policy1) for _ in range(100)])
reward2 = np.mean([rollout_policy(env, sample_policy2) for _ in range(100)])

print('The first sample policy got an average reward of {}.'.format(reward1))
print('The second sample policy got an average reward of {}.'.format(reward2))

assert 5 < reward1 < 15, ('Make sure that rollout_policy computes the action '
                          'by applying the policy to the state.')
assert 25 < reward2 < 35, ('Make sure that rollout_policy computes the action '
                           'by applying the policy to the state.')

### Exercise 2

The current network and training configuration are too large and heavy-duty for a simple problem like `CartPole`. Modify the configuration to use a smaller network and to speed up the optimization of the surrogate objective. (Fewer SGD iterations and a larger batch size should help.)

In [None]:
import ray
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG
from ray.tune.logger import pretty_print

In [None]:
ray.init(ignore_reinit_error=True, log_to_driver=False)

Here's one possible set. It takes longer for the max reward to reach 200, so I increased the number of episodes `N` to 10.

In [None]:
config = DEFAULT_CONFIG.copy()
config["num_workers"] = 3
config["num_sgd_iter"] = 10                       # was 30
config["sgd_minibatch_size"] = 256                # was 128
config["model"]["fcnet_hiddens"] = [20, 20]       # was [100, 100]
config["num_cpus_per_worker"] = 0

In [None]:
agent = PPOTrainer(config, "CartPole-v1")

In [None]:
N = 20                # was 10
results = []
episode_data = []
episode_json = []

for n in range(N):
    result = agent.train()
    results.append(result)
    
    episode = {'n': n, 
               'episode_reward_min': result['episode_reward_min'],  
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']
              }
    
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    
    print(f'{n:3d}: Min/Mean/Max reward: {result["episode_reward_min"]:8.4f}/{result["episode_reward_mean"]:8.4f}/{result["episode_reward_max"]:8.4f}')

In [None]:
df = pd.DataFrame(data=episode_data)
df

In [None]:
df.plot(x="n", y=["episode_reward_mean", "episode_reward_min", "episode_reward_max"], secondary_y=True)

Compare your graph with the graph in the lesson, where we used more computing resources:

![](../../../images/rllib/Cart-Pole-Episode-Rewards.png)

Note that we only used 5 episodes before. If you compare the graphs at n=4, you see that this execise solution is training more slowly, but it after N=10, the mean reward grows quickly.

Try it again with slightly larger and/or small neural network layers.

## 05: Custom Environments and Reward Shaping

### Exercise 1: A Custom Environment with Rewards

Now we'll create an `n-Chain` environment, which represents moves along a linear chain of states, with two actions:

     (0) **forward**: move along the chain but returns no reward
     (1) **backward**: returns to the beginning and has a small reward

The end of the chain, however, provides a large reward, and by moving **forward** at the end of the chain, this large reward can be repeated.

#### Step 1: Implement `ChainEnv._setup_spaces`

Use a `spaces.Discrete` action space and observation space. Implement `ChainEnv._setup_spaces` in `ChainEnv` so that `self.action_space` and `self.obseration_space` are proper gym spaces.
  
1. The observation space is an integer in the range `[0 to n-1]`.
2. The action space is an integer in `[0, 1]`.

For example:

```python
self.action_space = spaces.Discrete(2)
self.observation_space = ...
```

You should see a message indicating tests passing when done correctly!

#### Step 2: Implement a reward function.

When `env.step` is called, it returns a tuple of `(state, reward, done, info)`. Right now, the reward is always 0. Modify `step()` so that the following rewards are returned for the given actions: 

1. `action == 1` will return `self.small_reward`.
2. `action == 0` will return 0 if `self.state < self.n - 1`.
3. `action == 0` will return `self.large_reward` if `self.state == self.n - 1`.

You should see a message indicating tests passing when done correctly. 

In [None]:
sys.path.append('..')
from test_exercises import test_chain_env_spaces, test_chain_env_reward, test_chain_env_behavior
from gym import spaces

In [None]:
class ChainEnv(gym.Env):
    
    def __init__(self, env_config = None):
        env_config = env_config or {}
        self.n = env_config.get("n", 20)
        self.small_reward = env_config.get("small", 2)  # payout for 'backwards' action
        self.large_reward = env_config.get("large", 10)  # payout at end of chain for 'forwards' action
        self.state = 0  # Start at beginning of the chain
        self._horizon = self.n
        self._counter = 0  # For terminating the episode
        self._setup_spaces()
    
    def _setup_spaces(self):
        ##############
        # TODO: Implement this so that it passes tests
        self.action_space = spaces.Discrete(2)
        self.observation_space = spaces.Discrete(self.n)
        ##############

    def step(self, action):
        assert self.action_space.contains(action)
        if action == 1:  # 'backwards': go back to the beginning, get small reward
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = self.small_reward
            ##############
            self.state = 0
        elif self.state < self.n - 1:  # 'forwards': go up along the chain
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = 0
            self.state += 1
        else:  # 'forwards': stay at the end of the chain, collect large reward
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = self.large_reward
            ##############
        self._counter += 1
        done = self._counter >= self._horizon
        return self.state, reward, done, {}

    def reset(self):
        self.state = 0
        self._counter = 0
        return self.state
    
# Tests here:
test_chain_env_spaces(ChainEnv)
test_chain_env_reward(ChainEnv)

### Exercise 2: Improve the Policy

Modify `ShapedChainEnv.step()` in the next cell to provide a reward that encourages the policy to traverse the chain (not just stick to 0). Do not change the behavior of the environment (the action -> state behavior should be the same).

You can change the reward to be whatever you wish. We'll text it in the next section.

### Evaluate `ShapedChainEnv` by Running the Cell(s) Below

This trains PPO on the new env and counts the number of states seen.

First, we'll set up things we need from the lesson notebook.

In [None]:
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG

In [None]:
trainer_config = DEFAULT_CONFIG.copy()
trainer_config['num_workers'] = 1
trainer_config["train_batch_size"] = 400
trainer_config["sgd_minibatch_size"] = 64
trainer_config["num_sgd_iter"] = 10

In [None]:
def do_training(chainEnvClass, config = trainer_config, iterations=20):
    trainer = PPOTrainer(config, chainEnvClass)
    print("Training iterations: ", end="")
    
    for i in range(iterations):
        print(".", end="")
        trainer.train()
        
    print("")
    return trainer

Now here's one solution, where the reward calculations are the only difference from the previous implementation of `step`. This problem is actually difficult to solve, because it's hard to encourage exploration with just the reward alone. 

The key is to penalize action 1 (go back to the beginning), because you always get a small reward if you stay there, so there's a temptation to exploit that action and keep accruing the small reward until you hit the goal. Hence, this solution sets the reward for action 1 to zero and a small reward for action 0 and the other states.

It's still difficult to achieve good exploration.

In [None]:
class ShapedChainEnvVisited(ChainEnv):

    def __init__(self, env_config = None):
        super().__init__(env_config)
        self.visited = set()
        self.done_percentage = 0.5
        self.done_n = self.done_percentage * self.n
        
    def step(self, action):
        assert self.action_space.contains(action)
        self.visited.add(self.state)
        if action == 1:  # 'backwards': go back to the beginning
            reward = 0   # was self.small_reward
            self.state = 0
        elif self.state < self.n - 1:   # 'forwards': go up along the chain
            reward = self.small_reward  # was zero
            self.state += 1
        else:  # 'forwards': stay at the end of the chain
            reward = self.large_reward
        self._counter += 1
        done = len(self.visited) >= self.done_n
        if not done and self._counter > (self.n*10):
            done = True
            visited_per = (len(self.visited)*100.0)/self.n
            print(f'Stopping after {self.n*10} iterations. Visited {visited_per:6.2f}% of the states.')
        return self.state, reward, done, {}

test_chain_env_behavior(ShapedChainEnvVisited)

In [None]:
trainer = do_training(ShapedChainEnvVisited, config=trainer_config, iterations=20)

In [None]:
env = ShapedChainEnvVisited({})

state = env.reset()
done = False
max_state = -1
cumulative_reward = 0

while not done:
    action = trainer.compute_action(state)
    state, reward, done, results = env.step(action)
    max_state = max(max_state, state)
    cumulative_reward += reward

print(f'Cumulative reward you received is: {cumulative_reward}!')
print(f'Max state you visited is: {max_state}. (There are {env.n} states.)')

desired = env.done_percentage
actual = (max_state+1)/env.n  # add one because of zero indexing

print(f"This policy traversed {actual*100:4.1f}% of the available states.")
assert actual >= desired, f"{actual*100:4.1f}% is less than the desired percentage of {desired*100:4.1f}%."

Try using a larger percentage (you'll have to modify `ShapedChainEnvVisited` directly).

In [None]:
ray.shutdown()