## Encoding Rewards

In [1]:
# HIDDEN
import ray
import logging
ray.init(log_to_driver=False, ignore_reinit_error=True, logging_level=logging.ERROR); # logging.FATAL

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

#### Encoding Rewards

- We've now discussed the importance of encoding the observations.
- We may also have some choice on the action space, though here (and often) it is relatively clear/fixed.
- But what about the rewards? 

#### Current set-up

- Currently, we get a reward of +1 for reaching the goal. 
- This is part of what makes RL so hard (and impressive):
  - We want to learn about actions even though we don't know right away whether the action was "good". 
  - Contrast this with supervised learning, where every prediction we make on the training data can immediately be compared with the known target value.


#### Agents can't just be greedy

- Can agents simply learn to go for the best immediate reward?
- No. For example, in a video recommendation system, showing the user another funny cat video might make them click (high immediate reward) but result in long-term loss of interest in the service (low long-term reward).
- Our Frozen Lake is another example of the problem here: sometimes there is no immediate reward at all to learn from.

In [2]:
# TODO: perhaps this next section on "Learned action probabilities" could be moved much earlier, even as early as Module 1

#### Learned action probabilities

Let's load the trained model with our encoded observations:

In [3]:
from envs import RandomLakeObs
from ray.rllib.algorithms.ppo import PPOConfig

ppo_config = (
    PPOConfig()\
    .framework("torch")\
    .rollouts(create_env_on_local_worker=True, horizon=100)\
    .debugging(seed=0, log_level="ERROR")
)
ppo_RandomLakeObs = ppo_config.build(env=RandomLakeObs)

In [4]:
# # HIDDEN

# for i in range(8):
#     ppo_RandomLakeObs.train()
    
# print(ppo_RandomLakeObs.evaluate()["evaluation"]["episode_reward_mean"])

# ppo_RandomLakeObs.save("models/RandomLakeObs-Ray2")

In [5]:
ppo_RandomLakeObs.restore("models/RandomLakeObs-Ray2/checkpoint_000008")

#### Learned action probabilities

We'll use the `query_policy` function from Module 2:

In [6]:
from utils import query_policy
query_policy(ppo_RandomLakeObs, RandomLakeObs(), [0,0,0,0])

array([0.02463268, 0.45315894, 0.49981508, 0.02239344], dtype=float32)

- Recall the (left, down, right, up) ordering.
- When the observation is `[0 0 0 0]` (no holes or edges in sight), the agent prefers to go down and right.

What if there's a hole below you? We can feed in a different observation to the policy:

In [7]:
query_policy(ppo_RandomLakeObs, RandomLakeObs(), [0,1,0,0])

array([0.03616549, 0.03877434, 0.9037791 , 0.02128115], dtype=float32)

- Now the agent is very unlikely to go down, and very likely to go right!
- Again, all this was learned from trial and error, with a reward earned only when the goal was reached.

#### Random Lake rewards

- In the Random Lake example, can't be make life easier for the agent by giving immediate rewards?

This is the current reward code:

In [8]:
def reward(self):
    return int(self.player == self.goal)

- The agent has to learn, through trial and error over _entire episodes_, that moving down and right is generally a good thing. 

#### Redefining rewards

- Let's instead try giving a reward _at every step, that is higher as the agent gets closer to the goal_. 

In [9]:
from envs import RandomLakeObs

class RandomLakeObsRew(RandomLakeObs):
    def reward(self):
        return 6-(abs(self.player[0]-self.goal[0]) + abs(self.player[1]-self.goal[1]))

- The above method uses the [Manhattan Distance](https://en.wikipedia.org/wiki/Taxicab_geometry) between the player and the goal as the reward. 
- When the agent reaches the goal, the maximum reward of 6 is achieved.
- When the agent is furthest from the goal, the minimum reward of 0 is given.

#### Redefining rewards

In [10]:
env = RandomLakeObsRew()
env.reset()
env.render()

🧑🧊🧊🧊
🧊🧊🧊🧊
🕳🧊🕳🧊
🧊🧊🧊⛳️


In [11]:
env.reward()

0

⬆️ the reward is 0

#### Redefining rewards

In [12]:
obs, rew, done, _ = env.step(1)
env.render()

🧊🧊🧊🧊
🧑🧊🧊🧊
🕳🧊🕳🧊
🧊🧊🧊⛳️


In [13]:
rew

1

⬆️ the reward is 1 because we moved closer to the goal

#### Redefining rewards

In [14]:
obs, rew, done, _ = env.step(2)
obs, rew, done, _ = env.step(2)
obs, rew, done, _ = env.step(2)
obs, rew, done, _ = env.step(1)
env.render()

🧊🧊🧊🧊
🧊🧊🧊🧊
🕳🧊🕳🧑
🧊🧊🧊⛳️


In [15]:
rew

5

Now, the reward is 5.

#### Redefining rewards

In [16]:
obs, rew, done, _ = env.step(1)
env.render()

🧊🧊🧊🧊
🧊🧊🧊🧊
🕳🧊🕳🧊
🧊🧊🧊🧑


In [17]:
rew

6

Now, the reward is 6.

#### Comparing rewards

- So, we have two possible reward functions. Which one works better? 
- Recall that last time, after training for 8 iterations, we were able to reach the goal around 70% of the time:

In [18]:
ppo_RandomLakeObs.evaluate()['evaluation']['episode_reward_mean']

0.7247706422018348

#### Comparing rewards

Let's train with the new reward function!

In [19]:
ppo_RandomLakeObsRew = ppo_config.build(env=RandomLakeObsRew)

In [20]:
for i in range(8):
    ppo_RandomLakeObsRew.train()

In [21]:
ppo_RandomLakeObsRew.evaluate()['evaluation']['episode_reward_mean']

101.90566037735849

Wait a minute, what's going on here??

#### Comparing rewards?

- We tried to improve our RL system by shaping the reward function.
- This (presumably) affected training, but it also affected our evaluation.
- In supervised learning, this is like changed the scoring metric from squared error to absolute error.
- If the old system got a mean squared error of 20,000 and the new system got a mean absolute error of 40, which is better?
- We're comparing apples and oranges here!
- We want to compare both models on the same metric, for example the original metric. 
- Here, we want to see how frequently the agent reaches the goal.

#### Comparing rewards?

- The code here is a bit more advanced.
- It is included for completeness, but we won't go into detail.

In [22]:
from ray.rllib.agents.callbacks import DefaultCallbacks

class MyCallbacks(DefaultCallbacks):
    def on_episode_end(self, *, worker, base_env, policies, episode, env_index, **kwargs):
        info = episode.last_info_for()
        episode.custom_metrics["goal_reached"] = info["player"] == info["goal"]

In [23]:
ppo_config_callback = (
    PPOConfig()\
    .framework("torch")\
    .rollouts(create_env_on_local_worker=True, horizon=100)\
    .debugging(seed=0, log_level="ERROR")\
    .callbacks(callbacks_class=MyCallbacks)\
    .evaluation(evaluation_config={"callbacks" : MyCallbacks})
)

ppo_RandomLakeObsRew = ppo_config_callback.build(env=RandomLakeObsRew)

The trainer above uses our new reward scheme but also reports/measures the rate of reaching the goal.

#### Comparing rewards?

Let's try it out!

In [24]:
for i in range(8):
    ppo_RandomLakeObsRew.train()

In [25]:
# HIDDEN
ppo_RandomLakeObsRew.evaluate()["evaluation"]["episode_reward_mean"]

101.90566037735849

In [26]:
ppo_RandomLakeObsRew.evaluate()["evaluation"]["custom_metrics"]["goal_reached_mean"]

0.04081632653061224

- Hmm, these results are terrible!
- We used to get a 70%+ win rate, and now we're close to zero.
- What happened? 🤔

#### What is the agent really optimizing?

- The agent is really optimizing _discounted total reward_.
- _Total_: it values all the rewards it collects, not just the final reward.
- _Discounted_: it values earlier rewards more than later ones.
- Our agent is successfully maximizing discounted total reward, but this isn't corresponding to reaching the goal.
- But why? The goal gives a higher reward.

#### Exploration vs. exploitation

- A fundamental concept in RL is _exploration vs. exploitation_
- When the agent is learning the policy, it can choose to either:

1. Do things that it knows are pretty good ("exploit")
2. Try something totally new and crazy, just in case ("explore")

In [27]:
# TODO 
# diagram for this?

#### Exploration vs. exploitation

- With the old reward structure, the agent gets a reward of 0 unless it reaches the goal.
  - So, it keeps trying to find something better.
- With the new reward structure, the agent is getting lots of reward just for walking around.
  - It isn't very motivated to explore the environment.
- In fact, because it is maximizing discounted **total** reward, finding the goal is a bad thing!
  - This causes the episode to end, limiting the total reward of the agent.
  - The agent actually learns to _avoid_ the goal, especially early on in the episode.

#### Designing a better reward structure

- Instead, let's try penalizing the agent when it walks into a hole or off the edge.
- It will be easier to implement this directly in `step`:

In [28]:
class RandomLakeObsRew2(RandomLakeObs):
    def step(self, action):
        # (not shown) existing code gets new_loc, where the player is trying to go
        
        reward = 0
        
        if self.is_valid_loc(new_loc):
            self.player = new_loc
        else:
            reward -= 0.1 # small penalty
            
        if self.holes[self.player]:
            reward -= 0.1 # small penalty
            
        if self.player == self.goal:
            reward += 1
        
        # Return observation/reward/done
        return self.observation(), reward, self.done(), {"player" : self.player, "goal" : self.goal}

In [29]:
# HIDDEN
from envs import RandomLakeObsRew2

#### Testing it out, again

In [30]:
# HIDDEN
# redefine ppo_RandomLakeObs to include the new callbacks
# so that you can measure the custom metric instead of the reward
# they will give the same value but this is better for consistency
ppo_RandomLakeObs = ppo_config_callback.build(env=RandomLakeObs)

for i in range(8):
    ppo_RandomLakeObs.train()

In [31]:
ppo_RandomLakeObsRew2 = ppo_config_callback.build(env=RandomLakeObsRew2)

In [32]:
for i in range(8):
    ppo_RandomLakeObsRew2.train()

In [33]:
ppo_RandomLakeObs.evaluate()["evaluation"]["custom_metrics"]["goal_reached_mean"]

0.6853448275862069

In [34]:
ppo_RandomLakeObsRew2.evaluate()["evaluation"]["custom_metrics"]["goal_reached_mean"]

0.734982332155477

It looks like, this time, the two methods perform much more similarly.

#### Episode length

- In addition to the success rate, we can compute other statistics of the agent's behavior.
- One interesting measure is episode length.
- RLlib records this by default, so we can easily access it:

In [35]:
ppo_RandomLakeObs.evaluate()["evaluation"]["episode_len_mean"]

8.585470085470085

In [36]:
ppo_RandomLakeObsRew2.evaluate()["evaluation"]["episode_len_mean"]

7.20216606498195

Although the two agents have the same success rate, the new one tends toward shorter episodes.

Notes: 

- This is quite interesting because the agent cannot "see" the difference between holes and edges.
- We could explore this further by adding more custom metrics, e.g. number of bumps into the edge.

In [37]:
# TODO
#### disadvantages - loss of generality

#- now only works if goal is at bottom-right
#give a few real-world examples here -> important

#### Let's apply what we learned!

## Supervised learning analogy: reward shaping
<!-- multiple choice -->

Earlier, we made an analogy between encoding observations in RL and feature preprocessing in supervised learning. What aspect of supervised learning is the best analogy to reward shaping in RL?

- [ ] Feature engineering 
- [ ] Model selection | Not quite. But, as we'll see, there is a place for model selection in RL as well!
- [ ] Hyperparameter tuning | Not quite. But, as we'll see, there is a place for hyperparameter tuning in RL as well!
- [x] Selecting a loss function | Changing the loss function changes the "best" model, just like changing the rewards changes the "best" policy.

## Rewarding every step: small negative rewards
<!-- multiple choice -->

In RL environments like Random Lake where the agent needs to reach a specific goal, imagine we assigned a tiny negative reward for _every_ step taken by the agent. How would this generally/typically affect the amount of time the agent spends until it reaches the goal?

- [x] The agent will try to reach the goal in as few steps as possible.
- [ ] The agent will try to reach the goal in as many steps as possible. | If we're penalizing each step, taking more steps will result in less reward.
- [ ] No change. | If we're penalizing each step, taking more steps will result in less reward.

## Exploration vs. exploitation
<!-- multiple choice -->

Which of the following is a correct statement about the exploration-exploitation tradeoff in RL?

- [ ] If an only explores, it will never find a good policy. | It will find good policies in fact, just EXTREMELY slowly.
- [x] If an agent only exploits, it will never find a good policy. | It may just keep trying the same thing over and over again.
- [ ] Agents always find good policies even without exploration/exploitation.

## Unintended consequences
<!-- coding exercise -->

In this exercise, you will try out a bad idea: assigning a large negative reward every time the agent takes a step. We will use -1 per step. The agent still gets a reward of +1 for reaching the goal. Implement this reward, train the agent, and look at the average episode length printed out by the code. Compare this to the average episode length of an agent that just acts randomly. Then, answer the multiple choice question about the agent's behavior. What do you think is going on here? 

(FYI: as discussed previously, this type of change to an environment can also be achieved with gym wrappers.)

In [38]:
# EXERCISE
from utils import lake_default_config
from envs import RandomLakeObs

class RandomLakeBadIdea(RandomLakeObs):
    def reward(self):
        old_reward = int(self.player == self.goal) 
        return ____
    
ppo = lake_default_config.build(env=____)

for i in range(8):
    print(i)
    ppo.train()
    
print("Average episode length for trained agent: %.1f" % 
      ppo.evaluate()["evaluation"][____])

random_agent_config = (
    lake_default_config\
    .exploration(exploration_config={"type": "Random"})\
    .evaluation(evaluation_config={"explore" : True})
)
random_agent = random_agent_config.build(env=RandomLakeBadIdea)

print("Average episode length for random agent: %.1f" % 
      random_agent.evaluate()["evaluation"][____])

ppo.stop()

NameError: name '____' is not defined

In [1]:
# SOLUTION
from utils import lake_default_config
from envs import RandomLakeObs

class RandomLakeBadIdea(RandomLakeObs):
    def reward(self):
        old_reward = int(self.player == self.goal) 
        return old_reward - 1

ppo = lake_default_config.build(env=RandomLakeBadIdea)


for i in range(8):
    print(i)
    ppo.train()
    
print("Average episode length for trained agent: %.1f" % 
      ppo.evaluate()["evaluation"]["episode_len_mean"])

random_agent_config = (
    lake_default_config\
    .exploration(exploration_config={"type": "Random"})\
    .evaluation(evaluation_config={"explore" : True})
)
random_agent = random_agent_config.build(env=RandomLakeBadIdea)

print("Average episode length for random agent: %.1f" % 
      random_agent.evaluate()["evaluation"]["episode_len_mean"])

ppo.stop()

2022-11-03 15:42:39,372	INFO worker.py:1518 -- Started a local Ray instance.


Average episode length for trained agent: 4.4
Average episode length for random agent: 13.5


*** SIGTERM received at time=1667515560 ***
PC: @        0x1b635f73c  (unknown)  __execve
    @        0x12e418298  (unknown)  absl::lts_20211102::WriteFailureInfo()
    @        0x12e417fe4  (unknown)  absl::lts_20211102::AbslFailureSignalHandler()
    @        0x1b63874a4  (unknown)  _sigtramp
    @        0x1b627fc94  (unknown)  execv
    @        0x10147f070  (unknown)  child_exec
    @        0x10147eb20  (unknown)  subprocess_fork_exec
    @        0x100c3d09c  (unknown)  cfunction_call
    @        0x100bed080  (unknown)  _PyObject_MakeTpCall
    @        0x100ce2748  (unknown)  call_function
    @        0x100cdeef0  (unknown)  _PyEval_EvalFrameDefault
    @        0x100cd8008  (unknown)  _PyEval_EvalCode
    @        0x100bedc48  (unknown)  _PyFunction_Vectorcall
    @        0x100ce2620  (unknown)  call_function
    @        0x100cdeecc  (unknown)  _PyEval_EvalFrameDefault
    @        0x100cd8008  (unknown)  _PyEval_EvalCode
    @        0x100bedc48  (unknown)  _PyFunction_V

#### Agent's behavior

When trained on an environment with a large negative reward at every step, what do you think this agent is doing, that is undesirable?

- [ ] The agent stays still because it is be discouraged from moving. | Try again!
- [ ] The agent is not interested in reaching the goal because the reward is comparatively small. | Try again!
- [x] The agent learns to jump into the lake as fast as it can, to avoid the negative reward of moving. | Yikes! 🥶
- [ ] The agent reaches the goal right away. | That would be desirable though!