## Encoding Observations

In [1]:
# HIDDEN
import ray
import logging
ray.init(log_to_driver=False, ignore_reinit_error=True, logging_level=logging.ERROR); # logging.FATAL

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

#### Review: what is a policy?

- In RL we're trying to learn a policy, what is this again, exactly?
- A policy maps **observations** to **actions**.
- In other words, the observations are all the policy "sees".

#### Random lake policies

- What are the observations in the random lake?
- They are the player's location, represented as an integer from 0 to 15.  
- As a refresher from module 1, a deterministic policy might look like this:

| Observation | Action |
|------|-------|
| 0   |   0  |
| 1   |   3  |
| 2   |   1  |
| 3   |   1  |
| ... | ... |
| 14 |  2  |
| 15 | 2   |

#### Random lake policies

And a non-deterministic policy might look like this:

| Observation | P(left) | P(down) | P(right) | P(up) | 
|------------|-------|-----------|---------|-------|
| 0   |   0  |  0.9 | 0.01      | 0.04      | 0.05
| 1   |   3  |  0.05 | 0.05      | 0.05      | 0.85
| ... | ... |  ... | ...      | ...      | ...
| 15 | 2   |  0.0 | 0.0      | 0.99      | 0.01

This does not mean RLlib learns such a table, by the way, but we can think of this table conceptually.

#### Random lake policies

- In the random lake, our entire decision must be based on the player's position.
- Sometimes this is sufficient: from position 11, you should go down.

```
 0   1   2   3
 4   5   6   7
 8   9  10  11
12  13  14  15
```

- But what about position 5, what should you do from there?
- Answer: _it depends_. If there's a hole at position 9, you don't want to go down. Likewise with 6. 
- How can I decide _without knowing where the holes are_?

#### State vs. observation, a recap

- In Module 1, we defined the state informally as everything about the environment.
- Here, that would include the location of the player and the holes.
- The observation, on the other hand, only encodes part of the state: in this case, the location of the player.

#### Observation = state? Problem 1.

- OK then, why not just set the observation to the state? 
- There are two issues here.
- Problem 1: When the RL system is deployed, you may not know the whole state.
  - Example: in a recommender system, the agent (recommender) does not have access to the user's mood (part of the state that affects outcomes)
  - In supervised learning, we don't want to train on features that we won't have access to in deployment
    - Likewise here, the observation needs to be something we can access in deployment.

#### Observation = state? Problem 2.

- Problem 2: It can be hard to generalize from a really complex observation.
  - There are hundreds of thousands of possible states in just this small 4x4 random lake game.
  - Too much information could be confusing to the agent or it could require unreasonable amounts of data (simulations) to make sense of.

#### Encoding observations

- Part of our job as the RL practitioner is to pick a representation (or encoding) for the observation.
- From the information the player allowed to know, find a useful representation of what the player needs to know.
- In our case, we'll try one approach: the player gets to "see" whether the 4 spaces adjacent are holes or not.
- We'll encode this as 4 binary numbers.

#### Encoding observations

```
.OO.
....
O.P.
...G
```

- In this situation, there are no holes around the player, so the player "sees" `[0 0 0 0]`. 
- In other words, the observation here is `[0 0 0 0]`.

#### Encoding observations

```
.OO.
..P.
O.O.
...G
```

- Here, the player "sees" holes up and down, so the observation is `[0 1 0 1]` (left, down, right, up)

#### Encoding observations

What about edges?

```
....
..OP
O.OO
...G
```

- This is our choice as we design the observation space.
- I'll choose to represent "off the grid" as holes, meaning we pretend the lake looks like this:
 
```
OOOOOO
O....O
O..OPO
OO.OOO
O...GO
OOOOOO
```

- Here, the player sees holes left, down and right, so the observation is `[1 1 1 0]` (left, down, right, up)
- There might be better approaches though, because falling into a hole is worse (episode ends) than walking off the edge (nothing happens).

#### Coding up our observations

- Now that we have a plan, how do we modify the code?
- Since we structured our class to have an `observation` method, that's all we need to modify:

In [2]:
from envs_03 import RandomLake

class RandomLakeObs(RandomLake):
    def observation(self):
        i, j = self.player

        obs = []
        obs.append(1 if j==0 else self.holes[i,j-1]) # left
        obs.append(1 if i==3 else self.holes[i+1,j]) # down
        obs.append(1 if j==3 else self.holes[i,j+1]) # right
        obs.append(1 if i==0 else self.holes[i-1,j]) # up
        
        obs = np.array(obs, dtype=int) # cast to numpy array (optional)
        return obs

- The code creates an `obs` variable where each entry is 1 if that direction leads off the edge **or** a hole is present there.

In [3]:
# HIDDEN
import gym

#### Coding up our observations

- One more code change is needed, which is the constructor where the observation space is defined.
- Our observations were previously an integer from 0 to 15, so we used

In [4]:
observation_space = gym.spaces.Discrete(16)

And likewise for actions:

In [5]:
action_space = gym.spaces.Discrete(4)      

- However, our observations are now arrays of 4 numbers rather than a single number.
- To indicate this, we use `gym.spaces.MultiDiscrete` instead of `gym.spaces.Discrete`.
- Multi, because we have multiple numbers, but still discrete, because each of the 4 numbers can only take on 2 possible values (0 or 1).
- Here's the code:

In [6]:
class RandomLakeObs(RandomLake):
    def __init__(self, env_config=None):
        self.observation_space = gym.spaces.MultiDiscrete([2,2,2,2])
        self.action_space = gym.spaces.Discrete(4)      

(Note that `gym` also has a `MultiBinary` space type, but this is not currently supported by RLlib.)

#### Testing out our new env

Let's test it out!

In [7]:
# HIDDEN
import numpy as np
np.random.seed(42)

In [8]:
from envs_03 import RandomLakeObs

env = RandomLakeObs()
env.reset()

array([1, 1, 0, 1])

In [9]:
env.render()

🧑🧊🧊🧊
🕳🕳🕳🧊
🧊🧊🕳🧊
🧊🧊🕳⛳️


Here, we see the expected observation indicating "holes" to the left, down, and up.

Notes: 

The left and up are the map edges, and the down is an actual hole.

#### Testing out our new env

Let's try stepping right:

In [10]:
env.step(2)

(array([0, 1, 0, 1]), 0, False, {'player': (0, 1), 'goal': (3, 3)})

In [11]:
env.render()

🧊🧑🧊🧊
🕳🕳🕳🧊
🧊🧊🕳🧊
🧊🧊🕳⛳️


Now we see holes in the down and up directions, again as expected.

#### Training with our new observations

- Our new observations seem to be working, but do they help the agent learn?
- Recall that with our `Discrete(16)` observation space we were not able to get much more than a 30% success rate.
- Let's try again:

In [12]:
# HIDDEN
from utils_03 import lake_default_config

In [13]:
ppo = lake_default_config.build(env=RandomLakeObs)

for i in range(8):
    ppo.train()

In [14]:
ppo.evaluate()["evaluation"]["episode_reward_mean"]

0.6420454545454546

- This is way better than the ~30% we were getting before!
- Which makes sense... our agent can "see" the holes now, instead of walking blindly.

#### Let's apply what we learned!

## Supervised learning analogy: observation space
<!-- multiple choice -->

In the slides we changed the observation space for our agent and, as a result, achieved higher rewards. What aspect of the supervised learning process is this most analogous to?

- [x] Feature engineering | You got it! Our observation space acts as the feature space for our policy to act on.
- [ ] Model selection | Not quite. But, as we'll see, there is a place for model selection in RL as well!
- [ ] Hyperparameter tuning | Not quite. But, as we'll see, there is a place for hyperparameter tuning in RL as well!
- [ ] Selecting a loss function

## Including the player's location
<!-- multiple choice -->

In our new observation representation we actually _removed_ the player's location from the observation and _only_ include the presence of nearby holes. If we wanted an observation space that included both the nearby walls _and_ the player's location, which of the following gym spaces could we use?

- [ ] `gym.spaces.Discrete(5)` | Try again!
- [x] `gym.spaces.MultiDiscrete([2,2,2,2,16])` | Yes! The first 4 numbers represent the holes, and the last number represents the player's location.
- [ ] `gym.spaces.MultiDiscrete([2,2,2,2]) + gym.spaces.Discrete(16)` | Try again; unfortunately we can't add gym spaces.
- [ ] `gym.spaces.MultiDiscrete([32,32,32,32])` | This could be made to work, but is a confusing/redundant representation.

## Handling the edges
<!-- multiple choice -->

In the slides we decided to treat edges as holes. Recall this picture:

```
OOOOOO
O....O
O..OPO
OO.OOO
O...GO
OOOOOO
```

However, edges and holes are actually different from each other: walking into an edge does nothing, whereas walking into a hole causes the episode to end. This might be an important distinction, especially in a "slippery" version of the environment where the results of actions are non-deterministic. 

To address this issue, we decide to change the observation space. The agent still only "sees" the four squares around it, but now it sees whether each square is an empty space, hole, or edge. For this representation, which of the following gym observation spaces could we use?

- [ ] `gym.spaces.MultiDiscrete([2,2,2,2,2,2,2,2])` | Try again. Remember, the agent still only "sees" 4 squares.
- [ ] `gym.spaces.MultiDiscrete([3,3,3,3,3,3,3,3])` | Try again!
- [ ] `gym.spaces.MultiDiscrete([2,2,2,2])` | This is the same as the previous space, but we've made a change.
- [x] `gym.spaces.MultiDiscrete([3,3,3,3])` | You got it! There are now 3 possible options for what the agent can "see" at each square.

In [15]:
# TODO / note to self
# query_policy(trainer, RandomLakeObs(), [1,1,1,1])
# shows that it wants to go up. this is because the above "hole" is probably an edge based on its learning. fascinating.

## Implementing the edges
<!-- coding exercise -->

The code below shows the `observation` function for the current observation space. Modify the code so that it uses the new observation space, where 0 represents an empty space, 1 represents a hole, and 2 represents and edge. 

In [None]:
# EXERCISE
import numpy as np
from envs_03 import RandomLake, RandomLakeObs

class RandomLakeObs2(RandomLakeObs):
    def observation(self):
        i, j = self.player

        obs = []
        obs.append(1 if j==0 else self.holes[i,j-1]) # left
        obs.append(1 if i==3 else self.holes[i+1,j]) # down
        obs.append(1 if j==3 else self.holes[i,j+1]) # right
        obs.append(1 if i==0 else self.holes[i-1,j]) # up
        
        obs = np.array(obs, dtype=int) # cast to numpy array
        return obs

np.random.seed(42)
env = RandomLakeObs2()
obs = env.reset()
env.render()
print(obs)

In [3]:
# SOLUTION
import numpy as np
from envs_03 import RandomLake, RandomLakeObs

class RandomLakeObs2(RandomLakeObs):
    def observation(self):
        i, j = self.player

        obs = []
        obs.append(2 if j==0 else self.holes[i,j-1]) # left
        obs.append(2 if i==3 else self.holes[i+1,j]) # down
        obs.append(2 if j==3 else self.holes[i,j+1]) # right
        obs.append(2 if i==0 else self.holes[i-1,j]) # up
        
        obs = np.array(obs, dtype=int) # cast to numpy array
        return obs

np.random.seed(42)
env = RandomLakeObs2()
obs = env.reset()
env.render()
print(obs)

🧑🧊🧊🧊
🕳🕳🕳🧊
🧊🧊🕳🧊
🧊🧊🕳⛳️
[2 1 0 2]


## What the agent sees
<!-- coding exercise -->

With our new observation space encoding, the agent only "sees" the 4 spaces around it and only has this information available to make its decisions. The code cell below creates a rendering of what the agent "sees" while navigating the random lake. You can enter actions with the keyboard by typing in the words "left", "down", "right" or "up" (or "l", "d", "r", "u" for short) and the simulation will show you the result. (Type "quit" to exit.) Play the game until you reach the goal. As you go, try to map out the lake (perhaps by drawing on a piece of paper).

In [18]:
# TODO / NOTE:
# THIS EXERCISE DOES NOT HAVE A "solution"
# the code is here ONLY to help them answer the multiple choice

In [None]:
# EXERCISE

import numpy as np
from envs_03 import RandomLakeObs

actions = {"left" : 0, "down" : 1, "right" : 2, "up" : 3, 
           "l" : 0, "d" : 1, "r" : 2, "u" : 3}

np.random.seed(45)
env = RandomLakeObs()
obs = env.reset()

act = "start"
done = False

while not done:
   
    obs_print = [['.']*3 for i in range(3)]
    obs_print[1][1] = "P"
    if obs[0]:
        obs_print[1][0] = "O"
    if obs[1]:
        obs_print[2][1] = "O"
    if obs[2]:
        obs_print[1][2] = "O"
    if obs[3]:
        obs_print[0][1] = "O"
    print("Observation:")
    print("\n".join(list(map(lambda c: "".join(c), obs_print))))
    print()
    
    while act != "quit" and act not in actions: 
        act = input() # gather keyboard input 
    
    if act == "quit":
        break
        
    obs, rew, done, _ = env.step(act)
    
if done:
    if rew > 0:
        print("You win! +1 reward 🎉")
    else:
        print("You fell into the lake 😢")

Observation:
.O.
OP.
...



In [None]:
# SOLUTION

import numpy as np
from envs_03 import RandomLakeObs

actions = {"left" : 0, "down" : 1, "right" : 2, "up" : 3, 
           "l" : 0, "d" : 1, "r" : 2, "u" : 3}

np.random.seed(45)
env = RandomLakeObs()
obs = env.reset()

act = "start"
done = False

while not done:
   
    obs_print = [['.']*3 for i in range(3)]
    obs_print[1][1] = "P"
    if obs[0]:
        obs_print[1][0] = "O"
    if obs[1]:
        obs_print[2][1] = "O"
    if obs[2]:
        obs_print[1][2] = "O"
    if obs[3]:
        obs_print[0][1] = "O"
    print("Observation:")
    print("\n".join(list(map(lambda c: "".join(c), obs_print))))
    print()
    
    while act != "quit" and act not in actions: 
        act = input() # gather keyboard input 
    
    if act == "quit":
        break
        
    obs, rew, done, _ = env.step(act)
    
if done:
    if rew > 0:
        print("You win! +1 reward 🎉")
    else:
        print("You fell into the lake 😢")

#### What does the lake look like?

Based on your explorations, which is the correct map of the lake in the above question?

```
 (A)      (B)      (C)      (D)
P..O     P.OO     P..O     P.OO
..OO     .OOO     ..OO     .OOO
O...     O...     O...     O..O
...G     ...G     ..OG     ...G
```

- [x] (A)
- [ ] (B)
- [ ] (C)
- [ ] (D)

In [None]:
# TODO
# could also considering showing a BAD environment encoding to contrast with this reasonable one, as in the next slide deck!