## Encoding Observations

#### Review: what is a policy?

- In RL we're trying to learn a policy, what is this again, exactly?
- A policy maps **observations** to **actions**.
- In other words, the observations are all the policy "sees".

#### Random lake policies

- What are the observations in the random lake?
- They are the player's location, represented as an integer from 0 to 15.  
- As a refresher from module 1, a deterministic policy might look like this:

| Observation | Action |
|------|-------|
| 0   |   0  |
| 1   |   3  |
| 2   |   1  |
| 3   |   1  |
| ... | ... |
| 14 |  2  |
| 15 | 2   |

#### Random lake policies

And a non-deterministic policy might look like this:

| Observation | P(left) | P(down) | P(right) | P(up) | 
|------------|-------|-----------|---------|-------|
| 0   |   0  |  0.9 | 0.01      | 0.04      | 0.05
| 1   |   3  |  0.05 | 0.05      | 0.05      | 0.85
| ... | ... |  ... | ...      | ...      | ...
| 15 | 2   |  0.0 | 0.0      | 0.99      | 0.01

This does not mean RLlib learns such a table, by the way, but we can think this way conceptually.

#### Random lake policies

- In the random lake, our entire decision must be based on the player's position.
- Sometimes this is sufficient: from position 11, you should go down.

```
 0   1   2   3
 4   5   6   7
 8   9  10  11
12  13  14  15
```

- But what about position 5, what should you do from there?
- Answer: _it depends_. How can I decide _without knowing where the holes are_?

#### State vs. observation, a recap

- In Module 1, we defined the state informally as everything about the environment.
- Here, that would include the location of the player and the holes.
- The observation, on the other hand, only encodes part of the state.
- In this case, the location of the player.

#### Observation = state? Problem 1.

- OK then, why not just set the observation to the state? 
- There are two separate issues here.
- Problem 1: You may want an agent that _learns where the holes are_.
  - In reality, you may not have access to this information.
  - Example: in a recommender system, the agent (recommender) does not have access to the user's mood (part of the state)
 

#### Observation = state? Problem 2.

- Problem 2: It can be hard to generalize from a really complex observation.
  - There are hundreds of thousands of possible states in just this small 4x4 random lake game.
  - Too much information could be confusing to the agent.

#### Encoding observations

- Part of our job as the RL practitioner is to pick a representation (or encoding) for the observation.
- What is the player allowed to know, and what does the player need to know?
- In our case, we'll try one approach: the player gets to "see" whether the 4 spaces adjacent are holes or not.
- We'll encode this as 4 binary numbers.

#### Encoding observations

```
.OO.
....
O.P.
...G
```

- In this situation, there are no holes around the player, so the player "sees" `[0 0 0 0]`. 
- In other words, the observation here is `[0 0 0 0]`.

#### Encoding observations

```
.OO.
..P.
O.O.
...G
```

- Here, the player "sees" holes up and down, so the observation is `[0 1 0 1]` (left, down, right, up)

#### Encoding observations

What about edges?

```
....
..OP
O.OO
...G
```

- This is our choice as we design the observation space.
- I'll choose to represent "off the grid" as holes, meaning we pretend the lake looks like this:
 
```
OOOOOO
O....O
O..OPO
OO.OOO
O...GO
OOOOOO
```

- There might be better approaches though, because falling into a hole is worse (episode ends) than walking off the edge (nothing happens).
- Here, the player sees holes left, down and right, so the observation is `[1 1 1 0]` (left, down, right, up)

#### Coding up our observations

- Now that we have a plan, how do we modify the code?
- Since we structured our class to have an `observation` method, that's all we need to modify:

In [1]:
from envs import RandomLake

class RandomLakeObs(RandomLake):
    def observation(self):
        i, j = self.player

        obs = []
        obs.append(1 if j==0 else self.holes[i,j-1]) # left
        obs.append(1 if i==3 else self.holes[i+1,j]) # down
        obs.append(1 if j==3 else self.holes[i,j+1]) # right
        obs.append(1 if i==0 else self.holes[i-1,j]) # up
        
        obs = np.array(obs, dtype=int) # cast to numpy array (optional)
        return obs

- The code creates an `obs` variable where each entry is 1 if that direction leads off the edge **or** a hole is present there.

In [2]:
# HIDDEN
import gym

#### Coding up our observations

- One more code change is needed, which is the constructor where the observation space is defined.
- Our observations were previously an integer from 0 to 15, so we used

In [3]:
observation_space = gym.spaces.Discrete(16)

And likewise for actions:

In [4]:
action_space = gym.spaces.Discrete(4)      

- However, our observations are now arrays of 4 numbers rather than a single number.
- To indicate this, we use `gym.spaces.MultiDiscrete` instead of `gym.spaces.Discrete`.
- Multi, because we have multiple numbers, but still discrete, because each of the 4 numbers can only take on 2 possible values (0 or 1).
- Here's the code:

In [5]:
class RandomLakeObs(RandomLake):
    def __init__(self, env_config=None):
        self.observation_space = gym.spaces.MultiDiscrete([2,2,2,2])
        self.action_space = gym.spaces.Discrete(4)      

(Note that `gym` also has a `MultiBinary` space type, but this is not currently supported by RLlib.)

#### Testing out our new env

Let's test it out!

In [26]:
# HIDDEN
import numpy as np
np.random.seed(42)

In [27]:
from envs import RandomLakeObs

env = RandomLakeObs()
env.reset()

array([1, 1, 0, 1])

In [28]:
env.render()

P...
OOO.
..O.
..OG


Here, we see the expected observation indicating "holes" to the left, down, and right.

#### Testing out our new env

Let's try stepping right:

In [29]:
env.step(2)

(array([0, 1, 0, 1]), 0, False, {})

In [30]:
env.render()

.P..
OOO.
..O.
..OG


Now we see holes in the down and up directions, again as expected.

#### Training with our new observations

- Our new observations seem to be working, but do they help the agent learn?
- Recall that with our `Discrete(16)` observation space we were not able to get much more than a 30% success rate.
- Let's try again:

In [32]:
# HIDDEN

from ray.rllib.agents.ppo import PPOTrainer

In [33]:
trainer = PPOTrainer({"framework" : "torch", "create_env_on_driver" : True, "seed" : 0}, 
                       env=RandomLakeObs)

for i in range(8):
    trainer.train()

In [34]:
print(np.mean(trainer.evaluate()['evaluation']['hist_stats']['episode_reward']))

0.7383720930232558


- Wow, this is way better!
- Which makes sense... our agent can "see" the holes now, instead of walking blindly.

## Supervised learning analogy

In the slides we changed the observation space for our agent and, as a result, achieved higher rewards. What aspect of the supervised learning process is this most analogous to?

- [x] Feature engineering | You got it! Our observation space acts as the feature space for our policy to act on.
- [ ] Model selection | Not quite. But, as we'll see, there is a place for model selection in RL as well!
- [ ] Hyperparameter tuning | Not quite. But, as we'll see, there is a place for hyperparameter tuning in RL as well!

## Why does this work so well?

An important property of a successful lake-navigator is that it tries to avoid holes. We have tested out two observation encodings:

1. The player's location only, `Discrete(16)`. Let's call this PlayerLoc.
2. The nearby holes, `MultiDiscrete([2,2,2,2])`. Let's call this HolesLoc.

With the PlayerLoc observations, the agent will always perform the same action (or sample from the same probability distribution over actions, in the case of a stochastic policy) when it's at the same location in the lake. With the HolesLoc observations, on the other hand, the agent may behave differently at a given location, depending on where they nearby holes are. 

#### Frozen pond

Let's imagine we're back in the Frozen Pond; that is, the holes are not randomized, but rather always in the same places. In this scenario, 

#### Random lake


```
P...
.O.O
...O
O..G
```

#### 

## Including the player's location

In our new observation representation we actually _removed_ the player's location from the observation and _only_ include the presence of nearby holes. If we wanted an observation space that included both the nearby walls _and_ the player's location, which of the following gym spaces could we use?

- [ ] `gym.spaces.Discrete(5)` | Try again!
- [x] `gym.spaces.MultiDiscrete([2,2,2,2,16])` | Yes! The first 4 numbers represent the holes, and the last number represents the player's location.
- [ ] `gym.spaces.MultiDiscrete([2,2,2,2]) + gym.spaces.Discrete(16)` | Try again; unfortunately we can't add gym spaces.
- [ ] `gym.spaces.MultiDiscrete([32,32,32,32])` | This could be made to work, but is a confusing/redundant representation.

coding exercise - have them look at the agents and see how they behave in different scenarios, the animations

exercise - show how the old agent did the same thing only depending on location, whereas the new agent does different things at the same location depending on walls

LOTS of good exercises here.

## What the agent sees

With our new observation space encoding, the agent only "sees" the 4 spaces around it and only has this information available to make its decisions. 

could do an exercise where we show what the agent sees, like a rendering like this

```
 O
.P.
 O
```

and the player uses keyboard inputs to play the game a bit??

In [105]:
import numpy as np
from envs import RandomLakeObs

actions = {"left" : 0, "down" : 1, "right" : 2, "up" : 3, 
           "l" : 0, "d" : 1, "r" : 2, "u" : 3}

np.random.seed(45)
env = RandomLakeObs()
obs = env.reset()

act = "start"
done = False

while act != "quit" and not done:
   
    obs_print = [['.']*3 for i in range(3)]
    obs_print[1][1] = "P"
    if obs[0]:
        obs_print[1][0] = "O"
    if obs[1]:
        obs_print[2][1] = "O"
    if obs[2]:
        obs_print[1][2] = "O"
    if obs[3]:
        obs_print[0][1] = "O"
    print("Observation:")
    print("\n".join(list(map(lambda c: "".join(c), obs_print))))
    print()
    
    while True: # gather keyboard input 
        act = input()
        if act in actions:
            act = actions[act]
            break
    obs, rew, done, _ = env.step(act)
    
if done:
    if rew > 0:
        print("You win! +1 reward 🎉")
    else:
        print("You fell into the lake 😢")

Observation:
.O.
OP.
...



 left


Observation:
.O.
OP.
...



 down


Observation:
...
OP.
.O.



 down


You fell into the lake 😢
