## RL Policies

#### What is a policy?

Let's return to the "API" of RL:

![](img/RL-API.jpg)

- The policy is the output of RL
- It maps observations to actions
- The policy is like the agent's brain

#### What is a policy? Details.

- A policy is a function mapping observations to actions.
- Let's consider the Frozen Lake environment again:

In [2]:
import gym
env = gym.make("FrozenLake-v1", is_slippery=False)
obs = env.reset()
obs

0

- We received an observation 0. What do we do next? 
- The policy will tell us.

#### Example policy

A policy might look like this

| Observation | Action |
|------|-------|
| 0   |   0  |
| 1   |   3  |
| 2   |   1  |
| 3   |   1  |
| ... | ... |
| 14 |  2  |
| 15 | 2   |

- On the left, we have all possible observations (15 for Frozen Lake).
- On the right, we have the corresponding action we will take _if we see that observation_.
- "If I see 0, I will do 0; if I see 1, I will do 3," etc.

#### Goal of RL

**The goal of RL is to learn a good policy given an environment.**

#### Non-deterministic policies

- Previously we learned about deterministic and non-deterministic environments. 
- Analogously, we have deterministic and non-deterministic _policies_.
- Before we saw a deterministic policy: a given observation elicits a fixed action.
- Here is an example non-deterministic policy:

| Observation | P(left) | P(down) | P(right) | P(up) | 
|------------|-------|-----------|---------|-------|
| 0   |   0  |  0.9 | 0.01      | 0.04      | 0.05
| 1   |   3  |  0.05 | 0.05      | 0.05      | 0.85
| ... | ... |  ... | ...      | ...      | ...
| 15 | 2   |  0.0 | 0.0      | 0.99      | 0.01

"If I see 0, I will move left 99% of the time, down 1% of the time, right 4% of the time, and up 5% of the time."

#### Continuous action spaces

What if our action space is continuous? We can still have a policy. For example:

| Observation | Action |
|------|-------|
| 0   |   0.42  |
| 1   |   -3.99  |
| ... | ... |
| 15 | 2.24   |

A non-deterministic policy would need to draw from a probability distribution, though.

#### Continuous observation spaces

- What if our _observation_ space is continuous? 
- Well, now we can't draw the policy as a table anymore...
- In this case, our policy is a _function_ of the observation value. 
- E.g. "gas pedal angle with floor (action) = 1.5 x distance to closest obstacle (observation)" 
- This toy example says that if the nearest obstacle is far away, you can speed up the car.

In [7]:
# TODO
# too much text around here, need more code and/or images
# I think we can remove some of this stuff (continuous, beyond scalars) and add it back it when it's paired with a concrete example
# it's not very helpful/interesting as just ideas in isolation...

#### Beyond scalars

- So far we've assumed the observation is a single number and the action is a single number.
- However, both of these can be more complex data types: images, vectors, etc. 
- The actual observations of a self-driving car may be dozens of measurements, images, etc.
- The actual actions of a self-driving car may be setting multiple values at each time step.

#### Thinking about policies as functions

- In general, this is a useful way of thinking: the policy is a function that maps observations to actions.
- In **deep reinforcement learning**, this function is a neural network.
- A bit more on this later!

#### Summary

- The "agent" or "player" are personifications of the policy
- There is no additional "intelligence" or decision-making beyond the policy
- Therefore, we don't technically need the notion of an agent/player
- The policy is the output of RL.

#### Let's apply what we learned!

## Frozen Lake policy
<!-- multiple choice -->

Recall the frozen lake environment:

```
🧑🧊🧊🧊
🧊🕳🧊🕳
🧊🧊🧊🕳
🕳🧊🧊⛳️
```

with its observation space represented as:

```
 0   1   2   3
 4   5   6   7
 8   9  10  11
12  13  14  15
```

and actions represented as

| Action     |  Meaning    |
|------|------|
| 0 | left |
| 1 | down |
| 2 | right |
| 3 | up |

#### Question 1

The policy below contains a missing entry represented by a `?` symbol.

| Observation | Action |
|------|-------|
| 0   |   0  |
| 1   |   2  |
| ... | ... |
| 13 |  ? |
| 14 | 2 |
| 15 | 0   |

Select the best choice to fill the `?` entry.

- [ ] 0  | Try again!
- [ ] 1 | Try again!
- [x] 2 | Yes! Moving to the right takes you toward the goal.
- [ ] 3 | Try again!

#### Question 2

_In the slippery version of the Frozen Lake, the agent has a 1/3 probability of moving in the intended direction and 1/3 each in the two perpendicular directions._

Is the above a statement about the environment or the policy?

- [x] Environment | You got it! 
- [ ] Policy | Remember, the policy describes how the agent responds to observations.

#### Question 3

_In the slippery version of the Frozen Lake, it is sometimes better not to walk in the direction you really want to go, because it's more important to avoid the chance of slipping into a hole._

Is the above a statement about the environment or the policy?

- [ ] Environment | The statement above is about the best action to take in a situation; this is determined by the policy.
- [x] Policy | You got it!

## Expected reward
<!-- coding exercise -->

The code below loads the (non-deterministic) slippery Frozen Lake environment. A (deterministic) policy is defined as a Python dictionary that maps from observations to actions. The code loops over 1000 episodes. Within each episode, it iterates through time steps (observations and actions) until the episode is done and a reward is achieved. It then prints the average reward over the 1000 episodes. It usually gets an average reward around 0.05, meaning the goal is reached around 5% of the time. 

**Your task:** modify the policy so that an average reward of at least 0.02 is achieved (i.e., the agent reaches the goal 20% of the time).

In [1]:
# EXERCISE

import gym
import numpy as np

env = gym.make("FrozenLake-v1", is_slippery=True)

policy = {
    0 : 1,
    1 : 1,
    2 : 1,
    3 : 1,
    4 : 1,
    5 : 1,
    6 : 1,
    7 : 1,
    8 : 1,
    9 : 1,
    10: 1,
    11: 1,
    12: 1,
    13: 1,
    14: 1,
    15: 1
}

rewards = []
N = 1000
for i in range(N): # loop over episodes

    obs = env.reset()
    done = False
    
    while not done:
        action = policy[obs]
        obs, reward, done, _ = env.step(action)
    
    rewards.append(reward)
    
print("Average reward:", sum(rewards)/N)

Average reward: 0.057


In [34]:
# SOLUTION

import gym
import numpy as np

env = gym.make("FrozenLake-v1", is_slippery=True)

policy = {
    0 : 2,
    1 : 2,
    2 : 2,
    3 : 2,
    4 : 1,
    5 : 1,
    6 : 1,
    7 : 1,
    8 : 2,
    9 : 2,
    10: 2,
    11: 0,
    12: 2,
    13: 2,
    14: 2,
    15: 2
}

rewards = []
N = 1000
for i in range(N): # loop over episodes

    obs = env.reset()
    done = False
    
    while not done:
        action = policy[obs]
        obs, reward, done, _ = env.step(action)
    
    rewards.append(reward)
    
print("Average reward:", sum(rewards)/N)

Average reward: 0.044


In [35]:
# TODO: I think we need one more step of scaffolding leading up to this exercise.
# I think just the inner loop first, and then add the outer loop later.