## RL Policies

#### What is a policy?

Let's return to the "API" of RL:

![](img/RL-API.png)

- The policy is the output of RL
- It maps observations to actions
- The policy is like the agent's brain

#### What is a policy? Details.

- A policy is a function mapping observations to actions.
- Let's consider the Frozen Lake environment again:

In [2]:
import gym
env = gym.make("FrozenLake-v1", is_slippery=False)
obs = env.reset()
obs

0

- We received an observation 0. What do we do next? 
- The policy will tell us.

#### Example policy

A policy might look like this

| Observation | Action |
|------|-------|
| 0   |   0  |
| 1   |   3  |
| 2   |   1  |
| 3   |   1  |
| ... | ... |
| 14 |  2  |
| 15 | 2   |

- On the left, we have all possible observations (15 for Frozen Lake).
- On the right, we have the corresponding action we will take _if we see that observation_.
- "If I see 0, I will do 0; if I see 1, I will do 3," etc.

#### Goal of RL

**The goal of RL is to learn a good policy given an environment.**

#### Non-deterministic policies

- Previously we learned about deterministic and non-deterministic environments. 
- Analogously, we have deterministic and non-deterministic _policies_.
- Before we saw a deterministic policy: a given observation elicits a fixed action.
- Here is an example non-deterministic policy:

| Observation | P(left) | P(down) | P(right) | P(up) | 
|------------|-------|-----------|---------|-------|
| 0   |   0  |  0.9 | 0.01      | 0.04      | 0.05
| 1   |   3  |  0.05 | 0.05      | 0.05      | 0.85
| ... | ... |  ... | ...      | ...      | ...
| 15 | 2   |  0.0 | 0.0      | 0.99      | 0.01

"If I see 0, I will move left 99% of the time, down 1% of the time, right 4% of the time, and up 5% of the time."

#### Continuous action spaces

What if our action space is continuous? We can still have a policy. For example:

| Observation | Action |
|------|-------|
| 0   |   0.42  |
| 1   |   -3.99  |
| ... | ... |
| 15 | 2.24   |

A non-deterministic policy would need to draw from a probability distribution, though.

#### Continuous observation spaces

- What if our _observation_ space is continuous? 
- Well, now we can't draw the policy as a table anymore...
- In this case, our policy is a _function_ of the observation value. 
- E.g. "gas pedal angle with floor (action) = 1.5 x distance to closest obstacle (observation)" 
- This toy example says that if the nearest obstacle is far away, you can speed up the car.

In [6]:
# TODO
# too much text around here, need more code and/or images

#### Thinking about policies as functions

- In general, this is a useful way of thinking: the policy is a function that maps observations to actions.
- In **deep reinforcement learning**, this function is a neural network.
- A bit more on this later!

#### Beyond scalars

- So far we've assumed the observation is a single number and the action is a single number.
- However, both of these can be more complex data types: images, vectors, etc. 
- The actual observations of a self-driving car may be dozens of measurements, images, etc.
- The actual actions of a self-driving car may be setting multiple values at each time step.

### Summary

- The "agent" or "player" are personifications of the policy
- There is no additional "intelligence" or decision-making beyond the policy
- Therefore, we don't technically need the notion of an agent/player
- The policy is the output of RL.

## Ex 1

In [5]:
# HIDDEN
env.seed(1);

## Ex 2

some policy "by hand" questions where they fill in what makes sense

## Ex 3

provide a trained policy and have them explore it, or see how it could be improved


## Ex 4