## RL Environments

#### What is an environment?

- An environment could be:
  - a game, like a video game.
  - a simulation of a real world scenario, like a robot, user behavior, or the stock market
  - any other setup with an _agent_ who takes _actions_, views _observations_, and receives _rewards_
  
TERMINOLOGY NOTICE: we will use _agent_ and _player_ interchangeably. 

#### Running example: frozen lake

As a running example of an environment, we will use the [Frozen Lake](https://gym.openai.com/envs/FrozenLake-v0/) environment from [OpenAI Gym](https://gym.openai.com/). We can visualize the environment like this:

In [1]:
import gym
env = gym.make("FrozenLake-v1")
env.reset()
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


The goal is for the player (red highlight) to reach the goal (`G`) by walking on the frozen lake segments (`F`) without falling in the holes (`H`).

#### Movement

The player can move around the frozen lake. For example:

In [2]:
# HIDDEN
env.reset();
env.seed(6);

In [3]:
env.step(1); # 1 -> Down
env.render()

  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG


Don't worry about `step(1)` for now; we'll get to that. 

What you can see is that the player (red highlight) moved downward.

#### Goal

Fast-forward a lot of steps, and you've completed the puzzle:

In [55]:
# HIDDEN, OUTPUT SHOWN
def step_to_end(env):
    env.reset();
    env.seed(6);
    env.step(1);
    env.step(3);
    env.step(1);
    env.step(3);
    env.step(2);
    env.step(2);
    env.step(2);
    env.step(2);
    env.step(0);
    env.step(0);
    env.step(2);
    env.step(0);
    env.step(1);
    return env
env = step_to_end(env)
env.render()
# note to self, whoops, this was silly, could just set is_slippery=False...

  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m


You've achieved the goal by reaching `G`.

#### What makes an environment?

An environment involves several key components, that we'll go through in the following slides.

#### States

- We'll use the term _state_ to refer to everything about the environment. 

In [5]:
env.reset()
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


- For example, this is the starting state of the environment.
- The player is at the top-left, there's some frozen ice nearby, etc.
- We'll use the concept of a state to talk about our environment, but it won't appear in the "API".

#### Actions

- Here, the player can choose between 4 possible actions: up, down, left, right
- The space of all possible actions is called the **action space**.
- In SL we have classification (categorical $y$) and regression (continuous $y$)
- Likewise in RL the action space can be discrete or continuous
- In this case, it is discrete (4 possibilities)
- The code agrees:

In [6]:
env.action_space

Discrete(4)

#### Observations

- The observations are the _parts of the state that the agent can see_.
- Sometimes, the agent can see everything; we call this _fully observable_.
- Oftentimes, we have _partially observable_ environments. 
- In the Frozen Lake example, the agent can only see its own location out of the 16 squares.
- The agent is not "told" where the holes are via direct observations, so it will need to _learn_ this via trial and error.

#### Observations

- The space of all possible observations is called the **observation space**.
- Here, we have a discrete observation space consisting of the 16 possible player positions.
- You can think of the action space as analogous to the target in supervised learning.
- You can think of the observation space as analogous to the features in supervised learning.
- We might have a mix of discrete and continuous features; likewise, it is possible to have a mix of discrete and continuous components in the observation space (though less common).

In [7]:
env.observation_space

Discrete(16)

#### Rewards

- In supervised learning, the goal is usually to make good predictions.
- You may still try different loss functions depending on your specific goal, but the general concept is the same.
- In RL, the goal could be anything.
- But, like in SL, you will need to be _optimizing_ something.
- In RL, this quantity we are maximizing is called the **reward**.
- In the Frozen Lake example, the agent receives a reward when it reaches the goal.

#### Representing actions

- To use RL software, we will need a numerical representation of our action space and our observation space.
- In this case, we have 4 possible discrete actions, so we can encode them as {0,1,2,3} for (left, down, right, up).
- This is why, earlier, we did

In [8]:
env.step(1);

instead of 

```python
env.step("down")
```

#### Representing observations

- Likewise, we will need a numerical representation of our observations.
- Here, there are 16 possible positions of the player. These are encoded from 0-15 as follows:

```
0   4   8  12
1   5   9  13
2   6  10  14
3   7  11  15
```

These details of the Frozen Lake environment are also available in the [documentation](https://www.gymlibrary.ml/pages/environments/toy_text/frozen_lake).

#### Representing observations

In [13]:
# HIDDEN
env.seed(6);

Initially, we observe "0" because we start at the upper-left:

In [14]:
env.reset()

0

After moving to the left (action 0), we move to position 4.

In [16]:
obs, _, _, _ = env.step(0)
obs

4

#### Non-deterministic environments

- We've been keeping a secret from you, which is that the ice is slippery.
- In fact, when you choose action 1 corresponding to "down", the agent doesn't always move down.
- Some environments are _deterministic_ meaning that the same action always results in the same change of state.
- Some environments (like this one) are _non-deterministic_ meaning the outcome of an action can be random.

In [51]:
# HIDDEN
env.seed(4); 

In [52]:
env.reset()
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


#### Non-deterministic environments

In [53]:
env.step(1) # move down
env.render()

  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG


Moving down did not work as planned.

In [54]:
env.step(1) # move down
env.render()

  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG


Moving down worked this time.

#### Episodes

- An **episode** is one run through an RL environment.
- After an episode, the environment is reset.
- For example, in Frozen Lake an episode ends when you fall in a hole or you reach the goal.
- In some environments (like Frozen Lake), rewards are only received at the end of an episode.
- In other environments, rewards can be received at any time step (i.e., after an action).

#### Putting it all together

- We've now talked about the main components of an environment
  - States
  - Actions
  - Observations
  - Rewards
  - Episodes
  
When you call `.step()` in the code, you'll see these reflected:

In [62]:
env.reset()
env.step(1)

(0, 0.0, False, {'prob': 0.3333333333333333})

- The `1` in `step(1)` is the action we took
- `0` is the observation
- `0.0` is the reward
- `False` tells us the episode is not over yet
- The last part includes optional extra info about the entire state (we can ignore this for now)

#### SL datasets vs. RL environments

- In supervised learning, you are typically given a dataset.
- In RL, the environment acts as a _data generator_.
  - The more you play through the environment, the more "data" you generate and the more you can learn.
- One can also do RL on a pre-collected dataset (called _offline RL_), but that is out of scope for us.

## Ex 1

## Ex 2

## Ex 3

## Ex 4

In [12]:
exercise: would this be a reasonable environment
reward
etc

SyntaxError: invalid syntax (562311143.py, line 1)

exercise: what is the action space in this example, what is the observaiton space, how would it be encoded, etc

use `is_slippery=True` and false, have them say if it's deterministic or not