## RL Environments

#### What is an environment?

- An environment could be:
  - a game, like a video game.
  - a simulation of a real world scenario, like a robot, user behaviour, or the stock market
  - any other setup with an _agent_ who takes _actions_, views _observations_, and receives _rewards_
  
TERMINOLOGY NOTICE: we will use _agent_ and _player_ interchangeably. 

#### Running example: frozen lake

As a running example of an environment, we will use the [Frozen Lake](https://gym.openai.com/envs/FrozenLake-v0/) environment from [OpenAI Gym](https://gym.openai.com/). We can visualize the environment like this:

In [1]:
import gym
env = gym.make("FrozenLake-v1")
env.reset()
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


The goal is for the player (red highlight) to reach the goal (`G`) by walking on the frozen lake segments (`F`) without falling in the holes (`H`).

#### Movement

The player can move around the frozen lake. For example:

In [2]:
# HIDDEN
env.reset();
env.seed(6);

In [3]:
env.step(1); # 1 -> Down
env.render()

  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG


Don't worry about `step(1)` for now; we'll get to that. 

What you can see is that the player (red highlight) moved downward.

#### Goal

Fast-forward a lot of steps, and you've completed the puzzle:

In [28]:
# HIDDEN, OUTPUT SHOWN
def step_to_end(env):
    env.reset();
    env.seed(6);
    env.step(1);
    env.step(3);
    env.step(1);
    env.step(3);
    env.step(2);
    env.step(2);
    env.step(2);
    env.step(2);
    env.step(0);
    env.step(0);
    env.step(2);
    env.step(0);
    env.step(1);
    return env
env = step_to_end(env)
env.render()

  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m


You've achieved the goal by reaching `G`.

#### What makes an environment?

An environment involves several key components, that we'll go through in the following slides.

#### States

- We'll use the term _state_ to refer to everything about the environment. 

In [9]:
env.reset()
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


- For example, this is the starting state of the environment.
- The player is at the top-left, there's some frozen ice nearby, etc.
- We'll use the concept of a state to talk about our environment, but it won't appear in the "API".

#### Actions

- Here, the player can choose between 4 possible actions: up, down, left, right
- The space of all possible actions is called the **action space**.
- In SL we have classification (categorical $y$) and regression (continuous $y$)
- Likewise in RL the action space can be discrete or continuous
- In this case, it is discrete (4 possibilities)
- The code agrees:

In [10]:
env.action_space

Discrete(4)

#### Observations

- The observations are the _parts of the state that the agent can see_.
- Sometimes, the agent can see everything; we call this _fully observable_.
- Oftentimes, we have _partially observable_ environments. 
- In the Frozen Lake example, the agent can only see its own location out of the 16 squares.
- The agent is not "told" where the holes are via direct observations, so it will need to _learn_ this via trial and error.

#### Observations

- The space of all possible observations is called the **observation space**.


In [40]:
env.observation_space

Discrete(16)

#### Rewards

- In supervised learning, the goal is usually to make good predictions.
- You may still try different loss functions depending on your specific goal, but the general concept is the same.
- In RL, the goal could be anything.
- But, like in SL, you will need to be _optimizing_ something.
- In RL, this quantity we are maximizing is called the **reward**.
- In the Frozen Lake example, the agent receives a reward when it reaches the goal.

#### Representing actions

- To use RL software, we will need a numerical representation of our action space and our observation space.
- In this case, we have 4 possible discrete actions, so we can encode them as {0,1,2,3}.
- This is why, earlier, we did

In [38]:
env.step(1);

instead of 

```python
env.step("down")
```

#### Representing observations

- Likewise, we will need a numerical representation of our observations.
- Here, there are 25 pos

In [25]:
# HIDDEN
env.seed(6);

In [26]:
env.reset()

0

In [27]:
env.step(1)[0]

1

The agent moved from position 0 to position 1

#### RL environments vs. SL datasets

how does this compare?


#### Summary

- States
- Actions
- Rewards
- Observations?


## Ex 1

## Ex 2

## Ex 3

## Ex 4

In [None]:
exercise: would this be a reasonable environment
reward
etc

exercise: what is the action space in this example, what is the observaiton space, how would it be encoded, etc