## RL Environments

#### What is an environment?

- An environment could be:
  - a game, like a video game.
  - a simulation of a real world scenario, like a robot, user behavior, or the stock market
  - any other setup with an _agent_ who takes _actions_, views _observations_, and receives _rewards_
  
Terminology note: we will use _agent_ and _player_ interchangeably. 

#### Running example: frozen lake

As a running example of an environment, we will use the [Frozen Lake](https://gym.openai.com/envs/FrozenLake-v0/) environment from [OpenAI Gym](https://gym.openai.com/), which provides the standard interface for RL problems. We can visualize the environment like this:

In [1]:
import gym
env = gym.make("FrozenLake-v1", is_slippery=False)
env.reset()
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


The goal is for the player (red highlight) to reach the goal (`G`) by walking on the frozen lake segments (`F`) without falling in the holes (`H`).

#### Movement

The player can move around the frozen lake. For example:

In [2]:
env.step(1); # 1 -> Down
env.render()

  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG


Don't worry about `step(1)` for now; we'll get to that. 

What you can see is that the player (red highlight) moved downward.

#### Goal

Fast-forward a lot of steps, and you've completed the puzzle:

In [3]:
env.step(1)
env.step(2)
env.step(1)
env.step(2)
env.step(1)
env.step(2)
env.render()

  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m


You've achieved the goal by reaching `G`.

#### What makes an environment?

An environment involves several key components, that we'll go through in the following slides.

#### States

- We'll use the term _state_ informally to refer to everything about the environment. 

In [4]:
env.reset()
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


- For example, this is the starting state of the environment.
- The player is at the top-left, there's some frozen ice nearby, etc.
- We'll use the concept of a state to talk about our environment, but it won't appear in the "API".

#### Actions

- Here, the player can choose between 4 possible actions: up, down, left, right
- The space of all possible actions is called the **action space**.
- In SL we distinguish between regression (continuous $y$) and classification (categorical $y$)
- Likewise in RL the action space can be continuous or discrete
- In this case, it is discrete (4 possibilities)
- The code agrees:

In [5]:
env.action_space

Discrete(4)

#### Observations

- The observations are the _parts of the state that the agent can see_.
- Sometimes, the agent can see everything; we call this _fully observable_.
- Oftentimes, we have _partially observable_ environments. 
- In the Frozen Lake example, the agent can only see its own location out of the 16 squares.
- The agent is not "told" where the holes are via direct observations, so it will need to _learn_ this via trial and error.

#### Observations

- The space of all possible observations is called the **observation space**.
- You can think of the action space as analogous to the target in supervised learning.
- You can think of the observation space as analogous to the features in supervised learning.


Here, we have a discrete observation space consisting of the 16 possible player positions:

In [6]:
env.observation_space

Discrete(16)

#### Rewards

- In supervised learning, the goal is usually to make good predictions.
- You may still try different loss functions depending on your specific goal, but the general concept is the same.
- In RL, the goal could be anything.
- But, like in SL, you will need to be _optimizing_ something.
- In RL, we aim to maximize the **reward**.

#### Rewards 

In the Frozen Lake example, the agent receives a reward when it reaches the goal.

In [18]:
env.reset()
obs, reward, done, _ = env.step(0)
env.render()
print("reward =", reward)

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
reward = 0.0


In [19]:
obs, reward, done, _ = env.step(1)
print("reward =", reward)

reward = 0.0


Still no reward, let's keep going...

#### Rewards

In [20]:
env.step(1)
env.step(2)
env.step(1)
env.step(2)
env.step(1)
obs, reward, done, _ = env.step(2)
env.render()
print("reward =", reward)

  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
reward = 1.0


We got a reward of 1.0 for reaching the goal.

#### Representing actions

- To use RL software, we will need a numerical representation of our action space and our observation space.
- In this case, we have 4 possible discrete actions, so we can encode them as {0,1,2,3} for (left, down, right, up).
- This is why, earlier, we did

In [8]:
env.step(1);

to walk downward.

#### Representing observations

- Likewise, we will need a numerical representation of our observations.
- Here, there are 16 possible positions of the player. These are encoded from 0-15 as follows:

```
 0   1   2   3
 4   5   6   7
 8   9  10  11
12  13  14  15
```

These details of the Frozen Lake environment are also available in the [documentation](https://www.gymlibrary.ml/pages/environments/toy_text/frozen_lake).

#### Representing observations

Initially, we observe "0" because we start at the upper-left:

In [39]:
env.reset()
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


After moving to the down (action 1), we move to position 4.

In [40]:
obs, reward, done, _ = env.step(1)
obs

4

The observation is returned by the `step()` method.

#### Non-deterministic environments

- So far, taking a particular action from a particular state always resulted in the same new state.
- In other words, our Frozen Lake environment was _deterministic_.
- Some environments are _non-deterministic_, meaning the outcome of an action can be random.
- We can initialize a non-deterministic Frozen Lake like this:

In [44]:
env_slippery = gym.make("FrozenLake-v1", is_slippery=True)

In [45]:
# HIDDEN
env_slippery.seed(4); 

In [46]:
env_slippery.reset()
env_slippery.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


#### Non-deterministic environments

In [47]:
env_slippery.step(1) # move down
env_slippery.render()

  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG


Moving down did not work as planned.

In [48]:
env_slippery.step(1) # move down
env_slippery.render()

  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG


Moving down worked this time.

In this "slippery" Frozen Lake environment, movement only works as intended 1/3 of the time.

#### Episodes

- Playing the Frozen Lake has an end - either you fall into a hole or you reach the goal.
- However, one play-through is not enough for an RL algorithm to learn from.
- It will need multiple play-throughs, called **episode**.
- After an episode, the environment is reset.

#### Episodes

The `step()` method returns a flag telling us whether the episode is over:

In [50]:
obs, reward, done, _ = env_slippery.step(1)
done

True

In [51]:
env_slippery.render()

  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG


Here the episode is done because we fell into a hole.

#### Episodes

- In some environments (like Frozen Lake), rewards are only received at the end of an episode.
- In other environments, rewards can be received at any **time step** (i.e., after an action).

#### Putting it all together

- We've now talked about the main components of an RL environment:
  - States
  - Actions
  - Observations
  - Rewards
  - Episodes
  

#### SL datasets vs. RL environments

- In supervised learning, you are typically given a dataset.
- In RL, the environment acts as a _data generator_.
  - The more you play through the environment, the more "data" you generate and the more you can learn.
- One can also do RL on a pre-collected dataset (called _offline RL_), but that is out of scope for us.

#### Let's apply what we learned!

## Self-driving car environment
<!-- multiple choice -->

You're using RL to train a self-driving car. The car AI uses various cameras and sensors as its inputs and has to decide the angle of the steering wheel as well as the angle of gas/brake pedals on the floor.

#### Is the observation space continuous or discrete?

- [x] Continuous | Correct! 
- [ ] Discrete | In this case, the observations are the sensor inputs, e.g. depth estimates.

#### Is the action space continuous or discrete?

- [x] Continuous
- [ ] Discrete | The actions are angles; they don't come from a discrete set of options.

#### What would be the most reasonable reward structure for this environment?

- [ ] Reward equals the amount of time the car was able to drive without crashing | What would the reward be if the car never moves?
- [x] Reward equals the distance the car was able to drive without crashing | Yes, that sounds good!
- [ ] +1 reward every time the car crashes | Keep in mind we want to maximize reward, not minimize it.

## Episodes vs. time steps
<!-- multiple choice -->

Fill in the blanks in the following sentence: 

*In a reinforcement learning environment, one takes actions repeatedly until the \_\_\_\_\_ ends. This may involve only one \_\_\_\_\_, or very many.*

- [ ] time step / reward | Check the first blank carefully!
- [ ] reward / time step | Try again!
- [ ] time step / episode | Try again!
- [x] episode / time step | You got it!


## Gym's taxi environment
<!-- coding exercise -->

In this exercise we'll look at one of the text-based environments bundled with OpenAI gym, called the taxi environment. Documentation [here](https://www.gymlibrary.ml/environments/toy_text/taxi/).

In [53]:
# EXERCISE
import gym

taxi = gym.make("Taxi-v3")
taxi.seed(4)
obs = taxi.reset()

taxi.____

taxi.render()

+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| | : |[43m [0m: |
|Y| : |[34;1mB[0m: |
+---------+



Here, the taxi is represented by the yellow highlight, currently at position (0,1). The `:` can be crossed but the `|` cannot. The goal is to pick up passengers and drop them off. 

There as 6 possible actions: down (0), up (1), right (2), left (3), pick up (4), drop off (5). 

In [54]:
taxi.action_space

Discrete(6)

The following code prints out the observation in a more human-readable format:

In [55]:
print("Taxi row: %d\nTaxi col: %d\nPassenger loc: %d\nDestination loc: %d" % tuple(taxi.decode(obs)))

Taxi row: 3
Taxi col: 3
Passenger loc: 3
Destination loc: 1


The possible locations are `R` (0), `G` (1), `Y` (2), `B` (3) and in taxi (4). Thus, the passenger is currently at `G` and is heading to `Y`.

To answer the following question, you will need to modify the code, run it (possibly multiple times), and print out any relevant output.

In [57]:
# SOLUTION
import gym

taxi = gym.make("Taxi-v3")
taxi.seed(4)
obs = taxi.reset()
taxi.render()
taxi.step(0)
taxi.step(4)
taxi.step(2)
taxi.step(1)
taxi.step(1)
taxi.step(1)
taxi.step(1)
obs, reward, done, _ = taxi.step(5)
print(reward)

+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| | : |[43m [0m: |
|Y| : |[34;1mB[0m: |
+---------+

20


In [22]:
# TODO: in this case there is no testing of the code; the code is only used to explore
# this may not work well with the framework - add an automated test to the code as well? 
# e.g. can check that the state of the env is correct?

#### How much reward does the agent get for successfully dropping off the passenger?

- [ ] 0 | Try performing the necessary actions with `taxi.step()` and printing out the rewards.
- [ ] 5 | Try performing the necessary actions with `taxi.step()` and printing out the rewards.
- [ ] 10 | Try performing the necessary actions with `taxi.step()` and printing out the rewards.
- [x] 20 | You got it!

## Observations vs. renderings
<!-- coding exercise -->

The frozen lake environment allows us to create a visual rendering of the environment, which we've seen earlier. This is for human/debugging purposes, and is _not_ seen by the agent/algorithm. Your task here is to play a given Frozen Lake environment **without looking at the visual rendering** (no cheating!). Populate the `actions` list to contain a set of actions that correctly gets the agent to the goal. What you're experiencing is what a RL algorithm "sees" when learning!

Note that the given Frozen Lake environment is 3x3 instead of 4x4. Thus, the observation space runs from 0 to 8 instead of 0 to 15.

In [None]:
# EXERCISE

import gym
import numpy as np
np.random.seed(1)
env = gym.make("FrozenLake-v1", 
               desc=gym.envs.toy_text.frozen_lake.generate_random_map(size=3, p=0.3), 
               is_slippery=False)
env.render = None

obs = env.reset()
actions = ____
for action in actions:
    obs, reward, done, _ = env.step(action)
    print("Obs:", obs, "Reward:", reward, "Done:", done)


In [42]:
# SOLUTION

import gym
import numpy as np
np.random.seed(1)
env = gym.make("FrozenLake-v1", desc=gym.envs.toy_text.frozen_lake.generate_random_map(size=3, p=0.3), is_slippery=False)
env.render = None

obs = env.reset()
actions = []
# BEGIN SOLUTION
actions = [1,2,1,2]
# END SOLUTION
for action in actions:
    obs, reward, done, _ = env.step(action)
    print("Obs:", obs, "Reward:", reward, "Done:", done)

Obs: 3 Reward: 0.0 Done: False
Obs: 4 Reward: 0.0 Done: False
Obs: 7 Reward: 0.0 Done: False
Obs: 8 Reward: 1.0 Done: True


#### What does the environment look like?

Recalling that the actions 0, 1, 2, 3 represent left, down, right, up (respectively), which of the following statements correctly describes the best path to the goal?

- [ ] Start by moving down, then down again
- [x] Start by moving down, then right
- [ ] Start by moving right, then right again
- [ ] Start by moving right, then down