## Environments syntax

#### Motivation

- So far we've used pre-defined environments like Frozen Like and Google RecSim.
- To use RL on our own problem, we can't use any of these environments.
- We'll need to define our own environment with Python.

#### Frozen Lake Review

- Recall the Frozen Lake environment, from Module 1:

In [3]:
import gym
env = gym.make("FrozenLake-v1")
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


#### Frozen Lake Review

- OpenAI Gym is open source, so we could look at the [Frozen Lake source code](https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py).
- However, it's complicated and contains much more than we need.
- Let's make our own environment called Frozen Pond with the basic components of Frozen Lake.

#### Components of an Env

Conceptual decisions:

- Observation space
- Action space

In Python we will need to implement, at least:

- constructor
- `reset()`
- `step()`

In practice, we may also want other methods, such as `render()`

#### Conceptual decisions

- In this case, since we're mimicking the Frozen Lake, the observation space and action space are already decided.

In [4]:
observation_space = gym.spaces.Discrete(16)
action_space = gym.spaces.Discrete(4)

Later in this course we'll dive deeper into these decisions!

#### Coding it up

In [5]:
import gym

class FrozenPond(gym.Env):
    pass

- Notice that we start by subclassing `gym.Env`.
- Optional: You can read about objects, inheritance, and subclasses.
- Punch line: This is a basic `gym.Env` and we can overwrite features of it.

#### Constructor

- The constructor gets called when we make a new `FrozenPond` object.
- Here is where we define the observation space and action space.

In [6]:
import gym

class FrozenPond(gym.Env):
    def __init__(self):
        self.observation_space = gym.spaces.Discrete(16)
        self.action_space = gym.spaces.Discrete(4)        

#### Reset

- The next method we'll need is reset.
- The constructor sets permanent parameters like the observation space.
- `reset` sets up each new episode.
- There is some freedom between the two, e.g. setting the exit location.
- If something _could_ change, we'll put it in `reset`.

In [12]:
# HIDDEN
import numpy as np

In [49]:
class FrozenPond(gym.Env):
    def reset(self):
        self.player = (0, 0) # the player starts at the top-left
        self.exit = (3, 3)   # exit is at the bottom-right
        
        self.holes = np.array([
            [0,0,0,0], # FFFF 
            [0,1,0,1], # FHFH
            [0,0,0,1], # FFFH
            [1,0,0,0]  # HFFF
        ])
        
        return 0 # the observation corresponding to (0,0)

#### Reset

Let's test this out:

In [50]:
fp = FrozenPond()

In [51]:
fp.reset()

0

In [52]:
fp.holes

array([[0, 0, 0, 0],
       [0, 1, 0, 1],
       [0, 0, 0, 1],
       [1, 0, 0, 0]])

In [None]:
Looks good

#### Step

- The last method we need is `step`.
- This is the most complicated method that contains the core logic.
- Recall that `step` returns 4 things:
  1. Observation
  2. Reward
  3. Done flag
  4. Extra info (we will ignore)
- For clarity, let's write these as 3 separate methods:

#### Step — observation

Recall the observation is an index from 0 to 15:

```
 0   1   2   3
 4   5   6   7
 8   9  10  11
12  13  14  15
```

We can code this as follows:

In [18]:
class FrozenPond(gym.Env):
    def observation(self):
        return 4*self.player[0] + self.player[1]

For example, if the player is at (2,1) then we return

In [20]:
4*2 + 1

9

#### Step — reward

Following the Frozen Lake example, the reward will be 1 if the agent reaches the goal, and 0 otherwise:

In [21]:
class FrozenPond(gym.Env):
    def reward(self):
        return int(self.player == self.exit)

We will modify this reward function later in the module!

#### Step — done

- Finally, we want to know when an episode is done. 
- Following Frozen Lake, the episode is done when the agent reaches the goal or falls into the pond.

In [53]:
class FrozenPond(gym.Env):
    def done(self):
        return self.player == self.exit or self.holes[self.player] == 1

#### Step — putting it together

- Using the above pieces, we can now write the `step` method.
- `step` takes in an _action_, updates the _state_, and returns the observation, reward, and done flag.
- Recall how actions are encoded: 0 for left, 1 for down, 2 for right, 3 for up.
- We will implement a **non-slippery** frozen pond; in other words, deterministic rather than stochastic.

In [54]:
class FrozenPond(gym.Env):
    def step(self, action):
        # Compute the new player location
        if action == 0:   # left
            new_loc = (self.player[0], self.player[1]-1)
        elif action == 1: # down
            new_loc = (self.player[0]+1, self.player[1])
        elif action == 2: # right
            new_loc = (self.player[0], self.player[1]+1)
        elif action == 3: # up
            new_loc = (self.player[0]-1, self.player[1])
        else:
            raise ValueError("Action must be in {0,1,2,3}")
        
        # Update the player location only if you stayed in bounds
        # (if you try to move out of bounds, the action does nothing)
        if 0 <= new_loc[0] <= 3 and 0 <= new_loc[1] <= 3:
            self.player = new_loc
        
        # Return observation/reward/done
        return self.observation(), self.reward(), self.done(), None

#### Success!

- That's it! We've implemented the necessary pieces in Frozen Pond: 
  - constructor
  - `reset`
  - `step`
- We'll also add an optional `render` function so that we can draw the state:

In [55]:
class FrozenPond(gym.Env):
    def render(self):
        for i in range(4):
            for j in range(4):
                if (i,j) == self.exit:
                    print("G", end="")
                elif (i,j) == self.player:
                    print("P", end="")
                elif self.holes[i,j]:
                    print("H", end="")
            print()

For simplicity, we're using `P` to denote the player, instead of the red highlighting.

#### Testing our implementation

In [61]:
# HIDDEN
class FrozenPond(gym.Env):
    def __init__(self):
        self.observation_space = gym.spaces.Discrete(16)
        self.action_space = gym.spaces.Discrete(4)      
        
    def reset(self):
        self.player = (0, 0) # the player starts at the top-left
        self.exit = (3, 3)   # exit is at the bottom-right
        
        self.holes = np.array([
            [0,0,0,0], # FFFF 
            [0,1,0,1], # FHFH
            [0,0,0,1], # FFFH
            [1,0,0,0]  # HFFF
        ])
        
        return 0 # the observation corresponding to (0,0)
    
    def observation(self):
        return 4*self.player[0] + self.player[1]
    
    def reward(self):
        return int(self.player == self.exit)
    
    def done(self):
        return self.player == self.exit or self.holes[self.player] == 1
    
    def step(self, action):
        # Compute the new player location
        if action == 0:   # left
            new_loc = (self.player[0], self.player[1]-1)
        elif action == 1: # down
            new_loc = (self.player[0]+1, self.player[1])
        elif action == 2: # right
            new_loc = (self.player[0], self.player[1]+1)
        elif action == 3: # up
            new_loc = (self.player[0]-1, self.player[1])
        else:
            raise ValueError("Action must be in {0,1,2,3}")
        
        # Update the player location only if you stayed in bounds
        # (if you try to move out of bounds, the action does nothing)
        if 0 <= new_loc[0] <= 3 and 0 <= new_loc[1] <= 3:
            self.player = new_loc
        
        # Return observation/reward/done
        return self.observation(), self.reward(), self.done(), None
    
    def render(self):
        for i in range(4):
            for j in range(4):
                if (i,j) == self.exit:
                    print("G", end="")
                elif (i,j) == self.player:
                    print("P", end="")
                elif self.holes[i,j]:
                    print("H", end="")
                else:
                    print("F", end="")
            print()

In [62]:
env = FrozenPond()
env.reset()
env.render()

PFFF
FHFH
FFFH
HFFG


In [63]:
env.step(2)

(1, 0, False, None)

In [64]:
env.render()

FPFF
FHFH
FFFH
HFFG


Looks good!

#### Testing our implementation

Let's directly compare the two environments:

In [66]:
pond = FrozenPond()
lake = gym.make("FrozenLake-v1", is_slippery=False)
pond.reset()
lake.reset()

for a in [0, 2, 2, 1, 1, 1, 1, 2]:
    pond_obs, pond_rew, pond_done, _ = pond.step(a)
    lake_obs, lake_rew, lake_done, _ = lake.step(a)
    print("%2d/%2d    %d/%d    %5s/%5s" % \
          (pond_obs, lake_obs, pond_rew, lake_rew, pond_done, lake_done))

 0/ 0    0/0    False/False
 1/ 1    0/0    False/False
 2/ 2    0/0    False/False
 6/ 6    0/0    False/False
10/10    0/0    False/False
14/14    0/0    False/False
14/14    0/0    False/False
15/15    1/1     True/ True


They look the same to me!

## Exercise

EXERCISE:
DIFFERENCE BETWEEN MAZE AND POND
difference is whether episode ends when you walk into a wall

exercise: why don't we have a negative reward when we fall?

    Note that we don't explicitly penalize falling into the pond here; rather, the penalty is indirect in that the episode ends with a reward of zero, forfeiting the potential reward of 1 from reaching the goal successfully. 