---
# Introduction / setup

This notebook is accompanying material for the 1st tutorial on Topic 2 in the **Machine Learning in Mathematics & Theoretical Physics** Summer School in Oxford in 2023.

This tutorial is on the topic of reinforcement learning. We'll look at some of the basic concepts of environments, agents, and reinforcement learning algorithms. (Most of this will happen on the blackboard.)

In the later parts of the tutorial we will train an agent to solve some simple games, which are given as **OpenAI gym** environments, and we will use the **Stable Baselines** library of reinforcement learning algorithms.

For example, we will train a Lunar Lander agent, to land a spaceship as in the animation below:

![Lunar Lander](https://cdn-images-1.medium.com/max/960/1*f4VZPKOI0PYNWiwt0la0Rg.gif)

We will start the tutorial with some blackboard review on the basics of an environment, using the example of a **GridWorld** maze. While we're doing this, you can run the following code cells to set up all the requisite packages that we'll need in the notebook below (this will take a few minutes).

*Note: This notebook is designed to work in a ***Google Colab*** environment. If you're running things locally, you might have to tweak some parts. Note in particular that when we visualise our agent's performance, we'll be using a virtual display, since the remote machine doesn't have a display.*

In [None]:
# For some environments we will need box2d-py
!apt-get update && apt-get install swig cmake
!pip install box2d-py

# stable-baselines3 is a library of various reinforcement learning algorithms
!pip install "stable-baselines3[extra]>=2.0.0a4"

# For the custom GridWorld environment, we need LaTeX for some fonts
!apt install texlive texlive-latex-extra texlive-fonts-recommended dvipng cm-super
!pip install latex

# For grabbing some files we'll need for GridWorld
!pip install requests

With the above packages installed, you should be able to import everything you need like this:

In [None]:
#import gym
#from gym import spaces
#from gym.utils import seeding
import gymnasium as gym
from stable_baselines3 import DQN, A2C
from stable_baselines3.common.evaluation import evaluate_policy

import copy
import numpy as np
import imageio
import requests
import matplotlib.pyplot as plt
from matplotlib import rc
rc('text', usetex=True)
from __future__ import print_function

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

---
# A basic environment: GridWorld

Here we'll set up an example environment, called a **GridWorld** environment, in which an agent can move around on a grid, in which there are a number of pits into in which it can fall (bad), and there is also an exit (good).

To do this we need to first grab the files ```gridworld.py``` and ```helperFunctions.py```, which we can do as follows:

(You should then be able to see the files (on Google Colab) by opening the files tab on the left, by clicking the folder icon.)

In [None]:
url = 'https://raw.githubusercontent.com/callum-ryan-brodie/oxford-ml-physmath-school/main/gridworld.py'
r = requests.get(url, allow_redirects=True)
open('gridworld.py', 'wb').write(r.content);

url = 'https://raw.githubusercontent.com/callum-ryan-brodie/oxford-ml-physmath-school/main/helperFunctions.py'
r = requests.get(url, allow_redirects=True)
open('helperFunctions.py', 'wb').write(r.content);

We can then import them as:

In [None]:
import gridworld
import helperFunctions

Now we can create a **GridWorld** environment:

In [None]:
np.random.seed(4) # Ensures the same maze each time - delete and run the below twice for a random maze
env = gridworld.GameEnv()

Let's take a look at the initial environment by visualising it:

In [None]:
env.reset()
world = env.render_world()
plt.ioff()
plt.tight_layout()
plt.axis("off")
plt.imshow(world, interpolation="nearest")
plt.draw()
plt.title("Maze layout", fontsize=24)
plt.tight_layout()
plt.show()
plt.close()

In the visualisation of the environment we have the following colour-coding:

- In blue is the agent, which begins in the top-left.
- In red are the pits, which we'll consider as terminal states, meaning the game ends if the agent steps into one.
- In green is the exit of the maze, which is also a terminal state (but one we'll later encode as a 'good one' with a positive reward!)
- (And in white are the spaces where the agent can freely walk.)

The action space for the agent is discrete and consists of four possible actions: up, down, left, and right.

We can step our agent through the environment by running ```env.step(i)```, where $i \in \{0,1,2,3\}$ and where the index encodes the actions in the order up, down, left, right.

**Q: Complete the following code cell to make the agent take a step down, and then visualise the new environment.**

(Note: Make sure you don't reset the environment again before visualising!)

In [None]:
### COMPLETE THIS CODE CELL ###

**Q: Next use the following code cell to have a play around with moving the agent into for example the pits, or outside the boundaries of the maze, and visualise the environment afterwards, to see what happens.**

(And remember you can reset the environment at any point with ```env.reset()```.)

In [None]:
### COMPLETE THIS CODE CELL ###

Finally, note that when we run ```env.step(i)```, it returns a tuple of outputs, respectively containing:
1. The position $(x,y)$ that the agent ends up at after the action.
2. The reward $r(s,a)$ for taking this action from this state.
3. A boolean which is **True** if the agent has ended up in a terminal state.

**Q: In the following code cell, play around with moving the agent with ```env.step(i)``` and look at the returned tuples, while comparing with visualisations of the environment, to see that it behaves correctly at the various terminal states, and to see what the rewards are for the various state/action possibilities.**

The outputs of ```env.step(i)``` are what we will feed back to the reinforcement learning algorithm during training.

In [None]:
### COMPLETE THIS CODE CELL ###

---
# Code behind an environment

(In the below we have not shown some comments. And we've used ellipses to skip over code that isn't conceptually important.)

We also skip over all the code that is simply for visualising the environment.


Create a class (Python object oriented programming)

```
class GameEnv:
    def __init__(self):
```

## State and action space

The state and action space are simply set up as follows:
```
        self.sizeX = 5
        self.sizeY = 5

        ...

        self.state = ()

        ...

        self.action_space = [0, 1, 2, 3]
```
Here the ```state``` is just the $(x,y)$ coordinate of the 'worker' (the name for the blue block being moved through the maze). Note that there is no other information in the state space - the worker's position is the only aspect of the environment that changes upon an action.

Note that more generally one can have:
- State $\neq$ worker's state
- Agent $\neq$ worker
- State $\neq$ 'observation'


Note that in an OpenAI gym environment (e.g. **FrozenLake** which is a type of GridWorld):
```
        self.observation_space = spaces.Discrete(nS)
        self.action_space = spaces.Discrete(nA)
```
In the above ```nA``` and ```nS``` are the number of possible actions and states respectively.

There is also for example ```spaces.Box``` for continuous state / action spaces. See the gym documentation on the ```spaces``` [superclass](https://www.gymlibrary.dev/api/spaces/) for the full set of options.

## Rewards
```
        self.step_penalty = -1.
        self.pitfall_penalty = -50.
        self.exit_reward = 100000.
        self.no_move_penalty = -2.
```

## Initialisation

```
class GameOb:
    def __init__(self, name, reward, coordinates, size, rgba):
        self.x = coordinates[0]
        self.y = coordinates[1]
        self.size = size
        self.channel = rgba
        self.reward = reward
        self.name = name
```

```

    self.num_pits = 7

    ...

    self.objects = []

    ...

    self.initial_x = 0
    self.initial_y = 0

    ...

    def initialize_world(self):
        self.objects = []

        maze_exit = GameOb('exit', self.exit_reward, [4, 4], 1, [0, 1, 0, 1])
        self.objects.append(maze_exit)

        worker = GameOb('worker', None, [0, 0], 1, [0, 0, 1, 1])
        self.objects.append(worker)
        for i in range(self.num_pits):  # add pitfalls
            pitfall = GameOb('pitfall', self.pitfall_penalty, self.new_position(), 1, [1, 0, 0, 1])
            self.objects.append(pitfall)

        ...
```

## Resetting the environment

We reset the environment by running ```env.reset()```:
```
    def reset(self):
        self.steps = 0
        self.steps_taken = []
        self.gave_up = False
        self.fell = False
        self.state = (self.initial_x, self.initial_y)

        for obj in self.objects:
            if obj.name == 'worker':
                obj.x = self.initial_x
                obj.y = self.initial_y
                break
```

## Taking an action

We take an action by running ```env.step()```:

```

    self.max_steps = 1000

    ...

    def step(self, action, update_view=True):

        ...

        reward, done = self.move_worker(action)

        self.steps += 1
        self.steps_taken.append(action)

        if self.steps >= self.max_steps and not done:
            done = True
            self.gave_up = True

        if self.fell:
            done = True

        ...

        return self.get_state(), reward, done
```

We see that the actual environment update is handled in the separate function ```move_worker```:

```
    def move_worker(self, direction):

        worker = None
        others = []
        for obj in self.objects:
            if obj.name == 'worker':
                worker = obj
            else:
                others.append(obj)

        worker_x = worker.x
        worker_y = worker.y

        reward = self.step_penalty

        if direction == 0 and worker.y >= 1:
            worker.y -= 1
        if direction == 1 and worker.y <= self.sizeY - 2:
            worker.y += 1
        if direction == 2 and worker.x >= 1:
            worker.x -= 1
        if direction == 3 and worker.x <= self.sizeX - 2:
            worker.x += 1

        if worker.x == worker_x and worker.y == worker_y:
            reward = self.no_move_penalty

        for i in range(len(self.objects)):
            if self.objects[i].name == 'worker':
                self.objects[i] = worker
                break

        is_maze_solved = False
        for other in others:
            if worker.x == other.x and worker.y == other.y:
                if other.name == "exit":
                    is_maze_solved = True
                    reward = other.reward
                    break
                elif other.name == "pitfall":
                    is_maze_solved = False
                    reward = other.reward
                    self.fell = True
                    break

        return reward, is_maze_solved
```

---
# The LunarLander environment, with a Deep Q-Network

Here we'll look at utilising a standard reinforcement learning algorithm, called DQN, which we'll get from the **Stable Baselines** library (see [here](https://stable-baselines3.readthedocs.io/en/v2.0.0/modules/dqn.html)).

To see the utility of this algorithm we'll look at a more complicated environment than the **GridWorld** environment above, namely the [**Lunar Lander**](https://www.gymlibrary.dev/environments/box2d/lunar_lander/) environment that comes with OpenAI's **gym**.

As one of the standard **gym** environments, we can set up an agent for this environment simply as:

In [None]:
model = DQN(
    "MlpPolicy",
    "LunarLander-v2",
    verbose=1,
    exploration_final_eps=0.1, # The lowest epsilon value to reach (for explore vs. exploit.)
    target_update_interval=250, # How often to update the network
)

Currently our agent is untrained, so if we evaluate it we will see that it doesn't perform well.

Here we create a separate environment in which to evaluate the agent.

(We'll see below that a good performance is something like a $+200$ reward.)

In [None]:
# Separate env for evaluation
eval_env = gym.make("LunarLander-v2")

# Random Agent, before training
mean_reward, std_reward = evaluate_policy(
    model,
    eval_env,
    n_eval_episodes=10,
    deterministic=True,
)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

The syntax to train this model is also very simple:

(Here I'll train it for 10,000 timesteps, but this won't be enough to get good performance. Try training it for 100,000 timesteps instead.)

In [None]:
model.learn(total_timesteps=int(1e4))

Once training is completed, we can evaluate the agent's typical performance:

In [None]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

Or more interestingly, we can take a look at a video of its performance.

The below code will create a GIF of our **Lunar Lander** agent. When it's done, we can look at the GIF by (in Google Colab) clicking on the files tab on the left-hand side, and double-clicking on **lander.gif**.

In [None]:
images = []
obs = model.env.reset()
img = model.env.render()
for i in range(1000):
    images.append(img)
    action, _ = model.predict(obs)
    obs, _, _ ,_ = model.env.step(action)
    img = model.env.render()
    if obs[0][-1] == 1 and obs[0][-2] == 1:
      break

imageio.mimsave("lander.gif", [np.array(img) for i, img in enumerate(images) if i%2 == 0], fps=29)

---
# Other environments and algorithms

There are a number of other environments provided by **gym**. There are also a number of other reinforcement learning algorithms provided by **Stable Baselines**.

For example, let's look at the [**Cart Pole**](https://www.gymlibrary.dev/environments/classic_control/cart_pole/) environment, and train an agent with the [A2C algorithm](https://stable-baselines3.readthedocs.io/en/v2.0.0/modules/a2c.html).



![Cart Pole](https://www.gymlibrary.dev/_images/cart_pole.gif)

In [None]:
model = A2C(
    "MlpPolicy",
    "CartPole-v1",
    verbose=1
)

In [None]:
eval_env = gym.make("CartPole-v1")

mean_reward, std_reward = evaluate_policy(
    model,
    eval_env,
    n_eval_episodes=10,
    deterministic=True,
)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

In [None]:
model.learn(total_timesteps=int(1e4))

In [None]:
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

In [None]:
images = []
obs = model.env.reset()
img = model.env.render()
for i in range(1000):
    images.append(img)
    action, _ = model.predict(obs)
    obs, _, _ ,_ = model.env.step(action)
    img = model.env.render()

imageio.mimsave("cartpole.gif", [np.array(img) for i, img in enumerate(images) if i%2 == 0], fps=20)

If you have time, you can try out some of the other **gym** environments, and some of the other reinforcement learning algorithms from **Stable Baselines**, using the above templates.