# Practicle 10: Reinforcement Learning

The aim of this practical is to provide an overview of the main concepts of (Epsilon-Greedy) Q-learning,, a well-known approach to reinforcement learning, through two worked examples.

The practical utilises the __[Open AI Gym](https://gym.openai.com)__ library, an open source toolkit for developing and comparing reinforcement learning algorithms.

<hr style="border:1px solid black"> </hr>

### Set-Up
Install and import dependencies before starting the practical 

In [1]:
pip install gym

Collecting gym
  Downloading gym-0.23.1.tar.gz (626 kB)
[K     |████████████████████████████████| 626 kB 11.7 MB/s eta 0:00:01
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
[?25hCollecting gym-notices>=0.0.4
  Downloading gym_notices-0.0.6-py3-none-any.whl (2.7 kB)
Collecting cloudpickle>=1.2.0
  Using cached cloudpickle-2.0.0-py3-none-any.whl (25 kB)
Building wheels for collected packages: gym
  Building wheel for gym (PEP 517) ... [?25ldone
[?25h  Created wheel for gym: filename=gym-0.23.1-py3-none-any.whl size=701355 sha256=8087c6ad18cdb50acf9e469d40e6c80b8b8f7155398d95636ed6ac190da9354d
  Stored in directory: /home/changhyun/.cache/pip/wheels/78/28/77/b0c74e80a2a4faae0161d5c53bc4f8e436e77aedc79136ee13
Successfully built gym
Installing collected packages: gym-notices, cloudpickle, gym
Successfully installed cloudpickle-2.0.0 gym-0.23.1 gym-notices-0.0.6
Note: you m

In [2]:
pip install pygame

Collecting pygame
  Downloading pygame-2.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.8 MB)
[K     |████████████████████████████████| 21.8 MB 10.6 MB/s eta 0:00:01
[?25hInstalling collected packages: pygame
Successfully installed pygame-2.1.2
Note: you may need to restart the kernel to use updated packages.


In [3]:
import gym
from gym import envs

import time
import numpy as np

from time import sleep
from IPython.display import clear_output

The following function allows us to visualise short videos of the game ouput

In [4]:
def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)

<hr style="border:1px solid black"> </hr>

# Task 1 Self Driving Cab
Our task is to implement Q-learning to run a simulation of a self-driving cab. In this task, there are 4 locations (labeled by different letters) and our job is to pick up the passenger at one location and drop them off in another.

## Create the environment

First we create the Taxi-v3 environment.
- It's a 5x5 grid world
- Our Taxi is placed randomly in a square.
- The passenger is placed randomly in one of the 4 possible location
- They wish to go in one of the 4 possibles locations too

In [5]:
env = gym.make('Taxi-v3')
env.reset()
print(env.render(mode = 'ansi'))

+---------+
|R: | : :[35mG[0m|
| :[43m [0m| : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+




### About the environment: 
- The filled square represents the taxi, which is yellow without a passenger and green with a passenger.
- The pipe ("|") represents a wall which the taxi cannot cross.
- R, G, Y, B are the possible pickup and destination locations. 
- The blue letter represents the current passenger pick-up location
- The purple letter is the current destination.

### Gameplay: 

**Actions:**
There exactly six possible actions. With your taxi, you can try to move to the four cardinal directions: North, South, East or West. There are also two other possible actions which are the pick-up or the drop-off for a great total of six actions.

The keys to use when playing the game are:
- 0 = South (Down)
- 1 = North (Up)
- 2 = East (Right)
- 3 = West (Left)
- 4 = Pickup
- 5 = Dropoff

**Rewards**:
We receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. 
There is also a 10 point penalty for illegal pick-up and drop-off actions

<hr style="border:1px solid black"> </hr>

## Playing the game manually

The following cell allows you to play the game manually.
- Can you follow the code? 

The code records the points earned, number of steps taken and total time taken to complete the game 

In [None]:
# Start counters at zero
rew_tot = 0
epochs = 0

# Start a new game and print first screen
print('Starting Position')
new_state = env.reset()
print(env.render(mode = 'ansi'))

# The Open AI Gym function env.step() changes this variable to True when the game is solved
done = False

# For start the game timer
start = time.time()

# This is a list of all valid game commands 
valid_commands = ['0', '1', '2', '3', '4', '5', 'Q', 'q']

while done != True: # Keep playing the game while 'Done' is not set to True
    
    action = input('Enter your next move:') #prints the text box to enter the game instructions
    
    if action=='Q' or action=='q': #Entering Q or q quits the game 
        break
        
    elif action in valid_commands: # makes sure a valid command is entered
        
        # env.step() takes in an action and returns the next state and reward for the environment
        # It also set the 'Done' variable to 'True' if:
            #(i) we complete the game
            # (ii) we have taken over 200 epochs and not completed the game      
        new_state, reward, done, info = env.step(int(action)) 
        
        # Update rewards and epochs
        rew_tot += reward  
        epochs += 1
        
        # Print game state and current reward total
        clear_output(wait=True)
        print(env.render(mode = 'ansi'))
        print("Reward: %r" % rew_tot)
          
    else: # Print warning if command not valide
        
        print('Incorrect Command, Please Try Again!')

# while loop has ended as env.step() has set 'Done = True'

#Stop the timer
end = time.time()

#Print the final game state and scores 
clear_output(wait=True)
print("Game Over:")
print(env.render(mode = 'ansi'))
print("Final Reward: %r" % rew_tot)
print("Epochs taken: {}".format(epochs))
print("Time taken: {:.5f} seconds".format(end - start))

# Reminder of the controls
# 0 = South (Down)
# 1 = North (Up)
# 2 = East (Right)
# 3 = West (Left)
# 4 = Pickup
# 5 = Dropoff

<hr style="border:1px solid black"> </hr>

## Solving the game randomly

The following cell randomly samples the action space to solve the game. 

The code records the points earned, number of steps taken,and total time taken to complete the game 

In [None]:
# Start counters at zero
rew_tot = 0
epochs = 0

# set this to blank to all animation of game once completed
frames = []

# Start a new game and print first screen
new_state = env.reset()
print('Starting Position')
print(env.render(mode = 'ansi'))

# The Open AI Gym function env.step() changes this variable to True when the game is solved
done = False

# For start the game timer
start = time.time()

while done != True:
    
    # Randomly sample an action from the game
    action = env.action_space.sample()
    
    # play the game and get new variables 
    new_state, reward, done, info = env.step(action)
    
    # Update rewards and epochs
    rew_tot += reward  
    epochs += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': new_state,
        'action': action,
        'reward': reward
        })

# Game Over 

#Stop the timer
end = time.time()

#Print the final game state and scores 
print("Game Over")
print(env.render(mode = 'ansi'))
print("Final Reward: %r" % rew_tot)
print("Epochs taken: {}".format(epochs))
print("Time taken: {:.5f} seconds".format(end - start))

As you can see the game should have stopped with a very negative points total and will have timed out; e.g., `Epochs taken` is set to `200`. 

**Watch the game:**

You can now call the `print_frames` function to watch the gameplay

In [None]:
print_frames(frames)

<hr style="border:1px solid black"> </hr>

## Solving the game through Reinforcement Learning

In Q-learning we create and fill a table (`Q-Table`) storing state-action pairs. 

During train we fill the `Q-Table` with the maximum expected future rewards for action at each state. 

After training, the `Q-Table` serves as a guide to the best action at each state. 

### Step 1 Create the Q-table and initialize it
Now, we need to create our Q-table. To do this we need to calculate the number of rows (states) and columns (actions). To do this we need to know the action_space and the observation_space.

**Action Space** 

Reminder, the game has six possible actions: North, South, East or West, and Pick-up or Drop-off 

**Obsevation Space** 

We can break up the parking lot into a 5x5 grid, which gives us 25 possible taxi locations. These 25 locations are one part of our state space. 

**All game states:**

We also need to take all combinations of passenger locations and destination locations to come to a total number of states for our taxi environment; there's four (4) destinations {R, G, Y, B} and five {R, G, Y, B, in cab} passenger locations.

So, our taxi environment has 5×5×4×5=500 total possible states

OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`

In [None]:
state_space = env.observation_space.n
print("There are ", state_space, " possible states")
action_space = env.action_space.n
print("There are ", action_space, " possible actions")

So we need to create a Q table with state_size rows and action_size columns (500x6)

We do this using the zeros function in numpy

In [None]:
Q_Table = np.zeros((state_space, action_space))
print(Q_Table)
print(Q_Table.shape)

### Step 2 Define the hyperparameters
- **Question** What is the role of all of these parameters in Q-learning? 
- **Question** Do these values seem reasonable?  

In [None]:
total_episodes = 500  # Total number of training episodes
alpha = 0.5           # Learning rate
gamma = 0.5           # Discounting rate

### Step 3 Train the Q-Table

Make sure you can follow each step in the code.

- **Question** What is the name of the equation we are using to updated (train) the Q-Table? 

In [None]:
# Initialise a blank Q-Table
Q_Table = np.zeros((env.observation_space.n, env.action_space.n))

# Set Hyperparameters
total_episodes = 500  # Total number of training episodes
alpha = 0.5           # Learning rate
gamma = 0.5           # Discounting rate

# Traing looop
for episode in range(total_episodes+1):
    
    Train_done = False
    state = env.reset()
    
    while Train_done != True:
 
        #choosing the action with the highest Q value
        action = np.argmax(Q_Table[state])  

        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, Train_done, info = env.step(action)
        
        # Update The Q-TabeL:
        # Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q_Table(s',a') - Q_Table(s,a)]
        Q_Table[state,action] = Q_Table[state][action] + alpha * (reward + gamma * 
                                    np.max(Q_Table[new_state]) - Q_Table[state,action])   
        # Update state
        state = new_state
        
    # Valid the performance of the Q-Table every 100 training episodes 
    if episode % 100 == 0:
        #reset the average validation accurcy
        average_reward = 0.
        # Test the performance of the Q_Table 100 times
        for i in range(100):

            state = env.reset()
            Validation_done = False

            while Validation_done != True: 

                action = np.argmax(Q_Table[state])
                state, reward, Validation_done, info = env.step(action)
                average_reward += reward

        # Validation Loop Finished       
        average_reward = average_reward/100
        clear_output(wait=True)
        print('Episode {} avarage reward: {}'.format(episode,average_reward)) 
            
# Training finished
print("Training finished.")

### Step 4: Play the game utilising the Q-Table to determine the next step

This next block of code demonstrates high to solve the game usingthe Q-Table
- **Question** do you understand how this is done in the code

In [None]:
frames = [] # for animation

rew_tot = 0
epochs = 0
done = False

state = env.reset()
print('Starting Position')
print(env.render(mode = 'ansi'))

start = time.time()

while not done:
    
    #Choose action and update state
    action = np.argmax(Q_Table[state])  
    new_state, reward, done, info = env.step(action)
    
    #Track rewards and epochs
    rew_tot = rew_tot + reward
    epochs += 1
    
    # Update state
    state = new_state
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': new_state,
        'action': action,
        'reward': reward
        }
    )

end = time.time()

print("Game Over: Final Reward: %r" % rew_tot)
print(env.render(mode = 'ansi'))
print("Time taken: {:.5f} seconds".format(end - start))
print("Epochs taken: {}".format(epochs))

Use `print_frames` to watch the gameplay

In [None]:
print_frames(frames)

**Check avearge performance** To get a better understanding of the how good our Q-Table is determine the avearge game metrics over 5000 iterations

In [None]:
average_reward = 0
average_epochs = 0
average_time = 0

for i in range(5000):
    
    rew_tot = 0
    epochs = 0
    done = False
    
    state = env.reset()

    start = time.time()
    while not done:

        #Choose action and update state
        action = np.argmax(Q_Table[state])  
        new_state, reward, done, info = env.step(action)

        #Track rewards and epochs
        rew_tot = rew_tot + reward
        epochs += 1

        # Update state
        state = new_state
        
    # While loop is over
    end = time.time()
    average_reward += rew_tot
    average_epochs += epochs
    average_time += (end - start)
    
# Loop Finished 
print("Average Reward: {}".format(average_reward/5000))
print("Time taken: {:.5f} seconds".format(average_time/5000))
print("Epochs taken: {}".format(average_epochs/5000))

### Exercise Improve the reinforcment learning algorithm

As you can see, our AI agent is not very intellegent. 

See if you can improve it's performance through the tuning of the hyperparameters

Additionally, yse the validation loop to implement an early stoping criteria which stops training when avearge reward validation reward reaches a value between greater than 5. *Hint* This can be down in two additional lines of code.

Use the aveage performance code to verify the improved performance of your table

<hr style="border:1px solid black"> </hr>

# Task 2: Frozen Lake

### About the game

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.

In [None]:
env = gym.make('FrozenLake8x8-v1')
env.reset()
print(env.render(mode = 'ansi'))

### About the environment: 
The surface is described using a grid like the following:
- S: starting point, safe
- F: frozen surface, safe
- H: hole, fall to your doom
- G: goal, where the frisbee is located

### Gameplay: 

**Actions:** The commands to  nativagite the enviroment are
- 0 = Left
- 1 = Down
- 2 = Right
- 3 = Up

As ice is slippery, there is only a 1/3 chance of moving in the chosen direction. These is also 1/3 chance to move in either direction perpendicular to the intended direction. 
*For example*, if action is `left` and, then:
- P(move left)=1/3
- P(move up)=1/3
- P(move down)=1/3

**Rewards:** The reward schedule is :
- Reach goal (G): +1
- Reach hole (H): 0

<hr style="border:1px solid black"> </hr>

### Exercise 1: 
Implement Q-learning to run a simulation of the frozen lake game

**Hint** 
Some reasonable hyperparamters for this game are
- Nuber of training episodes: 500000
- Discount factor: 0.95
- Learning rate: 0.01
- Perform the validation loop every 10000 episodes 

In [None]:
# Implement your code here 

**Observe** the performance of the Q-table one iteration of the game
- What is happening? **Answer** The agent doesn't much
- Why do you think this is? **Answer** The is no insentive to move, there is no punishment for not moving

In [None]:
frames = [] # for animation

rew_tot = 0
epochs = 0
new_state = env.reset()
print(env.render(mode = 'ansi'))
done = False

start = time.time()
while not done:
    
    action = np.argmax(Q_Table[new_state])  
    new_state, reward, done, info = env.step(action)
    rew_tot = rew_tot + reward  
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': new_state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
end = time.time()

print("Game Over: Final Reward: %r" % rew_tot)
print("Time taken: {:.5f} seconds".format(end - start))
print("Epochs taken: {}".format(epochs))

In [None]:
print_frames(frames)

<hr style="border:1px solid black"> </hr>

### Exercise:
To improve the performance of our training you will need to implement an __[epsilon-greedy](https://www.baeldung.com/cs/epsilon-greedy-q-learning)__ policy to encourage more expolration of the environment
- Hint use `if: else:` statement as part of your solution

Additionally include an early stopping policy to quit updating the table once the average validation reward is 0.8

In [None]:
# Implement your code here 

In [None]:
frames = [] # for animation

rew_tot = 0
epochs = 0

new_state = env.reset()
print("Game Start:")
print(env.render(mode = 'ansi'))

done = False

start = time.time()
while not done:
    
    action = np.argmax(Q_Table[new_state])  
    new_state, reward, done, info = env.step(action)
    rew_tot = rew_tot + reward  
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': new_state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
end = time.time()

print("Game Over: Final Reward: %r" % rew_tot)
print(env.render(mode = 'ansi'))
print("Time taken: {:.5f} seconds".format(end - start))
print("Epochs taken: {}".format(epochs))

In [None]:
print_frames(frames)

<hr style="border:1px solid black"> </hr>