# Reinforcement Learning project - Q-Learning




## Aim
In this lab we are going to solve two simple [OpenAI Gym](https://gym.openai.com/) environments using [Q-Learning](https://en.wikipedia.org/wiki/Q-learning). Specifically, the [CartPole-v0](https://gym.openai.com/envs/CartPole-v0/) and [MountainCar-v0](https://gym.openai.com/envs/MountainCar-v0/) environments.

We will try to create a table containing the expected reward for each combination of a *state* and *action*. We will use this table to choose the (hopefully) best action given the state the system is in.

While this may not be the most advanced or complicated model there is, it is perfect for this task! Furthermore, it can be trained in a relatively short time!

## Runtime and environment
This [Jupyter Notebook](https://jupyterlab.readthedocs.io/en/latest/) was made to run on Google Colab. For this training, we recommend using the Google Colab environment.

Please read the [instructions on Google Colab](https://medium.com/swlh/the-best-place-to-get-started-with-ai-google-colab-tutorial-for-beginners-715e64bb603b) to get started quickly. It behaves similar to Jupyter Notebook, Jupyter Hub and Jupyter Lab, so if you have any experience with those, you're good to go!

Some notes on Google Colab:
- **Processes in Google Colab won't run forever**. These may be terminated at any time when the platform is crowded, and *will definitely* terminate after 12 hours. To maintain persistency, you can attach the session to **Google Drive** and have your models persist themselves to the Google Drive periodically.
- You can enable GPU or TPU support! You can find this option under *Runtime* -> *Change runtime type*.
- After installing dependencies, you need to restart the runtime in order to actually use them.

If you want to run the code on your own platform or system, you need to keep a few things in mind:
- The dependencies you need to install may differ from the ones we installed here. The installed dependencies are suitable for Google Colab, Ubuntu, and Debian.
- Since Google Colab isn't attached to a monitor, we render the output to a video file. On your own machine the built-in render method from OpenAI's Gym may suffice.
- The default paths use Google Drive! Change these.

## Info Support
This assignment was developed by Info Support. Looking for a graduation project or job? Check out their website: https://carriere.infosupport.com/



# Preparation

Some dependencies need to be installed for the code to work. Furthermore, we will define some methods which allow us to show the OpenAI Gym renderings in this (headless) Google Colab environment.

You only have to run these and don't need to change any of the code.

In [1]:
# Install dependencies
"""Note: if you are running this code on your own machine, you probably don't need all of these.
   Start with 'pip install gym' and install more packages if you run into errors."""
!apt-get update > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg cmake > /dev/null 2>&1

!pip install gym pyvirtualdisplay > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1
!pip install colabgymrender

Requirement already up-to-date: setuptools in /usr/local/lib/python3.7/dist-packages (57.0.0)
Collecting colabgymrender
  Downloading https://files.pythonhosted.org/packages/19/1d/47289e427492af14ced09dfe1531bf3ce8178e7504a8222669b3193d165e/colabgymrender-1.0.9-py3-none-any.whl
Installing collected packages: colabgymrender
Successfully installed colabgymrender-1.0.9


In [2]:
# Imports for helper functions
import base64
import io
import math
from pathlib import Path

import gym
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from colabgymrender.recorder import Recorder
from google.colab import drive
from gym.wrappers import Monitor
from IPython import display as ipythondisplay
from IPython.display import HTML
from pyvirtualdisplay import Display

Imageio: 'ffmpeg-linux64-v3.3.1' was not found on your computer; downloading it now.
Try 1. Download from https://github.com/imageio/imageio-binaries/raw/master/ffmpeg/ffmpeg-linux64-v3.3.1 (43.8 MB)
Downloading: 8192/45929032 bytes (0.0%)532480/45929032 bytes (1.2%)1630208/45929032 bytes (3.5%)2940928/45929032 bytes (6.4%)4530176/45929032 bytes (9.9%)6397952/45929032 bytes (13.9%)8675328/45929032 bytes (18.9%)11452416/45929032 bytes (24.9%)14434304/45929032 bytes (31.4%)17735680/45929032 bytes (38.6%)20889600/45929032 bytes (45.5%)24297472/45929032 bytes (52.9%)27541504/45929032 bytes (60.0%)30621696/4592

In [3]:
# Mount your Google Drive. By doing so, you can store any output, models, videos, and images persistently.
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [4]:
# Create a directory to store the data for this lab. Feel free to change this.
data_path = Path('/content/gdrive/My Drive/Colab Notebooks/HU_RL/part1')
data_path.mkdir(parents=True, exist_ok=True)
video_path = data_path / 'video'

In [5]:
# Define helper functions to visually show what the models are doing.
%matplotlib inline

gym.logger.set_level(gym.logger.ERROR)

display = Display(visible=0, size=(1400, 900))
display.start()

def show_video():
    # Display the stored video file
    # Credits: https://star-ai.github.io/Rendering-OpenAi-Gym-in-Colaboratory/
    mp4list = list(data_path.glob('video/*.mp4'))
    if len(mp4list) > 0:
        mp4 = mp4list[-1]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
            </video>'''.format(encoded.decode('ascii'))))
    else: 
        print('Could not find video')


def record_episode(idx):
    # This determines which episodes to record.
    # Since the video rendering in the OpenAI Gym is a bit buggy, we simply override it and decide
    # whether or not to render inside of our training loop.
    return True

    
def video_env(env):
    # Wraps the environment to write its output to a video file
    env = Monitor(env, video_path, video_callable=record_episode, force=True)
    return env


# Test the environment

In [6]:
"""We will use a basic OpenAI Gym examle: CartPole-v0.
In this example, we will try to balance a pole on a cart.
This is similar to kids (and.. grown-ups) trying to balance sticks on their hands.

Check out the OpenAI Gym documentation to learn more: https://gym.openai.com/docs/"""

# Create the desired environment
env = gym.make("CartPole-v0")

# Wrap the environment, to make sure we get to see a fancy video
env = video_env(env)

# Before you can use a Gym environment, it needs to be reset.
state = env.reset()

# Perform random actions untill we drop the stick. Just as an example.
done = False
while not done:
    env.render()
    # The action_space contains all possible actions we can take.
    random_action = env.action_space.sample() 
    # After each action, we end up in a new state and receive a reward.
    # When we drop the pole (more than 12 degrees), or balance it long enough (200 steps),
    # or drive off the screen, done is set to True.
    state, reward, done, info = env.step(random_action)
    break
# Show the results!
env.close()
show_video()

In [7]:
# Neat, it did something (randomly)! 

# In order to train the system, we will try to predict the reward a certain actions yields given the state of the system.
# But what is the state anyway?

# In this environment, the state represents the cart's position and velocity, and the pole's angle and velocity.

# Let's check out the current state
print(f'Cart position: {state[0]} (range: [-4.8, 4.8])')
print(f'Cart velocity: {state[1]} (range: [-inf, inf])')
print(f'Pole angle: {state[2]} (range: [-0.418, 0.418])')
print(f'Pole velocity: {state[3]} (range [-inf, inf])')

# You can find out the minimum and maximum possible observation values using:
print(f'Low observation space:', env.observation_space.low)
print(f'High observation space:', env.observation_space.high)

Cart position: 0.005880129578607302 (range: [-4.8, 4.8])
Cart velocity: -0.21530446554717803 (range: [-inf, inf])
Pole angle: -0.014180207275466405 (range: [-0.418, 0.418])
Pole velocity: 0.3266362077717006 (range [-inf, inf])
Low observation space: [-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]
High observation space: [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]


# Implement Q-Learning

## Task
Implement Q-Learning and find suitable parameters to reach a 200 reward.

In [8]:
# Define parameters - Fill in the dots

num_episodes = 10000
num_episodes_between_status_update = 500
num_episodes_between_videos = 5000

learning_rate = 0.05         # also known as: alpha
discount = 0.95              # also known as: gamma
epsilon = 0.5

# Epsilon decay
# facto to be multiplied with epsilon for every percentage amount of episodes
# so after 100 of the 10000 episodes, epsilon will drop by epsilon*epsilon_factor
epsilon_factor = 0.97
pass    # Optionally, add parameters for epsilon decay here

# Discretization
pass    # You can add parameters to discretize states here

In [9]:
## Q-Table creation

# As seen before, the state consists of 4 floating point values.
# It makes sense to discretize these values (read: place them in buckets), to reduce the state space and therefore the Q-table size
state_shape = [20,20,20,20]      # For instance: [4, 4, 6, 6], or [10] * 4, or [200, 200, 100, 100]

# Define the initial Q table as a random uniform distribution
q_table = np.random.uniform(low=-2, high=0, size=(state_shape + [env.action_space.n]))

#print('Initial Q table:', q_table)

# Train


In [10]:
# Functions

# def make_bin(value, min_x, max_x, n_bins):
# 	  bin_size = (max_x - min_x) / n_bins
# 	  value = (value - min_x) // bin_size
# 	  return value

# def discretize_state(state):
#     # A Q-table cannot practically handle infinite states, so limit the state space by
#     # discretizing the state into buckets.
#     bins  = [np.linspace(-3,10,state_shape[0]),
#              np.linspace(-3,10,state_shape[1]),
#              np.linspace(-3,10,state_shape[2]),
#              np.linspace(-3,10,state_shape[3])]
#     discrete_state = []
#     for i in range(len(state)):
#         discrete_state.append(np.digitize(state[i],bins[i]) - 1)
#     discrete_state = np.array(discrete_state)
#     # print("other method",discrete_state)

#     # max_x_list = [4,4,4.1,4]
#     # min_x_list = [-4,-4,-4.1,-4]
#     # discrete_state = np.array([make_bin(state[i],min_x_list[i],max_x_list[i],state_shape[i]) for i in range(len(state_shape))])
#     # print("my method",discrete_state)
#     return tuple(discrete_state.astype(np.int))

bins = [np.linspace(-4.8,4.8,state_shape[0]),
        np.linspace(-4,4,state_shape[1]),
        np.linspace(-0.418,0.418,state_shape[2]),
        np.linspace(-4,4,state_shape[3])]

def discretize_state(state):
    # A Q-table cannot practically handle infinite states, so limit the state space by
    # discretizing the state into buckets.
    discrete_state = []
    for i in range(len(state)):
       discrete_state.append(np.digitize(state[i],bins[i]) - 1)
    discrete_state = np.array(discrete_state)
    return tuple(discrete_state.astype(np.int))

def take_action(discrete_state, epsilon):
    # Take an action to either explore or exploit.
    # epsilon defines ratio exploration/exploitation
    if epsilon>np.random.random():
      # explore
      action = env.action_space.sample()
    else:
      # exploit: get action with highest q value
      #q_values = get_q_values(discrete_state)
      #action = q_values.index(max(q_values))
      action = np.argmax(q_table[discrete_state])
    return action

def estimated_max_for_next_state(discrete_state):
    # What's the best expected Q-value for the next state?
    estimated_max = max(q_table[discrete_state])
    return estimated_max

def new_q_value(discrete_state, action, max_future_q):
    # Calculate the new Q-value
    current_q = q_table[discrete_state][action]
    new_q = current_q + learning_rate * (reward + discount * max_future_q - current_q)
    return new_q

def decayed_epsilon(epsilon, episode):
    # Optionally, decay the epsilon value
    if episode%(num_episodes/100)==0:
      epsilon = epsilon * epsilon_factor
    return epsilon
    #return epsilon

def get_q_values(discrete_state):
    # Get q values for all possible actions
    return q_table

In [None]:
# Time to train the system

for episode in range(num_episodes):
    state = env.reset() # Don't forget to reset the environment between episodes
    current_state_disc = discretize_state(state)

    reward_sum = 0
    done = False
    while not done:
        if (episode + 1) % num_episodes_between_videos == 0:
            env.render()

        # Take an action by exploration or exploitation
        action = take_action(current_state_disc, epsilon)
        new_state, reward, done, info = env.step(action)
        new_state_disc = discretize_state(new_state)

        # Calculate the total reward
        reward_sum += reward

        if not done:
            # Retrieve the maximum estimated value for the next state
            max_future_q = estimated_max_for_next_state(new_state_disc)

            # Calculate the new value (note: Bellman equation)
            new_q = new_q_value(current_state_disc, action, max_future_q)
            q_table[current_state_disc + (action,)] = new_q
        else:
            # Render the video
            if (episode + 1) % num_episodes_between_status_update == 0:
                env.render()
                print(f'Total reward at episode {episode + 1}: {reward_sum}')

        # Prepare for the next loop
        current_state_disc = new_state_disc
        

    # Decay epsilon
    epsilon = decayed_epsilon(epsilon, episode)
    if episode%1000==0:
      print(epsilon)

env.close()
show_video()

0.485


True

Total reward at episode 500: 19.0


True

Total reward at episode 1000: 91.0
0.3576507015440401


True

Total reward at episode 1500: 66.0


True

Total reward at episode 2000: 27.0
0.26374025631947234


True

Total reward at episode 2500: 108.0


True

Total reward at episode 3000: 66.0
0.19448842824343143


True

Total reward at episode 3500: 113.0


True

Total reward at episode 4000: 200.0
0.1434204593885793


True

Total reward at episode 4500: 130.0


True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

Total reward at episode 5000: 169.0
0.1057617070434926


True

Total reward at episode 5500: 200.0


True

Total reward at episode 6000: 118.0
0.0779912344754647


True

Total reward at episode 6500: 200.0


True

Total reward at episode 7000: 200.0
0.05751261798852717


True

Total reward at episode 7500: 154.0


True

Total reward at episode 8000: 200.0
0.042411192105631185


True

Total reward at episode 8500: 200.0


True

Total reward at episode 9000: 200.0
0.031275036309068145


True

Total reward at episode 9500: 200.0


True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

Total reward at episode 10000: 200.0


# MountainCar


Now apply the things you've learned to the MountainCar problem. Please note that the observable space differs from the previous problem. Thus, before you start training, you need to learn more about thethis new environment.

Here is some code to help you get started..

In [17]:
# Create the desired environment
env = gym.make("MountainCar-v0")

# Wrap the environment, to make sure we get to see a fancy video
env = video_env(env)

# Before you can use a Gym environment, it needs to be reset.
state = env.reset()

# Perform random actions untill we drop the stick. Just as an example.
done = False
while not done:
   
    # Explore and take actions
    pass
    random_action = env.action_space.sample() 

    state, reward, done, info = env.step(2)
    print(state)
    # Remove the line below when you have created an implementation you want to test.
    if np.random.random()>0.95:
      done = True
print(state)
print(done)
print(info)
# Show the results!
env.close()
show_video()

[-0.56403287  0.00131228]
[-0.56141808  0.00261479]
[-0.55752025  0.00389783]
[-0.55236845  0.0051518 ]
[-0.54600114  0.00636731]
[-0.53846594  0.0075352 ]
[-0.52981928  0.00864666]
[-0.52981928  0.00864666]
True
{}


In [18]:
# Define parameters - Fill in the dots

num_episodes = 100000
num_episodes_between_status_update = 500
num_episodes_between_videos = 5000

learning_rate = 0.05         # also known as: alpha
discount = 0.95              # also known as: gamma
epsilon = 0.5

# Epsilon decay
# facto to be multiplied with epsilon for every percentage amount of episodes
# so after 100 of the 10000 episodes, epsilon will drop by epsilon*epsilon_factor
epsilon_factor = 0.97
pass    # Optionally, add parameters for epsilon decay here

# Discretization
pass    # You can add parameters to discretize states here

In [19]:
## Q-Table creation

# As seen before, the state consists of 4 floating point values.
# It makes sense to discretize these values (read: place them in buckets), to reduce the state space and therefore the Q-table size
state_shape = [50,50]      # For instance: [4, 4, 6, 6], or [10] * 4, or [200, 200, 100, 100]

# Define the initial Q table as a random uniform distribution
q_table = np.random.uniform(low=-2, high=0, size=(state_shape + [env.action_space.n]))

#print('Initial Q table:', q_table)

# Train


In [20]:
# Functions


bins = [np.linspace(-1.2,0.6,state_shape[0]),
        np.linspace(-0.07,0.07,state_shape[1])]


def discretize_state(state):
    # A Q-table cannot practically handle infinite states, so limit the state space by
    # discretizing the state into buckets.
    discrete_state = []
    for i in range(len(state)):
       discrete_state.append(np.digitize(state[i],bins[i]) - 1)
    discrete_state = np.array(discrete_state)
    return tuple(discrete_state.astype(np.int))

def take_action(discrete_state, epsilon):
    # Take an action to either explore or exploit.
    # epsilon defines ratio exploration/exploitation
    if epsilon>np.random.random():
      # explore
      action = env.action_space.sample()
    else:
      # exploit: get action with highest q value
      #q_values = get_q_values(discrete_state)
      #action = q_values.index(max(q_values))
      action = np.argmax(q_table[discrete_state])
    return action

def estimated_max_for_next_state(discrete_state):
    # What's the best expected Q-value for the next state?
    estimated_max = max(q_table[discrete_state])
    return estimated_max

def new_q_value(discrete_state, action, max_future_q):
    # Calculate the new Q-value
    current_q = q_table[discrete_state][action]
    new_q = current_q + learning_rate * (reward + discount * max_future_q - current_q)
    return new_q

def decayed_epsilon(epsilon, episode):
    # Optionally, decay the epsilon value
    if episode%(num_episodes/100)==0:
      epsilon = epsilon * epsilon_factor
    return epsilon


def get_q_values(discrete_state):
    # Get q values for all possible actions
    return q_table

In [21]:
# Time to train the system

for episode in range(num_episodes):
    state = env.reset() # Don't forget to reset the environment between episodes
    current_state_disc = discretize_state(state)

    reward_sum = 0
    done = False
    while not done:
        if (episode + 1) % num_episodes_between_videos == 0:
            env.render()

        # Take an action by exploration or exploitation
        action = take_action(current_state_disc, epsilon)
        new_state, reward, done, info = env.step(action)
        new_state_disc = discretize_state(new_state)

        # Calculate the total reward
        reward_sum += reward

        if not done:
            # Retrieve the maximum estimated value for the next state
            max_future_q = estimated_max_for_next_state(new_state_disc)

            # Calculate the new value (note: Bellman equation)
            new_q = new_q_value(current_state_disc, action, max_future_q)
            q_table[current_state_disc + (action,)] = new_q
        else:
            # Render the video
            if (episode + 1) % num_episodes_between_status_update == 0:
                env.render()
                print(f'Total reward at episode {episode + 1}: {reward_sum}')

        # Prepare for the next loop
        current_state_disc = new_state_disc
        

    # Decay epsilon
    epsilon = decayed_epsilon(epsilon, episode)
    if episode%1000==0:
      print(epsilon)

env.close()
show_video()

0.485
Total reward at episode 500: -200.0
Total reward at episode 1000: -200.0
0.47045
Total reward at episode 1500: -200.0
Total reward at episode 2000: -200.0
0.4563365
Total reward at episode 2500: -200.0
Total reward at episode 3000: -200.0
0.44264640499999997
Total reward at episode 3500: -200.0
Total reward at episode 4000: -200.0
0.42936701284999995
Total reward at episode 4500: -200.0
Total reward at episode 5000: -200.0
0.41648600246449996
Total reward at episode 5500: -200.0
Total reward at episode 6000: -200.0
0.40399142239056496
Total reward at episode 6500: -200.0
Total reward at episode 7000: -200.0
0.391871679718848
Total reward at episode 7500: -200.0
Total reward at episode 8000: -200.0
0.38011552932728254
Total reward at episode 8500: -200.0
Total reward at episode 9000: -182.0
0.36871206344746404
Total reward at episode 9500: -200.0
Total reward at episode 10000: -200.0
0.3576507015440401
Total reward at episode 10500: -200.0
Total reward at episode 11000: -200.0
0.3