# A Dive into Reinforcement Learning: Artificial Intelligence Project Proposal

By Charles Kornoelje

Updated 05/15/2020

## Vision

The goal of my CS 344 honors final project is to take a deep dive into [reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning), with the hope of training an artificial intelligence agent to play a video game. The first agent I was able to implement was with a [deep Q-learning network](https://en.wikipedia.org/wiki/Q-learning#Deep_Q-learning) (DQN) designed to play the Atari 2600 game, [_Breakout_]( https://en.wikipedia.org/wiki/Breakout_(video_game) ), the classic brick-breaking game. However, I quickly learned that training a somewhat intelligent agent would take lots of computational time and energy, something of which I do not have. So I began training an agent and moved on to find a game that took less power, which led me to the text-based video game, _[FrozenLake](https://gym.openai.com/envs/FrozenLake-v0/)_. I was able to train a smart agent to play the game after following a guide and finding some code.

The purpose of the project is to learn how to use reinforcement learning to train agents. Reinforcement learning is a domain of machine learning where an agent takes actions based on observations in their environment to maximize their reward. The project falls under the active reinforcement learning realm in which a Q-learning agent is trained with an action-utility function (Q-function) to learn a control policy that tells an agent which actions to take at a current state. Learning the control policy will assist the agent in decision making in order to take proper actions to maximize their score in video games. If an agent is able to be trained to play a game well, the same training can be applied to real life activities and techniques.

## Background

Both the _Breakout_ and _FrozenLake_ games are environments from the [OpenAI](https://openai.com/) [Gym Python package](https://gym.openai.com/). The Gym versions of the games make it easy to interface with modern machine learning frameworks, such as [Keras](https://keras.io/) and [TensorFlow](https://www.tensorflow.org/), which I will be using to train my artificial intelligence agent. I chose these technologies because I have previous experience with them and the guides I follow implement the reinforcement algorithms with them. The artificial agent will be trained using a deep neural network using reinforcement learning algorithms.

### Reinforcement Learning

My basic understanding of reinforcement learning came from Chapter 21 of _[Artificial Intelligence: A Modern Approach, Third Edition](http://aima.cs.berkeley.edu/)_ by Russell and Norvig. Reinforcement learning provides feedback to an agent as to whether its action is good or bad, and from there, it will update its behavior to maximize its reward. The agent begins without knowing which actions lead to desirable outcomes and which do not, and over time, the agent will begin to adjust its behavior to maximize its reward. Reinforcement learning assumes a fully-observable environment, which makes it especially applicable to training video game bots where the game state and all possible actions are known. It is also assumed that the agent does not know anything about the environment or what actions it should take, only what actions it may take. It decides what actions to take based on the Markov decision process (TODO: elaborate on this?).

### Q-Learning

The agent can be designed in many different ways for reinforcement learning, but I will focus on a Q-learning agent. This agent learns an action-utility function, or Q-function, that gives the expected value for taking an action given the current state (CITE: Textbook 831). Russell and Norvig state, “A Q-learning agent, on the other hand, can compare the expected utilities for its available choices without needing to know their outcomes, so it does not need a model of the environment. On the other hand, because they do not know where their actions lead, Q-learning agents cannot look ahead; this can seriously restrict their ability to learn” (CITE: textbook 831). Like any agent implementation, there are pros and cons, but allowing agents to compare expected values from choices without knowing their outcome is good for video game play that has non-deterministic outcome. We can use a Q-function to update itself over an iterative process to calculate exact Q-values when given an estimated model.

### Deep Q-Learning Network (DQN)

According to [Mnih et al. (2013)](https://arxiv.org/pdf/1312.5602.pdf), deep reinforcement learning can be achieved through a combination of a deep neural network and a Q-learning function, resulting in a deep Q-learning network (DQN). The DQN exists of some sort of memory that is a set of tuples containing (state, action, reward, next state) and an action-value Q-function initialized to random weights. In the basic sense, a DQN uses a Q-function to update the weights in the deep neural network that correspond to s state and action.

### Double DQN (DDQN)

According to [van Hasselt et al. (2015)](https://arxiv.org/pdf/1509.06461.pdf), “The popular Q-learning algorithm is known to overestimate action values under certain conditions” so two value functions are learned instead of one. The first value is estimating the value of the policy wanting to maximize the reward, and then the second value to fairly evaluate the value of the first policy. This leads to less overestimations and provides “more stable and reliable learning” (van Hasselt et al.).

## Implementation


### _Breakout_ Reinforcement Learning

To train the agent to play _Breakout_, I followed the article “_[Beat Atari with Deep Reinforcement Learning! (Part 1: DQN)](https://becominghuman.ai/lets-build-an-atari-ai-part-1-dqn-df57e8ff3b26)_” by Adrien Lucas Ecoffet, which provided a solid overview of the idea of Q-learning, but failed to provide a detailed enough explanation of code implementation. One of the comments responded with their [GitHub repo implementation](https://github.com/boyuanf/DeepQLearning) of a DQN to play _Breakout_. After slightly modifying boyuanf’s code, I was able to get it running on my machine. The DQN is trained by an array representation of the current screen state where each pixel has an RGB value, with the shape of the array being (210, 160, 3). For each state, there is an integer value that reinforces each action, with positive integers being positive reinforcement. I quickly realized that Q-learning involves a lot of math and custom functions that are specific to the problem, but I did not know how to build it on my own yet. I have added my [updated version of the code](https://github.com/charkour/cs344/blob/master/project/research-and-examples/boyuan-dqn-example.py) to my project directory. The major change I made was to lower the amount of previous actions and responses remembered ten fold. Previously, boyuanf was storing 20 GBs of past decisions and rewards, but I felt that was too much and having less than that would help the agent find better decisions more quickly. The artitecture starts with a normalized layer of the (210, 160, 3) input, then two convolutional layers, which are flattened into a dense layer with 256 rectifier units, and then another dense layer the size of the actions, which is 3, and then goes into a filtered output that applies a mask to get one action. There are over 600,000 parameters that are estimated in the model.

*   [gym](https://gym.openai.com/)
*   [numpy](https://numpy.org/)
*   [tensorflow](https://www.tensorflow.org/)
*   [keras](https://keras.io/)
*   [skimage](https://scikit-image.org/) (for preprocessing)
*   [Collections](https://docs.python.org/3.6/library/collections.html#collections.deque) (for deque)

I quickly realized that my personal machine (a 2015 MacBook Pro with a 3.1 GHz Dual-Core Intel i7 processor and 16 GB DDR3 RAM) would not have enough CPU power to train the DQN to play breakout in a reasonable amount of time. I tried Google Colab, but that was not much better. I was able to connect to one of Calvin’s lab machines and start training the model (TODO: give the specs?). After starting the training of the model, I moved onto trying to find a better article related to deep Q-learning for game playing.

### _FrozenLake_ Reinforcement Learning

The _FrozenLake_ board is 4x4, where there is a start (S) space in the top left, and a goal (G) space in the bottom right. The rest of the spaces are a random assortment of frozen (F) spaces, that are safe to step on, and hole (H) spaces, that are not safe for the player to step on, and will cause them to lose the game. Successfully traversing from S to G on F spaces will reward the agent positively. For every step taken the reward is 0, for falling in a hole the reward is 0, and 1 for reaching the goal. There are four actions the agent can take, each action is moving in one of the cardinal directions. At each current state, the DQN estimates the reward for each action and takes the best one. Overtime, the agent will learn that reaching the goal state is ideal because it receives a reward. One episode is one attempt at the game, which will either be a success (with a reward of 1) or a failure (0).

Determined not to give up on my conquest of training an video-game-playing agent with deep Q-learning, my research led me to an article “[Bias-Variance for Deep Reinforcement Learning: How To Build a Bot for Atari with OpenAI Gym](https://www.digitalocean.com/community/tutorials/how-to-build-atari-bot-with-openai-gym#step-6-%E2%80%94-creating-a-deep-q-learning-agent-for-space-invaders)” by Alvin Wan. From his tutorials, I was able to train an agent to play the game well, but I could not understand where the DNN was implemented in his code. To me it seemed like it was training just through using a gradient descent optimizer and a graph search to minimise error, which is not technically making it a deep network, but is still Q-learning I believe.

After Wan’s article, I searched for a Keras implementation of a DQN for _FrozenLake_, and found a [StackOverflow post](https://stackoverflow.com/questions/45869939/something-wrong-with-keras-code-q-learning-openai-gym-frozenlake) which lead to a [Jupyter Notebook](https://gist.github.com/weiji14/bab587907681869ec0f70f7496f98a12), which referenced a [Keras DQN implementation for OpenAI’s FrozenLake](https://gist.github.com/ceshine/eeb97564c21a77b8c315179f82b3fc08), by GitHub user CeShine.


We will begin by importing what we need.

In [1]:
"""
@author: CeShine
@author: Charkour
Updated to work for 8x8 frozen ice.
Ability to load weights.
Fix small bugs for new versions.
Reuseability features.
"""

import sys
import tempfile
import gym
import numpy as np
import tensorflow as tf
from keras.models import Sequential
from keras.layers.core import Reshape
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from rl.agents.dqn import DQNAgent
from rl.policy import Policy
from rl.memory import SequentialMemory

print('python       :', sys.version.split('\n')[0])
print('numpy        :', np.__version__)
print('tensorflow   :', tf.__version__)
print('gym          :', gym.__version__)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Using TensorFlow backend.


python       : 3.6.1 (v3.6.1:69c0db5050, Mar 21 2017, 01:21:04) 
numpy        : 1.18.1
tensorflow   : 1.14.0
gym          : 0.17.2


In addition keras-rl is installed with version 0.4.1

Next we define the policy for picking an action. It is a greedy decay policy. The epsilon
begins high, taking larger risks, and it will get lower and start taking more greedy choices.
This helps make a balance between exploration and exploitation.

In [2]:

class DecayEpsGreedyQPolicy(Policy):

    def __init__(self, max_eps=.1, min_eps=.05, lamb=0.001):
        super(DecayEpsGreedyQPolicy, self).__init__()
        self.max_eps = max_eps
        self.lambd = lamb
        self._steps = 0
        self.min_eps = min_eps

    def select_action(self, q_values):
        assert q_values.ndim == 1
        nb_actions = q_values.shape[0]
        eps = self.min_eps + (self.max_eps - self.min_eps) * \
            np.exp(-self.lambd * self._steps)
        self._steps += 1
        if self._steps % 1e3 == 0:
            print("Current eps:", eps)
        if np.random.uniform() < eps:
            action = np.random.random_integers(0, nb_actions - 1)
        else:
            action = np.argmax(q_values)
        return action

Next we will define the model. The model right now has an embedding layer
that takes a input of the current state. And then it is reshaped into 4
for the output that relates to each action: left, right, up, down.

The DQN agent is also setup. It uses the model and the policy defined.
It is compiled with the Adam optimizer.

In [3]:
ENV_NAME = 'FrozenLake8x8-v0'

np.set_printoptions(threshold=np.inf)
np.set_printoptions(precision=4)

# Get the environment and extract the number of actions.
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

def get_keras_model(action_space_shape):
    model = Sequential()
    model.add(Embedding(64, 4, input_length=1))
    model.add(Reshape((4,)))
    print(model.summary())
    return model

model = get_keras_model(nb_actions)

memory = SequentialMemory(window_length=1, limit=10000)
policy = DecayEpsGreedyQPolicy(max_eps=0.9, min_eps=0, lamb=1 / (1e4))
dqn = DQNAgent(model=model, nb_actions=nb_actions,
               memory=memory, nb_steps_warmup=500,
               target_model_update=1e-2, policy=policy,
               enable_double_dqn=False, batch_size=512
               )
dqn.compile(Adam())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1, 4)              256       
_________________________________________________________________
reshape_1 (Reshape)          (None, 4)                 0         
Total params: 256
Trainable params: 256
Non-trainable params: 0
_________________________________________________________________
None


Load the weights, and then train the DQN, then save the weights.

In [None]:
try:
    dqn.load_weights('./research-and-examples/dqn_{}_weights.h5f'.format(ENV_NAME))
except Exception as e:
    print(e)
    pass

temp_folder = tempfile.mkdtemp()

dqn.fit(env, nb_steps=1e5, visualize=False, verbose=1, log_interval=10000)

# After training is done, we save the final weights.
dqn.save_weights('./research-and-examples/dqn_{}_weights.h5f'.format(ENV_NAME), overwrite=True)

Training for 100000.0 steps ...
Interval 1 (0 steps performed)

  866/10000 [=>............................] - ETA: 2:45 - reward: 0.0000e+00



Load the weights and then test the DQN.

In [7]:
dqn.load_weights('./research-and-examples/dqn_{}_weights.h5f'.format(ENV_NAME))

# Finally, evaluate our algorithm for 5 episodes.
dqn.test(env, nb_episodes=100, visualize=False)


Testing for 100 episodes ...
Episode 1: reward: 1.000, steps: 50
Episode 2: reward: 1.000, steps: 107
Episode 3: reward: 0.000, steps: 83
Episode 4: reward: 0.000, steps: 38
Episode 5: reward: 1.000, steps: 29
Episode 6: reward: 1.000, steps: 61
Episode 7: reward: 1.000, steps: 81
Episode 8: reward: 1.000, steps: 40
Episode 9: reward: 0.000, steps: 70
Episode 10: reward: 1.000, steps: 91
Episode 11: reward: 1.000, steps: 65
Episode 12: reward: 1.000, steps: 34
Episode 13: reward: 1.000, steps: 94
Episode 14: reward: 0.000, steps: 81
Episode 15: reward: 1.000, steps: 28
Episode 16: reward: 0.000, steps: 76
Episode 17: reward: 1.000, steps: 53
Episode 18: reward: 0.000, steps: 118
Episode 19: reward: 1.000, steps: 105
Episode 20: reward: 1.000, steps: 74
Episode 21: reward: 0.000, steps: 104
Episode 22: reward: 1.000, steps: 27
Episode 23: reward: 1.000, steps: 63
Episode 24: reward: 1.000, steps: 84
Episode 25: reward: 1.000, steps: 37
Episode 26: reward: 1.000, steps: 69
Episode 27: re

<keras.callbacks.callbacks.History at 0x1243dd0f0>

In its current state, this has extended upon CeShine’s work to play the 8x8 grid version of _FrozenLake_. My future work will be to update and optimize the DNN to solve the 8x8 version with higher accuracy than it currently does. Some work will need to be done to first calculate the accuracy of a trained model and then train the model multiple times to compare the performance.

## Results


### _FrozenLake_

Currently, the Q-learning agent is able to complete the _FrozenLake_ game 75 times out of 100 attempts when training for 100,000 episodes. In the reinforcement learning domain, “solving” the puzzle is anything above 78 times, but I am not sure why this is considered the standard for assessing the ability of an artificial intelligence agent. Compared to Wan’s implementation of q-learning that achieves 82/100, CeShine’s example gets 80/100. However, Wan’s calculation takes the 100 best episodes while training for 4000 episodes, and CeShine’s trains for 100,000 episodes, but it takes the last 100 episodes. My extension of CeShine’s DQN for the 8x8 version solves the puzzle 75/100 times. I would like to see if the 8x8 performance can be increased by tweaking the network architecture or hyperparameters.


### _Breakout_

Additionally, I am excited to report the progress of the agent training on the campus lab computer. It has been training for 53 hours, completing over 6.35 million iterations through the neural network to estimate the agent’s future reward if an action is taken. I am surprised with the progress it has made. The highest score I have seen recently is 38 achieved on episode 16335, and it can usually score 5 or more while on episode 16400. This is exciting progress and it is performing better than boyuanf’s agents trained for 24 hours and 36 hours which got 0 and 11 respectively. I think this is due to the fact that I lowered the memory tenfold from what he had. I do not think I will continue working on this, but focus my work on the _FozenLake_ text game, which takes relatively less computational time and energy.

Update 1: Now the best is 48 on episode 19784. 7.7 million steps.

Update 2: Now the best is 58 on episode 37646. 18327920 steps.

I started on Tuesday, May 5 at 3:50pm and stopped training on Tuesday, May 12 at 9:12am for a total of 161 training hours. 19,502,978 million steps were completed. The final model was able to score 16 points, which is better than boyuanf’s model. Because of the way the models are saved, if I look at younger models, the score might be higher.

## Implications

Reinforcement learning is an interesting approach to training intelligent agents. This project has shown that with enough time and power, an artificial agent can be trained to play a simple task by just giving rewards. Although this approach may not give as good of results in the same amount of time as say supervised learning, reinforcement learning allows that agnet to be trained without massive amounts of meticulously curated and labeled data. One problem with reinforcement learning data, is having to define an action set and give rewards for certain actions or sequences of actions. If we are able to train computers to complete tasks in video games, then it is apparent that we can do this in real life too. We already do reinforcement learning in our normal lives, such as training a pet, and we will continue to do it with machines.

I have learned that it takes a lot of time and energy to perform Q-learning, but as technology advances, it seems like it will be more easily achieved. Some really great work has come from reinforcement learning techniques, and I think they will just be getting better. I feel that reinforcement learning relates very closely with how humans act and learn as we get rewards or consequences for our actions. And before every action, we deliberate (sometimes not enough) on what we think our reward will be. However, in the current code setup, I do not believe the Q-learning agents have a concept of realizing long-term rewards like humans are able to. Yet, I think it is possible to train machines with long-term effects, but coding it and training it will take even more time.
