# Harder Than It Seems: Artificial Intelligence Reinforcement Learning Project

By Charles Kornoelje for CS344 at Calvin University

Prof. Vander Linden

Updated 05/21/2020

Please contact for the slide deck.

## Vision

The goal of my CS 344 honors final project is to take a deep dive into [reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning), with the hope of training an artificial intelligence agent to play a video game. The first agent I was able to implement was with a [deep Q-learning network](https://en.wikipedia.org/wiki/Q-learning#Deep_Q-learning) (DQN) designed to play the Atari 2600 game, _[Breakout](https://w.wiki/RQQ)_, the classic brick-breaking game. However, I quickly learned that training a somewhat intelligent agent would take lots of computational time and energy, which I did not have, so I began training an agent and moved on to find a game that took less power, which led me to the text-based video game, _[FrozenLake](https://gym.openai.com/envs/FrozenLake-v0/)_. I was able to train a smart agent to play the game after following a guide and tweaking some code.

The purpose of this project is to learn how to use reinforcement learning to train agents. Reinforcement learning is a domain of machine learning where an agent takes actions based on observations in their environment to maximize their reward. The project falls under the active reinforcement learning realm in which a [Q-learning](https://en.wikipedia.org/wiki/Q-learning) agent is trained with an action-utility function (Q-function) to learn a control policy that tells an agent which actions to take at a current state. Learning the control policy will assist the agent in decision making in order to take proper actions to maximize their score in video games. If an agent is able to be trained to play a game well, the same training can be applied to real life activities and techniques.


## Background

I wanted to train an artificial intelligence to play a video game because I had seen companies like Google’s [DeepMind](https://deepmind.com/) [train agents to play complex games like Starcraft](https://www.theverge.com/2019/10/30/20939147/deepmind-google-alphastar-starcraft-2-research-grandmaster-level) which had good results. My search led me to the machine learning domain of reinforcement learning.


### Reinforcement Learning

Russell and Norvig in Chapter 21 of _[Artificial Intelligence: A Modern Approach, Third Edition](http://aima.cs.berkeley.edu/)_ define reinforcement learning as a system that provides feedback to an agent in the form of a reward for an action, which it will use to make updates to the model in order to maximize its reward. The reward can be either good or bad: a consequence. The agent begins without knowing which actions lead to desirable outcomes, and over time, the agent will begin to adjust its behavior to maximize its reward. Reinforcement learning assumes a fully-observable environment, which makes it especially applicable to training video game bots where the game state and all possible actions are known. It is also assumed that the agent does not know anything about the environment or what actions it should take, only what actions it may take. It decides what actions to take based on the Markov decision process (Russell and Norvig 830).

![RL](https://www.kdnuggets.com/images/mathworks-reinforcement-learning-fig1-543.jpg)
[Click here for the image source](https://www.kdnuggets.com/images/mathworks-reinforcement-learning-fig1-543.jpg)

For those with a psychology background, B. F. Skinner’s work with [reinforcement](https://en.wikipedia.org/wiki/Reinforcement) and behaviorism closely relates to reinforcement learning with artificial agents. Similar principles of thought can be applied to both. Skinner worked with animals, and trained them with reinforcement to complete tasks. Below is a simple image of reinforcement learning. The dog represents the artificial agent, and it completes actions for a reward. It will get a response back from the environment which it will use to update its mental model on how actions relate to rewards. The dog wants to be a “good boy” it will do actions that will result in the affirmation of being a “good boy”.


### Q-Learning

Although there are several methods for training an agent using reinforcement learning, I will just be focusing on Q-learning. Q-learning agent learns an action-utility function, or Q-function, that gives the expected value for taking an action given the current state. Russell and Norvig state, “A Q-learning agent ... can compare the expected utilities for its available choices without needing to know their outcomes, so it does not need a model of the environment. [B]ecause they do not know where their actions lead, Q-learning agents cannot look ahead; this can seriously restrict their ability to learn” (Russell and Norvig 831). Like any agent implementation, there are pros and cons, but allowing agents to compare expected values from actions without knowing their outcome is good for video games which have non-deterministic outcomes. We can give an agent a Q-function to update itself over an iterative process to calculate exact Q-values when given an estimated model.

![Q-learning](https://miro.medium.com/max/1400/1*FHsbUXsJFg8xt5U2c-6y1A.png)
[Click here for the image source](https://miro.medium.com/max/1400/1*FHsbUXsJFg8xt5U2c-6y1A.png)

In the beginning, a Q-learning agent has no idea which actions lead to rewards, but over time, it is able to learn what is best. The equation above is a Q-function that is learned to calculate new Q-values for a given action-reward pair. On the left-hand side of the equation, there is an updated Q-value for a given state and action. A Q-learning agent has a giant table of values that are indexed by a state and action pair, so this new value will update the value in the table. The current Q-value is the value that existed when the agent made that move in the state. This value is then added to a value that takes into account the reward and the maximum predicted award. The alpha is the learning rate, like in the Google Crash Course exercises, which decides how much impact the new calculation should have on the Q-value. The discount rate relates to the idea of simulated annealing, where at the start, the Q-function is likely to take more jumps, but then stop jumping around and go with what is best. The discount rate changes from exploration to exploitation as time goes on.


### Deep Q-Learning Network

In 2013, [Mnih et al.](https://arxiv.org/pdf/1312.5602.pdf) of DeepMind described that deep reinforcement learning can be achieved through a combination of a deep neural network and a Q-learning function, resulting in a deep Q-learning network (DQN). In this method, the Q-table is replaced by the deep neural network values. The DQN has some sort of memory that is a set of tuples—resembling (state, action, reward, next state)—and an action-value Q-function initialized to random weights. In the basic sense, a DQN uses a Q-function to update the weights in the deep neural network that correspond to a state and action (Mnih et al. 4-5).


### Double DQN

In 2015, according to [van Hasselt et al.](https://arxiv.org/pdf/1509.06461.pdf) of DeepMind (acquired by Google by this point), “The popular Q-learning algorithm is known to overestimate action values under certain conditions” so two value functions are learned instead of one. The first value is estimating the value of the policy wanting to maximize the reward, and then the second value to fairly evaluate the value of the first policy. This leads to less overestimations and provides “more stable and reliable learning”. Both values are estimated using DQNs like before, but there is now double the amount, making it a DDQN (van Hasselt et al. 2).


### Technologies

Both the _Breakout_ and _FrozenLake_ are environments from the [OpenAI](https://openai.com/) [Gym Python package](https://gym.openai.com/). The Gym versions of the games make it easy to interface with modern machine learning frameworks, such as [Keras](https://keras.io/), which I will be using to train my artificial intelligence agent. I chose these technologies because I have previous experience with Keras and the guides I follow implement the reinforcement algorithms with it. The artificial agent will be trained on a deep neural network using reinforcement learning algorithms.


## Implementation


### _Breakout_ Reinforcement Learning

To train the agent to play _Breakout_, I followed the article “_[Beat Atari with Deep Reinforcement Learning! (Part 1: DQN)](https://becominghuman.ai/lets-build-an-atari-ai-part-1-dqn-df57e8ff3b26)_” by Adrien Lucas Ecoffet, which provided a solid overview of the idea of Q-learning, but failed to provide a detailed enough explanation of code implementation. GitHub-user boyuanf provided their [GitHub repo implementation](https://github.com/boyuanf/DeepQLearning) of a DQN to play _Breakout_ in the article comments. After fixing some bugs, I was able to get it running on my machine. The DQN is trained by an array representation of the current screen state where each pixel has an RGB value, with the shape of the array being (210, 160, 3). For each state, there is an integer value that reinforces each action, with positive integers being positive reinforcement. I quickly realized that Q-learning involves a lot of math and custom functions that are specific to each game, but I did not know how to build it on my own yet. Below are the packages I used in order to train the DQN.



*   [gym](https://gym.openai.com/) | it provides interfaces between games and machine learning libraries
*   [numpy](https://numpy.org/)
*   [tensorflow](https://www.tensorflow.org/)
*   [keras](https://keras.io/)
*   [atari-py](https://github.com/openai/atari-py) | used for the Atari game environments.
*   [skimage](https://scikit-image.org/) | used for preprocessing each state in _Breakout_
*   [Collections](https://docs.python.org/3.6/library/collections.html#collections.deque) | used for `deque` in relation to memory

The major change I made was to lower the amount of previous actions and responses remembered ten fold. Previously, boyuanf was storing 20 GBs of past decisions and rewards, but I felt that was too much and having less than that would help the agent find better decisions more quickly. The artitecture starts with a normalized layer of the (210, 160, 3) input, then two convolutional layers, which are flattened into a dense layer with 256 rectifier units, and then another dense layer the size of the actions, which is 3, and then goes into a filtered output that applies a mask to get one action. There are over 600,000 parameters that are estimated in the model.

Below is boyuanf's code. The original can be found here. Aside from fixing bugs to get the program to run. I only updated the replay memory in the flags to be 40,000 instead of 400,000. Because I did not expand off of the code, I will limit my explanation of it. I do not understand everything:

We begin by importing what we need.

In [None]:
import gym
import random
import numpy as np
import tensorflow as tf
from keras import layers
from skimage.color import rgb2gray
from skimage.transform import resize
from keras.models import Model

from collections import deque
from keras.optimizers import RMSprop
from keras import backend as K
from datetime import datetime
import os.path
import time
from keras.models import load_model
from keras.models import clone_model
from keras.callbacks import TensorBoard

Next, we will set parameters for the training. Note that I changed `replay_memory`.
Please note: if you want to train your own version, you will have to change the flag
`restore_file_path` to be the file on your machine.

In [None]:
FLAGS = tf.app.flags.FLAGS

tf.app.flags.DEFINE_string('train_dir', 'tf_train_breakout',
                           """Directory where to write event logs and checkpoint. """)
tf.app.flags.DEFINE_string('restore_file_path',
                           '/Users/charleskornoelje/Documents/LocalDevelopment/344/cs344/project/research-and-examples/tf_train_breakout/breakout_model_20200512101401.h5',
                           """Path of the restore file """)
# tf.app.flags.DEFINE_integer('num_episode', 100000,
tf.app.flags.DEFINE_integer('num_episode', 100000,
                            """number of epochs of the optimization loop.""")
# tf.app.flags.DEFINE_integer('observe_step_num', 5000,
tf.app.flags.DEFINE_integer('observe_step_num', 50000,
                            """Timesteps to observe before training.""")
# tf.app.flags.DEFINE_integer('epsilon_step_num', 50000,
tf.app.flags.DEFINE_integer('epsilon_step_num', 1000000,
                            """frames over which to anneal epsilon.""")
tf.app.flags.DEFINE_integer('refresh_target_model_num', 10000,  # update the target Q model every refresh_target_model_num
                            """frames over which to anneal epsilon.""")
# tf.app.flags.DEFINE_integer('replay_memory', 400000,  # takes up to 20 GB to store this amount of history data
tf.app.flags.DEFINE_integer('replay_memory', 40000,
                            """number of previous transitions to remember.""")
tf.app.flags.DEFINE_integer('no_op_steps', 30,
                            """Number of the steps that runs before script begin.""")
tf.app.flags.DEFINE_float('regularizer_scale', 0.01,
                          """L1 regularizer scale.""")
tf.app.flags.DEFINE_integer('batch_size', 32,
                            """Size of minibatch to train.""")
tf.app.flags.DEFINE_float('learning_rate', 0.00025,
                          """Number of batches to run.""")
tf.app.flags.DEFINE_float('init_epsilon', 1.0,
                          """starting value of epsilon.""")
tf.app.flags.DEFINE_float('final_epsilon', 0.1,
                          """final value of epsilon.""")
tf.app.flags.DEFINE_float('gamma', 0.99,
                          """decay rate of past observations.""")
tf.app.flags.DEFINE_boolean('resume', False,
                            """Whether to resume from previous checkpoint.""")
tf.app.flags.DEFINE_boolean('render', False,
                            """Whether to display the game.""")

Next, we define some constants we well as some helpful functions including the model itself.

In [None]:
ATARI_SHAPE = (84, 84, 4)  # input image size to model
ACTION_SIZE = 3


# 210*160*3(color) --> 84*84(mono)
# float --> integer (to reduce the size of replay memory)
def pre_processing(observe):
    processed_observe = np.uint8(
        resize(rgb2gray(observe), (84, 84), mode='constant') * 255)
    return processed_observe


def huber_loss(y, q_value):
    error = K.abs(y - q_value)
    quadratic_part = K.clip(error, 0.0, 1.0)
    linear_part = error - quadratic_part
    loss = K.mean(0.5 * K.square(quadratic_part) + linear_part)
    return loss


def atari_model():
    # With the functional API we need to define the inputs.
    frames_input = layers.Input(ATARI_SHAPE, name='frames')
    actions_input = layers.Input((ACTION_SIZE,), name='action_mask')

    # Assuming that the input frames are still encoded from 0 to 255. Transforming to [0, 1].
    normalized = layers.Lambda(lambda x: x / 255.0, name='normalization')(frames_input)

    # "The first hidden layer convolves 16 8×8 filters with stride 4 with the input image and applies a rectifier nonlinearity."
    conv_1 = layers.convolutional.Conv2D(
        16, (8, 8), strides=(4, 4), activation='relu'
    )(normalized)
    # "The second hidden layer convolves 32 4×4 filters with stride 2, again followed by a rectifier nonlinearity."
    conv_2 = layers.convolutional.Conv2D(
        32, (4, 4), strides=(2, 2), activation='relu'
    )(conv_1)
    # Flattening the second convolutional layer.
    conv_flattened = layers.core.Flatten()(conv_2)
    # "The final hidden layer is fully-connected and consists of 256 rectifier units."
    hidden = layers.Dense(256, activation='relu')(conv_flattened)
    # "The output layer is a fully-connected linear layer with a single output for each valid action."
    output = layers.Dense(ACTION_SIZE)(hidden)
    # Finally, we multiply the output by the mask!
    filtered_output = layers.Multiply(name='QValue')([output, actions_input])

    model = Model(inputs=[frames_input, actions_input], outputs=filtered_output)
    model.summary()
    optimizer = RMSprop(lr=FLAGS.learning_rate, rho=0.95, epsilon=0.01)
    # model.compile(optimizer, loss='mse')
    # to changed model weights more slowly, uses MSE for low values and MAE(Mean Absolute Error) for large values
    model.compile(optimizer, loss=huber_loss)
    return model


# get action from model using epsilon-greedy policy
def get_action(history, epsilon, step, model):
    if np.random.rand() <= epsilon or step <= FLAGS.observe_step_num:
        return random.randrange(ACTION_SIZE)
    else:
        q_value = model.predict([history, np.ones(ACTION_SIZE).reshape(1, ACTION_SIZE)])
        return np.argmax(q_value[0])


# save sample <s,a,r,s'> to the replay memory
def store_memory(memory, history, action, reward, next_history, dead):
    memory.append((history, action, reward, next_history, dead))


def get_one_hot(targets, nb_classes):
    return np.eye(nb_classes)[np.array(targets).reshape(-1)]


# train model by radom batch
def train_memory_batch(memory, model, log_dir):
    mini_batch = random.sample(memory, FLAGS.batch_size)
    history = np.zeros((FLAGS.batch_size, ATARI_SHAPE[0],
                        ATARI_SHAPE[1], ATARI_SHAPE[2]))
    next_history = np.zeros((FLAGS.batch_size, ATARI_SHAPE[0],
                             ATARI_SHAPE[1], ATARI_SHAPE[2]))
    target = np.zeros((FLAGS.batch_size,))
    action, reward, dead = [], [], []

    for idx, val in enumerate(mini_batch):
        history[idx] = val[0]
        next_history[idx] = val[3]
        action.append(val[1])
        reward.append(val[2])
        dead.append(val[4])

    actions_mask = np.ones((FLAGS.batch_size, ACTION_SIZE))
    next_Q_values = model.predict([next_history, actions_mask])

    # like Q Learning, get maximum Q value at s'
    # But from target model
    for i in range(FLAGS.batch_size):
        if dead[i]:
            target[i] = -1
            # target[i] = reward[i]
        else:
            target[i] = reward[i] + FLAGS.gamma * np.amax(next_Q_values[i])

    action_one_hot = get_one_hot(action, ACTION_SIZE)
    target_one_hot = action_one_hot * target[:, None]

    # tb_callback = TensorBoard(log_dir=log_dir, histogram_freq=0,
    #                           write_graph=True, write_images=False)

    h = model.fit(
        [history, action_one_hot], target_one_hot, epochs=1,
        batch_size=FLAGS.batch_size, verbose=0, use_multiprocessing=True)
        #batch_size=FLAGS.batch_size, verbose=0, callbacks=[tb_callback])

    #if h.history['loss'][0] > 10.0:
    #    print('too large')

    return h.history['loss'][0]

Next we define a function to train the model to play the game.

In [None]:
def train():
    env = gym.make('BreakoutDeterministic-v4')

    # deque: Once a bounded length deque is full, when new items are added,
    # a corresponding number of items are discarded from the opposite end
    memory = deque(maxlen=FLAGS.replay_memory)
    episode_number = 0
    epsilon = FLAGS.init_epsilon
    epsilon_decay = (FLAGS.init_epsilon - FLAGS.final_epsilon) / FLAGS.epsilon_step_num
    global_step = 0

    if FLAGS.resume:
        model = load_model(FLAGS.restore_file_path)
        # Assume when we restore the model, the epsilon has already decreased to the final value
        epsilon = FLAGS.final_epsilon
    else:
        model = atari_model()

    now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
    log_dir = "{}/run-{}-log".format(FLAGS.train_dir, now)
    file_writer = tf.summary.FileWriter(log_dir, tf.get_default_graph())

    model_target = clone_model(model)
    model_target.set_weights(model.get_weights())

    while episode_number < FLAGS.num_episode:

        done = False
        dead = False
        # 1 episode = 5 lives
        step, score, start_life = 0, 0, 5
        loss = 0.0
        observe = env.reset()

        # this is one of DeepMind's idea.
        # just do nothing at the start of episode to avoid sub-optimal
        for _ in range(random.randint(1, FLAGS.no_op_steps)):
            observe, _, _, _ = env.step(1)
        # At start of episode, there is no preceding frame
        # So just copy initial states to make history
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))

        while not done:
            if FLAGS.render:
                env.render()
                time.sleep(0.01)

            # get action for the current history and go one step in environment
            action = get_action(history, epsilon, global_step, model_target)
            # change action to real_action
            real_action = action + 1

            # scale down epsilon, the epsilon only begin to decrease after observe steps
            if epsilon > FLAGS.final_epsilon and global_step > FLAGS.observe_step_num:
                epsilon -= epsilon_decay

            observe, reward, done, info = env.step(real_action)
            # pre-process the observation --> history
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)

            # if the agent missed ball, agent is dead --> episode is not over
            if start_life > info['ale.lives']:
                dead = True
                start_life = info['ale.lives']

            # TODO: may be we should give negative reward if miss ball (dead)
            # reward = np.clip(reward, -1., 1.)  # clip here is not correct

            # save the statue to memory, each replay takes 2 * (84*84*4) bytes = 56448 B = 55.125 KB
            store_memory(memory, history, action, reward, next_history, dead)  #

            # check if the memory is ready for training
            if global_step > FLAGS.observe_step_num:
                loss = loss + train_memory_batch(memory, model, log_dir)
                # if loss > 100.0:
                #    print(loss)
                if global_step % FLAGS.refresh_target_model_num == 0:  # update the target model
                    model_target.set_weights(model.get_weights())

            score += reward

            # If agent is dead, set the flag back to false, but keep the history unchanged,
            # to avoid to see the ball up in the sky
            if dead:
                dead = False
            else:
                history = next_history

            #print("step: ", global_step)
            global_step += 1
            step += 1

            if done:
                if global_step <= FLAGS.observe_step_num:
                    state = "observe"
                elif FLAGS.observe_step_num < global_step <= FLAGS.observe_step_num + FLAGS.epsilon_step_num:
                    state = "explore"
                else:
                    state = "train"
                print('state: {}, episode: {}, score: {}, global_step: {}, avg loss: {}, step: {}, memory length: {}'
                      .format(state, episode_number, score, global_step, loss / float(step), step, len(memory)))

                if episode_number % 1000 == 0 or (episode_number + 1) == FLAGS.num_episode:
                #if episode_number % 1 == 0 or (episode_number + 1) == FLAGS.num_episode:  # debug
                    now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
                    file_name = "breakout_model_{}.h5".format(now)
                    model_path = os.path.join(FLAGS.train_dir, file_name)
                    model.save(model_path)


                # Add user custom data to TensorBoard
                loss_summary = tf.Summary(
                    value=[tf.Summary.Value(tag="loss", simple_value=loss / float(step))])
                file_writer.add_summary(loss_summary, global_step=episode_number)

                score_summary = tf.Summary(
                    value=[tf.Summary.Value(tag="score", simple_value=score)])
                file_writer.add_summary(score_summary, global_step=episode_number)

                episode_number += 1

    file_writer.close()

Next we define a method for testing our model.

In [None]:
def test():
    env = gym.make('BreakoutDeterministic-v4')
    env._max_episode_steps = 40000

    episode_number = 0
    epsilon = 0.001
    global_step = FLAGS.observe_step_num+1
    model = load_model(FLAGS.restore_file_path, custom_objects={'huber_loss': huber_loss})  # load model with customized loss func

    while episode_number < FLAGS.num_episode:

        done = False
        dead = False
        # 1 episode = 5 lives
        score, start_life = 0, 5
        observe = env.reset()

        observe, _, _, _ = env.step(1)
        # At start of episode, there is no preceding frame
        # So just copy initial states to make history
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))

        while not done:
            env.render()
            time.sleep(0.01)

            # get action for the current history and go one step in environment
            action = get_action(history, epsilon, global_step, model)
            # change action to real_action
            real_action = action + 1

            observe, reward, done, info = env.step(real_action)
            # pre-process the observation --> history
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)

            # if the agent missed ball, agent is dead --> episode is not over
            if start_life > info['ale.lives']:
                dead = True
                start_life = info['ale.lives']

            # TODO: may be we should give negative reward if miss ball (dead)
            reward = np.clip(reward, -1., 1.)

            score += reward

            # If agent is dead, set the flag back to false, but keep the history unchanged,
            # to avoid to see the ball up in the sky
            if dead:
                dead = False
            else:
                history = next_history

            # print("step: ", global_step)
            global_step += 1

            if done:
                episode_number += 1
                print('episode: {}, score: {}'.format(episode_number, score))

From here, we can first train the model. And then test it. However, I recommend that you use
the [actual file](./research-and-examples/boyuan-dqn-example.py) instead of the notebook for training and testing.

In [None]:
train()

In [None]:
test()

I quickly realized that my personal machine (a 2015 MacBook Pro with a 3.1 GHz Dual-Core Intel i7 processor and 16 GB DDR3 RAM) would not have enough CPU power to train the DQN to play breakout in a reasonable amount of time. I tried Google Colab, but that was not much better. I was able to connect to one of Calvin’s lab machines and start training the model (3.6 GHz Quad-Core Intel i7, 16GB RAM, and NVIDIA GeForce GTX 960). After starting the training of the model, I moved onto trying to find a better article related to deep Q-learning for game playing.


### _FrozenLake_ Reinforcement Learning

The _FrozenLake_ game board is 4x4, where there is a start (S) space in the top left, and a goal (G) space in the bottom right. The rest of the spaces are a random assortment of frozen (F) spaces, that are safe to step on, and hole (H) spaces, that are not safe for the player to step on, and will cause them to lose the game. Successfully traversing from S to G on F spaces will reward the agent positively. For every step taken the reward is 0, for falling in a hole the reward is 0, and the reward is 1 for reaching the goal, subsequently ending the game. There are four actions the agent can take: up, down, left, right. At each current state, the DQN estimates the reward for each action and takes the best one. Overtime, the agent will learn that reaching the goal state is ideal because it receives a reward. One episode is one attempt at the game, which will either be a success (with a reward of 1) or a failure (0).

My research led me to an article “[Bias-Variance for Deep Reinforcement Learning: How To Build a Bot for Atari with OpenAI Gym](https://www.digitalocean.com/community/tutorials/how-to-build-atari-bot-with-openai-gym#step-6-%E2%80%94-creating-a-deep-q-learning-agent-for-space-invaders)” by Alvin Wan. From his tutorials, I was able to train an agent to play the game well, but I could not understand where the DNN was implemented in his code. To me it seemed like it was training just through using a gradient descent optimizer and a graph search to minimise error, which is not technically making it a deep network, but is still Q-learning I believe.

After Wan’s article, I searched for a Keras implementation of a DQN for _FrozenLake_, and found a [StackOverflow post](https://stackoverflow.com/questions/45869939/something-wrong-with-keras-code-q-learning-openai-gym-frozenlake) which lead to a [Jupyter Notebook](https://gist.github.com/weiji14/bab587907681869ec0f70f7496f98a12), which referenced a [Keras DQN implementation for OpenAI’s FrozenLake](https://gist.github.com/ceshine/eeb97564c21a77b8c315179f82b3fc08), by GitHub user CeShine. Through their implementation, I was able to expand the code to play an 8x8 version of _FrozenLake_ and tried training with different deep architectures. Below I will walk through the code:

We will begin by importing what we need.

In [4]:
# Suppress warning for the notebook.
import warnings
warnings.filterwarnings('ignore')

"""
@author: CeShine
@author: Charkour
Updated to work for 8x8 frozen ice.
Ability to load weights.
Fix small bugs for new versions.
Reuseability features.

Using keras-rl (https://github.com/matthiasplappert/keras-rl) to provide basic framework,
and embedding layer to make it essentially a Q-table lookup algorithm.
"""

import sys
import tempfile
import gym
import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense, Reshape
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from rl.agents.dqn import DQNAgent
from rl.policy import Policy
from rl.memory import SequentialMemory

print('python       :', sys.version.split('\n')[0])
print('numpy        :', np.__version__)
print('tensorflow   :', tf.__version__)
print('gym          :', gym.__version__)

python       : 3.6.1 (v3.6.1:69c0db5050, Mar 21 2017, 01:21:04) 
numpy        : 1.18.1


NameError: name 'tf' is not defined

In addition [keras-rl](https://github.com/matthiasplappert/keras-rl) is installed with version 0.4.1
Keras RL provides Keras API with reinforcement learning classes.

Next we define the policy for picking an action. It is a greedy decay policy. The epsilon
begins high, taking larger risks, and it will get lower and start taking more greedy choices.
This helps make a balance between exploration and exploitation.

In [5]:

class DecayEpsGreedyQPolicy(Policy):

    def __init__(self, max_eps=.1, min_eps=.05, lamb=0.001):
        super(DecayEpsGreedyQPolicy, self).__init__()
        self.max_eps = max_eps
        self.lambd = lamb
        self._steps = 0
        self.min_eps = min_eps

    def select_action(self, q_values):
        assert q_values.ndim == 1
        nb_actions = q_values.shape[0]
        eps = self.min_eps + (self.max_eps - self.min_eps) * \
            np.exp(-self.lambd * self._steps)
        self._steps += 1
        if self._steps % 1e3 == 0:
            print("Current eps:", eps)
        if np.random.uniform() < eps:
            action = np.random.random_integers(0, nb_actions - 1)
        else:
            action = np.argmax(q_values)
        return action

Here we will define some constants, adjust parameters, and
set up the environment, and then set the environment to be the same seed.
This will help us compare models to one another.

In [6]:
ENV_NAME = 'FrozenLake8x8-v0'
FILE_PATH = './research-and-examples/dqn_{}_weights_double.h5f'.format(ENV_NAME)

# Some parameters for printing the output.
np.set_printoptions(threshold=np.inf)
np.set_printoptions(precision=4)

# Get the environment and extract the number of actions.
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

Next we will define the model. The model right now has an embedding layer
that takes a input of the current state. And then it is reshaped into 4
for the output that relates to each action: left, right, up, down.


In [7]:
def get_keras_model(action_space_shape):
    model = Sequential()
    model.add(Embedding(64, 4, input_length=1))
    model.add(Reshape((4,)))
    print(model.summary())
    return model

This is another architecutre that use densely connected layers after the embedding layer.
This has many more parameters than the one above.

In [None]:
def get_keras_model(action_space_shape):
    model.add(Embedding(64, 32, input_length=1))
    model.add(Reshape((32,)))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(4, activation='linear'))
    print(model.summary())
    return model

The DQN agent is also setup. It uses the model and the policy defined.
It is compiled with the Adam optimizer.

In [None]:
model = get_keras_model(nb_actions)

memory = SequentialMemory(window_length=1, limit=10000)
policy = DecayEpsGreedyQPolicy(max_eps=0.9, min_eps=0, lamb=1 / (1e4))
dqn = DQNAgent(model=model, nb_actions=nb_actions,
               memory=memory, nb_steps_warmup=500,
               target_model_update=1e-2, policy=policy,
               enable_double_dqn=False, batch_size=512
               )
dqn.compile(Adam())

Below, this DQN is a double DQN which is helpful to prevent overestimation.

In [8]:
model = get_keras_model(nb_actions)

memory = SequentialMemory(window_length=1, limit=10000)
policy = DecayEpsGreedyQPolicy(max_eps=0.9, min_eps=0, lamb=1 / (1e4))
dqn = DQNAgent(model=model, nb_actions=nb_actions,
               memory=memory, nb_steps_warmup=500,
               target_model_update=1e-2, policy=policy,
               enable_double_dqn=True, batch_size=512
               )
dqn.compile(Adam())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1, 4)              256       
_________________________________________________________________
reshape_1 (Reshape)          (None, 4)                 0         
Total params: 256
Trainable params: 256
Non-trainable params: 0
_________________________________________________________________
None


Load the weights, and then train the DQN, then save the weights.

In [None]:
try:
    dqn.load_weights(FILE_PATH)
except Exception as e:
    print(e)
    pass

temp_folder = tempfile.mkdtemp()

dqn.fit(env, nb_steps=1e5, visualize=False, verbose=1, log_interval=10000)

# After training is done, we save the final weights.
dqn.save_weights(FILE_PATH, overwrite=True)

Unable to open file (unable to open file: name = './research-and-examples/dqn_FrozenLake8x8-v0_weights_double_notebook.h5f', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
Training for 100000.0 steps ...
Interval 1 (0 steps performed)

  973/10000 [=>............................] - ETA: 2:19 - reward: 0.0000e+00 

(Please note the above output has been truncated)


Load the weights and then test the DQN.

In [9]:
dqn.load_weights(FILE_PATH)

# Finally, evaluate our algorithm for 100 episodes.
dqn.test(env, nb_episodes=100, visualize=False)


Testing for 100 episodes ...

Episode 1: reward: 0.000, steps: 200
Episode 2: reward: 0.000, steps: 27
Episode 3: reward: 1.000, steps: 127
Episode 4: reward: 1.000, steps: 65
Episode 5: reward: 1.000, steps: 47
Episode 6: reward: 1.000, steps: 95
Episode 7: reward: 1.000, steps: 37
Episode 8: reward: 1.000, steps: 98
Episode 9: reward: 1.000, steps: 53
Episode 10: reward: 0.000, steps: 171
Episode 11: reward: 0.000, steps: 129
Episode 12: reward: 1.000, steps: 107
Episode 13: reward: 1.000, steps: 99
Episode 14: reward: 1.000, steps: 68
Episode 15: reward: 1.000, steps: 92
Episode 16: reward: 1.000, steps: 65
Episode 17: reward: 1.000, steps: 68
Episode 18: reward: 0.000, steps: 38
Episode 19: reward: 1.000, steps: 33
Episode 20: reward: 1.000, steps: 64
Episode 21: reward: 1.000, steps: 54
Episode 22: reward: 1.000, steps: 85
Episode 23: reward: 0.000, steps: 62
Episode 24: reward: 1.000, steps: 162
Episode 25: reward: 1.000, steps: 52
Episode 26: reward: 1.000, steps: 47
Episode 27:

<keras.callbacks.callbacks.History at 0x1327c78d0>

Again, I prefer to run the code not in the notebook but using a helper script I created
to assess the accuracy of the model. The file can be found [here](./research-and-examples/test.py).
This script will run the _FrozenLake_ code and then count up the output to give the
amount of times the agent crosses the lake in 100 times for a test lake.
Please note that this script does not work in notebooks.

In its current state, this has extended upon CeShine’s work to play the 8x8 grid version of _FrozenLake_. I have also extended the work by training the agent with many different types of architectures.


## Results


### _FrozenLake_

After training the Q-learning agent with different architectures, the best I was able to achieve was with a DDQN that solved the puzzle 82/100 times after being trained for 100,000 episodes. The architecture was simply an embedding layer that took in an array of 64 values (the current state) and then was reshaped into four values for the output (one for each action). There were only 256 parameters. The single DQN version only achieved 79/100 which might show that the DDQN model does prevent overfitting.

I tried creating an architecture that had four densely connected layers which resemble more typical deep neural networks. Because the game state only has 64 values, I wanted to keep the number of parameters small, so each layer was eight nodes wide with the last one being four which resulted in 2,492 parameters. Both the DQN and DDQN achieved 38/100 after being trained for 100,000 episodes. This is likely due to the fact that the Q-learning agent overfit to the training game setup and then behaved poorly on the test game board because the state space in the game is so small (64 states x 4 actions). Training the models for 100,000 episodes took about twelve minutes on the lab computers and more than twenty on my machine.

According to Wan, an agent that can achieve an accuracy of 78% is considered being able to “solve” the puzzle. So my first two agents were able to solve the puzzle well, but the second two were not. To give a comparison, CeShine’s DQN example gets 80/100 on the 4x4 _FrozenLake_ game and mine were trained on the 8x8 version.

While training, I found some interesting behavior from the agent. The agent would at the start, just walk into holes because it did not know better. After a bit of training, it would wander around the lake and eventually would find the goal. After finding the goal and achieving a reward, the agent would learn and then have better performance moving forward. This visualization is from GitHub user [weiji14](https://gist.github.com/weiji14/bab587907681869ec0f70f7496f98a12). The left-hand side shows the current position of the agent on the lake and the right side shows the lake. The agent starts in the top left, and then moves across the lake, avoiding the brown holes, and gets the black goal space. This is showing the 4x4 version of the game but it applied to the 8x8 version as well:

![viz1](./research-and-examples/viz1.jpg)
![viz2](./research-and-examples/viz2.jpg)

### _Breakout_

Although I had moved on to _FrozenLake, _I decided to leave a DQN agent training on Calvin’s lab machines which was learning to play _Breakout._ I started training on Tuesday, May 5 at 3:50pm and stopped training on Tuesday, May 12 at 9:12am for a total of 161 training hours. 19,502,978 million steps were completed, which means that the DQN network was updated that many times. The final model was able to score 16 points, which is better than boyuanf’s model. Because of the way the models are saved, if I look at younger models, the score might be higher.

I am surprised with the progress it made. After 53 hours, and completing over 6.35 million iterations through the neural network, the agent could usually score over 5 for episodes 164,00. The highest score when I first checked was 38 achieved on episode 16,335. I would continue to check on the training progress. After 7.7 million steps, on episode 19,784, the agent achieved a score of 48 and after 18.3 million steps, on episode 37,646, the agent achieved a score of 58. This is exciting progress and it is performing better than boyuanf’s agents trained for 24 hours and 36 hours which got 0 and 11 respectively. I think this is due to the fact that I lowered the memory tenfold from what he had.

While training the model, I noticed a slow start and then the first day the agent would get mostly 0 points, but would sometimes get to 10. The second day saw a great increase, and there were typically no zeros and the score was usually over 5. On the third day, the score was usually in the double digits. On the 4 day I saw a score of 48, and on the fifth day, I saw 58, which was the highest score. After three days of training, the progress seemed to slow and get consistent scores of 15-20, with the occasional high record or low score. Because of the way the model was set-up, it only saves itself every 100,000 models because it is very large. The final model after 161 hours achieved a score of 16 which is shown below:

![gif](./research-and-examples/breakout.gif)


## Implications

Overall, I consider this project a success. I have learned that it takes a lot of time and energy to perform Q-learning, but as technology advances, it seems like it will be more easily achieved. I was able to get simple games like _Breakout_ and _FrozenLake_ to be played by agents, and large companies have trained agents for more complex games. This project has shown that with enough time and power, an artificial agent can be trained to play a simple task by just giving rewards. Although this approach may not give as good of results in the same amount of time as, say, supervised learning, reinforcement learning allows an agent to be trained without massive amounts of meticulously curated and labeled data. Instead, reinforcement learning only requires an interface between the agent and environment, a reward system, and the definitions of the state, actions, and goal. With those in place, an agent and be trained over and over for hours, days, or years. It should be said that creating an interface and defining actions and rewards is much harder than the PhD-holding professionals at DeepMind and OpenAI make it out to be. Additionally, the computing power needed to train the models is not easily accessible to the everyday, computer science student.

If we are able to train computers to complete tasks in video games, then it is apparent that we can do this in real life too. We already do reinforcement learning in our normal lives, such as training a pet, and we will continue to do it with machines. This project makes me realize the ethical implications of reinforcement learning. If we are able to train machines to play video games well, then it could be used for cheating which leads to unethical gains in numerous categories. Additionally, applying reinforcement learning to agents in more complex environments (like _[Grand Theft Auto V](https://www.inverse.com/article/31858-ai-artificial-intelligence-driving-grand-theft-auto-v-twitch)_ or robots in real life) raises ethical questions of the function action-exploiting agents present. If their rewards are ill-defined, or they are chasing an unethical goal, these agents could cause serious issues. I do not have answers to that, but I know that people will continue to push this technology forward, so I think it wise to continue studying it and asking questions. I feel that reinforcement learning relates very closely with how humans act and learn as we get rewards or consequences for our actions. And before every action, we deliberate (sometimes not enough) on what we think our reward will be. However, in the current code setup, I do not believe the Q-learning agents have a concept of realizing long-term rewards like humans are able to. Yet, I think it is possible to train machines with long-term effects, but coding it and training it will take even more time. It seems this is only the beginning of reinforcement learning.

