# Artificial Intelligence Proposal

By Charles Kornoelje

Updated 05/07/2020

Note: I really appreciate your feedback, Professor Vander Linden. Let me know your thoughts. Thanks.


## Vision

The goal of my CS 344 honors final project is to take a deep dive into [reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning), with the hope of training an artificial intelligence agent to play a video game. The first agent I was able to implement was with a [deep q-learning network](https://en.wikipedia.org/wiki/Q-learning#Deep_Q-learning) (DQN) designed to play the Atari 2600 game, <i>[Breakout](https://en.wikipedia.org/wiki/Breakout_(video_game))</i>, the classic brick-breaking game. However, I quickly learned that training a somewhat intelligent agent would take lots of computational time and energy, something of which I do not have. So I began training an agent and moved on to find a game that took less power, which led me to the text-based video game, _[FrozenLake](https://gym.openai.com/envs/FrozenLake-v0/)_. I was able to train a smart agent to play the game after following a guide.

The purpose of the project is to learn how to use reinforcement learning to train agents. Reinforcement learning is a domain of machine learning where an agent takes actions based on observations in their environment to maximize their reward. The project falls under the passive reinforcement learning realm in which a Q-learning agent is trained with an action-utility function (Q-function) to learn the transition model that connects constrained utility states. Learning the transition model will assist the agent in decision making in order to take proper actions to maximize their score in video games. If an agent is able to be trained to play a game well, the same training can be applied to real life activities and techniques.


## Background

Both the _Breakout _and _FrozenLake_ games are environments from [OpenAI](https://openai.com/)’s [Gym](https://gym.openai.com/) Python package. The Gym versions of the games make it easy to interface with modern machine learning frameworks, such as [Keras](https://keras.io/) and [TensorFlow](https://www.tensorflow.org/), which I will be using to train my artificial intelligence agent. I chose these technologies because I have previous experience with them and the guides I follow implement the reinforcement algorithms with them. The artificial agent will be trained using a deep neural network using reinforcement learning algorithms. My basic understanding of reinforcement learning came from Chapter 21 of _[Artificial Intelligence: A Modern Approach, Third Edition](http://aima.cs.berkeley.edu/)_ by Russell and Norvig.


## Implementation

To train the agent to play _Breakout_, I followed the article “[Beat Atari with Deep Reinforcement Learning! (Part 1: DQN)](https://becominghuman.ai/lets-build-an-atari-ai-part-1-dqn-df57e8ff3b26)” by Adrien Lucas Ecoffet, which provided a solid overview of the idea of q-learning, but failed to provide a detailed enough explanation of code implementation. One of the comments responded with their [GitHub repo implementation](https://github.com/boyuanf/DeepQLearning) of a DQN to play _Breakout_. After slightly modifying boyuanf’s code, I was able to get it running on my machine. The DQN is trained by an array representation of the current screen state where each pixel has an RGB value, with the shape of the array being (210, 160, 3). For each state, there is an integer value that reinforces each action, with positive integers being positive reinforcement. I quickly realized that q-learning involves a lot of math and custom functions that are specific to the problem, but I did not know how to build it on my own yet. I have added my [updated version of the code](https://github.com/charkour/cs344/blob/master/project/research-and-examples/boyuan-dqn-example.py) to my project directory. The major change I made was to lower the amount of previous actions and responses remembered ten fold. Previously, boyuanf was storing 20 GBs of past decisions and rewards, but I felt that was too much and having less than that would help the agent find better decisions more quickly. The artitecture starts with a normalized layer of the (210, 160, 3) input, then two convolutional layers, which are flattened into a dense layer with 256 rectifier units, and then another dense layer the size of the actions, which is 3, and then goes into a filtered output that applies a mask to get one action. There are over 600,000 parameters that are estimated in the model.



*   [gym](https://gym.openai.com/)
*   [numpy](https://numpy.org/)
*   [tensorflow](https://www.tensorflow.org/)
*   [keras](https://keras.io/)
*   [skimage](https://scikit-image.org/) (for preprocessing)
*   [Collections](https://docs.python.org/3.6/library/collections.html#collections.deque) (for deque)

I quickly realized that my personal machine (a 2015 MacBook Pro with a 3.1 GHz Dual-Core Intel i7 processor and 16 GB DDR3 RAM) wouldn’t have enough CPU power to train the DQN to play breakout in a reasonable amount of time. I tried Google Colab, but that wasn’t much better. I was able to connect to one of Calvin’s lab machines and start training the model (TODO: give the specs?). After starting the training of the model, I moved onto trying to find a better article related to deep q-learning for game playing. Determined not to give up on my conquest of training an video-game-playing agent with deep q-learning, my research lead me to an article on [DigitalOcean](https://www.digitalocean.com/), “[Bias-Variance for Deep Reinforcement Learning: How To Build a Bot for Atari with OpenAI Gym](https://www.digitalocean.com/community/tutorials/how-to-build-atari-bot-with-openai-gym#step-6-%E2%80%94-creating-a-deep-q-learning-agent-for-space-invaders)” by Alvin Wan. From his tutorials, I was able to get a simple DQN running in order to train an agent to play the FrozenIce game and achieve good results. Currently, I am hoping to expand upon his work because so far I have just gone through this blog post.

Specific package versions are required with `Python 3.6`.



*   gym 0.9.5
*   tensorflow 1.5.0
*   numpy 1.14.0

The following code is a result of following Wan’s tutorial exactly. The _FrozenLake_ board is 4x4, where there is a start (S) space in the top left, and a goal (G) space in the bottom right. The rest of the spaces are a random assortment of frozen (F) spaces, that are safe to step on, and hole (H) spaces, that are not safe for the player to step on, and will cause them to lose the game. Successfully traversing from S to G on F spaces will reward the agent positively. For every step taken the reward is 0, for falling in a hole the reward is 0, and 1 for reaching the goal. There are four actions the agent can take, each action is moving in one of the cardinal directions. At each current state, the DQN estimates the reward for each action and takes the best one. Overtime, the agent will learn that reaching the goal state is ideal because it receives a reward. One episode is one attempt at the game, which will either be a success (with a reward of 1) or a failure (0).

To the best of my knowledge, the training is done through a gradient descent optimizer. I’m not too sure what type of network is being trained so I will have to research this more. Wan in his guide just refers to the network but doesn’t really describe the architecture. As far as I know, it is just doing a graph search that is trying to minimize the error, which is not technically making it a deep network, but is still q-learning I believe.

In [3]:
"""
Bot 4 -- Use Q-learning network to train bot
"""

from typing import List
import gym
import numpy as np
import random
import tensorflow as tf
random.seed(0)
np.random.seed(0)
tf.set_random_seed(0)

num_episodes = 4000
discount_factor = 0.99
learning_rate = 0.15
report_interval = 500
exploration_probability = lambda episode: 50. / (episode + 10)
report = '100-ep Average: %.2f . Best 100-ep Average: %.2f . Average: %.2f ' \
         '(Episode %d)'


def one_hot(i: int, n: int) -> np.array:
    """Implements one-hot encoding by selecting the ith standard basis vector"""
    return np.identity(n)[i].reshape((1, -1))


def print_report(rewards: List, episode: int):
    """Print rewards report for current episode
    - Average for last 100 episodes
    - Best 100-episode average across all time
    - Average for all episodes across time
    """
    print(report % (
        np.mean(rewards[-100:]),
        max([np.mean(rewards[i:i+100]) for i in range(len(rewards) - 100)]),
        np.mean(rewards),
        episode))


def main():
    env = gym.make('FrozenLake-v0')  # create the game
    env.seed(0)  # make results reproducible
    rewards = []

    # 1. Setup placeholders
    n_obs, n_actions = env.observation_space.n, env.action_space.n
    obs_t_ph = tf.placeholder(shape=[1, n_obs], dtype=tf.float32)
    obs_tp1_ph = tf.placeholder(shape=[1, n_obs], dtype=tf.float32)
    act_ph = tf.placeholder(tf.int32, shape=())
    rew_ph = tf.placeholder(shape=(), dtype=tf.float32)
    q_target_ph = tf.placeholder(shape=[1, n_actions], dtype=tf.float32)

    # 2. Setup computation graph
    W = tf.Variable(tf.random_uniform([n_obs, n_actions], 0, 0.01))
    q_current = tf.matmul(obs_t_ph, W)
    q_target = tf.matmul(obs_tp1_ph, W)

    q_target_max = tf.reduce_max(q_target_ph, axis=1)
    q_target_sa = rew_ph + discount_factor * q_target_max
    q_current_sa = q_current[0, act_ph]
    error = tf.reduce_sum(tf.square(q_target_sa - q_current_sa))
    pred_act_ph = tf.argmax(q_current, 1)

    # 3. Setup optimization
    trainer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    update_model = trainer.minimize(error)

    with tf.Session() as session:
        session.run(tf.global_variables_initializer())

        for episode in range(1, num_episodes + 1):
            obs_t = env.reset()
            episode_reward = 0
            while True:
                # env.render()

                # 4. Take step using best action or random action
                obs_t_oh = one_hot(obs_t, n_obs)
                action = session.run(pred_act_ph, feed_dict={obs_t_ph: obs_t_oh})[0]
                if np.random.rand(1) < exploration_probability(episode):
                    action = env.action_space.sample()
                obs_tp1, reward, done, _ = env.step(action)

                # 5. Train model
                obs_tp1_oh = one_hot(obs_tp1, n_obs)
                q_target_val = session.run(q_target, feed_dict={obs_tp1_ph: obs_tp1_oh})
                session.run(update_model, feed_dict={
                    obs_t_ph: obs_t_oh,
                    rew_ph: reward,
                    q_target_ph: q_target_val,
                    act_ph: action
                })
                episode_reward += reward
                obs_t = obs_tp1

                if done:
                    rewards.append(episode_reward)
                    if episode % report_interval == 0:
                        print_report(rewards, episode)
                    break
        print_report(rewards, -1)

main()


[2020-05-07 21:40:03,308] Making new env: FrozenLake-v0


100-ep Average: 0.39 . Best 100-ep Average: 0.40 . Average: 0.10 (Episode 500)
100-ep Average: 0.28 . Best 100-ep Average: 0.57 . Average: 0.24 (Episode 1000)
100-ep Average: 0.63 . Best 100-ep Average: 0.63 . Average: 0.34 (Episode 1500)
100-ep Average: 0.57 . Best 100-ep Average: 0.72 . Average: 0.39 (Episode 2000)
100-ep Average: 0.63 . Best 100-ep Average: 0.74 . Average: 0.43 (Episode 2500)
100-ep Average: 0.61 . Best 100-ep Average: 0.74 . Average: 0.46 (Episode 3000)
100-ep Average: 0.74 . Best 100-ep Average: 0.74 . Average: 0.48 (Episode 3500)
100-ep Average: 0.68 . Best 100-ep Average: 0.78 . Average: 0.50 (Episode 4000)
100-ep Average: 0.68 . Best 100-ep Average: 0.78 . Average: 0.50 (Episode -1)


In its current state, this does not extend the work I have referenced. My future work will be to update and optimize the algorithm to work for an 8x8 grid version of _FrozenLake_ instead of the 4x4 one used in the guide. Perhaps I will also update the TensorFlow code to use Keras instead, but I’ll see if I can get the 8x8 version working first. It would also be interesting to update the _FrozenLake_ environment to give a positive score when it takes the optimal path towards the goal, but that would also take a little bit of math and programming, so I’m not sure if that will be done.


## Results

Currently, the q-learning agent is able to complete the _FrozenLake_ game 78 times out of its best 100 (however, it used to be 82, not sure what changed) attempts when training for 4000 episodes. In the reinforcement learning domain, “solving” the puzzle is anything above 72 times, but I am not sure why this is considered the standard for assessing the ability of an artificial intelligence agent. Because I followed Wan’s tutorial exactly, the performance is exactly the same. When I try to train an agent to complete the 8x8 puzzle, my guess is that it will take much more than four times as many episodes to complete to get comparable results, because I tried it with 16,000 episodes and then 800,000, and it made no progress on solving the problem. So my guess is that the algorithm is not generalized for any size _FrozenLake_ puzzle.

Additionally, I am excited to report the progress of the agent training on the campus lab computer. It has been training for 53 hours, completing over 6.35 million iterations through the neural network to estimate the agent’s future reward if an action is taken. I am surprised with the progress it has made. The highest score I’ve seen recently is 38 achieved on episode 16335, and it can usually score 5 or more while on episode 16400. This is exciting progress and it is performing better than boyuanf’s agents trained for 24 hours and 36 hours which got 0 and 11 respectively. I think this is due to the fact that I lowered the memory told fold from what he had. I do not think I will continue working on this, but focus my work on the FozenLake text game, which takes relatively less computational time and energy.


## Implications

Reinforcement learning is an interesting approach to training intelligent agents. This project has shown that with enough time and power, an artificial agent can be trained to play a simple task by just giving rewards. Although this approach may not give as good of results in the same amount of time as say supervised learning, reinforcement learning allows that agnet to be trained without massive amounts of meticulously curated and labeled data. One problem with reinforcement learning data, is having to define an action set and give rewards for certain actions or sequences of actions. If we are able to train computers to complete tasks in video games, then it is apparent that we can do this in real life too. We already do reinforcement learning in our normal lives, such as training a pet, and we will continue to do it with machines.

I’ve learned that it takes a lot of time and energy to perform q-learning, but as technology advances, it seems like it will be more easily achieved. Some really great work has come from reinforcement learning techniques, and I think they will just be getting better. I feel that reinforcement learning relates very closely with how humans act and learn as we get rewards or consequences for our actions. And before every action, we deliberate (sometimes not enough) on what we think our reward will be. However, in the current code setup, I do not believe the q-learning agents have a concept of realizing long-term rewards like humans are able to. Yet, I think it is possible to train machines with long-term effects, but coding it and training it will take even more time.

