<h1>CS4619: Artificial Intelligence II</h1>
<h1>Reinforcement Learning, Again</h1>
<h2>
    Derek Bridge<br>
    School of Computer Science and Information Technology<br>
    University College Cork
</h2>

<h1>Initialization</h1>
$\newcommand{\Set}[1]{\{#1\}}$ 
$\newcommand{\Tuple}[1]{\langle#1\rangle}$ 
$\newcommand{\v}[1]{\pmb{#1}}$ 
$\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ 
$\newcommand{\rv}[1]{[#1]}$ 
$\DeclareMathOperator{\argmax}{arg\,max}$ 
$\DeclareMathOperator{\argmin}{arg\,min}$ 
$\DeclareMathOperator{\dist}{dist}$
$\DeclareMathOperator{\abs}{abs}$

<h1>Reinforcement Learning</h1>
<!-- https://twitter.com/xiaxiaoqiang/status/1731094134134276310 -->
<ul>
    <li>There are many algorithms for RL, but the one we studied previously was $Q$-Learning.
    </li>
    <li>The problem with algorithms such as $Q$-Learning is the table, which has an entry for every state
        paired with every action.
        <ul>
            <li>This does not scale well to problems that have many states and/or many actions.</li>
            <li>It fails to generalise: similar states probably have similar $Q$-values.</li>
        </ul>
    </li>
    <li>The state-of-the-art solution is to use a deep neural network to represent and learn the $Q$-values.</li>
</ul>

<h1>Deep $Q$-Learning</h1>
<ul>
    <li>In <b>Deep $Q$-Learning</b> we use a deep neural network to represent and learn the $Q$-values.</li>
    <li>We're going to look at one way of doing this: a DQN (<b>Deep $Q$-Network</b>).</li>
    <li>A DQN predicts $Q$-values (regression!).</li>
    <li>One way of doing this:
        <ul>
            <li>Input layer takes in a state and an action.</li>
            <li>Output layer has just one neuron, which outputs the $Q$-value.</li>
            <li>To choose which action to take in a state, we must activate the network repeatedly: once
                for each action.
            </li>       
        </ul>
    </li>
    <li>Another way of doing this:
        <ul>
            <li>Input layer takes in just a state.</li>
            <li>Output layer has one neuron per action to output their $Q$-values.</li>
            <li>To choose which action to take in a state, we activate the network just once and choose the 
                action for the output neuron with the largest activation.
            </li>
        </ul>
        This is what DQNs do. It is regression but, given a state, you're predicting more than one target value (one per action). (This is sometimes known as multivariate regression.)
    </li>
</ul>

<h2>DQN</h2>
<ul>
    <li>Define a neural network (the second kind that was discussed above) and initialize its weights randomly.</li>
    <li>Pseudocode (discussed in subsequent slides, below):
<figure style="border: 1px solid black; background-color: #D0D0D0">
    <figcaption style="border-bottom: 1px solid black">
        Deep-Q-Learning($\epsilon$)
    </figcaption>
    <ul>
        <li>Initialize replay memory: $D = [\,]$</li>
        <li>$\v{s} = \mathit{SENSE}()$;</li>
        <li>do forever
            <ul>
                <li>$\mathit{rand} = $ a randomly-generated number in $[0,1)$;</li>
                <li>if $\mathit{rand} < \epsilon$
                    <ul>
                        <li>Choose action $a$ randomly;</li>
                    </ul>
                    else
                    <ul>
                        <li>$a = \argmax_a Q(\v{s}, a)$;</li>
                    </ul>
                </li>
                <li>$r = \mathit{EXECUTE}(a)$;</li>
                <li>$\v{s}' = \mathit{SENSE}()$;</li>
                <li>Store $\langle\v{s}, a, r, \v{s}'\rangle$ in $D$</li>
                <li>Randomly choose a mini-batch $\v{X}$ of examples from $D$</li>
                <li>For each $\langle \v{s}, a, r, \v{s}'\rangle \in \v{X}$, calculate the target value, i.e.
                    $r + \gamma\max_{a'}Q(\v{s}', a')$
                </li>
                <li>Train the network on mini-batch $\v{X}$ (one iteration only) using loss function
                    $\frac{1}{2}\sum_{\langle \v{s}, a, r, \v{s}'\rangle \in \v{X}}[r + \gamma\max_{a'}Q(\v{s}', a') - Q(s, a)]^2$
                <li>$\v{s} = \v{s}'$;</li>
            </ul>
        </li>
    </ul>
</figure>
    </li>
    <li>Keep in mind that the $Q$-values above ($Q(\v{s}, a)$ and $Q(\v{s}', a')$) are predicted by
        the neural network
    </li>
</ul>

<h2>Exploration vs. Exploitation</h2>
<ul>
    <li>We discussed this previously.</li>
    <li>The $\epsilon$-greedy policy is a simple solution.
    </li>
</ul>

<h2>DQN's loss function</h2>
<ul>
    <li>In supervised learning, the target values are fixed before learning begins.</li>
    <li>In DQNs, the targets change!</li>
    <li>Consider a transition $\langle \v{s}, a, r, \v{s}'\rangle$:
        <ul>
            <li>Suppose the agent is in state $\v{s}$.</li>
            <li>It feeds $\v{s}$ into the network; the output neurons produce $Q$-values for each action.</li>
            <li>These $Q$-values are the <em>predictions</em>.</li>
            <li>It chooses the action $a$ that has highest $Q$-value: $\argmax_a Q(\v{s}, a)$.</li>
            <li>It executes $a$, obtaining reward $r$ and transitioning to state $\v{s}'$.</li>
            <li>It feeds $\v{s}'$ into the network; the output neurons produce $Q$-values for each action.</li>
            <li>For action $a$ (the one we chose), 
                $r + \gamma\max_{a'}Q(\v{s}', a')$ is a better estimate and so this is the 
                <em>target</em>.
            </li>
            <li>For the other actions (the ones we did not choose), the target is the same as the
                prediction (error is 0).
            </li>
        </ul>
    </li>
    <li>We can use mean-squared-error for the loss function:
        <ul>
            <li>The square of the difference between target and prediction.</li>
            <li>In effect, the loss is:
                $$\frac{1}{2}[\underbrace{r + \gamma\max\nolimits_{a'}Q(\v{s}', a')}_{\mathit{target}} - \underbrace{Q(s, a)}_{\mathit{prediction}}]^2$$
            </li>
        </ul>
    </li>
</ul>

<h2>Experience replay</h2>
<ul>
    <li>The previous slide implies that, in each sense-plan-act cycle, we train on just one example,
        <ul>
            <li>i.e. for transition $\langle \v{s}, a, r, \v{s}'\rangle$, we described the target value
                and prediction.
            </li>
        </ul>
    </li>
    <li>But this takes ages to converge.</li>
    <li>Instead, DQNs use <b>experience replay</b>:
        <ul>
            <li>All the experiences $\langle \v{s}, a, r, \v{s}'\rangle$ are stored in a 
                so-called 'replay memory'.</li>
            <li>When training, the DQN chooses a random mini-batch from the replay memory.</li>
        </ul>
    </li>
</ul>

<h2>Further tweaks</h2>
<ul>
    <li>The presentation above simplifies the algorithm a little.</li>
    <li>It ignores an extra if-statement for handling the situation where the loop terminates (e.g. when a
        game is over).</li>
    <li>More importantly, DQNs include lots of other 'tricks' to improve convergence, including:
        <ul>
            <li>$\epsilon$ decreases, e.g. from 1 to 0.1 over time.</li>
            <li>The use of two neural networks (one for predicting targets and one for everything else).
                <ul>
                    <li>Weights from the second one are copied into the first one periodically.</li>
                </ul>
            </li>
            <li>Error clipping, reward clipping, &hellip;</li>
        </ul>
    </li>
    <li>We won't concern ourselves with these in CS4619!</li>
</ul>

<h1>DQNs for Atari Games</h1>
<ul>
    <li>The DQN algorithm was developed by a company called DeepMind (now owned by Google).</li>
    <li>In 2013/2014, they trained DQNs to play Atari 2600 video games.</li>
    <li>States of the game:
        <ul>
            <li>You might think that the state of the game is represented using a game-specific
                data structure.
            </li>
            <li>The cool part: instead, the state of the game is an image (array of pixels) &mdash; the same
                as you see on the screen.
                <ul>
                    <li>This is a generic way of representing the state of these games.</li>
                    <li>It is what makes their DQN applicable to so many different games:
                        all you need are images from the game after each user action.
                    </li>
                    <li>It does mean that the lower layers in the neural network need to be convolutional layers
                        &mdash; to do some image processing.
                    </li>
                </ul>
            </li>
        </ul>
    <li>Mnih et al.: <i>Playing Atari with Deep Reinforcement Learning</i> CoRR, abs/1312.5602, 2013
        (<a href="https://arxiv.org/abs/1312.5602">https://arxiv.org/abs/1312.5602</a>)
        <ul>
            <li>describes seven games, outperforming humans on three of them.</li>
        </ul>
    </li>
    <li>Mnih et al.: <i>Human-level control through deep reinforcement learning</i>,
        Nature volume 518, pages 529–533, 2015
        (<a href="https://www.nature.com/articles/nature14236">https://www.nature.com/articles/nature14236</a>)
        <ul>
            <li>describes 49 games, outperforming humans on half of them.</li>
        </ul>
    </li>
</ul>

<h2>States as images</h2>
<ul>
    <li>If states are images then, in principle, we can find out the positions of all the objects
        (e.g. the bricks and paddle in Breakout).
    </li>
    <li>But we cannot find out their speed or direction of travel.</li>
    <li>So DeepMind chose to represent a single state by four images:
        <ul>
            <li>After an action, the new state is the current screen image and the preceding three.</li>
        </ul>
    </li>
    <li>They also do a little preprocessing to reduce image sizes: 
        <ul>
            <li>convert RGB to grayscale;</li>
            <li>scale down from $210 \times 160$ pixels to $84 \times 84$.</li>
        </ul>
    </li>
    <li>So, if $m$ is the mini-batch size, then the  input shape is $(m, 84, 84, 4)$. Why?</li>
</ul>

<h2>OpenAI Gym</h2>
<ul>
    <li>We don't want to programme Atari games from scratch ourselves.</li>
    <li><i>Gymnasium</i> (<a href="https://github.com/Farama-Foundation/Gymnasium">https://github.com/Farama-Foundation/Gymnasium</a>, <a href="https://gymnasium.farama.org/">Documentation</a>) provides a range of agents and environments for working on RL, e.g. Atari games, board games, etc.
    </li>
</ul>

<h2>TF-Agents</h2>
<ul>
    <li>TF-Agents (<a href="https://www.tensorflow.org/agents">https://www.tensorflow.org/agents</a>) 
        is a tensorflow library that implement DQNs and other deep learning versions of RL.</li>
</ul>

<h2>Demo code</h2>
<ul>
    <li>Take a look at notebook 18 <a href="https://github.com/ageron/handson-ml2">here</a>.</li>
    <li>You can download it and run it yourself on your own machine (if you have a GPU) or on Google Colab.</li>
    <li>Or if you click on it, you'll see it contains a button ("Run in Google Colab") that you can click on.</li>
    <li>In Google Colab, chose Runtime &gt; Change runtime type &gt; and choose a GPU; Save. Then execute as many cells as 
        you want.
        If you only want to execute the DQN for Breakout, then execute the first cell + the one that contains
        the definition of the <code>plot_animation</code> function + the cells in the section called
        <i>Using TF-Agents to Beat Breakout</i>.
    </li>
    <li>Here are animated GIFs of the agent that I trained with <code>max_length=100000</code> and 
        <code>n_iterations=10000-100000</code>:
        <figure style="text-align: center;">
               <figcaption>After 40000 iterations:</figcaption>
               <img src="images/breakout3.gif" />
        </figure>
         <figure style="text-align: center;">
               <figcaption>After 100000 iterations:</figcaption>
               <img src="images/breakout9.gif" />
        </figure>
    </li>
</ul>

<h1>Concluding Remarks</h1>
<ul>
    <li>The lecture gives a flavour of using deep learning for RL.</li>
    <li>There are many more complex variants (e.g. Dueling Deep $Q$-Networks and Double Deep $Q$-Networks) 
        including ones that lift the assumption of a finite set of actions (e.g. Actor-Critic models).
    </li>
    <li>Let's also mention AlphaGo, AlphaGo Zero and AlphaZero:
        <ul>
            <li>March 2016, AlphaGo beat 18 times world champion Lee Sedol 4-1.</li>
            <li>October 2017, AlphaGo Zero (trained on self-play only) beat AlphaGo 100-0.
                <figure>
                    <img src="images/alphago.png" />
                </figure>
            </li>
            <li>December 2017, AlphaZero beats StockFish at chess and Elmo at Shogi (Japanse chess).</li>
        </ul>
        In principle, we can train AlphaZero to play any perfect information game from self-play only. However,
        AlphaZero is trained separately &mdash;from scratch&mdash; on each game.
    </li>
    <li>Even more interestingly, DeepMind has been working on a system that can be trained to play many different
        games. The games are things like Hide &amp; Seek, Tag, Catch-the-Flag and so on, played in a simulator 
        that allows the creation of environments consisting of players and blocks. The agent uses Deep RL, with a
        reward function that is normalized to reward performance across the wide range of training tasks. It engages
        in a process of continuous learning. The developers claim that the agent learns "interesting emergent
        heuristic behaviours such as trial-and-error experimentation, simple tool use, option switching, 
        and co-operation": <a href="https://deepmind.com/blog/article/generally-capable-agents-emerge-from-open-ended-play">Blog post (with link to videos)</a>; <a href="https://arxiv.org/pdf/2107.12808.pdf">Longer manuscript</a>.
    </li>
    <li>In the same vein, Google has shown that a single transformer-based reinforcement learning model, with a single set of weights, trained can play up to 46 Atari games simultaneously at close-to-human performance: <a href="https://blog.research.google/2022/07/training-generalist-agents-with-multi.html">Blog post</a>; <a href="https://arxiv.org/abs/2205.15241">Longer manuscript</a>.</li>
    <li>With similar goals but a different approach, and starting from the observation that humans can learn quickly across many tasks, MIT has been working on equipping RL with knowledge of how humans learn and act &mdash; how they explore the world, how they model its causal structure, and how they use these models to plan actions that achieve their goals. They illustrate with a video game playing agent called EMPA (the
Exploring, Modeling, and Planning Agent), tested on about 90 Atari-like games:
        <a href="https://arxiv.org/pdf/2107.12544.pdf">Manuscript</a>
</ul>