<h1>CS4619: Artificial Intelligence II</h1>
<h1>Reinforcement Learning</h1>
<h2>
    Derek Bridge<br>
    School of Computer Science and Information Technology<br>
    University College Cork
</h2>

<h1>Initialization</h1>
$\newcommand{\Set}[1]{\{#1\}}$ 
$\newcommand{\Tuple}[1]{\langle#1\rangle}$ 
$\newcommand{\v}[1]{\pmb{#1}}$ 
$\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ 
$\newcommand{\rv}[1]{[#1]}$ 
$\DeclareMathOperator{\argmax}{arg\,max}$ 
$\DeclareMathOperator{\argmin}{arg\,min}$ 
$\DeclareMathOperator{\dist}{dist}$
$\DeclareMathOperator{\abs}{abs}$

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

<h1>Reinforcement learning</h1>
<ul>
    <li>The agent carries out an action.</li>
    <li>A teacher or the environment provides a <b>reward</b> (or punishment), often delayed, 
        that acts as positive (or negative) reinforcement &hellip;
        <ul>
            <li>&hellip; making it more (or less) likely that the agent will execute that action if it find itself in the same
                or similar situation in the future.
            </li>
        </ul>
    </li>
    <li>For simplicity, in this lecture, we assume a fully-observable, deterministic environment.</li>
</ul>

<h1>Reward</h1>
<figure>
    <img src="images/reward.png" />
</figure>
<p style="font-size: 2em">
    $$\v{s}_0 \xrightarrow[\,\,\,\,\,\,\,\,r_0]{a_0\,\,\,\,\,\,\,\,} \v{s}_1 \xrightarrow[\,\,\,\,\,\,\,\,r_1]{a_1\,\,\,\,\,\,\,\,} \v{s}_2 \xrightarrow[\,\,\,\,\,\,\,\,r_2]{a_2\,\,\,\,\,\,\,\,} \cdots$$
</p>

<h1>Cumulative reward</h1>
<ul>
    <li>Cumulative reward: 
        $$r_0 + \gamma r_1 + \gamma^2r_2 + \cdots$$    
        or
        $$\sum_{t=0}^{t=\infty}\gamma^tr_t$$
        where $\gamma$ is the <b>discount rate</b> ($0 \leq \gamma \leq 1$).
    </li>
    <li>Why do we discount?
        <ul>
            <li>E.g. animal behaviour, including human behaviour, shows a preference for immediate reward.</li>
            <li>E.g. it is mathematically convenient (to overcome infinite cyclic processes; to give a preference for short solutions).</li>
        </ul>
    </li>
    <li>The task of the agent is to learn an action function that maximises cumulative reward.</li>
</ul>

<h1>Action-value function</h1>
<ul>
    <li>Assume 2 Boolean sensors and 3 actions.</li>
    <li>Compare
        <div style="display: flex; justify-content: space-between;">
            <table>
                <tr><th style="border: 1px solid black;">Percept</th><th style="border: 1px solid black;">Action</th></tr>
                <tr><td style="border: 1px solid black;">00</td><td style="border: 1px solid black;">MOVE</td></tr>
                <tr><td style="border: 1px solid black;">01</td><td style="border: 1px solid black;">TURN(RIGHT, 2)</td></tr>
                <tr><td style="border: 1px solid black;">10</td><td style="border: 1px solid black;">MOVE</td></tr>
                <tr><td style="border: 1px solid black;">11</td><td style="border: 1px solid black;">TURN(LEFT, 2)</td></tr>
            </table>
            <table>
                <tr><th style="border: 1px solid black;">Percept</th><th style="border: 1px solid black;">Action</th><th style="border: 1px solid black; width: 3em;">$Q$</th></tr>
                <tr><td style="border: 1px solid black;">00</td><td style="border: 1px solid black;">MOVE</td><td style="border: 1px solid black;">&hellip;</td></tr>
                <tr><td style="border: 1px solid black;">00</td><td style="border: 1px solid black;">TURN(RIGHT, 2)</td><td style="border: 1px solid black;">&hellip;</td></tr>
                <tr><td style="border: 1px solid black;">00</td><td style="border: 1px solid black;">TURN(LEFT, 2)</td><td style="border: 1px solid black;">&hellip;</td></tr>
                <tr><td style="border: 1px solid black;">01</td><td style="border: 1px solid black;">MOVE</td><td style="border: 1px solid black;">&hellip;</td></tr>
                <tr><td style="border: 1px solid black;">01</td><td style="border: 1px solid black;">TURN(RIGHT, 2)</td><td style="border: 1px solid black;">&hellip;</td></tr>
                <tr><td style="border: 1px solid black;">01</td><td style="border: 1px solid black;">TURN(LEFT, 2)</td><td style="border: 1px solid black;">&hellip;</td></tr>
                <tr><td style="border: 1px solid black;">10</td><td style="border: 1px solid black;">MOVE</td><td style="border: 1px solid black;">&hellip;</td></tr>
                <tr><td style="border: 1px solid black;">10</td><td style="border: 1px solid black;">TURN(RIGHT, 2)</td><td style="border: 1px solid black;">&hellip;</td></tr>
                <tr><td style="border: 1px solid black;">10</td><td style="border: 1px solid black;">TURN(LEFT, 2)</td><td style="border: 1px solid black;">&hellip;</td></tr>
                <tr><td style="border: 1px solid black;">11</td><td style="border: 1px solid black;">MOVE</td><td style="border: 1px solid black;">&hellip;</td></tr>
                <tr><td style="border: 1px solid black;">11</td><td style="border: 1px solid black;">TURN(RIGHT, 2)</td><td style="border: 1px solid black;">&hellip;</td></tr>
                <tr><td style="border: 1px solid black;">11</td><td style="border: 1px solid black;">TURN(LEFT, 2)</td><td style="border: 1px solid black;">&hellip;</td></tr>
            </table>
        </div>
    </li>
    <li>Class exercise: Suppose the agent has $m$ touch sensors (returning 0 or 1) and $n$ different actions. 
        How many rows will the table contain?
    </li>
</ul>

<h1>What is $Q$?</h1>
<ul>
    <li>$Q(\v{s}, a)$ is an <em>estimate</em> of the cumulative reward the agent will receive if, 
        having sensed $\v{s}$, it chooses to execute action $a$.
    </li>
    <li>Hence, having sensed $\v{s}$, choose action $a$ for which $Q(\v{s}, a)$ is highest:
        $$\argmax_a Q(\v{s},a)$$
    </li>
</ul>

<h1>Class exercise</h1>
<ul>
    <li>Given this table
        <table>
            <tr><th style="border: 1px solid black;">Percept</th><th style="border: 1px solid black;">Action</th><th style="border: 1px solid black;">$Q$</th></tr>
            <tr><td style="border: 1px solid black;">$\vdots$</td><td style="border: 1px solid black;">$\vdots$</td><td style="border: 1px solid black;">$\vdots$</td></tr>
            <tr><td style="border: 1px solid black;">01</td><td style="border: 1px solid black;">MOVE</td><td style="border: 1px solid black;">0.2</td></tr>
            <tr><td style="border: 1px solid black;">01</td><td style="border: 1px solid black;">TURN(RIGHT, 2)</td><td style="border: 1px solid black;">0.1</td></tr>
            <tr><td style="border: 1px solid black;">01</td><td style="border: 1px solid black;">TURN(LEFT, 2)</td><td style="border: 1px solid black;">0.7</td></tr>
            <tr><td style="border: 1px solid black;">$\vdots$</td><td style="border: 1px solid black;">$\vdots$</td><td style="border: 1px solid black;">$\vdots$</td></tr>
        </table>   
    </li>
    <li>Suppose $\v{s}$ is 01</li>
    <li>What is $\argmax_a Q(\v{s},a)$?</li>
</ul>

<h1>$Q$-learning</h1>
<ul>
    <li>Start with random $Q$-values (or all zero).</li>
    <li>Improve by trial-and-error: choose actions, get rewards, update $Q$-values.</li>
</ul>
<figure style="border: 1px solid black; background-color: #D0D0D0">
    <figcaption style="border-bottom: 1px solid black">
        QLearning($\epsilon$)
    </figcaption>
    <ul>
        <li>$\v{s} = \mathit{SENSE}()$;</li>
        <li>do forever
            <ul>
                <li>$\mathit{rand} = $ a randomly-generated number in $[0,1)$;</li>
                <li>if $\mathit{rand} < \epsilon$
                    <ul>
                        <li>Choose action $a$ randomly;</li>
                    </ul>
                    else
                    <ul>
                        <li>$a = \argmax_a Q(\v{s}, a)$;</li>
                    </ul>
                </li>
                <li>$r = \mathit{EXECUTE}(a)$;</li>
                <li>$\v{s}' = \mathit{SENSE}()$;</li>
                <li>$Q(\v{s}, a) = r + \gamma \times \max_{a'} Q(\v{s}', a')$;</li>
                <li>$\v{s} = \v{s}'$;</li>
            </ul>
        </li>
    </ul>
</figure>

<h1>Exploration vs. Exploitation</h1>
<ul>
    <li>Exploration:
        <ul>
            <li>Choose an action which may not be the best action according to the current $Q$-values.
                But it may gain you new experience and improve the $Q$-values.
            </li>
        </ul>
    </li>
    <li>Exploitation:
        <ul>
            <li>Choose the action which is best according to the current $Q$-values.
                It may gain you reward.
            </li>
        </ul>
    </li>
    <li>The so-called <b>$\epsilon$-greedy policy</b> (where $0 \leq \epsilon \leq 1$) is the simplest way to balance exploration and exploitation.</li>
</ul>

<h1>Updating $Q$-values</h1>
<ul>
    <li>From the algorithm:
        $$Q(\v{s}, a) = r + \gamma \times \max_{a'}Q(\v{s}', a')$$
    </li>
    <li>The new value is the reward for the latest action $r$ plus our highest current estimate of the cumulative reward
        it can receive.
    </li>
    <li>Over the course of repeated actions, the $Q$-values will get better and better:
        <ul>
            <li>When one $Q$-value improves then the $Q$-values of its immediate predecessors will also improve
                next time they get updated.
            </li>
        </ul>
    </li>
</ul>

<h1>Class exercise</h1>
<ul>
    <li>Given this table
        <table>
            <tr><th style="border: 1px solid black;">Percept</th><th style="border: 1px solid black;">Action</th><th style="border: 1px solid black; width: 3em;">$Q$</th></tr>
            <tr><td style="border: 1px solid black;">$\vdots$</td><td style="border: 1px solid black;">$\vdots$</td><td style="border: 1px solid black;">$\vdots$</td></tr>
            <tr><td style="border: 1px solid black;">10</td><td style="border: 1px solid black;">MOVE</td><td style="border: 1px solid black;">5</td></tr>
            <tr><td style="border: 1px solid black;">10</td><td style="border: 1px solid black;">TURN(RIGHT, 2)</td><td style="border: 1px solid black;">4</td></tr>
            <tr><td style="border: 1px solid black;">10</td><td style="border: 1px solid black;">TURN(LEFT, 2)</td><td style="border: 1px solid black;">1</td></tr>
            <tr><td style="border: 1px solid black;">11</td><td style="border: 1px solid black;">MOVE</td><td style="border: 1px solid black;">0</td></tr>
            <tr><td style="border: 1px solid black;">11</td><td style="border: 1px solid black;">TURN(RIGHT, 2)</td><td style="border: 1px solid black;">4</td></tr>
            <tr><td style="border: 1px solid black;">11</td><td style="border: 1px solid black;">TURN(LEFT, 2)</td><td style="border: 1px solid black;">6</td></tr>
        </table>   
    </li>
    <li>Suppose current percept is 10.<br />
        Assuming exploitation, which action will be chosen?</li>
    <li>Suppose reward is 3, next percept is 11 and $\gamma$ is 1.<br />
        Using $Q(\v{s}, a) = r + \gamma \times \max_{a'} Q(\v{s}', a')$, update the table.
    </li>
</ul>

<h1>Types of Learning in AI</h1>
<ul>
    <li><b>Reinforcement learning</b>:
        <ul>
            <li>The agent receives rewards (or punishments) after executing actions.</li>
            <li>The rewards (or punishments) act as positive (or negative) reinforcement.</li>
            <li>The agent learns a policy that defines which actions to perform in which situations to
                maximize reward over time.
            </li>
        </ul>
    </li>
    <li><b>Unsupervised learning</b>:
        <ul>
            <li>The agent learns from an unlabeled dataset.</li>
            <li>The goal is to find structure within the dataset.</li>
            <li>Clustering and most forms of dimensionality reduction are examples of unsupervised
                learning but there are other examples of unsupervised learning (not covered) such as anomaly detection and
                association rule mining.
            </li>
        </ul>
    </li>
    <li><b>Supervised learning</b>:
        <ul>
            <li>The agent learns from a labeled dataset.</li>
            <li>The goal is to generalise from the labeled dataset to learn how to predict 
                target values/class labels when given feature values.
            </li>
            <li>Learning models for regression and classification are examples of supervised learning.</li>
        </ul>
    </li>
    <li><b>Semisupervised learning</b>:
        <ul>
            <li>The agent learns from a dataset, only a (small) subset of which is labeled.</li>
            <li>The goal is usually the same as in supervised learning but making use of the
                unlabeled data to compensate for the low volume of labeled data.
            </li>
        </ul>
    </li>
</ul>

<h1>Concluding remarks</h1>
<ul>
    <li>Reinforcement Learning underpins a lot of current success in game playing.</li>
    <li>But it is seeing real use in other areas, e.g. robot motion control, recommender systems.</li>
    <li>For real use, you need more sophisticated algorithms:
        <ul>
            <li>To handle non-deterministic environments;</li>
            <li>To improve convergence;</li>
            <li>To build a model of the environment;</li>
            <li>To scale up, and</li>
            <li>To represent the policy in a way that allows the agent to generalise from what it learns.</li>
        </ul>
    </li>
    <li>We'll look at the last of these in the next lecture.</li>
</ul>