# MLMI7: Reinforcement Learning and Decision Making

Lecturer: Dr. José Miguel Hernández-Lobato & Prof. Carl Edward Rasmussen

----

# Table of Contents
>## 1. Introduction
* 1.1. What is Reinforcement Learning?
* 1.2. Markov Property
* 1.3. Bellman Equation

# 1. Introduction

## 1.1. What is Reinforcement Learning?

* **Agent-Environment Interface**

><img src='images/image01.png' width=400>

>* **Goal: (1)** For a given state, **(2)** make actions to **(3)** maximise the numerical reward

* **Tasks & Rewards**

>* **Episodic tasks:** interaction terminates after finite no. of steps

>$$R_t = r_{t+1} + \cdots + r_T$$

>* **Continuing tasks:** interaction has no limit (e.g. trading, driving)
>  * $\gamma$: discount factor / $\gamma=0$: the agent is **miopic** / $\gamma \rightarrow 1$: the agent is **farsighted**

>$$R_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots$$


* **Main Elements in RL**

>|Element|Explanation|
|-|-|
|**Policy**|defines behaviour of the agent, $\pi_t (a|s) = p(a_t=a | s_t=s)$|
|**Reward**|defines the goal of the agent (to maximise $E[$cumulative reward$]$|
|**Value Function**|defines what is the goal in the long run (prediction of future reward)|
|**Model**|explains how the environment behaves (e.g. state transition prob.)|

* **Exploration and exploitation dilemma**

>* The agent must simultaneously **exploit** current knowledge and **explore** new actions

## 1.2. Markov Property

* **Markov Property** of the State

>$$p(s_{t+1}, r_{t+1} | s_{0:t}, a_{0:t}, r_{1:t}) = p(s_{t+1},r_{t+1}|s_t,a_t)$$

>* The best policy for choosing actions as a function of a Markov state is just as good as the best policy for choosing actions as a function of complete history

* **Markov Decision Process**

><img src='images/image02.png' width=200>

>* **State-transition Probability**

>$$p(s_{t+1}|s_t,a_t) = \sum_{r \in \mathbb{R}} p(s_{t+1},r_{t+1}|s_t,a_t)$$

>* **Expected Reward**

>$$E[r_{t+1}|s_{t+1},s_t,a_t] = \frac{\sum_{r \in \mathbb{R}} r p(s_{t+1}, r_{t+1} | s_t,a_t)}{p(s_{t+1}|s_t,a_t)}$$

* **Value Functions**

>* **State-value** fn. for policy $\pi$: $V_\pi (s) = \mathbb{E}_\pi [r_t | s_t=s]$
>* **Action-value** fn. for policy $\pi$: $Q_\pi (s,a) = \mathbb{E}_\pi [r_t | s_t=s,a_t=a]$

## 1.3. Bellman Equation

* **Value Functions - Recursive Relations**

>\begin{align}
V_{\pi}(s) &= \mathbb{E}_\pi [R_t |s_t=s] \\
&= \sum_a \pi(a,s) \sum_{s'} \sum_r p(s',r|s,a)(r + \gamma V_{\pi}(s')) \\
\\
Q_{\pi}(s) &= \mathbb{E}_\pi [R_t |s_t=s,a_t=a] \\
&= \bar{r}(s,a) + \sum_{s'} p(s'|s,a) \gamma \sum_{a'} \pi(s',a') Q^\pi(s',a')
\end{align}

* **Bellman Optimality Equations**

>\begin{align}
V_*(s) &= \underset{\pi}{\max} V_\pi (s) \\
&= \underset{a}{\max} \sum_{s',r} p(s',r|s,a) \left[ r_{t+1}+\gamma V_*(s') \right] \\
\\
Q_*(s,a) &= \underset{a}{\max} \mathbb{E}_{\pi^*} \left[ r_{t+1} + \gamma \underset{a' \in \mathcal{A}}{\max} Q_* (s_{t+1},a') \bigg| s_t=s, a_t=a \right] \\
&= \sum_{s',r} p(s',r|s,a) \left[ r_{t+1} + \gamma \underset{a'}{\max} Q_*(s',a') \right]
\end{align}

>* **Optimal policy:** policy associated with optimal value fn. or optimal Q-fn.

* **Optimality and Approximation**

>* **Tabular case:** value fn. can be expressed as table values
>* **Non-tabular case:** value fn. must be approximated using approximations

# 2. Dynamic Programming

* **DP algorithms** can solve an MDP RL task given the model of the environment (state-transition prob. and reward fn.)

* **Limitations**

>* Tasks with continuous states and actions $\rightarrow$ quantize and solve **finite-state MDP**
>* Involves operations over the entire set, which can be very large $\rightarrow$ **asynchronous algorithms**
>* **Curse of dimensionality** $\rightarrow$ use **model-free RL** (next chapter)

## 2.1. Policy Iteration

><img src='images/image03.png' width=400>

>* **Eq 1: Policy Evaluation**
>  * **Stop when:** value fn. stops changing, $|V_{k+1}(s) - V_k(s)| < \theta$

>$$V_{k+1}(s) = \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a) \left( r+\gamma V_k(s') \right)$$

>* **Eq 2: Policy Improvement** (greedy)
>  * **Step when:** new policy is not better than the old policy

>$$\pi'(s) = \underset{a}{\text{argmax}} \sum_{s,r} p(s',r|s,a) \left( r+\gamma V_\pi(s') \right)$$

>* The two processes **compete** in the short term and **cooperate** in the long term

## 2.2. Value Iteration

><img src='images/image04.png' width=400>

>* **Eq 3: Value Iteration** (combines policy evaluation & improvement)
>  * **Stop when:** value fn. stops changing, $|V_{k+1}(s) - V_k(s)| < \theta$

>$$V_{k+1}(s) = \max_a \sum_{s',r} p(s',r|s,a) \left( r+\gamma V_k(s') \right)$$


# 3. Model-free Reinforcement Learning

* Learn from **experience**, which is obtained from **interaction** with **simulated** or **real** environment

## 3.1. Monte Carlo Prediction ($V$)

><img src='images/image05.png' width=400>

>* Average: unbiased estimate, $\sigma$ falls as $\frac{1}{\sqrt{n}}$

## 3.2. On-policy Monte Carlo Control ($Q$)

><img src='images/image06.png' width=400>

>* Evaluate or improve the policy that is used to make decisions

## 3.3. Off-policy Monte Carlo Control

><img src='images/image07.png' width=400>

>* **Target policy:** greedy policy wrt $Q$
>* **Behaviour policy:** soft policy that generates behaviour
>  * **Soft policy:** non-zero prob. for all actions (coverage)