# MLMI7: Reinforcement Learning and Decision Making

Lecturer: Dr. José Miguel Hernández-Lobato & Prof. Carl Edward Rasmussen

----

# Table of Contents
>## 1. Introduction
* 1.1. What is Reinforcement Learning?
* 1.2. Markov Property
* 1.3. Bellman Equation

# 1. Introduction

## 1.1. What is Reinforcement Learning?

* **Agent-Environment Interface**

><img src='images/image01.png' width=400>

>* **Goal: (1)** For a given state, **(2)** make actions to **(3)** maximise the numerical reward

* **Tasks & Rewards**

>* **Episodic tasks:** interaction terminates after finite no. of steps

>$$R_t = r_{t+1} + \cdots + r_T$$

>* **Continuing tasks:** interaction has no limit (e.g. trading, driving)
>  * $\gamma$: discount factor / $\gamma=0$: the agent is **miopic** / $\gamma \rightarrow 1$: the agent is **farsighted**

>$$R_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots$$


* **Main Elements in RL**

>|Element|Explanation|
|-|-|
|**Policy**|defines behaviour of the agent, $\pi_t (a|s) = p(a_t=a | s_t=s)$|
|**Reward**|defines the goal of the agent (to maximise $E[$cumulative reward$]$|
|**Value Function**|defines what is the goal in the long run (prediction of future reward)|
|**Model**|explains how the environment behaves (e.g. state transition prob.)|

* **Exploration and exploitation dilemma**

>* The agent must simultaneously **exploit** current knowledge and **explore** new actions

## 1.2. Markov Property

* **Markov Property** of the State

>$$p(s_{t+1}, r_{t+1} | s_{0:t}, a_{0:t}, r_{1:t}) = p(s_{t+1},r_{t+1}|s_t,a_t)$$

>* The best policy for choosing actions as a function of a Markov state is just as good as the best policy for choosing actions as a function of complete history

* **Markov Decision Process**

><img src='images/image02.png' width=200>

>* **State-transition Probability**

>$$p(s_{t+1}|s_t,a_t) = \sum_{r \in \mathbb{R}} p(s_{t+1},r_{t+1}|s_t,a_t)$$

>* **Expected Reward**

>$$E[r_{t+1}|s_{t+1},s_t,a_t] = \frac{\sum_{r \in \mathbb{R}} r p(s_{t+1}, r_{t+1} | s_t,a_t)}{p(s_{t+1}|s_t,a_t)}$$

* **Value Functions**

>* **State-value** fn. for policy $\pi$: $V_\pi (s) = \mathbb{E}_\pi [r_t | s_t=s]$
>* **Action-value** fn. for policy $\pi$: $Q_\pi (s,a) = \mathbb{E}_\pi [r_t | s_t=s,a_t=a]$

## 1.3. Bellman Equation

* **Value Function has Recursive Relation**

>$$V_{\pi}(s) = \sum_a \pi(a,s) \sum_{s'} \sum_r p(s',r|s,a)(r + \gamma V_{\pi}(s'))$$

* **Optimal Value Functions**

>\begin{align}
V_*(s) &= \underset{\pi}{\max} V_\pi (s) \\
Q_*(s,a) &= \underset{\pi}{\max} Q_\pi (s,a) \\
&= \mathbb{E} [r_{t+1} + \gamma V_* (s_{t+1}) | s_t=s,a_t=a]
\end{align}

* **Bellman Optimality Equation**

>\begin{align}
V_*(s) &= \underset{a}{\max} \sum_{s',r} p(s',r|s,a)[r_{t+1}+\gamma V_*(s')] \\
Q_*(s,a) &= \sum_{s',r} p(s',r|s,a) [r_{t+1} + \gamma \underset{a'}{\max} Q_* (s',a')]
\end{align}

* **Optimality and Approximation**

>* **Tabular case:** value fn. can be expressed as table values
>* **Non-tabular case:** value fn. must be approximated using fn. approximations