# 1. Reinforcement Learning

## 1.1. Markov Decision Process

><img src = 'images/image1_01.png' width=200>

>* **$R_t$: long-term total reward**

>\begin{align}
\text{Finite Horizon: } R_t &= \sum^\infty_{\tau=0} r_{t+\tau} \\
\text{Infinite Horizon: } R_t &= \sum^\infty_{\tau=0} \gamma^\tau r_{t+\tau} \;\;\; \gamma \text{: discounting factor}
\end{align}

## 1.2. Open-Loop vs. Closed-Loop Control

>\begin{align}
\text{Open-Loop: } a^*_{t}(s_{t}) &= \underset{a_{t}}{\text{argmax}} \underset{a_{t+1}}{\max} \left[ \bar{r}(s_t,a_t) + \sum_{s_{t+1}} P(s_{t+1}|s_t,a_t) \bar{r}(s_{t+1},a_{t+1}) \right] \\
\text{Closed-Loop: } a^*_{t}(s_{t}) &= \underset{a_{t}}{\text{argmax}} \left[ \bar{r}(s_t,a_t) + \sum_{s_{t+1}} P(s_{t+1}|s_t,a_t) \underset{a_{t+1}}{\max} \bar{r}(s_{t+1},a_{t+1}) \right]
\end{align}

>* Open-Loop: choose actions in advance / naive optimisation
>* Closed-Loop: correct recursion

## 1.3. Bellman Equations

* **State Value** (infinite sum $\rightarrow$ finite no. of linear equations using recursion)

>\begin{align}
V^\pi (s) &= \mathbb{E}_\pi [R_t|s_t=s] \\
&= \mathbb{E}_\pi \left[ r_t + \gamma \sum^\infty_{\tau=0} \gamma^\tau r_{t+\tau+1} \Bigg| s_t = s \right] \\
&= \sum_a \pi(s,a) \left[ \bar{r}(s,a) + \sum_{s'} P(s'|s,a) \gamma \mathbb{E}_\pi \left[ \sum^\infty_{\tau=0} \gamma^\tau r_{t+\tau+1} \Bigg| s_{t+1}=s' \right] \right] \\
&= \sum_a \pi(s,a) \left[ \bar{r}(s,a) + \sum_{s'} P(s'|s,a) \gamma V^\pi (s') \right]
\end{align}

* **Action value**

>\begin{align}
Q^\pi (s,a) &= \mathbb{E}_\pi [R_t|s_t=s,a_t=a] \\
&= \bar{r}(s,a) + \sum_{s'} P(s'|s,a) \gamma \sum_{a'} \pi(s',a') Q^\pi (s',a') \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;
\end{align}

* **Policy Improvement Theorem**

>$$\alpha'(s) = \alpha(s) \;\;\; \forall s \;\;\; \text{except} \; s^*$$

>$$Q^\pi (s^*, \alpha'(s^*)) \geq V^\pi (s^*) \;\;\;\Rightarrow\;\;\; V^{\pi'}(s) \geq V^{\pi}(s) \;\;\; \forall s$$

>* Or go all the way greedy: $\alpha'(s) = \underset{a}{\text{argmax}}\;Q^\pi(s,a) \;\;\; \forall s$

* **Generalised Policy Iteration**

>* **Policy evaluation** ($V \rightarrow V^\pi$) & **Policy improvment** ($\pi \rightarrow \text{greedy}(V)$)
>* Guaranteed to converge to global optimum

* **Bellman Optimality Equation** (simpler way to find optimal policy)

>\begin{align}
V^*(s) &= \max_a \left[ \bar{r}(s,a) + \sum_{s'} P(s'|s,a) \gamma V^*(s') \right] \\
Q^*(s,a) &= \bar{r}(s,a) + \sum_{s'} P(s'|s,a) \gamma \max_{a'} Q^* (s',a')
\end{align}

>* **Problem 1:** difficult to solve non-linear function ($\max$)
>* **Problem 2:** useful only when there is no information on transition & reward prob.

## 1.4. Different Types of Learning

><img src = 'images/image1_02.png' width=400>

* **0. Monte Carlo Learning**

>* Step 1. **Collect experience** following some policy
>* Step 2. **Estimate value fn.** by averaging returns over episodes

* **1. Caching-based Learning** (online estimation & bootstrapping)

>\begin{align}
\text{TD: } \hat{V}^\pi (s_t) &\leftarrow \hat{V}^\pi (s_t) + \epsilon \left[ r_t + \gamma \hat{V}^\pi(s_{t+1}) - \hat{V}^\pi (s_t) \right] \\
\text{Q-learning: } \hat{Q}^* (s_t,a_t) &\leftarrow \hat{Q}^* (s_t,a_t) + \epsilon \left[ r_t + \gamma \max_{a'} \hat{Q}^*(s_{t+1},a') - \hat{Q}^* (s_t,a_t) \right]
\end{align}

>* **TD: on-policy,** policy decides the next state
>* **Q-Learning: off-policy,** optimal state is visited

* **2. Model-based Learning** 

>* Estimate $P(s'|s,a)$ and $\bar{r}(s,a)$ from experience $\rightarrow$ run DP to obtain values and policy
>* Strictly speaking, it is the optimal method, but it is difficult to estimate the model parameters

* **3. Episodic Memory**

>* Store episodes sequences $rightarrow$ try to follow actions that ended well
>* No computation of average 
  
* **Learning Curve Comparison**  
  
><img src = 'images/image1_03.png' width=400>

>* Model-based: **clever learning / hard control**
>* Model-free: **dull learning / easy control**

## 1.5. Evidences in Brain

* **Memory Systems**

>* **Caching-based:** Implicit Memory - Procedural (skills and habits)
>* **Model-based:** Explicit Memory - Facts
>* **Episodic Memory:** Explicit Memory - Events

* **Dopamine = TD Prediction Error**

><img src = 'images/image1_04.png' width=400>

>$$\hat{V}^\pi(s_t) \leftarrow \hat{V}^\pi(s_t) + \epsilon \left[ r_t + \hat{V}^\pi (s_{t+1}) - \hat{V}^\pi(s_t) \right]$$

>* Striatum: responsible for procedural memory / produces **dopamine**
>* Activity in **ventral tegmental area:** in accordance with **TD prediction error** ($\delta = r_t + \Delta V$)
>* After learning, TD prediction error moves **from $t_{\text{reward}}$ to $t_{\text{stimulus}}$**

* **Devaluation**

>* **Model-based:** sensitive to devaluation
>* **Caching-based:** insensitive to devaluation
>  * *Effect of devaluation $\rightarrow$ indicate whether the behaviour is under the control of model-based or caching-based learning*

><img src = 'images/image1_05.png' width=400>

>* Brain listens to the **system with less uncertainty**
>* Complexity of the scenario $\uparrow$ & No. of trials $\uparrow$ $\Rightarrow$ **caching-based learning** becomes better (habitization)