# Reinforcement Learning
Author: Bingchen Wang

Last Updated: 6 Sep, 2022

---
<nav>
    <a href="../Machine%20Learning.ipynb">Machine Learning</a>
</nav>

---

In [1]:
%%html
<link rel='stylesheet' type='text/css' media='screen' href='../styles/custom.css'>

<section class = "section--outline">
    <div class = "outline--header">Outline </div>
    <div class = "outline--content">
        <b>Concepts:</b>
        <ul>
            <li> <a href='#Motivation'>Motivation</a>
                <ul>
                    <li> <a href='#MDP'>Markov Decision Processes (MDPs)</a>
                    <li> <a href='#VF'>Value Function</a>
                    <li> <a href='#BE'>Bellman's Equations</a>   
                </ul>
            <li> <a href='#Solution'>Solution</a>
                <ul>
                    <li> <a href='#LearningPsa'>Learning $P_{sa}$</a>
                    <li> <a href='#VI'>Fitted Value Iteration</a>
                        <ul><li> Deep Q-Network <ul><li> Soft updating</ul></ul>
                </ul>
            <li> <a href = '#FHMDP'>Finite-horizon MDP</a>
                <ul>
                    <li> <a href = '#SAR'>State-action Rewards</a>
                    <li> <a href = '#FHMDP'>Finite Horizon MDP</a>
                    <li> <a href = '#LQR'>Linear Quadratic Regulation (LQR)</a>
                </ul>        
        </ul>
        <b>Implementation:</b>
        <ul>
            <li> <a href='./Tensorflow Implementation.ipynb'>TensorFlow</a>
            <li> PyTorch
            <li> <a href='../../Machine Learning Specialization/Course 3 Unsupervised Learning, Recommenders, Reinforcement Learning/Week 3/Practice Lab/C3_W3_A1_Assignment.ipynb'>Machine Learning Specialization<a>
        </ul>
    </div>
</section>

### Notation

| Concept | Notation |
| :----------| :----------- |
| states | $S$ |
| actions | $A$ |
| state transition probabilities | $P_{sa}$ |
| discount factor| $\gamma$ |
| reward | $R$ |
| return | $Q$ |
| policy| $\pi$|



<a name='Motivation'></a>
### Motivation


<a name='MDP'></a>
#### Markov Decision Process (MDP)
A Markov Decision Process (MDP) is described by <mark>$(S, A, \{P_{sa}\}, \gamma, R)$</mark>.

<a name='VF'></a>
#### Value Function
For a policy $\pi$, $V^{\pi}:S\mapsto A$ is such that $V^{\pi}(s)$ is the expected total payoff for starting in state $s$ and executing $\pi$.
$$
    V^{\pi}(s) = \mathrm{E}[R(s_0)+ \gamma R(s_1) + \cdots | \pi, s_0 = s]
$$

<a name='BE'></a>
#### Bellman's equations
Any policy:
$$
    V^{\pi}(s) = R(s_0)+ \gamma \sum_{s'}P_{s\pi(s)} V^{\pi}(s)
$$
Optimal policy:
$$
    V^{*}(s) = R(s_0)+ \max_{a} \gamma \sum_{s'}P_{sa} V^{*}(s)
$$

<div class="alert alert-block alert-info"> <b>Goal of Reinforcement Learning:</b> Learn the optimal action at any given state. (i.e. find the optimal policy $\pi^{*}$) </div>
<div class="alert alert-block alert-warning"> <b>Learning strategy:</b> <ol>
    <li> Model or learn $P_{sa}$
    <li> Find $V^{*}$
    <li> Implicitly find $\pi^{*}$ (given by the argmax equation)
</ol> </div>



<a name="Solution"></a>
### Solution

<a name="LearningPsa"></a>
#### Learning $P_{sa}$
##### Approach 1: Learn from reality
$$P_{sa}(s') = \frac{\text{# times took action "a" in state s and got to s'}}{\text{# times took action "a" in state s}}$$
(or $1/|S|$ if the denominator is 0.)
##### Approach 2: Build a simulator based on domain knowledge, e.g. physics simulator
##### Approach 3: Build a supervised learning model and add noise
<section class = "section--algorithm">
    <div class = "algorithm--header"> Learning the state transition probabilities</div>
    <div class = "algorithm--content">
        <ul>
            <li> Collect data ${\{s^{(i)}_t\}}^{i=1,...,m}_{t=1,...,T}$.
            <li> Train $h: S \times A \mapsto S$.
            <li> Use $s_{t+1} = h(s_t,a_t) + \epsilon_t$ where $\epsilon_t \sim N(0,\sigma^2 I)$.
        </ul>
    </div>
</section>

<a name = 'VI'></a>
#### Fitted Value Iteration
<section class = "section--algorithm">
    <div class = "algorithm--header"> Fitted Value Iteration</div>
    <div class = "algorithm--content">
        <blockquote>
            Sample $\{s^{(1)}, s^{(2)},\dots,s^{(m)}\} \subseteq S$ randomly. <br>
            Initialise $\theta := 0$. <br>
            <b>Repeat:</b>
            <blockquote> For $i = 1, \dots, m$,
                <blockquote> For each action $a \in A$, 
                    <blockquote>
                        Sample $s_1^\prime,s_2^\prime,\dots,s_k^\prime \sim P_{s^{(i)},a}$. <br>
                        Set $ q(a) = \frac{1}{k}\sum^{k}_{j=1}[R(s^{(i)}) + \gamma V(s^\prime_j)]$. <br>
                    </blockquote>  
                    set $y^{(i)} = \max_a q(a)$ 
                </blockquote>
            Update $\theta := \arg\min_{\theta} \frac{1}{2}\sum^m_{i=1}(f_\theta(s^{(i)})-y^{(i)})^2$
            <div class="alert alert-block alert-success"> <b>Note:</b> Can choose $f_{\theta}$ to be any suitable supervised learner.</div>    
            </blockquote>    
        </blockquote>
    </div>
</section>

<a name = "FHMDP"></a>
### Finite-horizon MDP

<a name = "SAR"></a>
#### State-action rewards
**Intuition**: Actions may have costs.

**Bellman's euqations**:
$$ V^*(s) = \max_a [R(s,a) + \gamma \sum_{s^\prime}P_{sa}(s^\prime)V^*(s^\prime)]$$

**Value iteration**:
$$V := RHS$$

<a name = "FHMDP-detail"></a>
#### Finite-horizon MDP
**MDP**: $(S,A,\{P_{sa}\}, T, R)$ where $T$ is the horizon time
<div class="alert alert-block alert-danger"> <b>Note:</b> There is no gamma ($\gamma$). </div>

**Notations**:

| Concept | Notation |
| :----------| :----------- |
| state | $s_t$ |
| action | $a_t$ |
| state transition probabilities | $P_{sa}^{(t)}$ |
| horizon time| $T$ |
| reward | $R^{(t)}$ |
| return/value | $V_t^*$ |
| policy| $\pi^*_t$|

**Value iteration**:
$$
V^*_T(s) = \max_a R^{(T)}(s,a)
$$
Repeat:
$$
V^*_t(s) = \max_a \{R^{(t)}(s,a) + \sum_{s^\prime} P_{sa}^{(t)} V^*_{t+1}(s^\prime)\} \\
\pi^*_t(s) = \arg\max_a \{R(s,a) + \sum_{s^\prime} P_{sa}^{(t)} V^*_{t+1}(s^\prime)\}
$$

<a name = "LQR"></a>
#### Linear Quadratic Regulation (LQR)
<div class="alert alert-block alert-info"> <b>Assumptions:</b> 
    <ol>
        <li> linear state-transition probabilities $P_{sa}: s_{t+1} = As_t+Ba_t + w_t$ where $w_t \sim N(0,\Sigma_w)$
        <li> quadratic state-action reward function $R(s,a)= -(s^TUs + a^TVa)$
    </ol>        
</div>

##### Learning A and B
<section class = "section--algorithm">
    <div class = "algorithm--header"> Method 1: Learn from it</div>
    <div class = "algorithm--content">
        <ul>
            <li> Collect data ${\{s^{(i)}_t\}}^{i=1,...,m}_{t=1,...,T}$.
            <li> Model $s_{t+1} \approx As_t+ Ba_t$.
            <li> Loss $ \min_{A,B} \sum^m_{i=1} \sum^{T-1}_{t=0}||s_{t+1} - (As_t+Ba_t)||^2$.
        </ul>
    </div>
</section>

<section class = "section--algorithm">
    <div class = "algorithm--header"> Method 2: Linearise a non-linear model</div>
    <div class = "algorithm--content">
        <ul>
            <li> Known model $s_{t+1} = f(s_t, a_t)$.
            <li> Pick a pair of typical values $(\bar{s}_t, \bar{a}_t)$ for $s_t$ and $a_t$
            <li> Linearise $s_{t+1} \approx f(\bar{s}_t, \bar{a}_t) + (\nabla_s f(\bar{s}_t, \bar{a}_t))^T (s_t - \bar{s}_t)+ (\nabla_a f(\bar{s}_t, \bar{a}_t))^T (a_t - \bar{a}_t)$
        </ul>
    </div>
</section>

##### Base case
$$
V^*_T(s_T) = \max_{a_T} R(s_T,a_T) = \max_{a_T} [-s^T_TUs_T - a^T_TVa_T] = - s^T_TUs_T \\
\pi^*_T(s_T) = \vec{0}
$$
##### Intermediate case
Suppose
$$
V^*_{t+1}(s_{t+1}) = s^T_{t+1} \Phi_{t+1} s_{t+1} + \Psi_{t+1}
$$

where $\Phi_{t+1} \in \mathbb{R}^{n \times n}$, $\Psi_{t+1} \in \mathbb{R}$. Then,
$$
V^*_t = \max_{a_t} R(s_t, a_t) + \mathrm{E}_{s_{t+1} \sim P_{s_ta_t}}[V^*_{t+1}(s_{t+1})] \\
V^*_t = \max_{a_t} -s^T_tUs_t - a^T_tVa_t + \mathrm{E}_{s_{t+1} \sim N(As_t + Ba_t, \Sigma_w)}[s^T_{t+1} \Phi_{t+1} s_{t+1} + \Psi_{t+1}] 
$$

Take derivative with respect to $a_t$ and set it to $0$ to solve for $a_t$:
$$
a_t = \underbrace{(B^T\Phi_{t+1}B - V)^{-1}B^{T} \Phi_{t+1}A}_{L_t} s_t \\
\pi^*_t(s_t) = L_ts_t
$$
<div class = "alert alert-block alert-warning">The optimal action is a linear function of the state $s_t$.</div>
Then,
$$
V^*_t = s^T_t\Phi_ts_t + \Psi_t
$$
where
$$
\Phi_t = A(\Phi_{t+1}- \Phi_{t+1}B(B^T\Phi_{t+1}B-V)^{-1}B\Phi_{t+1})A - U \\
\Psi_t = -\mathrm{tr}\Sigma_w\Phi_{t+1} + \Psi_{t+1}
$$
<div class = "alert alert-block alert-danger"><b>Note:</b> The covariance matrix of the noise $w_t$ affects only $\psi_t$, which does not affect the optimal policy.</div>

##### Putting it together
<section class = "section--algorithm">
    <div class = "algorithm--header"> Linear Quadratic Regulation</div>
    <div class = "algorithm--content">
        <blockquote>
            Initialise $\Phi_T = -U$, $\Psi_t = 0$. <br>
            Solve the base case. $V^*_T = -s^T_TUS_T$ and $\pi^*_T(s_T) = 0$. <br>
            Recursively calculate: <blockquote>
            $\Phi_t, \Psi_t$ using $\Phi_{t+1},\Psi_{t+1}$ (for $t=T-1, T-2, \dots, 0$):<blockquote>
            $\Phi_t = A(\Phi_{t+1}- \Phi_{t+1}B(B^T\Phi_{t+1}B-V)^{-1}B\Phi_{t+1})A - U$<br>
            $\Psi_t = -\mathrm{tr}\Sigma_w\Phi_{t+1} + \Psi_{t+1}$</blockquote>
            </blockquote>
            Calculate $L_t = (B^T\Phi_{t+1}B - V)^{-1}B^{T} \Phi_{t+1}A$.
        </blockquote>
        Solution: $\pi^*(s_t) = L_ts_t$ (exact result of LQR).
    </div>
</section>