# Reinforcement Learning course by David Silver notes

## Lecture 1. Introduction

## Lecture 2. Markov Decision Processes

* What is Random process?
* What is Markov property?
* What is state transition matrix?
* What is Markov process or Markov chain?
* Draw an example diagram of Markov process?
* What is Markov reward process? (Reward function takes state s and returns expected reward, Value-function takes state s and returns expected return - long term value of being in a state)
* Draw an example of Markov reward process?
* What is return?
* What is value function in Markov reward process?
* Bellman equation for MRP?
* What is backup diagram in MRP + example?
* What is MDP?
* Draw an example of MDP?
* What is the policy?
* What is state-value function?
* What is action-value function?
* Bellman expectation equations?
* What is optimal policy?
* Bellman optimality equations?

TODO: episode, dynamics of MDP, example diagram of MDP

### Definitions

MDP is the framework that enables us to model real world problems and formalize that problem mathematicaly. In order to understand MDPs it is crucial to fully understand every term that we are about to define and because of that this note will be in the form of cheetsheet with lots of definitions.

<img src="imgs/agent-env.png">

_def_. Real world problem that we want to model we call **Stochastic or Random process** in the language of statistics. We can think of a random process as a set of random states. More formaly, stochastic process is a family of random variables describing certain events.

_def_. If the next state of a process depends only on present state and not on previous states we say that process has **Markov property**. This should be thought as a restictions of the states meaning that each state should capture all relevant information from history. More formaly, random process with states $S_1 ... S_n$ satisfies Markov property if:

$$
P(S_{t+1}|S_t) = P(S_{t+1}|S_1, ... ,S_{t})
$$

for every state.

_def_. **State transition matrix** is a square matrix which tell us what is the probabilty of transitioning from one state to another.

$$
\mathcal{P} = \begin{bmatrix}
p_{11} & .. & p_{1n}\\
. &  & \\
. &  & \\
p_{n1} & .. & p_{nn}
\end{bmatrix}
$$

_def_. **Markov process** or **Markov chain** is a process with Markov property ie. sequence of random states $S_1, S_2 ...$ with Markov property. More formaly, Markov process is a tuple $\langle \mathcal{S}, \mathcal{P} \rangle$ where $\mathcal{S}$ is set of states and $\mathcal{P}$ is state transition matrix.

<img src="imgs/mp.png">

<img src="imgs/stochastic-process-not-markov.gif">
<img src="imgs/stochastic-process-is-markov.gif">

_def_. **Markov reward process** is a Markov process with value judgements - how good it is to be in a certain stated. Markov reward process is a tuple $\langle \mathcal{S}, \mathcal{P}, \mathcal{R}, \gamma \rangle$ where $\mathcal{S}$ is set of states, $\mathcal{P}$ is state transition matrix, $\mathcal{R}$ is reward function and $\gamma$ is a discount factor.

<img src="imgs/mrp.png">


_def_. **Return** $G_t$ is a total dicounted sum of rewards you get after time $t$.

$$
G_t = R_{t+1} + \gamma R_{t+2} ... = \sum_{k=0}^{\infty}\gamma^k R_{t+k+1}
$$

_def_. The **goal** is a optimal return. The goal of reinforcement learning is to optimize return.

_def_. **Value function** $v(s)$ gives the long-term value of being in a state (expectation)

<img src="imgs/value_functions.png">

_def_. **Bellman equation for MRPs** is following equation and it basically states that value function can be decomposed into 2 parts: 1. imidiate reward $R_{t+1}$ 2. discounted value of successor state $\gamma v(S_{t+1})$

<img src="imgs/bellman_mrps.png">
<img src="imgs/IMG_5525.jpg">

_def_. **Backup diagram** is one-step look ahead tree which helps us visualize one step of a process.

<img src="imgs/calc_value_function.png">

<img src="imgs/bellman_eq_mat.png">

Because Bellman equation is linear it is possible to be solved (calculate $v$) directly:

<img src="imgs/solve_bellman_eq.png">

Computational complexity of this calculation is $O(n^3)$ and therefore direct solution is applicable only to small MRPs. For larger MRPs other methods are avalible: Dynamic programming, Monte-carlo evaluation and Temporal-Difference learning.

_def_. **Markov decision process**

<img src="imgs/mdp.png">

_def_. **Policy** is a mapping from states to probabilities of selecting each possible action.

$$
\pi(a|s) = P(A_t=a|S_t=s)
$$

<img src="imgs/mdp_note.png">

_def_. Previously, in MRPs, we defined a value function $v(s)$ as a long-term value of being in a state. In MDPs we are defining **state-value function**  which tells us how good is to be in the state following the policy $\pi$.State-value function is defined as expected return starting from $s$ following policy $\pi$.

$$
v_{\pi}(s) = E_{\pi}(G_t|S_t=s)
$$

_def_. **Action-value function** tells us how good is to take certain action from a peticular state while following the policy $\pi$. We defined action-value function $q_{\pi}(s, a)$ as a expected return when starting from state $s$, taking the action $a$ while following policy $\pi$:

$$
q_{\pi}(s, a) = E_{\pi}(G_t|S_t=s, A_t=a)
$$


We can now also define how Bellman equation looks like in MDPs.

_def_. **Bellman expectation equation for state-value function**:

<img src="imgs/bellman_eq_sv.png">

<!--- 
$$
\mathsf{v}_{\pi}(s) = \mathbb{E}_{\pi}(R_{t+1} + \gamma\mathsf{v}_{\pi}(S_{t+1}) | S_t = s) = \sum_{a \in A} \pi(a|s)q_{\pi}(s, a)
$$
 -->
_def_. **Bellman expectation equation for action-value function**:
<img src="imgs/bellman_eq_av.png">
<!-- 
$$
q_{\pi}(s, a) = \mathbb{E}_{\pi}(R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1}) | S_t = s, A_t = a) = R_{s}^{a} + \gamma \sum_{s' \in S} P_{ss'}^{a}\mathsf{v}_{\pi}(s')
$$
 -->
 
Therefore,
<img src="imgs/e_bellman_eq1.png">
<img src="imgs/e_bellman_eq2.png">
<img src="imgs/e_bellman_eq3.png">
<img src="imgs/e_bellman_eq4.png">


### Optimality

_def_. **Optimal value functions**
<img src="imgs/optimal_value_functions.png">

_def_. **Optimal policy**
<img src="imgs/optimal_policy.png">

_def_. **Bellman optimality equation**
<img src="imgs/o_bellman_eq1.png">
<img src="imgs/o_bellman_eq2.png">
<img src="imgs/o_bellman_eq3.png">
<img src="imgs/o_bellman_eq4.png">

### Example
<img src="imgs/recycling_robot_1.jpg">
<img src="imgs/recycling_robot_2.jpg">


## Lecture 3. Planning by Dynamic Programming

* What is DP?
* What is planning problem?
* What is prediction and what is control?
* What is policy evaluation?
* Write down iterative policy evaluation algorithm?
* Formulate and prove policy improvement theorem?
* Write down policy iteration algorithm?
* What is modified policy iteration?
* What is generalized policy iteration?
* Formulate priciple of optimality?
* Describe value iteration algorithm?
* Sync vs async dp?

**Dynamic programming** is the method for solving complex problems by 1. breaking them down into subproblems -> 2. solve the subproblems -> 3. combine solutions to subproblems

Dynamic Programming is a very general solution method for problems which have two properties:
1. Optimal substructure
    * Principle of optimality applies
    * Optimal solution can be decomposed into subproblems
2. Overlapping subproblems
    * Subproblems recur many times
    * Solutions can be cached and reused
    
MDPs satisfy both properties - Bellman equation gives recursive decomposition and value function stores and reuses solutions (value functions are *cache*).

_def_. **Planning** is the problem of solving the MDP (finding the optimal policy) given the full knowledge about  MDP (we know rewards and dynamics). DP is used for solving a planning problem. Solving the planning can be split into 2 steps:
1. **Prediction** is the problem of evaluating the policy $\pi$
2. **Control** is the problem of finding the optimal policy.

<img src="imgs/prediction_and_control.png">

_def_. **Policy evaluation** is the process of figuring out how to evaluate the policy ie. if someone gives us the policy we need to figure out how much reward we are going to collect when following that policy. Policy evaluation is iterative algorithm based on Bellman expectation equation.

<img src="imgs/policy_eval.png">

Note that value function $V$ is the 1-dim array (cache) of values of each state.

<img src="imgs/iterative_policy_eval_bkp_diag.png">

_def_. **Policy iteration** is the process of finding the optimal policy. Policy iteration is iterative algorithm based on Bellman optimality equation. Policy iteration consist of two steps that are repeating in a cycle, policy evaluation and policy improvement.


<img src="imgs/policy_iter.png">

_def_. **Policy improvement** is the process of improving the policy. We are starting from policy $\pi$ and the way we come up with the new policy is to act greedly with respect to current action-value function. More formally,

$$
\pi'(a|s) = \underset{a \in A}{\operatorname{argmax}}q_{\pi}(s, a)
\tag{1}    
$$


Lets explain the following theorem by focusing on a special case of *determinstic policy* $a = \pi(s)$. The theorem is easily expandable to the case of stochastic policies $\pi(a|s)$.


_theorem_. (**Policy improvement theorem**) Let $\pi$ and $\pi'$ be any pair of deterministic policies such that, for all $s \in S$ 

$$
q_{\pi}(s, \pi'(s)) \ge v_{\pi}(s)
\tag{2}
$$

Then the policy $\pi'$ must be as good as, or better then, $\pi$. That is, it must obtain greater or equal expected return from all states $s \in S$

$$
v_{\pi'}(s) \ge v_{\pi}(s).
\tag{3}
$$

$\Delta.$ TODO

This means that if we are acting greedly (like described by equation (1)) then the condition for policy improvement theorem (equation (2)) is fullfiled so we can apply the theorem and conclude that updating policy in the greedy manner will indeed improve policy. Another conclusion is that if improvement stops that means that we reached optimal policy.


<img src="imgs/policy_iteration_2.png">


_def_. **Modified policy iteration** is essential the same algorithm as policy iteration except it can have several variations:
1. We can introduce stopping trashold $\epsilon$ and then not wait to find exact optimal policy until the end of the loop, but instead stop a little earilier (while ($\pi' - \pi) > \epsilon$)
2. Or simply stop after $k$ iterations of iterative policy evaluation

So far, we were discussing *policy iteration* algorithm by combining *iterative policy evaluation* and *greedy policy improvement*. **Generalized policy iteration** is the same approach except we can combine **any** policy evaluation algorithm and **any** policy improvement algorithm.

<img src="imgs/generalized_policy_iteration.png">

_theorem_. **Principle of optimality**
<img src=imgs/principle_of_optimality.png>

_def_. **Value iteration**

<img src="imgs/deterministic_value_iter.png">
<img src="imgs/value_iter1.png">
<img src="imgs/value_iter2.png">
<img src="imgs/value_iteration.png">

<img src="imgs/sync_dp.png">

_def_. **Asynchronous dynamic programming** TODO

TODO other ideas
