d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# MDPs and Bellman Equations

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you learn:<br>
 - Agent-Environment interactions
 - Markov processes
  * Probability transitions
  * Episodes
 - Markov Reward Processes
  * Reward 
  * Return
  * Discount factors
  * Value functions
  * Bellman equation
 - Markov Decision Processes
  * Policy 
  * State-value function
  * Action-value function
  * Bellman equations revisited
  * Optimal value function
  * Optimal policy 
<br>

### Out of scope
- [POMDPs](https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process)
- Continuous MDPs
- Infinite MDPs
- Undiscounted MDPs
  
## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) References
* [David Silver lecture](https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ)
* Sutton book - Chapter 2 & 3

### Markovian Process

0. Markov Decision Process formally describes agent<>environment setting
0. Reminder: Environment is fully observable. What does that mean?
0. All RL problems can be formulated as such
0. Markov Property (reminder):
$$ P[ S\_{t+1} \bigm\vert S\_{t} ] = P[ S\_{t+1} \bigm\vert S\_{1},S\_{2},...,S\_{t} ] $$
 * What is the interpretation?

### Transition Matrix

\\(\rho\_{ss'} = P[ S\_{t+1} = s' \bigm\vert S\_{t} = s]\\).

This formula represents the probability transition between any two states. The probability of going to s' when you are at state s. We can summarize all of the transitions in a matrix:

\\(\rho = \begin{pmatrix} 
\rho\_{11} & \rho\_{12} & ... & \rho\_{1n}
\\\ .
\\\ .
\\\ .
\\\ \rho\_{n1} & \rho\_{n2}& ... & \rho\_{nn} \end{pmatrix} \\)

**Question:**
<br/>
What can you say about sum of each row and columns?
<br/><br/><br/>
**Markov Process** also known as Markov Chain:

0. It is tuple \\(\langle S,\rho \rangle\\)
0. S is finite
0. \\(\rho\\) is state transition probability matrix

### Exercise 1 ###
<br/><br/>
![MDPs](https://files.training.databricks.com/images/rl/mdps.png)
<br>
For above graph:
0. Write \\(\rho\\), the state transition probability matrix.
0. What is the probability of ending in F given you are at state S?
0. What is the probability of ending in F after two tries given you are at state S? What are the assumptions behind your answer?

### Markov Reward Process ###
Markov Reward Process is a Markov chain \\(\langle S, \rho, R, \gamma \rangle\\) such that:
0. \\(P[ S\_{t+1}, R\_{t+1} \bigm\vert S\_{t} ] = P[ S\_{t+1}, R\_{t+1} \bigm\vert S\_{1},S\_{2},...,S\_{t}]\\)
0. \\(R\\) is a **reward function**. That is \\(R\_{s} = E[R\_{t+1} \bigm\vert S_{t} = s]\\)
0. \\(\gamma\\) is a discount factor. \\(\gamma \in [0,1]\\).

**Return** is the total sum of discounted reward:
\\(G\_{t} = R\_{t+1}+ \gamma R\_{t+2} + \gamma^2 R\_{t+3}+ ...\\)
<br/><br/>
**Questions**
0. Why \\(\gamma \in [0,1]\\)? What happens when you have cyclic cases?
0. What does \\(\gamma = 0 \\) imply? In what scenario do you use that?
0. What does \\(\gamma = 1 \\) imply? In what scenario do you use that?

### Value Function ###

0. Long term value of state \\(s\\). i.e what we expect to get as return given we are at state \\(s\\)
$$ v(s) = E[G\_{t} \bigm\vert S\_{t} = s] $$

### Exercise 2 ###

Refer to exercise 1 graph. Assume the return of state **\\(S, R, F\\) are -1, -3, 1**, respectively.
<br/><br/><br/>
For each of the following episodes calculate \\(G\_{1}\\). Assume we start at \\(S\\) and \\(\gamma = 0.5 \\). You may want to write a code to do the computation. 
0. \\(S, S, S, S, ...\\)
0. \\(S, R, S, R, S, R, S, R, S, R, ...\\)
0. \\(S, R, F, S, R, F, S, R, F, ...\\)

### Bellman Equations for Markov Reward Processes ###

\\( \begin{aligned} \\\ v(s) &= E[G\_{t} \bigm\vert S\_{t} = s] \\\  &= E[R\_{t+1} + \gamma R\_{t+2} + \gamma^2 R\_{t+2}+ ... \bigm\vert S\_{t} = s]  \\\   &= E[R\_{t+1} + \gamma (R\_{t+2} + \gamma R\_{t+2}+ ...) \bigm\vert S\_{t} = s] \\\
&= E[R\_{t+1} + \gamma G\_{t+1} \bigm\vert S\_{t} = s]\\\
&= E[R\_{t+1} + \gamma v(s\_{t+1}) \bigm\vert S\_{t} = s]
\end{aligned}\\)

### Questions ###

0. What is computational complexity for solving above system of linear equations?
0. Find eigenvalues and eigenvectors of transition probability matrix in previous exercise. What do they signify?
0. How would you solve approach large MRPs?
  - Dynamic Programming (more to come later)
  - MC evaluation (more to come)
  - Temporal-difference Learning (if time permits)

### Markov Decision Processes (MDPs) & Policies ###

**MDP** is a tuple \\(\langle S, A, \rho, R, \gamma \rangle\\) such that:
0. \\(P[ S\_{t+1}, R\_{t+1} \bigm\vert S\_{t}, A\_{t} ] = P[ S\_{t+1}, R\_{t+1} \bigm\vert S\_{1},A\_{1},S\_{2},A\_{2}...,S\_{t},A\_{t} ]\\)
0. A is finite set of actions
0. \\(\rho^a\_{ss'} = P[ S\_{t+1} = s' \bigm\vert S\_{t} = s, A\_{t} = a]\\)
0. R is a reward function i.e. \\(R^a\_{s} = E[R\_{t+1} \bigm\vert S\_{t} = s, A\_{t} = a]\\)
<br/><br/>

A **Policy** is a distribution over actions given the states. i.e how actions are distributed for a given state. Formally it can written as:
$$ \pi(a \bigm\vert s) = P[A\_{t} = a \bigm\vert S_{t} = s] $$

0. If you know the policy, you know how an agent will behave
0. Remember definition of MDPs: MDP policies only depend on current states. You can throw away all history!
0. What is the other name for 2?

### Questions ###
0. If we have a MDP, is the sequence of states \\(S\_{1}, S\_{2}, S\_{3}, S\_{4}... \\) MP? If so, write down probability transition formula from \\(s\\) to \\(s'\\) under policy \\(\pi\\) i.e. \\(\rho^\pi\_{s,s'}\\) If not, why not?
0. Is the sequence of \\(S\_{1},R\_{2}, S\_{2}, R\_{3}, S\_{4}... \\) MRP? If so, write down reward function formula \\(R^\pi\_{s}\\). If not, why not?
0. How does **value function** change for MDP? Can you write its formula?

### State-value and action-value functions ###
$$ v\_{\pi}(s) = E\_{\pi}[G\_{t} \bigm\vert S\_{t} = s] $$
$$ q\_{\pi}(s, a) = E\_{\pi}[G\_{t} \bigm\vert S\_{t} = s, A\_{t} = a] $$

### Bellman Expectation Equation ###

\\( \begin{aligned} \\\ q\_{\pi}(s,a) &= E\_{\pi}[G\_{t} \bigm\vert S\_{t} = s, A\_{t} = a] \\\  &= E\_{\pi}[R\_{t+1} + \gamma R\_{t+2} + \gamma^2 R\_{t+3}+ ... \bigm\vert S\_{t} = s, A\_{t} = a]  \\\   &= E\_{\pi}[R\_{t+1} + \gamma (R\_{t+2} + \gamma R\_{t+3}+ ...) \bigm\vert S\_{t} = s, A\_{t} = a] \\\
&= E\_{\pi}[R\_{t+1} + \gamma G\_{t+1} \bigm\vert S\_{t} = s, A\_{t} = a]\\\
&= E\_{\pi}[R\_{t+1} + \gamma q\_{\pi}(s\_{t+1}, A\_{t+1}) \bigm\vert S\_{t} = s, A\_{t} = a]\\\
&= R^a\_{s}+\gamma \sum\_{s'\in S} \rho\_{ss'}^a v\_{\pi}(s')\\\
&= R^a\_{s}+\gamma \sum\_{s'\in S} \rho\_{ss'}^a \sum\_{a'\in A} \pi(a'\bigm\vert s') q\_{\pi}(s',a')
\end{aligned}\\)


\\( \begin{aligned} \\\ v\_{\pi}(s) &= E\_{\pi}[G\_{t} \bigm\vert S\_{t} = s] \\\  &= E\_{\pi}[R\_{t+1} + \gamma R\_{t+2} + \gamma^2 R\_{t+3}+ ... \bigm\vert S\_{t} = s]  \\\   &= E\_{\pi}[R\_{t+1} + \gamma (R\_{t+2} + \gamma R\_{t+3}+ ...) \bigm\vert S\_{t} = s] \\\
&= E\_{\pi}[R\_{t+1} + \gamma G\_{t+1} \bigm\vert S\_{t} = s]\\\
&= E\_{\pi}[R\_{t+1} + \gamma v\_{\pi}(s\_{t+1}) \bigm\vert S\_{t} = s]\\\
&= \sum\_{a\in A} \pi(a\bigm\vert s) q\_{\pi}(s,a)\\\
&= \sum\_{a\in A} \pi(a\bigm\vert s) (R^a\_{s}+\gamma \sum\_{s'\in S} \rho\_{ss'}^a v\_{\pi}(s'))
\end{aligned}\\)


###Questions###
0. Can you flatten MDP to MRP? If so how?

### Optimal Value functions ###
$$ v\_{*}(s) = max\_{\pi} v\_{\pi}(s) $$ 

$$ q\_{*}(s, a) = max\_{\pi} q\_{\pi}(s,a) $$

**Optimal state-value function** is the maximum value of the function over all policies. Similarly, **optimal action-value function** is the maximum action-value function over all policies.

### Optimal Policy ###
\\(\pi \geq \pi' \\) if \\(v\_{\pi} \geq v\_{\pi'}, \forall s \\).

###Theorem###
For any Markov Decision Process.
0. There exists an optimal policy \\(\pi\_{*}\\) that is better than or equal to all other policies, \\(\pi \geq \pi' \forall \pi \\)
0. All optimal policies achieve the optimal value function, \\(v\_{\pi\_{\*}} = v\_{*}(s) \\).
0. All optimal policies achieve the optimal action-value function, \\(q\_{\pi\_{\*}}(s,a) = q\_{*}(s,a) \\).

### How can you find an optimal policy (more on this later) ? ### 

0. It can be found by maximizing over \\(q\_{*}(s,a)\\). i.e pick an action that maximizes q at each state
0. There is always a deterministic optimal policy for any MDP
0. If we know \\(q\_{*}(s,a)\\), we have the optimal policy. i.e. the agent knows what to do at each state

### Putting everything together, Bellman Optimality equations ###

$$ \begin{aligned} \\\ v\_{\*}(s) &= max\_{a}q\_{\*}(s,a)\\\ &= max\_{a} R\_{s}^a + \gamma \sum\_{s'\in S} \rho\_{ss'}^a v\_{\*}(s') \end{aligned} $$

$$ \begin{aligned} \\\ q\_{\*}(s,a) &= R^a\_{s}+\gamma \sum\_{s'\in S} \rho\_{ss'}^a v\_{\*}(s') \\\ &= R^a\_{s}+\gamma \sum\_{s'\in S} \rho\_{ss'}^a max\_{a'}q\_{\*}(s',a')  \end{aligned} $$

### Solving Bellman Optimality Equations (more on this later) ###

0. Non-linear. Cannot solve it with Linear Algebra
0. No closed-form solution. Yes, in special cases we have closed form solution
0. Approaches:
 - Value iteration
 - Policy iteration
 - Q learning
 - Sarsa

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>