# The Bellman Equation of Optimality

### Deterministic Case
Watch this: https://www.youtube.com/watch?v=14BfO5lMiuk

Suppose the following *Deterministic* case: Where all our actions have a 100% guarantee outcome. Imagine that out agent observes state $s_0$ and has $N$ available actions to take. 

Every action leads to another state, $s_1...S_N$, with a respective reward, $r_1...r_N$. 

We will also assume that we know the values, $V_i$, of all states connected to the initial state $s_0$. 

*What will be the best course of action that the agent can take in such a state?* 

Say we now choose the concrete action $a_1$, and calculate the value given this action, then the value will be: $V_0(a=a_i) = r_i + V_i$. 

For the agent to choose the right action, the agent will need to calculate the resulting values for every action and choose the maximum possible outcome: 

$$V_0 = max_a\sum{1...N}(r_a + V_a)$$

If we are using a discount factor $\gamma$ we will use:

$$V_0 = max_a\sum{1...N}(r_a + \gamma V_a)$$

### Stochastic Case
Watch this: https://www.youtube.com/watch?v=aNuOLwojyfg 


What we will need to do in a stochastic case is calculate the expected value for every action, instead of just taing the value of the next state. 

So let's consider the following: 

Let's consider one single action available from state $s_0$, with three possible outcomes. That is a single action can lead to different states with different probabilities: 

$p_1 = s_1$, $p_2 = s_2$ , $p_3 = s_3$. All probabilities must sum up to one: $p_1 + p_2 + p_3 = 1$. 

Every target state has it's own reward $s_1 = r_1$, $s_2 = r_2$, $s_3 = r_3$. 

Here is how we calculate the expected value after issuing a single action (1): 

$$V_0(a=1) = p_1(r_1 + \gamma V_1) + p_2(r_2 + \gamma V_2) + p_3(r_3 + \gamma V_3)$$

Which can be expressed as the sum of all values, multiplied by their probabilities. 

By combining the Bellman Equation, for a deterministic case, with a value for stochastic actions, we get the **bellman optimality equation**: 

$$V_0=\max _{a \in A}\sum_{s \in S}P_{a,0 \rightarrow s}(r_{s,a} + \gamma V_s)$$

So now (as before): The optimal value of the state is equal to the action, which gives us the maximum possible expected immediate reward, plus discounted long-term reward for the next state. 

The definition is **recursive** meaning: The value of the state is defined via the values of immediate reachable states. 

These values not only give us the best reward that we can obtain, but they basically give us the optimal policy to obtain that reward. 

If our agent knows the value for every state, then it automatically knows how to gather all this reward. Therefor at every state the agent ends up in, it needs to select the action with the maximum expected reward for the action: A sum of the immediate reward and the one-step discounted long-term reward. 

## Value of action
To make our life sligthly easier, we can define different quantities in addition to the value of state $V_s$: **Value of action: $Q_{s,a}$**. 

This equals the total reward we can get by executing an action $a$ in state $s$ and can be defined $V_s$. 

This quantity gave a name to a whole family of methods: **Q-Learning** Learn more here: https://www.youtube.com/watch?v=aCEvtRtNO-M 

In these "Q-Learning" methods, our primary objective is to get values $Q$ for every pair of state and action. 

$$Q_{s,a}=\sum_{s\prime \in S}P_{a,s \rightarrow s\prime}(r_{s,a} + \gamma V_s\prime)$$

Or in other words: $Q$ for this state $s$ and action $a$ equals ***the expected immediate reward and the discounted long-term reward of the destination state.***

We can also define this as: 

$$V_s = max_{a \in A}Q_{s,a}$$

Meaning: The value of some state equals to the value of the maximum action we can execute from this state.

$Q$ values are much more convenient in practice, as for the agent its much simpler to make decisions about actions baed on $Q$ than based on $V$. 

In the case of $Q$, for the agent to choose the action based on the state, the agent just need to calculate $Q$ for all available actions, using the current state and choose the action with the largest value of $Q$. 

If the agent were to do the same using values of states, the agent needs to know not only values, but also probabilities for transitions. In practice, we rarely know them in advance, so the agent needs to estimate transition probabilities for every action and state pair. 

## The Value Iteration Method
Watch: https://www.youtube.com/watch?v=4KGC_3GWuPY

The **Value iteration algorithm** allows us to numerically calculate the values of states and values of actions of MDPs with known transition probabilities and reward. 

The procedure (for values of states) includes the following steps: 
1. Initialize values of all states $V_i$ to some initial value (usually zero)
2. For every state $s$ in the MDP, perform the **Bellman Update**
3. Repeat step 2 for some large number of steps or until changes become too small

In the case of **Action Values (Q)**, only monitor modifications to the preceding procedure are required: 
1. Initialize all $Q_{s,a}$ to zero
2. For every state $s$ and every action $a$ in this state, perform update: $Q_{s,a} \leftarrow \sum_{s\prime} P_{a,s \rightarrow s \prime}(r_{s,a} + \gamma max_a\prime Q_{s\prime,s\prime})$
3. Repeat step 2