# State and action values
It is a brief overview of this topic. 

Newest edition of book by Sutton & Bartol gives a good explanation, but beginning is weak.

https://www.deeplearningwizard.com/deep_learning/deep_reinforcement_learning_pytorch/bellman_mdp/#bellman-expectation-equations
***

## Important probabilities

<i>Parts regarding probabilities are more flashed out in [Meaning_of_Psasr_probability_matrix.ipynb](data_processing/neural_networks/RL_Reinforced_Learning/Meaning_of_Psasr_probability_matrix.ipynb).

Here i change some stuff because i now have a better understanding of topic and notation.</i>

Join probability of getting reward can be written as
$$p(s,a,s^\prime,r) = p(s) \cdot p(a|s) \cdot p(s^\prime|s,a) \cdot p(r|s,a,s^\prime)$$
Following part is controlled by a system (see link how to 'condition away' $p(s) \cdot p(a|s)$ )
$$p(s^\prime, r|s,a) = p(s^\prime|s,a) \cdot p(r|s,a,s^\prime)$$
Following is referred as policy, which describes how 'agent' behaves
$$\pi(a|s) = p(a|s)$$
***

## Agent's policy (Behavior rules ) $\pi(a|s)$

Policy may be uniform random (p is proportional to number available actions at $s$):
$$\pi(a|s) = \frac{1}{|A(s)|}$$
Or it can be deterministic. Say only action $K$ is relevant:
$$ \pi(a_i|s)= \begin{cases}
    1; \ i = K\\
    0; \ i \neq K
    \end{cases}$$

Or anything in-between
***

## Expected immediate reward after reaching state $s,a,s^\prime \rightarrow r(s,a,s^\prime)$

Stochastisity of a system might give rewards randomly drawn from some distribution (for specific tuple $(s,a,s^\prime)$).

In order to compute expected reward $\mathbb{E}[R(s,a,s^\prime)]$ we should compute weighted sum:
$$r(s,a,s^\prime) = \mathbb{E}[R(s,a,s^\prime)] = \sum_r  r\cdot p(r|s,a,s^\prime)$$
_Here $R(s,a,s^\prime)$ is just a random variable._

We can express $p(r|s,a,s^\prime)$ by using system dynamics chain rule equation
$$p(r|s,a,s^\prime) = \frac{p(s^\prime,r|s,a)}{p(s^\prime|s,a)}$$

So expected immediate reward at $s^\prime$ is:
$$\boxed{r(s,a,s^\prime) = \frac{\sum_r  r\cdot p(s^\prime,r|s,a)}{p(s^\prime|s,a)}}$$
***

$p(s^\prime|s,a)$ is a marginalization of $p(s^\prime,r|s,a)$ over $r$
$$p(s^\prime|s,a) = \sum_r p(s^\prime,r|s,a)$$
<i>It is so because, by conditioning, $p(s^\prime,r|s,a)$ is normalized:</i>
$$\sum_r \sum_{s^\prime} p(s^\prime,r|s,a) = 1$$

***

## Expected immediate reward for using action $a$ in state $s,a \rightarrow \hat q(s,a)$
System might be stochastic in terms of target states which can be reached by using action $a$ in state $s$.<br>
We have to calculate expected reward $\mathbb{E}[r(s,a,s^\prime)]$ based on probability that agent will transition to some state $s^\prime$ by using action $a$:

$$\hat q(s,a) = \mathbb{E}[r(s,a,s^\prime)] = \sum_{s^\prime \in S} p(s^\prime | s,a) \cdot r(s,a,s^\prime)$$
We have calculated $r(s,a,s^\prime)$ previously
$$\hat q(s,a) = \sum_{s^\prime \in S} p(s^\prime | s,a) \cdot \frac{\sum_{r \in R} r \cdot p(s^\prime, r|s,a)}{p(s^\prime | s,a) } = \sum_{r \in R} \sum_{s^\prime \in S} r \cdot p(s^\prime, r|s,a) $$
$$\boxed{\hat q(s,a) =  \sum_{r \in R} \sum_{s^\prime \in S} r \cdot p(s^\prime, r|s,a) }$$
***

## Expected immediate reward for being in state $s\rightarrow \hat v_{\pi}(s)$
Expected immediate reward $\mathbb{E}_\pi[r(s,a)]$ __depends on policy__ $\pi(a|s)$ agent follows. 

For example agent can have a policy of always picking 0 reward, thus expected immediate reward will be 0.

_This is the reason why we specify subscript $\pi$, to differentiate rewards achieved in a same system but using different policies._

$$\boxed{\hat v_\pi(s) = \mathbb{E}_\pi[r(s,a)] = \sum_a \pi(a|s) \cdot \hat q(s,a)}$$

We plug in definition of $\hat q(s,a)$:

$$\boxed{\hat v_\pi(s) = \sum_{a \in A} \pi(a|s) \sum_{s^\prime \in S}\sum_{r \in R}r \cdot p(s^\prime, r|s,a)}$$
***

## Bellman equations

### Introduction of episode and time
Bellman equations allows to give an estimate of all future rewards for a particular state $s$ (or state-action pair).

This implies that we have a trajectory of transitions along which rewards are gathered.

Each transition is assumed to happen on specific time step $t$. We introduce notation for event trajectory:

$$S_{t = 0} \rightarrow A_{t = 0} \rightarrow R_{t = 1} \rightarrow S_{t = 1} \rightarrow \dots \rightarrow S_{t = T}= \{S_0,A_0,R_1,S_1,\dots, S_T\}$$

Where $T$ is trajectory termination time, if it exists.
***
### Discounted future (cumulative) reward
Lets examine a such singlet trajectory/experiment/trial.

For some time step $t$ and state $S_t$ we can calculate cumulative rewards $G_t$.

$$G_t = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{T-t-1} R_T = \sum_{i = t+1}^{T} \gamma^{i-t-1} R_i$$
Where $\gamma $ is discount, which is defined to deal with infinite episodes.
***


(Optional) Deriving power offsets:
$$\gamma^0 R_{t+1} \rightarrow \gamma^{0 + \alpha} R_{t+1 + \alpha} = \gamma^{\alpha} R_{t+1 + \alpha}$$ 
$$ t+1 + \alpha = T \rightarrow \alpha = T - (t+1)$$
$$\gamma^{\alpha} R_T \rightarrow \gamma^{T- t - 1} R_T$$
***

(Optional) Move sum to 0-start:
$$G_t = \sum_{i = t+1}^{T} \gamma^{i-t-1} R_i$$
$$j  = i - (t+1)\rightarrow j_{min} = 0 \rightarrow i_{min} = t+ 1$$
$$ i_{max} = T\rightarrow j_{max} = T - t - 1$$
$$i = j + t + 1$$
$$G_t= \sum_{j = 0}^{T - t - 1} \gamma^j R_{t + j + 1}$$
***

### Expected discounted cumulative reward at state $s$ (and $(s,a)$ pair)
We can define state value $v_{\pi}(s)$- an expected discounted cumulative reward at state $s$ as:

$$v_{\pi}(s) = \mathbb{E}_\pi [G_t|S_t = s] = \mathbb{E}_\pi \bigg[ \sum_{j = 0}^{\infty} \gamma^j R_{t + j + 1}|S_t = s\bigg]$$
Here we changed upper bound for convenience, given that all rewards past terminal state are 0.

Also we can define action value $q_{\pi}(s,a)$ which includes conditioning that we have taken action $a$:

$$q_{\pi}(s,a) = \mathbb{E}_\pi [G_t|S_t = s, A_t = a] = \mathbb{E}_\pi \bigg[ \sum_{j = 0}^{\infty} \gamma^j R_{t + j + 1}|S_t = s, A_t = a\bigg]$$
Impact of conditioning in these definitions is not obvious. But clearly pre-selecting action will result in deterministic first reward.
***

### Decoupling immediate rewards and future rewards
We can consider two cases:
$$G_t = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{T-t-1} R_T$$
and $G_{t+1}$:
$$G_{t+1} = R_{t+2} + \gamma R_{t+3} + \dots + \gamma^{T-(t+1)-1} R_T$$
multiply by $\gamma$
$$\gamma G_{t+1}  = \gamma R_{t+2} + \dots + \gamma\gamma^{T-t-2} R_T = \gamma R_{t+2} + \dots + \gamma^{T-t-1} R_T$$
We see that we can rewrite $G_t$ as:
$$G_t = R_{t+1} + \underbrace{\gamma R_{t+2} + \dots + \gamma^{T-t-1} R_T}_{\gamma G_{t+1}} = R_{t+1} + \gamma G_{t+1}$$
***

So value state can be defined as:
$$v_{\pi}(s) = \mathbb{E}_\pi [G_t|S_t = s] = \mathbb{E}_\pi [R_{t+1} + \gamma G_{t+1}|S_t = s]$$
Expression can be split via linearity, but conditioning/causality should carry over. 
$$v_{\pi}(s) = \mathbb{E}_\pi [R_{t+1}|S_t = s] + \mathbb{E}_\pi [\gamma G_{t+1}|S_t = s]$$
_NOTE: Interesting aspect is that for a single episode, due to Markov property, conditioning on history longer than 1 step makes no difference._

Left part is our defined immediate reward
$$\mathbb{E}_\pi [R_{t+1}] = \sum_{a \in A} \pi(a|s) \sum_{s^\prime \in S}\sum_{r \in R}r \cdot p(s^\prime, r|s,a)$$
One can rewrite it via
$$r(s,a,s^\prime) = \frac{\sum_r  r\cdot p(s^\prime,r|s,a)}{p(s^\prime|s,a)}$$
as
$$\mathbb{E}_\pi [R_{t+1}] = \sum_{a \in A} \pi(a|s) \sum_{s^\prime \in S} p(s^\prime|s,a) \cdot r(s,a,s^\prime) $$
Right part should be weighted as for all $(a,s^\prime)$
$$\mathbb{E}_\pi [ \gamma G_{t+1} |S_{t} = s] = \sum_{a \in A} \pi(a|s) \sum_{s^\prime \in S} p(s^\prime|s,a) \cdot  \gamma v_{\pi}(s^\prime)$$

<i>Note: Notation is a bit wonky. But its clear that future cum-rewards are captured in weighted discounted sum of next state values.</i>

So $$\boxed{v_{\pi}(s) = \sum_{a \in A} \pi(a|s) \sum_{s^\prime \in S} p(s^\prime|s,a) \cdot \bigg[ r(s,a,s^\prime) + \gamma v_{\pi}(s^\prime) \bigg]}$$

By using definition for $r(s,a,s^\prime)$ and

$$p(s^\prime|s,a) = \sum_r p(s^\prime,r|s,a)$$ 

we can express $v_{\pi}(s)$ in terms of $p(s^\prime,r|s,a)$


$$\boxed{v_{\pi}(s) = \sum_{a \in A} \pi(a|s) \sum_{s^\prime, r} p(s^\prime,r|s,a) \cdot \bigg[ r + \gamma v_{\pi}(s^\prime) \bigg]}$$

Definition for $q_{\pi}(s))$ is analogues to $\hat q_{\pi}(s)$ in that summation over $a$ is absent

$$\begin{matrix}
\boxed{q_{\pi}(s) = \sum_{s^\prime \in S} p(s^\prime|s,a) \cdot \bigg[ r(s,a,s^\prime) + \gamma v_{\pi}(s^\prime) \bigg]}
&
\boxed{q_{\pi}(s) = \sum_{s^\prime, r} p(s^\prime,r|s,a) \cdot \bigg[ r + \gamma v_{\pi}(s^\prime) \bigg]}
\end{matrix}$$

So relation between $v_{\pi}(s)$ and $q_{\pi}(s)$ is elementary:
$$\boxed{v_{\pi}(s) = \sum_{a \in A} \pi(a|s) \cdot q_{\pi}(s) }$$

## Bellman optimality equations: optimal state and action value

Optimal implies that we drop stochastisity and select transitions with p=1 that maximize reward.
$$\pi(a|s) \rightarrow \pi_\ast(a|s) \text{ or } a_\ast = \pi_\ast(s)$$
_(depending on the context)_

Optimal state value $v_\ast$
$$v_\ast = \underset{a}{\mathrm{max}} \ q_{\pi}(s,a) = q_{\pi}(s,a = \pi_\ast(s))$$

Optimal state value $q_\ast$

$$q_{\ast}(s,a) = \mathbb{E}\bigg[ R_{t+1} + \gamma \cdot \underset{a^\prime}{\mathrm{max}} \ q_{\ast}(s^\prime,a^\prime)\bigg|S_t = a, A_t = a \bigg]$$
Which reduces to hops from one state-action to another