# Probabilties
see [intro_to_stats_for_RL.ipynb](../../../../educational/statistics_sets/intro_to_stats_for_RL.ipynb)

# Discrete-time Markov chain (DTMC)

DTMC is a sequence of random variables $X_1, X_2, \dots$ with the <i>Markov property</i>.

Lets unpack this below:
* Random variable $X$ during 'experiment/sampling' can return one value from a collection of sample space $S$.<br>
 Return value is picked at random with a given probability distribution.<br><br>
For example $X$ can be a result of a coin toss.
    * Sample space is a set of two outcomes heads and tails $S = \{H,T\}$
    * Probability of each outcome can be arbitrary, but all probabilities should sum to 1. We can consider unbiased coin toss so
    $$P(X = H) = p_H = P(X = T) = p_T = \frac{1}{2}$$
    $$p_H + p_T = 1$$

* Sequence of random variables $X_1, X_2, \dots, X_n$ is describes a series of identical 'experiments' conducted $n$ times.
    * Index $1,2,\dots, n$ helps to differentiate each individual 'experiment'.
    * For our coin toss example, sequence $X_1, X_2$ can unfold in various outcomes: $\{(H,H),(H,T),(T,H),(T,T)\}$
    * If coin toss is unbiased, probability of each outcome is equals.<br>
        In the extreme if $p_H = 1$ (so $p_T = 1 - p_H = 0$), only possible outcome is $(H,H)$
    * In general to achieve result $(H,H)$, due to independence of coin tosses we calculate probability of this outcome as<br>
        $p_H \times p_H$ 

* Markov property tells that result of 



In general, its a model of random walk on graph with some states $X_i \in S$. 

Example sequence
Randomness/stochastic nature comes in from non-deterministic transitions from one state to another.

Markov property imposes restrictions on a model which simplifies analysis. Namely:
1. Limited history- probability of transition to next state only depends on current state and not previous states.
$$


<img src="https://upload.wikimedia.org/wikipedia/commons/a/ad/Markov_Decision_Process.svg" alt="image info" style="background-color:white;padding:0px;" width="200" height="200" />

## Value `V` of state `s` $V(s)$

Value of state is formally defined as expected total reward that can be obtained from some state `s`.

<i> To deal with infinite/looping trajectories or to enforce short time memory we can apply decreasing discount factor $\gamma \in [0,1)$ to each next obtained reward.</i>

$$V(s) = \mathbb{E} \left[\sum_{t=0}^\infty \gamma^t \ r_t  \right]

Lets consider a case with initial state $s_0$ with two nearby states $S = \{s_1,s_2\}$ upon reaching which episode terminates and agent receives a reward from $R = \{r_1,r_2\}$.<br>
Probability to transition from state $s_0$ to state $s_1 = p_{0 \rightarrow 1}$ and similarly with $p_{0 \rightarrow 2}$.<br>
We dont use self-loops $p_{0 \rightarrow 0}$ since it would introduce infinite episode.

Expected reward is a probability-weighted sum of rewards:

$$V(s_0) = \mathbb{E} \left[r_{t = 0} \right] = p_{0 \rightarrow 1} \ r_1 + p_{0 \rightarrow 2} \c r_2  = \sum_{s^\prime \in S} p_{0 \rightarrow s^\prime} \ r_{s^\prime}$$
of course given that 
$$\sum_{s^\prime \in S} p_{0 \rightarrow s^\prime} = 1$$
Depending on transition probabilities expected reward may vary.

If we add additional states $S_2 = \{ s_3, s_4\}$, one-way connected to $s_2$, agent can 'explore further' and accumulate more reward past $r_2$
$$V(s_2) = p_{2 \rightarrow 3} \ r_3 + p_{2 \rightarrow 4} \ r_4 = \sum_{s^\prime \in S_2} p_{2 \rightarrow s^\prime} \ r_{s^\prime} $$

$$V(s_0) = \mathbb{E} \left[r_{t=0} + \gamma \ r_{t=1} \right] = p_{0 \rightarrow 1} \ r_1 + p_{0 \rightarrow 2} \ [r_2  + \gamma \ p_{2 \rightarrow 3} \ r_3 + \gamma \ p_{2 \rightarrow 4} \ r_4]
= p_{0 \rightarrow 1} \ r_1 + p_{0 \rightarrow 2} \ [r_2  + \gamma \ V(s_2)]
$$
You can observe a recursion of unfolding time steps. It is heavily used in <i>Bellman equations of optimality</i>.

## Bellman equations of optimality
Bellman modifies definition of $V(s)$ with introduction of `action` $a_s = \pi(s)$ and a `policy` $\pi$, which defines transition probabilities.<br> 
Actions are issued by following a policy and they brings agent from one state to another.<br> 
If agent is guided solely by policy, its rewards and thus value of each state will depend on how successful this policy is.<br>
Bellman equation of arbitrary policy shows decoupled `now` reward and `future` reward.<br>
$$ V(s) = 
\mathbb{E} \left[r_{t=0} + \sum_{t=1}^\infty \gamma^t \ r_t  \right] 
\rightarrow 
V^\pi(s) = \mathbb{E}_\pi \left[R(s,a = \pi[s]) + \gamma \ V^\pi (s^\prime) \right]$$
* `Now` reward $R(s,a = \pi[s])$ is only concerned about reward gained by performing action $a_s$ based on current policy, that brings you from state $s$ to state $s^\prime$<br>
* Discounted future reward $\gamma \ V^\pi (s^\prime)$ is defined via Bellman's equation recursively

Bellman's condition of optimality states (rather obviously), that maximal reward can be reached using `optimal` policy $\pi^*$

Value of a state $V^\pi(s)$ depends on a policy $\pi$, and so does reward gathered from state $s: R(s,a_s) + $ .<br>
Action is required in order to define an `optimal` policy $V^{\pi *}$. Agent following $V^{\pi *}$ selects optimal actions and achieves most reward.
