# Reinforcement Learning II

<hr>

**Revisiting Markov Decision Processes (MDP)**

$\text{Markov Decision Process, MDP} = \text{<} S, A, T, R \text{>}$

where
- $S$ refers to the set of all possible states
- $A$ refers to the set of all possible actions
- $T$ refers to the transition probabilities between states for a given action
- $R$ refers to the reward expected, starting from state $s$, taking an action $a$ and ending up in the next state, $s'$

The optimal policy, $\pi^* (s)$, is the set of actions that would optimize the expected sum of rewards starting from state $s$. If an agent takes an action $a$ at state $s$ and then acts optimally after that, can be defined by a value function $Q^* (s,a) = \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma max_{a'} Q(s', a')]$

These two equations are connected as $\pi^* (s) = \arg \max_a Q^* (s,a)$, which is the optimal action taken at state $s$ and then acting optimally after that.

****

**Estimating reinforcement learning parameters**

In RL, we may not know everything that is required in MDP. Realistically, $\text{Reinforcement Learning, RL} = \text{<} S, A {>}$ and we are likely to not know what are the transition probabilities between states and the rewards at each state. The high-level idea is to balance between **explore-exploit** when the apriori does not exist. Therefore the question is to *what extent do we explore, what do we explore and how it would change our recommendation for policies?*

The empirical estimation of $T$, $R$ can be as follows:

$\hat{T}(s, a, s') = \frac{\text{count}(s, a, s')}{\sum_{s'}\text{count}(s, a, s')}$

$\hat{R}(s, a, s') = \frac{\sum_{t=1}^{\text{count} (s, a, s')} R_t(s, a, s')}{\text{count}(s, a, s')}$

But realistic problems arise with such estimations as these statistics cannot be reliably collected unless the agent visits this state multiple times during the estimation process. This is especially problematic if the state space is large.

****

**Sampling-based approach for Q-learning**

Suppose there is no information about $T$, $R$ and an agent starts out from state $s
$ and collects a few samples by taking actions. Each sample will then collect information about the following tuple $((s, a, s'), R(s, a, s'))$ which indicates that the agent received a reward of $R(s, a, s')$ when it reached state $s'$ by taking action $a$ from state $s$.

We can then collect $K$ samples:

$sample_1 = R(s, a, s') + \gamma max_{a'} Q(s', a')$

$\dots$

$sample_K = R(s, a, s_K') + \gamma max_{a'} Q(s_K', a')$

Here, our estimation* of $Q(s, a)$ can be done over $K$:

$Q(s,a) = \frac{1}{K} \sum_{i=1}^{K} sample_i = \frac{1}{K} \sum_{i=1}^{K} [R(s, a, s_i') + \gamma max_{a'} Q(s_i', a')]$

$Q_{i+1} (s,a) = \alpha \cdot \text{sample} + (1-\alpha) \cdot Q_i (s, a)$ using the exponential running average*

Procedure to sampling-based Q-learning:

1. Initialize $Q(s, a) = 0$ $\forall s, a$
2. Iterate until convergence
    - Collect sample
    - Update $Q_{i+1} = \alpha \cdot [R(s, a, s') + \gamma max_{a'} Q_i (s', a')] + (1 - \alpha) Q_i (s, a)$
    - When re-written, it can be observed to be similar to a gradient descent algorithm:
    
        $Q_{i+1} = Q_i (s,a) + \alpha [R(s, a, s') + \gamma max_{a'} Q_i (s', a') - Q_i (s, a)]$
    

$\therefore$ This algorithm guarantees convergence. Each time we take an action in the real world, we collect evidence, update our knowledge and recommend a set of actions based on what we know.

In the question of exploration vs exploitation, one approach to balance both is the $\epsilon$-greedy approach, which does this by randomly sampling an action with probability $\epsilon$ and by choosing the best currently available option with probability $1 - \epsilon$ such that the higher the $\epsilon$ the higher are the chances that it explores new/less-visited states and actions. As the agent learns to act well and has sufficiently explored its environment, $\epsilon$ should decay off.

<hr>

**Extras*

Objective: Estimate $\mathbb{E}[f(x)]$

1. Model-based estimation of a probability distribution

    - Sample $x_i \sim p(X)$ for $i = 1, \dots, K$
    - $\hat{p}(x) = \frac{\text{count}(x)}{K}$
    - $\mathbb{E}[f(x)] = \sum_x p(x) \cdot f(x)$

2. Model-free, *does not estimate $\hat{p}(x)$*

    - Sample $x_i \sim p(X)$ for $i = 1, \dots, K$
    - $\mathbb{E}[f(x)] = \frac{1}{K} \sum_{i=1}^{K} f(x_i)$
    
Exponential running average

$\bar{X_n} = \frac{X_n + (1 - \alpha) X_{n-1} + (1 - \alpha)^2 X_{n-2} + \dots}{1 + (1 - \alpha) + (1 - \alpha)^2 + \dots}$

$\bar{X_n} = \alpha X_n + (1 - \alpha) \bar{X}_{n-1}$

<hr>

# Basic code
A `minimal, reproducible example`