In [3]:
import numpy as np 

### Learning Agents

- The agent repeatedly takes **action** $a_t \in \mathbb{A}$ in discrete time periods, $t \in \mathbb{N}.$
- When the agent choose action $a_t$ at time $t$, it obtaints an immediate **observable reward**, $r_{t+1}$.
- The **return** $G_t$ at time $t$ is some function $f$ of the future rewards $G_t= f(r_{t+1} , r_{t+2},\cdots)$.
- Often we use $G_t = \sum_{i=t}^{\infty} \gamma^i r_{i+1}$, where $\gamma \ in [0,1]$ is the **time discounting** of the agent.

### Markov Decision Processes
- The reward is a function of the action chosen in the previous state $s_t \in \mathbb{S}$.
- The $s$ state and action $\alpha$ at time $t$ determines the probability of the subsequent state s ′ and reward $r$.
- The probabilities are specified by the function $p(s',r|s,\alpha)$.
- This specifies a finite **Markov Decision-Process**.
- The goal of the agent is to maximise the expected return $\mathbb{E}(G)$.
- The agent follows a **policy** which specifies an action to take in each state: $\pi(s)\in\mathbb{A}$.
- The optimal policy is denoted $\pi^*$ .

### Value Functions

We use the function $v(\alpha)$ to denote the expected return for action a over the entire episode:

$$v(\alpha) = \mathbb{E}(G|a_t = \alpha)$$

$$ = \mathbb{E}(R_{\alpha})$$

- Typically, the function $v$ is unknown to the agent.
- In this scenario, the agent performs *sequential decision making under uncertainty*.

Consider a $n$-armed bandit, where $R_{\alpha}\sim N(\alpha, 1)$

In [51]:
def play_bandit(a, variance=1.0):
    """ Return the reward from taking action a """ 
    return np.random.normal(a, scale=np.sqrt(variance))

r_2 = play_bandit(a=1) 
r_2

1.0635084643199402

### Greedy Action

If we can compute $v(\alpha)$, then the agent's optimal policy is simple. That is the agent simply choose the action with the highest expection.

$$\alpha^* = arg\max_{\alpha} v(\alpha)$$

- we call $\alpha^*$ as the **greedy action**.

### Learning as Sampling

The random variates, $r_t$, are directly observable.
They are samples from the distribution $F_{\alpha}(r)$.

$$r_t \sim^{distribution} F_{\alpha}(r)$$

How can we estimate $v(\alpha)$ given the observed rewards $r_1,r_2, ..., r_t$?

### Value Estimation
We can use the *sampling to estimate* $v(.)$,

$$Q_r(\alpha) = \frac{r_1+r_2+...+r_k}{k}$$

- By the law of large numbers,

$$ \lim_{k\to\infty}Q(\alpha) = v(\alpha) $$

In [41]:
def sample_from_bandit(a, k):
    rewards = []
    for t in range(k):
        r = play_bandit(a)
        rewards.append(r)
    return np.mean(rewards)

In [50]:
q_2 = sample_from_bandit(a=2, k=20)
q_2

2.560533626413457

#### Optimise the Coding

Using `map`, instead of a loop, to speed up the algorithm.

In [52]:
def sample_from_bandit(a,k):
    return np.mean(map(lambda i: play_bandita), range(k))

Or, in a comprehension way,

In [53]:
def sample_from_bandit(a,k):
    return np.mean(map(lambda i: play_bandita), range(k))

Further code optimisation could be implemented, for instance, for large sample sizes we need to allocate memory to hold all the previous samples.

### Incremental update of estimates

$$\begin{equation}
\begin{split}
Q_{k+1} & = \frac{1}{k} \sum_{i=1}^k r_i \\
& = \frac{1}{k}(r_k + \sum_{i=1}^{k-1} r_i) \\
& = \frac{1}{k}(r_k + (k-1) \frac{1}{k-1}\sum_{i=1}^{k-1} r_i) \\
& = \frac{1}{k}[r_k + (k-1)Q_k] \\
& = \frac{1}{k}[r_k + k\cdot Q_k - Q_k] \\
& = Q_k + \frac{1}{k}[r_k - Q_k] 
\end{split}
\end{equation}$$

### Temporal Difference Learning
- We are adjusting/updating an old estimate towards a new estimate based on more recent information.
- We can think of the coefficient $(k)^{−1}$ as a step size parameter.
$$Q_{k+1} = \frac{1}{k} [r_k - Q_k] $$

`new_estimate = old_estimate + step_size * (target - old_estimate)`

In [55]:
def update_q(old_estimate, target_estimate, k): 
    step_size = 1./(k+1) 
    error = target_estimate - old_estimate 
    return old_estimate + step_size * error

def sample_from_bandit(a, k):

    current_estimate = 0.
    for t in range(k):
        current_estimate = update_q(current_estimate, play_bandit(a), t) 
        return current_estimate

In [57]:
q_2 = sample_from_bandit(a=2, k=100000) 
q_2

1.7672853726906987