# Exploitation-Exploration Dilemma
> A notebook that helps us to discover Reinforcement Learning
- toc: true
- branch: master
- badges: true
- comments: true
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2
- image: https://miro.medium.com/max/1400/1*ywOrdJAHgSL5RP-AuxsfJQ.png
- description: First in a series on understanding Reinforcement Learning.

# Preliminaries
* A ``policy`` defines the agent's behaviours
    * Deterministic policy A = $\pi(S)$
    * Stochastic policy: $\pi(A|S) = p(A|S)$ 
- The actual value function is the expected return
$$
\begin{split}
v_\pi(s) = \mathbb{E} [G_t | S_t = s, \pi] &= \mathbb{E} [R_{t+1} + \gamma R_{t+2 + ...} | S_t = s, \pi] \\
&=\mathbb{E} [R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s, A_t ~= \pi(s)]

\end{split}
$$
where $\gamma$ is a discount factor

- Optimal value is the highest possible value for any policy

$$
v_*(s) = \max_{a} \mathbb{E} [R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t=a]
$$


- The true action value for action ``a`` is the expected reward
$$
q(a) = \mathbb{E}\left\{R_t | A_t=a \right\}
$$


- A simple estimate of the action value is the average of the sampled rewards:
$$
Q_t(a)=\frac{\sum_{n}^{}R_n\mathbb{I}(A_n=a)}{\sum_{n}^{}\mathbb{I}(A_n=a)}
$$

where $R_n$ is the reward at time n.

- The update of the action values at tume step n+1
$$
Q_{n+1} = Q_n + \frac{1}{n} (R_n - Q_n)
$$

where $\alpha=\frac{1}{n}$ is a step size.

- The optimal value is
$$
v_* = \max_{a \in A}  q(a)
$$

- Regret is the opportunity loss for one step

$$
v_* - q(A_t)
$$

- Action Regret $\Delta_a$ for a fiven action is the difference between optimal value and true value of a
$$
\Delta_a = v_* -q(a)
$$

- The trade off between exploration and exploitation will be done by minimizing the total regret

$$
\begin{split}
L_t &= \sum_{i=1}^t (v_* - q(a_i)) \\
&= \sum _{a \in \mathcal{A}} N_t(a)(v_* - q(a_i)) = \sum _{a \in \mathcal{A}} N_t(a) \Delta _a

\end{split}
$$

* Categorizing Agents
    * Value Based (Value Function)
    * Policy Based (Policy)
    * Actor Critic (Policy and Value Function)

* Prediction and Control
    * Prediction: evaluate the future for a given policy
    * Control: optimize the future (find the best policy)
$$
\pi_*(s) = \argmax_{\pi} v_\pi (s) 
$$

# Introduction
The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the action taken rather than instructs by giving correct actions. It creates the need for active exploration, for an explicit search of good behaviour. 

Evaluative feedback indicates how good the action taken was, but not whether it was the best of worst action possible. On the other hand, instructive feedback indicates the correct action to take  which is the basis of supervised learning. 

The well known trade-off between exploitation and exploration is essentially the compromise between maximizing performance (exploitation)  and increasing the knowledge (exploration). It is the typical problem in online decision making because we are actively collecting our information to make the best overall decisions. 

## Multi-Armed Bandit

Consider the following learning problem: we are amongs the choice of k different options, after each choice, we receiver a numerical reward which are sampled from a stationary probability distribution that depends on what we selected. The objective is to maximize the expected reward $\sum _i R_i$ over some time period. This is the original form of the k-armed bandit problem. 

When we look into the total regrets, we see the differences between the optimal value and the actual action value which are accumulated as time evolves. The objective is to minimise that regrets because the faster it grows, the worst it is. 

In principle, the regret can grow unbounded, so the interesting part is to study how fast it grows. For example, greedy policy has linear regret as it grows in the number of step we have taken. 