# Chapter 2: Multi Armed Bandits
Let's explore the different ideas of Reinforcement Learning.

## k-armed bandit problem
At each timestamp, we need to choose 1 out of $k$ possible choices. Each option is associated with a reward. This value can be seen as a result of some hidden probability distribution. Let's consider refer to its expected value as the **value** of an action. Assuming $A_t$ is the action taken at timestamp $t$ and $R_t$ the corresponding reward, then 
$$ q_{*}(a) = E[R_t | a = A_t]$$
At each timestamp, we have an estimation of the reward value of a given action, say $Q_t(a)$. There 3 points
1. we would like $Q_t(a)$ to be as close as possible to $q_{*}(a)$
2. As the number of actions is finite, we can know which action has the most immediate expected / esimated reward. 
3. Choosing the action with the largest expected estimated reward is referred to as **exploitation**. Otherwise, it is called **exploration**


## Action-value methods
As mentioned above, among the most important components, aspects of RL is to estimate the rewards of different actions. The most straightforward way is calculating the average of rewards when choosing that exact action: 
$\begin{align} 
Q_t(a) = \frac{\sum_{i=1}^{t - 1} R_i \cdot 1_{A_i = a}}{\sum_{i=1}^{t - 1}  1_{A_i = a}}
\end{align}$
where this is translated as the fraction of the: sum of rewards at time stamps where the action $a$ was taken, over the number of times the action $a$ was taken.   
There are different ways to estimate the rewards at a given timestamp. Nevertheless, we would elaborate on that later.  
The greedy approach is to choose $a$ such that:
$$ A_t = argmax_a Q_t(a)$$ 
A slightly asymptotically better policy is choosing the greedy option only $1 - \epsilon$ of the time, while choosing a random option out of all options in the $\epsilon$ left.   
A computational remark: it is enough to save the number of times / steps each action was chosen, and the sum of rewards up till the current timestamp. Introducing some notation, we have the following:
1. $Q_n = \frac{\sum_{i = 1}^{n - 1} R_i}{n - 1}$, where $R_i$ is the reward at the $i$-th timestamp
2. $Q_{n + 1} = \frac{\sum_{i = 1}^{n} R_i}{n} = \frac{R_n + (n - 1) \cdot \frac{\sum_{i = 1}^{n - 1} R_i}{n-1}}{n}$
3. $Q_{n + 1} = \frac{R_n + Q_n (n - 1)}{n}= Q_n + \frac{R_n - Q_n}{n}$

The last equation provides a framework of the efficient computation of rewards and estimated rewards

In [1]:
# sounds like time to code our problems and have some fun with it
from typing import Union
from _collections_abc import Sequence
import numpy as np
import random
# set the seeds

random.seed(69)
np.random.seed(69)

class kArmBandit:
    def __init__(self, k: int=5, rewards_means: list[Union[float, int]]=None, rewards_variances: Union[float, list[float]]=None) -> None:
        # number of options
        self.k = k
        # the means parameters must be a sequence of floats or integers of size k
        assert rewards_means is None or (isinstance(rewards_means, Sequence) and len(rewards_means) == k)
        # the default values for the mean values are random

        if rewards_means is None:        
            rewards_means = [np.random.uniform(-5, 5) for _ in range(k)]

        if rewards_variances is None:
            rewards_variances = [1 for _ in range(k)]

        if isinstance(rewards_means, float) or isinstance(rewards_variances, int):
            rewards_variances = [rewards_variances for _ in range(k)]

        assert isinstance(rewards_variances, Sequence) and len(rewards_variances) == k

        self.rewards = [lambda: np.random.normal(mean, variance, 1) for mean, variance in zip(rewards_means, rewards_variances)]

    def reward(self, option_number: int):
        assert option_number < self.k, 'The option chosen is not available by the arm bandit'
        return self.rewards[option_number]()

    def reward_and_optimal(self, option_number: int):
        # first extract all the possible rewards
        all_rewards = [r() for r in self.rewards]
        # return the reward chosen and the maximum one
        return self.rewards(option_number), max(all_rewards)
    

In [None]:
# let's create an agent class for that

class Agent:
    def __init__(self, k: int, epsilon:float=0, start_estimate:float=0, update_estimate: callable=None) -> None:
        self.k = k
        # make sure epsilon is in the range [0, 1]
        assert 0 <= epsilon <= 1
        self.epsilon = epsilon # determines how often does the agent make a random non-greedy choice
        self.estimates = [start_estimate for _ in range(self.k)]
        self.step_sizes = [1 for _ in range(self.k)]

        # the default update estimate value is 1 / step_size
        # this is used for average sampling
        if update_estimate is None:
            update_estimate = lambda step_size: 1 / step_size
        
        self.update_estimate =  update_estimate  
        self.total_reward = 0
    
    def chooose(self):
        p = random.random()
        if p < self.epsilon:
            # return a random choice
            return random.choice(list(range(self.k)))
        else:
            # choose the option with the highest estimate
            return np.argmax(self.estimates)
    
    def update_estimates(self, option_num:int, option_reward:float):
        # make sure to add the reward to the total_reward
        self.total_reward += option_reward
        # make sure to update the estimated value of the corresponding reward
        e = self.estimates[option_num] # writing the entire expression repeatedly is not the best option
        self.estimates[option_num] = (e + self.update_estimate(self.step_sizes[option_num]) * (option_reward - e)) 
        # increment the step size for the corresponding option
        self.step_sizes[option_num] += 1 


In [None]:
# time to create a different class for experimenting with the different settings of the Agent and the ArmBadint
class Experiment:
    def __init__(self, agent: Agent, arm_bandit: kArmBandit, steps: int=1000, exps:int=100) -> None:
        # first make sure they have they operate on the same number of options
        assert agent.k == arm_bandit.k, "The agent and the armbandit must have the same number of options"
        self.agent = agent
        self.bandit = arm_bandit
        # how many will the agent have to choose an option at each trials
        self.steps = steps
        # the number of trials
        self.exps = exps

    def __run_trial(self):
        # each trial should calculate the
        # 1. the average reward at each step
        # 2. the percentage of the optimal rewards so far
        pass        
        


## Optimistic Initial Values
* One approach to driver the agent to explore more is setting optimistic initial action reward estimates. The main reasoning is that the agent will be disappointed with rewards driving him to explore each of the actions several times. This approach is quite limited: 
1. It only drives exploration in the early steps: The initial values will wash off as the agent continues exploring
2. Any approach based on initial values cannot be generalized to non-stationary problems
3. settings such ***optimistic*** values might not be known before hand.

## Nonstationary Problems
The above-mentioned approach works only for problems where the expected value of an action is constant. Nevertheless, they are far from perfect for non-stationary rewards. Among the possible approaches is using a constant weight to calculate the new term with respect to the previous estimate. In other words:
$$ Q_{n + 1} = Q_n + \alpha \cdot (R_n - Q_n)$$
The latter can be expanded:
$$Q_{n + 1} = (1 - \alpha) ^ n \cdot Q_1 + \sum_{i = 1}^{n} \alpha \cdot (1 - \alpha) ^ {n - i} R_i$$
Meaning that the most recent rewards are more significant in the estimation of the next reward.

## Upper-confidence Bound action selection
Exploration is always needed mainly becuase the estimation of the next actions are always uncertain. The $\epsilon$-greedy approach might force exploration but undescriminately. Such exploration approach is obviously far from optimal. One way to improve our exploration process is to use the UPPER CONFIDENCE BOUND action selection mechanism:
$$ A_t = argmax_a [Q_{t}(a) + c \cdot \sqrt{\frac{\ln(t)}{N_t(a)}}] $$
where the first term is clearly exploitation, while the 2nd is exploration:  