# Multi-armed bandit problem

## Goal
- Reinforcement learning problem
  - Exploitation-Exploration tradeoff
  - Greedy decisions
  - $\epsilon$-greedy exploration
- Thompson Sampling
  - Beta distribution
  - Application 
    - AB Testing
- Code
- References

## Theoretical Aspects
### Reinforcement Learning problem

- a policy, a reward signal, a value function, and a model of the environment.


### k-armed Bandit 

$$\begin{equation} \
\begin{aligned} \
q_{*}(a) = \mathbb E[R_{t}|A_{t}=a] \
\end{aligned} \
\end{equation}$$

$$\begin{equation} \
\begin{aligned} \
Q_{t}(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}} = \frac{\sum_{i=1}^{t-1}R_{i}.\mathbb 1_{A_{i}=1}}{\sum_{i=1}^{t-1}\mathbb 1_{A_{i}=1}} \ \end{aligned} \
\end{equation}$$ 

$$\begin{equation} \
\begin{aligned} \
A_{t} = \text{arg}\max\limits_{a}Q_{t}(a) \
\end{aligned} \
\end{equation}$$

### Exploration-Exploitation

$$\begin{equation} \
\begin{aligned} \
Q_{n} = \frac{R_{1}+R_{2}+...+R_{n-1}}{n-1} \
\end{aligned} \
\end{equation}$$

$$\begin{equation} \\
\begin{aligned} \\
Q_{n+1} &= \frac{1}{n}\sum\limits_{i=1}^{n}R_{i}  \\
 &= \frac{1}{n}\Bigg(R_{n} + \sum\limits_{i=1}^{n-1}R_{i}\Bigg )  \\
 &= \frac{1}{n}\Bigg(R_{n} + (n-1)\frac{1}{(n-1)}\sum\limits_{i=1}^{n-1}R_{i}\Bigg )  \\
 &= \frac{1}{n}\Bigg(R_{n} + (n-1)Q_{n} \Bigg )  \\
 &= \frac{1}{n}\Bigg(R_{n} + nQ_{n} - Q_{n} \Bigg )  \\
 &= Q_{n} + \frac{1}{n}\Big[R_{n} - Q_{n} \Big ]  \\ 
\end{aligned} \\
\end{equation}$$

$$\begin{equation} \\
\begin{aligned} \\
\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} \Big[ \text{Target} - \text{OldEstimate} \Big ]
\end{aligned} \\
\end{equation}$$

### Greedy decisions
### $\epsilon$-greedy exploration
### Beta distribution
### Thompson Sampling



## Practical Aspects

In [None]:
# %load ./mab/mab_ts_ab.py
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

class MABClass:
    def __init__(self, epsilon, leverCount, probs):
        self.epsilon = epsilon # probability to explore 
        self.leverCount = leverCount # number of choices
        self.probs = probs # array of lever probability
        self.redefineSettings()

    def calcAction(self):
        # Select action/lever as per explore-exploit probability
        if np.random.rand() < self.epsilon: # explore case
            return np.random.choice(self.leverCount)
        else: # exploit case
            return np.argmax(self.valueEstimateQ) # random.choice(self.leverCount)

    def calcBetaAction(self, sum_rewards, sum_penalties):
        # Select action/lever as per beta sampling
        thetas = [np.random.beta(1 + alpha, 1 + beta) for (alpha, beta) in zip(sum_rewards, sum_penalties)]
        return np.argmax(thetas)

    def calcReward(self, actionId):
        val = np.random.randn() + self.valueTrueQStar[actionId]
        return val
#         return 1 if (np.random.random() < self.probs[actionId]) else 0
#         return val if (np.random.random() < self.probs[actionId]) else 0

    def calcBetaReward(self, k):
        reward = np.random.binomial(1, probs[k])
        regret = max_prob - probs[k]
        return reward, regret

    def calcQEstimate(self, actionId, reward):
        self.actionCount[actionId] += 1
        self.valueEstimateQ[actionId] += (1/self.actionCount[actionId]) * (reward - self.valueEstimateQ[actionId])

    def redefineSettings(self):
        # Define individual lever probability
        self.valueTrueQStar = np.random.randn(self.leverCount) # Reset the valueTrueQStar before each incremental step
        self.valueEstimateQ = np.zeros(self.leverCount, dtype=float)
        self.actionCount = np.zeros_like(self.valueEstimateQ, dtype=int)

epsilons = [0, 0.01, 0.1] # , 0.5, 0.9] # Define list of epsilons=exploration
# leverCount = 10 # Define number of levers
runs = 2000 # 500 #
steps = 1000
# probs = [0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95] # Define lever probability
probs = [0.1, 0.4, 0.45, 0.6, 0.61]
leverCount = len(probs)
max_prob = max(probs)

# rewards = np.zeros((len(epsilons), runs, steps))
# penalties = np.zeros((len(epsilons), runs, steps))
# actions = np.zeros((len(epsilons), runs, steps))
# total_rewards = np.zeros((len(epsilons), steps))
# regret = np.zeros((len(epsilons), steps))

experimentCount = 0
rewards = np.zeros((1, runs, steps))
penalties = np.zeros((1, runs, steps))
actions = np.zeros((1, runs, steps))
total_rewards = np.zeros((1, steps))
regret = np.zeros((1, steps))


for e, epsilon in enumerate(epsilons): # loop over all the bandits epsilons
    bandit = MABClass(epsilon, leverCount, probs)
    for run in tqdm(range(runs)): # loop over all the runs
        bandit.redefineSettings()
        for step in range(steps): # loop over all the steps
            actionId = bandit.calcAction()
            reward = bandit.calcReward(actionId)
            # actionId = bandit.calcBetaAction()
            # reward, regret = bandit.calcBetaReward(actionId)
            bandit.calcQEstimate(actionId, reward)
            actions[experimentCount, run, step] = actionId
            rewards[experimentCount, run, step] = reward
        # print('Test')
avgActions, avgRewards = actions.mean(axis=1), rewards.mean(axis=1)

# +
plt.subplot(1, 1, 1)
for eps, rewardsY in zip(epsilons, avgRewards):
    plt.plot(rewardsY, label=r'$\epsilon$ = {}'.format(eps), lw=1)
plt.xlabel('Steps')
plt.ylabel('Average reward')
plt.legend()

# plt.savefig('./images/{}.png'.format("mab_avg_rewards"))
# plt.close()
plt.show();


### References
- Reinforcement Learning - An Introduction - Richard S. Sutton and Andrew G. Barto
- A tutorial on Thompson Sampling - Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband and Zheng Wen
- A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit - Giuseppe Burtini, Jason Loeppky and Ramon Lawrence