In [None]:
import numpy as np
import pandas as pd
import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(world_readable=True, theme="white")

from MAB.bandits import GaussianBanditGame, GaussianBandit, BernoulliBandit

## 1. A Gaussian bandit game

In [None]:
slotA = GaussianBandit(5,3)
slotB = GaussianBandit(6,2)
slotC = GaussianBandit(1,5)

In [None]:
game = GaussianBanditGame([slotA, slotB, slotC])

In [None]:
game.user_play()

## 2. Online Advertising Example

In [None]:
adA = BernoulliBandit(0.004)
adB = BernoulliBandit(0.016)
adC = BernoulliBandit(0.02)
adD = BernoulliBandit(0.028)
adE = BernoulliBandit(0.031)
ads = [adA, adB, adC, adD, adE]

### 2.1 Strategy 1: A/B/n testing

This is an exploration strategy used to determine which action should be taken by directly comparing actions. An experiment is run and at the end of the experiment, the results are compared for each action.

This can be seen as a baseline strategy for solving the problem.

Suppose you select an action $a$ for the $i$th time, for which you get reward $R_i$. The average reward observed prior to the $n^{th}$ selection is $$Q_n \equiv \frac{R_1 + ... + R_n}{n-1}$$. We can define an update rule by factoring out the $R_n$ term and multiplying by $(n-1)/(n-1)$. This gives us
$$Q_{n+1} = Q_n + \frac{1}{n}(R_n - Q_n)$$

This tells us that to update the expected reward at the $(n+1)^{th}$ action, we just need to add the deviation of the reward from the expected value, divided by the total number of actions taken. As we make more observations, our corrections to the expected reward will get smaller and smaller.

An interesting thing to note is that this could be a limitation if the environment changes with time, in which case we may want more recent observations to have the same or more importance.

In [None]:
n_test = 10_000
n_prod = 90_000
n_ads = len(ads)
Q = np.zeros(n_ads) # Q, action values
N = np.zeros(n_ads) # N, total impressions
total_reward = 0
avg_rewards = [] # save average rewards over time

In [None]:
# Each turn, randomly select an action to take
for i in range(n_test):
    ad_chosen = np.random.randint(n_ads)
    R = ads[ad_chosen].pull_lever() # observe reward
    N[ad_chosen] += 1
    Q[ad_chosen] += (1 / N[ad_chosen]) * (R - Q[ad_chosen])
    total_reward += R
    avg_reward_so_far = total_reward / (i+1)
    avg_rewards.append(avg_reward_so_far)

In [None]:
best_ad_index = np.argmax(Q)
print(f"The best performing ad is ad {["A","B", "C", "D", "E"][best_ad_index]}")

In [None]:
# Make a choice as to which ad was best for the test period and use that in production
ad_chosen = best_ad_index
for i in range(n_prod):
    R = ads[ad_chosen].pull_lever()
    total_reward += R
    avg_reward_so_far = total_reward / (n_test + i + 1)
    avg_rewards.append(avg_reward_so_far)

In [None]:
df_rewards = pd.DataFrame(avg_rewards, columns=["A/B/n"])

df_rewards.iplot(title=f"A/B/n Test Avg. Reward: {avg_reward_so_far:.4f}", xTitle="Impressions", yTitle="Avg. Reward")

Using this strategy, we can see that after the exploration phase ends, the average reward consistently grows until it plateaus around the average for campaign E.

#### Issues with A/B/n testing

* It is inefficient with the samples and does not modify the experiment dynamically by learning from observations. It doesn't take advantage of any information to cull non-promising campaigns early, for example.
* It is unable to correct a decision once it's made. If during the test period the wrong "best" campaign is selected, then it is fixed for the production period. It cannot adapt.
* It cannot adapt to changes in a dynamic environment, especially so for non stationary environments.
* The length of the test period is a hyperparameter that has a significant effect on performance and on cost



### 2.2 Strategy 2: $\epsilon$-Greedy Actions

The $\epsilon$-greedy approach corrects the static nature of A/B/n testing by allowing for continuous exploration.

In essence, the user should always take the greedy action that gives the best reward with probability $1 - \epsilon$. However, with probability $\epsilon$ it should take a random action that could be sub-optimal. Typically, the value of $\epsilon$ is kept small to exploit the knowledge developed.

In [None]:
eps = 0.2
n_prod = 100_000
n_ads = len(ads)
Q = np.zeros(n_ads)
N = np.zeros(n_ads)
total_reward = 0
avg_rewards = []

In [None]:
ad_chosen = np.random.randint(n_ads)
for i in range(n_prod):
    R = ads[ad_chosen].pull_lever()
    N[ad_chosen] += 1
    Q[ad_chosen] += (1 / N[ad_chosen]) * (R - Q[ad_chosen])
    total_reward += R
    avg_reward_so_far = total_reward / (i + 1)
    avg_rewards.append(avg_reward_so_far)

    if np.random.uniform() <= eps:
        ad_chosen = np.random.randint(n_ads)
    else:
        ad_chosen = np.argmax(Q)
    
df_rewards[f"e-greedy: {eps}"] = avg_rewards

In [None]:
greedy_list = ['e-greedy: 0.01', 'e-greedy: 0.05', 'e-greedy: 0.1', 'e-greedy: 0.2']

df_rewards[greedy_list].iplot(title="e-Greedy Actions", dash=["solid", "dash", "dashdot", "dot"])

We can see from the above plot that those strategies with the lowest $\epsilon$ values perform the worst early on, as they have very little capacity for exploration and so must accumulate many samples before they can change strategy. However, in the long run, while the higher $\epsilon$ values plateau, we can see the smaller values continuue to increase. This lends itself to the idea that a beneficial strategy would be to start with a large value for early exploration, and then dynamically decrease the value with increasing samples.

#### Disadvantages
* $\epsilon$-greedy actions and A/B/n tests are equally inefficient and static in allocating the exploration budget. In this particular example, you would want to drop the campaigns that are clearly performing extremely poorly and use the exploration budget on the more promising options.
* Modifying the $\epsilon$ greedy approach introduces more hyperparameters that require tuning.

#### Advantages
* Unlike the A/B/n approach, exploration is continuous and therefore it could feasibly adapt to a dynamic environment.
* The $\epsilon$ greedy approach can be made better by dynamically adjusting the value of $\epsilon$
* The approach can be made more dynamic by increasing the importance of the more recent observations. In the update equation for the average reward $Q_{n+1}$, we could replace the factor 1/n with a constant $\alpha$ that allows more recent observations to have greater contribution.


### 2.3 Strategy 3: Upper confidence bounds

