## The Pessimistic Principle

In this assignment, our goal is to implement and evaluate the performance of the lower confidence bound algorithm for the four Bernoulli arms example covered in Lecture 1.

![Four arms](./graphs/fourarms.png)

We will run 200 trials. For each trial, generate an offline dataset which consists of 500 rewards under the first arm, 500 rewards under the second arm, 1 reward under the third arm and 500 rewards under the last arm. We then apply the LCB algorithm with the constant $c$ equal to $0, 0.1, 1$. For each choice of $c$, compute the regret (see the definition on Page 20 of Lecture 10) in each trial. Then aggregate the regret over 200 trials, for each $c$. Print and compare these three regrets.

#### First, we implement the environment

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import rc

# you might see an error that the module "tkinter" is not installed. If on Mac Os you can install it through the terminal via "brew install python-tk@3.9". General help can as always be found on stackoverflow: "https://stackoverflow.com/questions/25905540/importerror-no-module-named-tkinter" 

np.random.seed(10)

bandit_probabilities = [0.10, 0.40, 0.10, 0.10]

number_of_bandits = len(bandit_probabilities) # = n_actions
    
action_space = np.arange(number_of_bandits) # =[0,1,2,3]

number_of_trials = 200

arms = np.zeros(timesteps, dtype=int)

def step(action):
    rand = np.random.random()  # [0.0,1.0)
    reward = 1.0 if (rand < bandit_probabilities[action]) else 0.0
    return reward


#### Second, we review the lower confidence bound algorithm and implement it

<img src="graphs/LCB.png" width=700>

In [2]:
lcb_constants = [0, 0.1, 1.0]
sample_size = [500, 500, 1, 500]

In [3]:
def lower_confidence_bound_policy(c, actions, q_values, num_invocations):
    lower_confidence_bounds = [q_values[action] - c * np.sqrt(np.log(sum(num_invocations)) / (num_invocations[action])) if num_invocations[action] > 0 else np.inf for action in actions]
    return np.random.choice([action_ for action_, value_ in enumerate(lower_confidence_bounds) if value_ == np.max(lower_confidence_bounds)])


regret = np.zeros((len(lcb_constants), number_of_trials), dtype=float)

for lcb_constant_counter, lcb_constant in enumerate(lcb_constants):
    for trial in range(number_of_trials):
        n = np.zeros(number_of_bandits, dtype=int)
        q = np.zeros(number_of_bandits, dtype=float)

        for j in range(len(sample_size)):
            for t in range(sample_size[j]):
                action = j
                r = step(action)

                # updating action counter and expected reward Q
                n[action] += 1
                q[action] = q[action] + 1.0 / (n[action] + 1) * (r - q[action])

        regret[lcb_constant_counter, trial] = np.max(bandit_probabilities) - bandit_probabilities[lower_confidence_bound_policy(lcb_constant, action_space, q, sample_size)]

print(np.mean(regret, axis=1))

[0.027 0.    0.   ]
