In [None]:
import random

We create *k* actions with unit expected rewards. These expected rewards are used to create normal random distributions with unit variance. When an action is selected we return a reward based on the distribution for that action. 

In [None]:
k = 10
expected_rewards = [random.uniform(-1, 1) for _ in range(k)]

print(expected_rewards)

def action(a):
    return random.normalvariate(expected_rewards[a], 1)

We'll now create the functionality to update value estimates. We use a sample average method where we simply average all of the rewards obtained from the selected action so far.

In [None]:
estimates = [0] * k
times_selected = [0] * k

def sample_average_estimate(a, r):
    estimates[a] = (estimates[a] * times_selected[a] + r) / (times_selected[a] + 1)
    times_selected[a] += 1

Let's try a simple greedy action selection method. 

In [None]:
def greedy_select():
    # Select the indicies of all the apparent optimal actions
    max_actions = [action for action, value in enumerate(estimates) if value == max(estimates)]
    # Return a random choice from them
    return random.choice(max_actions)

In [None]:
for i in range(1000):
    a = greedy_select()
    r = action(a)
    sample_average_estimate(a, r)

print(estimates)

We can see that several estimates are not updating because we are not exploring at all. 

To fix this, let's try a slightly more sophisticated action selection method which randomly selects actions with some probability.

In [None]:
def epsilon_greedy_select(epsilon):
    if random.random() < epsilon:
        return random.choice(range(k))
    return greedy_select()

In [None]:
for i in range(1000):
    a = epsilon_greedy_select(0.05)
    r = action(a)
    sample_average_estimate(a, r)

print(estimates)