### Problem Description

The k-armed bandit problem is a simple reinforcement learning problem where the environment contains only a single state. The agent is presented with k different actions that it can take, and each action gives out a reward generated from some stationary probability distribution. The agent starts out not knowing which action is the best action, so it must first explore all of the possibilities to figure out the expected reward of each action.

The real-life analogy of this problem is a gambler (bandit) that wants to maximize their money (reward) by playing the slot machines, but there are k slot machines, each with its own probabilistic payoff. The gambler must then figure out which slot machine is the most profitable.

In [1]:
# Setup

%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from functools import partial

def reset_seed(seed=42):
    np.random.seed(seed)

### The Environment

There will be two kinds of environments: discrete and continuous. A discrete bandit environment just means that there are some finite number of payoffs, each with its own probability (total probability must add up to one). A continuous bandit environment generates payoffs based on some probability distribution.

In [2]:
class DiscreteBandit():
    """
    A discrete bandit environment.
    
    Args:
        payoffs: (ndarray of shape [n_arms, n_payoffs]) the payoffs for each action
        probabilities (ndarray of shape [n_arms, n_payoffs]) the probability for every payoff of every action
    """
    def __init__(self, payoffs, probabilities):
        self.payoffs = payoffs
        self.probabilities = probabilities
        
    def step(self, x):
        return np.random.choice(self.payoffs[x], p=self.probabilities[x])

In [3]:
reset_seed(42)

payoffs = np.array([[-1, 0, 1],
                    [-2, 0, 5],
                    [0, 4, 8],
                    [-2, -1, 1]])

probabilities = np.array([[0.45, 0.1, 0.45],
                          [0.45, 0.1, 0.45],
                          [0.98, 0.01, 0.01],
                          [0.98, 0.01, 0.01]])

env = DiscreteBandit(payoffs, probabilities)

max_steps = 1000
total_reward = 0

for step in range(max_steps):
    total_reward += env.step(np.random.choice(len(payoffs)))  # choose a random action
    
print("Total reward:", total_reward)

Total reward: -111


In [4]:
class ContinuousBandit():
    """
    A continuous bandit environment.
    
    Args:        
    """
    def __init__(self, reward_distributions):
        self.reward_distributions = reward_distributions
        
    def step(self, x):
        return self.reward_distributions[x]()

In [5]:
reset_seed(42)

mu = [np.random.uniform(-10, 10) for _ in range(5)]
sigma = [np.random.uniform(0, 5) for _ in range(5)]
reward_distributions = [partial(np.random.normal, m, s) for m, s in zip(mu, sigma)]

for m, s in zip(mu, sigma):
    print("Mean: %.2f \t Sigma: %.2f" % (m, s))

env = ContinuousBandit(reward_distributions)

max_steps = 1000
total_reward = 0

for step in range(max_steps):
    total_reward += env.step(np.random.choice(len(reward_distributions)))  # choose a random action
    
print("Total reward: %.2f" % total_reward)

Mean: -2.51 	 Sigma: 0.78
Mean: 9.01 	 Sigma: 0.29
Mean: 4.64 	 Sigma: 4.33
Mean: 1.97 	 Sigma: 3.01
Mean: -6.88 	 Sigma: 3.54
Total reward: 848.69
