# Multi-armed Bandits Algorithms

This notebook explains and reproduces the most popular multi-armed bandits (MAB) algorithms:
- epsilon-greedy
- UCB
- Thompson Sampling

### What?
MAB is a classic reinforcement learning problem for balancing the trade-off between exploration vs exploitation.

Essentially, MAB formulates various kinds of sequential decision-making problems; thus, it is applicable to a wide range of scenarios.

### Example
In a casino, a gambler plays a slot machine that has `K` pulling arms (the story if there are `K` slot machines with one arm). 

How to leave the casino without being bankrupt? Or even with some earnings?

##### Rules
1. Each arm has a reward distribution that is initially unknown to the gambler.
2. Each round, the gambler can only play one arm and only observe the reward of that very arm.

##### Goal
Maximize the cumulative reward before he bankrupts or the casino closes.

# Algorithms

In [4]:
import numpy as np
import matplotlib.pyplot as plt

# Problem (Bandit Environment)

Don't get confused.

MAB algorithms solves MAB problems.

We need to define a problem first before investigating how each algorithms perform.

### The gambler, Jack
In front of Jack, the gambler, is a fancy slot machine, called the Fruit Machine.
```
  ______          _ _     __  __            _     _            
 |  ____|        (_) |   |  \/  |          | |   (_)           
 | |__ _ __ _   _ _| |_  | \  / | __ _  ___| |__  _ _ __   ___ 
 |  __| '__| | | | | __| | |\/| |/ _` |/ __| '_ \| | '_ \ / _ \
 | |  | |  | |_| | | |_  | |  | | (_| | (__| | | | | | | |  __/
 |_|  |_|   \__,_|_|\__| |_|  |_|\__,_|\___|_| |_|_|_| |_|\___|
```

##### Fruit Machine
There are 3 arms/buttons that correspond to 3 different fruits (cherry, orange, and manage).

Each time Jack pays $1 to play.

This machine only requires one single brain cell to play:
1. choose a fruit & pull its arm -> the wheel starts to spin -> the wheel stops
2. win the prize (get back $2) or lose ($1 is gone)
3. play again (going home is never an option for Jack :)

In [6]:
class SlotMachine(object):
    """
    The basics of a generic slot machine
    """
    def __init__(self, name: str):
        self.bandit_name = name

    def get_reward(self, arm_choice: int):
        raise NotImplementedError

In [3]:
class FruitSlotMachine(SlotMachine):
    """
    An instance of Bernoulli Bandit. Reward obeys a Bernoulli distribution
    This machine has 3 arms:
    0 -- cherry
    1 -- orange
    2 -- mango
    """
    def __init__(self, distribution: list = None):
        """
        this machine supports two factory modes:
        1. set a fixed win rate for each arm
        2. randomize the win rate for each arm
        """
        super().__init__("Fruit Machine")
        self.num_arms = 3 # fixed at 3
        if distribution: # mode 1
            assert len(distribution) == self.num_arms
            self.distribution = distribution
        else: # mode 2
            self.distribution = [np.random.random() for _ in range(self.num_arms)]

    def get_reward(self, arm_choice: int):
        """
        return the sampled reward for the given arm choice
        
        Output:
        reward -- 0 or 1
        """
        if np.random.random() < self.distribution[arm_choice]:
            return 1
        else:
            return 0