In [54]:
import gym_bandits
import gym
import numpy as np
import math
import random

Gym_bandits provides several versions of the bandit environment. We can examine the different bandit versions at https://github.com/JKCooper2/gym-bandits.

In [55]:
env = gym.make("BanditTwoArmedHighLowFixed-v0")

Since we created a 2-armed bandit, our action space will be 2 (as there are two arms),
as shown here:

In [56]:
print(env.action_space.n)

2


We can also check the probability distribution of the arm with:

In [57]:
print(env.p_dist)

[0.8, 0.2]


It indicates that, with arm 1, we win the game 80% of the time and with arm 2, we win the game 20% of the time. Our goal is to find out whether pulling arm 1 or arm 2 makes us win the game most of the time.
Now that we have learned how to create bandit environments in the Gym, in the next section, we will explore different exploration strategies to solve the MAB problem and we will implement them with the Gym.

### Exploration strategies

In the previous classes, we discussed the exploration-exploitation dilemma in the Multi-Armed Bandit (MAB) problem. To tackle this challenge, various exploration strategies are employed to find the best arm among multiple choices. In this tutorial, we will delve into four popular exploration strategies and implement them to determine the best arm.

## Exploration Strategies:

### Epsilon-greedy:

Overview: Epsilon-greedy is a simple yet effective exploration strategy. It selects the best arm with probability (1-ε) and explores a random arm with probability ε.
Implementation: We will implement epsilon-greedy using Python and simulate its performance in a MAB scenario.
### Softmax Exploration:

Overview: Softmax exploration selects arms probabilistically based on their estimated values. The probability of selecting an arm is proportional to its estimated value.
Implementation: We will implement the softmax exploration strategy and compare its performance with epsilon-greedy.
### Upper Confidence Bound (UCB):

Overview: UCB balances exploration and exploitation by selecting arms based on their upper confidence bounds, which consider both the estimated value and the uncertainty in the estimate.
Implementation: We will implement the UCB strategy and analyze its performance in different MAB settings.
### Thomson Sampling:

Overview: Thomson Sampling is a Bayesian approach that samples arms according to their posterior probabilities of being optimal. It inherently accounts for uncertainty in the estimates.
Implementation: We will implement Thomson Sampling and evaluate its performance compared to other strategies.


## Implementing epsilon-greedy

Now, let's learn to implement the epsilon-greedy method to find the best arm. First, let's initialize the variables.
Initialize the count for storing the number of times an arm is pulled:

In [58]:
env.reset()

  logger.warn(
  logger.warn(
  logger.warn(


0

In [59]:
count = np.zeros(2)

Initialize sum_rewards for storing the sum of rewards of each arm:

sum_rewards = np.zeros(2)

Initialize Q for storing the average reward of each arm:

In [60]:
Q = np.zeros(2)

Set the number of rounds (iterations):

In [61]:
num_rounds = 100

Now, let's define the epsilon_greedy function.
First, we generate a random number from a uniform distribution. If the random number is less than epsilon, then we pull the random arm; else, we pull the best arm that has the maximum average reward, as shown here:

In [62]:
def epsilon_greedy(epsilon):
    if np.random.uniform(0,1) < epsilon:
        return env.action_space.sample()
    else:
        return np.argmax(Q)

Now, let's play the game and try to find the best arm using the epsilon-greedy method.


In [63]:
sum_rewards = np.zeros(2)
for i in range(num_rounds):
    # Select the arm based on the epsilon-greedy method:
    arm = epsilon_greedy(epsilon=0.5)
    # Pull the arm and store the reward and next state information:
    next_state, reward, done, info = env.step(arm)
    # Increment the count of the arm by 1:
    count[arm] += 1
    # Update the sum of rewards of the arm:
    sum_rewards[arm]+=reward
    # Update the average reward of the arm:
    Q[arm] = sum_rewards[arm]/count[arm]
print(Q)

[0.80821918 0.22222222]


  logger.deprecation(
  if not isinstance(done, (bool, np.bool8)):


In [64]:
print('The optimal arm is arm {}'.format(np.argmax(Q)+1))

The optimal arm is arm 1


### Softmax Exploration

Now, we define the softmax function with the temperature T:


![](SoftMAX.png)  

In [65]:
def softmax(tau):
    
    total = sum([math.exp(val/tau) for val in Q])    
    probs = [math.exp(val/tau)/total for val in Q]
    
    threshold = random.random()
    cumulative_prob = 0.0
    for i in range(len(probs)):
        cumulative_prob += probs[i]
        if (cumulative_prob > threshold):
            return i
    return np.argmax(probs) 
    

In [66]:
# Reset the env for softmax method 
env.reset()
# number of rounds (iterations)
num_rounds = 20000

# Count of number of times an arm was pulled
count = np.zeros(10)

# Sum of rewards of each arm
sum_rewards = np.zeros(10)

# Q value which is the average reward
Q = np.zeros(10)

In [67]:
print("List of all possible actions:", list(range(env.action_space.n)))


List of all possible actions: [0, 1]


In [69]:
for i in range(num_rounds):
    # Select the arm using softmax
    arm = softmax(50)
    print(arm)
    # Get the reward
    if (env.action_space.contains(arm)):
        observation, reward, done, info = env.step(arm) 
        # update the count of that arm
        count[arm] += 1
        # Sum the rewards obtained from the arm
        sum_rewards[arm]+=reward
        # calculate Q value which is the average rewards of the arm
        Q[arm] = sum_rewards[arm]/count[arm]
print( 'The optimal arm is {}'.format(np.argmax(Q)))

5
5
7
9
6
3
9
1
0
4
2
2
1
4
8
7
0
2
6
1
3
0
4
5
6
0
7
4
7
0
3
4
6
6
7
7
5
0
9
4
1
8
9
4
6
7
5
9
9
6
5
2
4
1
1
3
7
3
2
0
3
2
2
3
2
7
5
1
5
3
0
8
8
9
3
4
1
6
5
6
7
8
4
6
0
4
7
0
9
3
0
5
8
3
5
2
0
9
4
8
3
4
9
2
2
2
9
5
6
8
3
9
6
0
9
5
6
2
7
9
0
2
9
4
4
7
2
2
1
4
0
1
3
2
2
0
9
0
3
6
7
1
3
4
0
9
1
7
7
9
3
0
1
2
5
1
6
7
8
6
1
8
4
9
8
3
2
1
3
8
7
3
3
2
7
4
2
9
8
9
6
6
5
3
0
9
9
7
3
8
0
5
4
8
3
9
4
6
5
1
5
4
8
6
1
8
5
0
8
2
1
7
1
3
2
7
1
6
7
3
0
8
7
1
4
3
0
6
4
1
8
4
5
1
2
7
4
6
4
5
2
5
5
9
2
2
4
2
9
0
3
9
0
9
4
2
0
6
4
6
5
7
4
1
0
1
3
6
3
5
2
6
4
8
9
5
2
8
3
5
9
6
3
6
8
2
5
8
1
1
9
3
6
2
8
3
9
3
3
4
3
8
1
2
9
9
3
0
8
8
7
2
0
0
4
0
9
2
0
2
6
3
7
7
9
3
8
8
4
9
7
6
5
3
3
1
5
2
9
7
6
4
8
1
9
1
7
4
7
8
9
6
8
8
4
2
2
0
4
3
3
7
4
4
7
9
3
2
4
6
8
0
3
2
9
7
7
4
1
9
0
2
8
7
8
0
7
0
0
1
8
6
4
7
3
5
9
2
1
1
8
2
3
1
9
1
3
9
5
5
1
4
2
5
8
0
3
2
7
8
9
1
1
2
0
7
4
2
9
4
3
3
3
6
6
6
6
9
2
8
4
8
3
5
3
6
3
2
3
3
6
1
9
0
4
3
2
4
1
4
0
9
7
1
9
3
6
3
0
2
4
9
9
3
9
0
7
6
6
0
3
0
0
1
4
4
5
3
9
9
6
0
3
1
1
6
8
2
5
2
