# Implementing Thompson sampling

Now, let's learn how to implement the Thompson sampling method to find the best arm.

First, let's import the necessary libraries:

In [1]:
import gym
import gym_bandits
import numpy as np

## Creating the bandit environment

Let's take the same two-armed bandit we saw in the previous section: 

In [2]:
env = gym.make("BanditTwoArmedHighLowFixed-v0")

Let's check the probability distribution of the arm:

In [3]:
print(env.p_dist)

[0.8, 0.2]


We can observe that with arm 1 we win the game with 80% probability and with arm 2 we
win the game with 20% probability. Here, the best arm is arm 1, as with arm 1 we win the
game 80% probability. Now, let's see how to find this best arm using the thompson sampling method.

## Initialize the variables

First, let's initialize the variables:

Initialize the count for storing the number of times an arm is pulled:

In [4]:
count = np.zeros(2)

Initialize the sum_rewards for storing the sum of rewards of each arm:

In [5]:
sum_rewards = np.zeros(2)

Initialize the Q for storing the average reward of each arm:

In [6]:
Q = np.zeros(2)

Define the number of rounds (iterations):

In [7]:
num_rounds = 100

Initialize the alpha value with 1 for both the arms:

In [9]:
alpha = np.ones(2)

Initialize the beta value with 1 for both the arms:

In [10]:
beta = np.ones(2)

## Defining the Thompson Sampling function 

Now, let's define the thompson_sampling function.

As shown below, we randomly sample value from the beta distribution of both the arms
and return the arm which has the maximum sampled value: 

In [11]:
def thompson_sampling(alpha,beta):
    
    samples = [np.random.beta(alpha[i]+1,beta[i]+1) for i in range(2)]

    return np.argmax(samples)

## Start pulling the arm

Now, let's play the game and try to find the best arm using the Thompson sampling
method.

In [12]:
for i in range(num_rounds):
    
    #select the arm based on the thompson sampling method
    arm = thompson_sampling(alpha,beta)

    #pull the arm and store the reward and next state information
    next_state, reward, done, info = env.step(arm) 

    #increment the count of the arm by 1
    count[arm] += 1
    
    #update the sum of rewards of the arm
    sum_rewards[arm]+=reward

    #update the average reward of the arm
    Q[arm] = sum_rewards[arm]/count[arm]

    #if we win the game, that is, if the reward is equal to 1, then we update the value of alpha as 
    #alpha = alpha + 1 else we update the value of beta as beta = beta + 1
    if reward==1:
        alpha[arm] = alpha[arm] + 1
    else:
        beta[arm] = beta[arm] + 1
    

After all the rounds, we look at the average reward obtained from each of the arms:

In [13]:
print(Q)

[0.77659574 0.33333333]


Now, we can select the optimal arm as the one which has a maximum average reward. Since the arm 1 has a maximum average reward than the arm 2, our optimal arm will be
arm 1. 

In [14]:
print('The optimal arm is arm {}'.format(np.argmax(Q)+1))

The optimal arm is arm 1
