# Thompson sampling

Thompson sampling is a strategy used in reinforcement learning to find the best actions while exploring the environment. Instead of using fixed rules, Thompson sampling uses probability distributions to represent its belief about the rewards of different actions.

In Thompson sampling, the agent maintains these probability distributions and samples from them to select actions. The higher the probability of an action being the best, the more likely it is to be chosen. This allows for a balance between exploring new actions and exploiting the actions that seem to have higher rewards.

The key idea behind Thompson sampling is to embrace uncertainty. By sampling from the probability distributions and selecting the action with the highest sampled reward, the strategy naturally explores actions that have high potential rewards but are uncertain. It also exploits actions with lower uncertainty and lower estimated rewards.

Thompson sampling has advantages such as not requiring manual adjustment of exploration parameters. The exploration behavior is automatically determined by the probabilistic sampling. It also has strong theoretical foundations and has been successful in various scenarios.

However, Thompson sampling can be computationally demanding compared to simpler strategies like epsilon-greedy or UCB. It involves maintaining and updating the probability distributions, which can be resource-intensive.

# Implementing Thompson sampling

Now, let's learn how to implement the Thompson sampling method to find the best arm.

First, let's import the necessary libraries:

In [1]:
# If you are using google colab
!pip install git+https://github.com/JKCooper2/gym-bandits.git

# If you are not using google colab
#git clone https://github.com/JKCooper2/gym-bandits.git
#cd gym-bandits
#pip install -e .


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/JKCooper2/gym-bandits.git
  Cloning https://github.com/JKCooper2/gym-bandits.git to /tmp/pip-req-build-ngblyjv4
  Running command git clone --filter=blob:none --quiet https://github.com/JKCooper2/gym-bandits.git /tmp/pip-req-build-ngblyjv4
  Resolved https://github.com/JKCooper2/gym-bandits.git to commit 417ed323ca2f7298a3abdad34b781fa9f13148f1
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: gym-bandits
  Building wheel for gym-bandits (setup.py) ... [?25l[?25hdone
  Created wheel for gym-bandits: filename=gym_bandits-0.0.2-py3-none-any.whl size=5176 sha256=a0af5bff1fd4d5cec7a9d30658b10e764a1b3dbaf02a1d4049782d84e966880f
  Stored in directory: /tmp/pip-ephem-wheel-cache-ff56z1dl/wheels/2e/94/6b/ee0d6aafd6f5273960cc3127123c3a09681b4becdabc1b1893
Successfully built gym-bandits
Installing collected packages: gym

In [2]:
import gym
import gym_bandits
import numpy as np

## Creating the bandit environment

Let's take the same two-armed bandit we saw in the previous section:

In [3]:
env = gym.make("BanditTwoArmedHighLowFixed-v0")

  deprecation(
  deprecation(


Let's check the probability distribution of the arm:

In [4]:
print(env.p_dist)

[0.8, 0.2]


We can observe that with arm 1 we win the game with 80% probability and with arm 2 we
win the game with 20% probability. Here, the best arm is arm 1, as with arm 1 we win the
game 80% probability. Now, let's see how to find this best arm using the thompson sampling method.

## Initialize the variables

First, let's initialize the variables:

Initialize the `count` for storing the number of times an arm is pulled:

In [5]:
count = np.zeros(2)

Initialize the `sum_rewards` for storing the sum of rewards of each arm:

In [6]:
sum_rewards = np.zeros(2)

Initialize the `Q` for storing the average reward of each arm:

In [7]:
Q = np.zeros(2)

Define `num_rounds` - number of rounds (iterations):

In [8]:
num_rounds = 100

Initialize the `alpha` value with 1 for both the arms:

In [9]:
alpha = np.ones(2)

Initialize the `beta` value with 1 for both the arms:

In [10]:
beta = np.ones(2)

## Defining the Thompson Sampling function

Now, let's define the `thompson_sampling` function.

As shown below, we randomly sample value from the beta distribution of both the arms
and return the arm which has the maximum sampled value:

In [11]:
def thompson_sampling(alpha,beta):

    samples = [np.random.beta(alpha[i]+1,beta[i]+1) for i in range(2)]

    return np.argmax(samples)

## Start pulling the arm

Now, let's play the game and try to find the best arm using the Thompson sampling
method.

In [12]:
# The gym_bandits environment requires a call to env.reset()
# before we can make the first env.step()
env.reset()

# Now we can start the game
for i in range(num_rounds):

    #select the arm based on the thompson sampling method
    arm = thompson_sampling(alpha,beta)

    #pull the arm and store the reward and next state information
    next_state, reward, done, info = env.step(arm)

    #increment the count of the arm by 1
    count[arm] += 1

    #update the sum of rewards of the arm
    sum_rewards[arm]+=reward

    #update the average reward of the arm
    Q[arm] = sum_rewards[arm]/count[arm]

    #if we win the game, that is, if the reward is equal to 1, then we update the value of alpha as
    #alpha = alpha + 1 else we update the value of beta as beta = beta + 1
    if reward==1:
        alpha[arm] = alpha[arm] + 1
    else:
        beta[arm] = beta[arm] + 1


  logger.warn(
  logger.warn(
  logger.warn(
  logger.deprecation(


After all the rounds, we look at the average reward obtained from each of the arms:

In [13]:
print(Q)

[0.77173913 0.5       ]


Now, we can select the optimal arm as the one which has a maximum average reward. Since the arm 1 has a maximum average reward than the arm 2, our optimal arm will be
arm 1.

In [14]:
print('The optimal arm is arm {}'.format(np.argmax(Q)+1))

The optimal arm is arm 1
