# The UCB (Upper Confidence Bound)

The UCB (Upper Confidence Bound) algorithm is another exploration strategy commonly used in reinforcement learning. Unlike epsilon-greedy and softmax methods, UCB takes into account not only the estimated values of actions but also the uncertainty or confidence in those estimates.

The UCB algorithm aims to balance exploration and exploitation by selecting actions based on an upper confidence bound, which is a measure of the potential upper limit of an action's value. The idea is to prioritize actions that have higher estimated values but also higher uncertainty.

The UCB algorithm prioritizes actions that have high estimated values but have been selected fewer times, thus promoting exploration of potentially promising actions. As the number of times an action is selected increases, the uncertainty decreases, and the algorithm tends to exploit the actions with higher estimated values.

The UCB algorithm is known for its theoretical guarantees and its ability to converge to the optimal action with fewer samples compared to some other exploration methods. However, it does require a prior estimation of the confidence bounds, and it may not perform optimally in all scenarios or when the underlying environment dynamics change over time.


# Implementing UCB

Now, let's learn how to implement the UCB algorithm to find the best arm.

First, let's import the necessary libraries:

In [1]:
# If you are using google colab
!pip install git+https://github.com/JKCooper2/gym-bandits.git

# If you are not using google colab
#git clone https://github.com/JKCooper2/gym-bandits.git
#cd gym-bandits
#pip install -e .


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/JKCooper2/gym-bandits.git
  Cloning https://github.com/JKCooper2/gym-bandits.git to /tmp/pip-req-build-38ofgjvo
  Running command git clone --filter=blob:none --quiet https://github.com/JKCooper2/gym-bandits.git /tmp/pip-req-build-38ofgjvo
  Resolved https://github.com/JKCooper2/gym-bandits.git to commit 417ed323ca2f7298a3abdad34b781fa9f13148f1
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: gym-bandits
  Building wheel for gym-bandits (setup.py) ... [?25l[?25hdone
  Created wheel for gym-bandits: filename=gym_bandits-0.0.2-py3-none-any.whl size=5176 sha256=6c7565fbf265e218484eaa2fe857bb999979d9c7b35bd844055fa1e19141cde6
  Stored in directory: /tmp/pip-ephem-wheel-cache-l87yq6kx/wheels/2e/94/6b/ee0d6aafd6f5273960cc3127123c3a09681b4becdabc1b1893
Successfully built gym-bandits
Installing collected packages: gym

In [2]:
import gym
import gym_bandits
import numpy as np

## Creating the bandit environment

Let's take the same two-armed bandit we saw in the epsilon-greedy section:

In [3]:
env = gym.make("BanditTwoArmedHighLowFixed-v0")

  deprecation(
  deprecation(


Let's check the probability distribution of the arm:

In [4]:
print(env.p_dist)

[0.8, 0.2]


We can observe that with arm 1 we win the game with 80% probability and with arm 2 we
win the game with 20% probability. Here, the best arm is arm 1, as with arm 1 we win the
game 80% probability. Now, let's see how to find this best arm using the UCB method.

## Initialize the variables

First, let's initialize the variables:

Initialize the `count` for storing the number of times an arm is pulled:

In [5]:
count = np.zeros(2)

Initialize the `sum_rewards` for storing the sum of rewards of each arm:

In [6]:
sum_rewards = np.zeros(2)

Initialize `Q` for storing the average reward of each arm:

In [7]:
Q = np.zeros(2)

Define `num_rounds` number of rounds (iterations):

In [8]:
num_rounds = 100

## Defining the UCB function

Now, we define the `UCB` function which returns the best arm as the one which has the
high upper confidence bound (UCB) arm:

$$ \text{UCB(a)} =Q(a) +\sqrt{\frac{2 \log(t)}{N(a)}}  --- (1) $$

In [9]:
def UCB(i):

    #initialize the numpy array for storing the UCB of all the arms
    ucb = np.zeros(2)

    #before computing the UCB, we explore all the arms at least once, so for the first 2 rounds,
    #we directly select the arm corresponding to the round number
    if i < 2:
        return i

    #if the round is greater than 10 then, we compute the UCB of all the arms as specified in the
    #equation (1) and return the arm which has the highest UCB:
    else:
        for arm in range(2):
            ucb[arm] = Q[arm] + np.sqrt((2*np.log(sum(count))) / count[arm])
        return (np.argmax(ucb))

## Start pulling the arm

Now, let's play the game and try to find the best arm using the UCB method.

In [10]:
# The gym_bandits environment requires a call to env.reset()
# before we can make the first env.step()
env.reset()

# Now we can start the game
for i in range(num_rounds):

    #select the arm based on the UCB method
    arm = UCB(i)

    #pull the arm and store the reward and next state information
    next_state, reward, done, info = env.step(arm)

    #increment the count of the arm by 1
    count[arm] += 1

    #update the sum of rewards of the arm
    sum_rewards[arm]+=reward

    #update the average reward of the arm
    Q[arm] = sum_rewards[arm]/count[arm]


  logger.warn(
  logger.warn(
  logger.warn(
  logger.deprecation(


After all the rounds, we look at the average reward obtained from each of the arms:

In [11]:
print(Q)

[0.85227273 0.25      ]


Now, we can select the optimal arm as the one which has a maximum average reward. Since the arm 1 has a maximum average reward than the arm 2, our optimal arm will be
arm 1.

In [12]:
print('The optimal arm is arm {}'.format(np.argmax(Q)+1))

The optimal arm is arm 1
