# Softmax Exploration

The softmax exploration method is another approach to balance exploration and exploitation. Instead of selecting actions completely randomly as in epsilon-greedy, softmax assigns a probability distribution over all possible actions based on their estimated values.

In the softmax method, each action's selection probability is calculated using the softmax function, which converts action values into probabilities. The softmax function gives higher probability to actions with higher estimated values, but it still assigns non-zero probabilities to all actions, allowing for exploration.

For example, let’s assume we have 4 arms and arm 1 is the best arm. After exploring the non-best arms – [arm 2, arm 3, arm 4] – uniformly, we realised that arm 3 is never a good arm and it always gives a reward of 0. In this case, instead of exploring arm 3 again, we can spend more time exploring arm 2 and arm 4. But the problem with the epsilon-greedy method is that we explore all the non-best arms equally. So, all the non-best arms – [arm 2, arm 3, arm 4] – will be explored equally. To avoid this, if we can give priority to arm 2 and arm 4 over arm 3, then we can explore arm 2 and arm 4 more than arm 3. We can give priority to the arms by assigning a probability to all the arms based on the average reward Q. The arm that has the maximum average reward will have high probability, and all the non-best arms have a probability proportional to their average reward.

The degree of exploration in the softmax method is controlled by a temperature parameter. Higher temperature values make the action probabilities more uniform, leading to increased exploration, while lower temperature values make the action probabilities sharper and closer to deterministic, favoring exploitation.

Softmax exploration provides a more continuous and controlled exploration behavior compared to epsilon-greedy. It can be beneficial when the agent needs to explore the action space more systematically or when there are multiple actions with similar values that need to be explored further.



# Implementing Softmax Exploration

Now, let's learn how to implement the softmax exploration to find the best arm.

First, let's import the necessary libraries:

In [17]:
# If you are using google colab
!pip install git+https://github.com/JKCooper2/gym-bandits.git

# If you are not using google colab
#git clone https://github.com/JKCooper2/gym-bandits.git
#cd gym-bandits
#pip install -e .


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/JKCooper2/gym-bandits.git
  Cloning https://github.com/JKCooper2/gym-bandits.git to /tmp/pip-req-build-azo3slsn
  Running command git clone --filter=blob:none --quiet https://github.com/JKCooper2/gym-bandits.git /tmp/pip-req-build-azo3slsn
  Resolved https://github.com/JKCooper2/gym-bandits.git to commit 417ed323ca2f7298a3abdad34b781fa9f13148f1
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [18]:
import gym
import gym_bandits
import numpy as np

## Creating the bandit environment

Let's take the same two-armed bandit we saw in the epsilon-greedy section:

In [19]:
env = gym.make("BanditTwoArmedHighLowFixed-v0")

Let's check the probability distribution of the arm:

In [20]:
print(env.p_dist)

[0.8, 0.2]


We can observe that with arm 1 we win the game with 80% probability and with arm 2 we
win the game with 20% probability. Here, the best arm is arm 1, as with arm 1 we win the
game 80% probability. Now, let's see how to find this best arm using the softmax exploration method
method.

## Initialize the variables

First, let's initialize the variables:

Initialize the `count` for storing the number of times an arm is pulled:

In [21]:
count = np.zeros(2)

Initialize the `sum_rewards` for storing the sum of rewards of each arm:

In [22]:
sum_rewards = np.zeros(2)

Initialize the `Q` for storing the average reward of each arm:

In [23]:
Q = np.zeros(2)

Define `num_rounds` - number of rounds (iterations):

In [24]:
num_rounds = 100

## Defining the softmax exploration function

Now, let's define the softmax function with temperature `T` as:

$$P_t(a) = \frac{\text{exp}(Q_t(a)/T)} {\sum_{i=1}^n \text{exp}(Q_t(i)/T)} $$

In [25]:
def softmax(T):

    #compute the probability of each arm based on the above equation
    denom = sum([np.exp(i/T) for i in Q])
    probs = [np.exp(i/T)/denom for i in Q]

    #select the arm based on the computed probability distribution of arms
    arm = np.random.choice(env.action_space.n, p=probs)

    return arm

## Start pulling the arm

Now, let's play the game and try to find the best arm using the softmax exploration method.

Let's begin by setting the temperature `T` to a high number, say 50:

In [26]:
T = 50

In [27]:
# The gym_bandits environment requires a call to env.reset()
# before we can make the first env.step()
env.reset()

# Now we can start the game
for i in range(num_rounds):

    #select the arm based on the softmax exploration method
    arm = softmax(T)

    #pull the arm and store the reward and next state information
    next_state, reward, done, info = env.step(arm)

    #increment the count of the arm by 1
    count[arm] += 1

    #update the sum of rewards of the arm
    sum_rewards[arm]+=reward

    #update the average reward of the arm
    Q[arm] = sum_rewards[arm]/count[arm]

    #reduce the temperature
    T = T*0.99

After all the rounds, we look at the average reward obtained from each of the arms:

In [28]:
print(Q)

[0.75510204 0.17647059]


Now, we can select the optimal arm as the one which has a maximum average reward. Since the arm 1 has a maximum average reward than the arm 2, our optimal arm will be
arm 1.

In [29]:
print('The optimal arm is arm {}'.format(np.argmax(Q)+1))

The optimal arm is arm 1
