# Implementing Softmax Exploration

Now, let's learn how to implement the softmax exploration to find the best arm.

First, let's import the necessary libraries:

In [3]:
import gym
import gym_bandits
import numpy as np

## Creating the bandit environment

Let's take the same two-armed bandit we saw in the epsilon-greedy section: 

In [4]:
env = gym.make("BanditTwoArmedHighLowFixed-v0")

Let's check the probability distribution of the arm:

In [5]:
print(env.p_dist)

[0.8, 0.2]


We can observe that with arm 1 we win the game with 80% probability and with arm 2 we
win the game with 20% probability. Here, the best arm is arm 1, as with arm 1 we win the
game 80% probability. Now, let's see how to find this best arm using the softmax exploration method
method. 

## Initialize the variables

First, let's initialize the variables:

Initialize the count for storing the number of times an arm is pulled:

In [6]:
count = np.zeros(2)

Initialize the sum_rewards for storing the sum of rewards of each arm:

In [7]:
sum_rewards = np.zeros(2)

Initialize the Q for storing the average reward of each arm:

In [8]:
Q = np.zeros(2)

Define the number of rounds (iterations):

In [9]:
num_rounds = 100

## Defining the softmax exploration function

Now, let's define the softmax function with temperature T as:

$$P_t(a) = \frac{\text{exp}(Q_t(a)/T)} {\sum_{i=1}^n \text{exp}(Q_t(i)/T)} $$

In [10]:
def softmax(T):
    
    #compute the probability of each arm based on the above equation
    denom = sum([np.exp(i/T) for i in Q]) 
    probs = [np.exp(i/T)/denom for i in Q]
    
    #select the arm based on the computed probability distribution of arms
    arm = np.random.choice(env.action_space.n, p=probs)
    
    return arm

## Start pulling the arm

Now, let's play the game and try to find the best arm using the softmax exploration method.

Let's begin by setting the temperature T to a high number, say 50:

In [11]:
T = 50

In [12]:
for i in range(num_rounds):
    
    #select the arm based on the softmax exploration method
    arm = softmax(T)

    #pull the arm and store the reward and next state information
    next_state, reward, done, info = env.step(arm) 

    #increment the count of the arm by 1
    count[arm] += 1
    
    #update the sum of rewards of the arm
    sum_rewards[arm]+=reward

    #update the average reward of the arm
    Q[arm] = sum_rewards[arm]/count[arm]
    
    #reduce the temperature
    T = T*0.99

After all the rounds, we look at the average reward obtained from each of the arms:

In [13]:
print(Q)

[0.84090909 0.17857143]


Now, we can select the optimal arm as the one which has a maximum average reward. Since the arm 1 has a maximum average reward than the arm 2, our optimal arm will be
arm 1. 

In [14]:
print('The optimal arm is arm {}'.format(np.argmax(Q)+1))

The optimal arm is arm 1
