# Implementing epsilon-greedy 

Now, let's learn how to implement the epsilon-greedy method to find the best arm.

First, let's import the necessary libraries:

In [2]:
import gym
import gym_bandits
import numpy as np

## Creating the bandit environment

For better understanding, let's create the bandit with only two arms:

In [3]:
env = gym.make("BanditTwoArmedHighLowFixed-v0")

Let's check the probability distribution of the arm:

In [13]:
print(env.p_dist)

[0.8, 0.2]


We can observe that with arm 1 we win the game with 80% probability and with arm 2 we
win the game with 20% probability. Here, the best arm is arm 1, as with arm 1 we win the
game 80% probability. Now, let's see how to find this best arm using the epsilon-greedy
method. 

## Initialize the variables

First, let's initialize the variables:

Initialize the count for storing the number of times an arm is pulled:

In [5]:
count = np.zeros(2)

Initialize the sum_rewards for storing the sum of rewards of each arm:

In [6]:
sum_rewards = np.zeros(2)

Initialize the Q for storing the average reward of each arm:

In [7]:
Q = np.zeros(2)

Define the number of rounds (iterations):

In [8]:
num_rounds = 100

## Defining the epsilon-greedy method

First, we generate a random number from a uniform distribution, if the random number is
less than epsilon then pull the random arm else we pull the best arm which has maximum
average reward as shown below: 

In [9]:
def epsilon_greedy(epsilon):
    
    if np.random.uniform(0,1) < epsilon:
        return env.action_space.sample()
    else:
        return np.argmax(Q)

## Start pulling the arm

Now, let's play the game and try to find the best arm using the epsilon-greedy method.

In [10]:
for i in range(num_rounds):
    
    #select the arm based on the epsilon-greedy method
    arm = epsilon_greedy(0.5)

    #pull the arm and store the reward and next state information
    next_state, reward, done, info = env.step(arm) 

    #increment the count of the arm by 1
    count[arm] += 1
    
    #update the sum of rewards of the arm
    sum_rewards[arm]+=reward

    #update the average reward of the arm
    Q[arm] = sum_rewards[arm]/count[arm]

After all the rounds, we look at the average reward obtained from each of the arms:

In [11]:
print(Q)

[0.77631579 0.20833333]


Now, we can select the optimal arm as the one which has a maximum average reward. Since the arm 1 has a maximum average reward than the arm 2, our optimal arm will be
arm 1. 

In [12]:
print('The optimal arm is arm {}'.format(np.argmax(Q)+1))

The optimal arm is arm 1
