## Multiarmed bandits with Tensorflow

In [1]:
!pip install tensorflow==1.14


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorflow==1.14
  Downloading tensorflow-1.14.0-cp37-cp37m-manylinux1_x86_64.whl (109.3 MB)
[K     |████████████████████████████████| 109.3 MB 44 kB/s 
Collecting keras-applications>=1.0.6
  Downloading Keras_Applications-1.0.8-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 5.9 MB/s 
[?25hCollecting tensorflow-estimator<1.15.0rc0,>=1.14.0rc0
  Downloading tensorflow_estimator-1.14.0-py2.py3-none-any.whl (488 kB)
[K     |████████████████████████████████| 488 kB 55.8 MB/s 
Collecting tensorboard<1.15.0,>=1.14.0
  Downloading tensorboard-1.14.0-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 44.3 MB/s 
Installing collected packages: tensorflow-estimator, tensorboard, keras-applications, tensorflow
  Attempting uninstall: tensorflow-estimator
    Found existing installation: tensorflow-estimator 2.8.0
    Uninstalling tenso

In [7]:
import tensorflow as tf
import numpy as np

### Bandits
The pullBandit function generates a random number from a normal distribution with an average value of 0. The fewer the number of bandits, the more likely it is to receive a positive reward.

Objective: to learn always to choose a bandit who will give a positive reward.

In [8]:
# A list of bandits with initialization of the values representing probabilities of winning 
bandits = [0.8, 0.5, -0.1, -0.8, -2.5]
num_bandits = len(bandits)
def pullBandit(bandit):
    # random number
    result = np.random.randn(1)
    if result > bandit:
        # reward
        return 1
    else:
        # losing
        return -1

### Arm-pulling agent

Initialization of the agent and its behavior.
The environment consists of a set of values for each of the bandits. Each value is an estimate of the profit earned from choosing a bandit.

The gradient descent method is used to update the agent's state based on the reward received.

In [9]:
tf.reset_default_graph()

# choosing the arm
weights = tf.Variable(tf.ones([num_bandits])) # at the beginning all the arms have equal weights
chosen_action = tf.argmax(weights,0) # choose the arm with the max weight at the moment

# placeholders
reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
action_holder = tf.placeholder(shape=[1],dtype=tf.int32)
responsible_weight = tf.slice(weights,action_holder,[1]) # take the corresponding weight of the arm according to the action
loss = -(tf.log(responsible_weight)*reward_holder) # as the reward rises the loss decreses
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
update = optimizer.minimize(loss)

### Agent training

We will train the agent by acting in an initialized environment and receiving a reward. While learning, the agent is more likely to choose actions that will bring greater reward over time.

In [10]:
total_episodes = 1000 # Number of episodes of game
total_reward = np.zeros(num_bandits) # Score = 0
e = 0.05 # probability of taking random steps

init = tf.initialize_all_variables()

# Start computing the tensorflow graph
with tf.Session() as sess:
    sess.run(init)
    i = 0
    while i < total_episodes:
        
        # Exploration vs. Exploitation. With probability e we take a random action.
        # Choose the arm
        if np.random.rand(1) < e:
            action = np.random.randint(num_bandits)
        else:
            action = sess.run(chosen_action)
        
        reward = pullBandit(bandits[action]) # Rewarding
        
        # Update the weights
        _, resp, ww = sess.run([update,responsible_weight,weights], feed_dict={reward_holder:[reward],action_holder:[action]})
        
        # Update the reward sum
        total_reward[action] += reward
        if i % 50 == 0:
            print ("Current reward " + str(total_reward))
        i+=1
print ("\nAgent say that the arm № " + str(np.argmax(ww)+1) + " is the most appropriate!")
if np.argmax(ww) == np.argmax(-np.array(bandits)):
    print ("...and it was right!")
else:
    print ("...and it was wrong!")

Current reward [-1.  0.  0.  0.  0.]
Current reward [-1. -2.  1.  0. 13.]
Current reward [ 0. -2.  1.  0. 62.]
Current reward [  0.  -2.   0.   1. 106.]
Current reward [  0.  -2.  -1.   1. 155.]
Current reward [  0.  -2.  -2.   1. 204.]
Current reward [  0.  -3.  -2.   1. 253.]
Current reward [  0.  -3.  -2.   2. 302.]
Current reward [ -2.  -4.  -2.   3. 348.]
Current reward [ -2.  -3.  -1.   3. 396.]
Current reward [ -2.  -4.  -1.   3. 445.]
Current reward [ -2.  -4.   1.   3. 493.]
Current reward [ -3.  -4.   2.   3. 539.]
Current reward [ -3.  -4.   2.   4. 586.]
Current reward [ -3.  -5.   2.   4. 635.]
Current reward [ -3.  -5.   2.   4. 685.]
Current reward [ -2.  -5.   2.   4. 734.]
Current reward [ -3.  -5.   3.   4. 780.]
Current reward [ -3.  -6.   3.   4. 827.]
Current reward [ -4.  -7.   4.   4. 872.]

Agent say that the arm № 5 is the most appropriate!
...and it was right!
