# Simple Reinforcement Learning in Tensorflow Part 1:

## The Multi-armed bandit

This tutorial contains a simple example of how to build a policy-gradient based agent that can solve the multi-armed bandit problem. For more information, see this Medium post.

For more Reinforcement Learning algorithms, including DQN and Model-based learning in Tensorflow, see my Github repo, DeepRL-Agents.

In [1]:
import tensorflow as tf
import numpy as np

## The Bandits

Here we define our bandits. For this example we are using a four-armed bandit. The pullBandit function generates a random number from a normal distribution with a mean of 0. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit that will give that positive reward.

In [22]:
# List out our bandits. Currently bandit 4 (index#3) is set to most often provide a positive reward.
bandits = [0.2, 0, -0.2, -0.0]
num_bandits = len(bandits)

def pullBandit(bandit):
    # Get a random number.
    result = np.random.randn(1)
    if result > bandit:
        # return a positive reward.
        return 1
    else:
        # return a negative reward.
        return -1

## The Agent

The code below established our simple neural agent. It consists of a set of values for each of the bandits. Each value is an estimate of the value of the return from choosing the bandit. We use a policy gradient method to update the agent by moving the value for the selected action toward the recieved reward.

In [23]:
tf.reset_default_graph()

# These two lines established the feed-forward part of the network. This does the actual choosing.
weights = tf.Variable(tf.ones((num_bandits, )))
chosen_action = tf.argmax(weights, 0)

# The next six lines establish the training proceedure. We feed the reward and chosen action into the network
# to compute the loss, and use it to update the network.
reward_holder = tf.placeholder(shape=(1, ), dtype=tf.float32)
action_holder = tf.placeholder(shape=(1, ), dtype=tf.int32)
responsible_weight = tf.slice(weights, action_holder, (1, ))
loss = (-1.0) * tf.log(responsible_weight) * reward_holder
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
update = optimizer.minimize(loss)

## Training the Agent

In [24]:
total_episodes = 1000 #Set total number of episodes to train agent on.
total_reward = np.zeros(num_bandits) #Set scoreboard for bandits to 0.
e = 0.1 #Set the chance of taking a random action.

# init = tf.initialize_all_variables()
init = tf.global_variables_initializer()

# Launch the tensorflow graph
with tf.Session() as sess:
    sess.run(init)
    i = 0
    while i < total_episodes:
        
        # Choose either a random action or one from our network.
        if np.random.rand(1) < e:
            action = np.random.randint(num_bandits)
        else:
            action = sess.run(chosen_action)
        
        reward = pullBandit(bandits[action]) #Get our reward from picking one of the bandits.
        
        #Update the network.
        _, resp, ww = sess.run(
            [update, responsible_weight, weights], 
            feed_dict={reward_holder:[reward], action_holder:[action]})
        
        #Update our running tally of scores.
        total_reward[action] += reward
        if i % 50 == 0:
            print("Running reward for the " + str(num_bandits) + " bandits: " + str(total_reward))
        i += 1
print("The agent thinks bandit " + str(np.argmax(ww)+1) + " is the most promising....")
if np.argmax(ww) == np.argmax(-np.array(bandits)):
    print("...and it was right!")
else:
    print("...and it was wrong!")

Running reward for the 4 bandits: [-1.  0.  0.  0.]
Running reward for the 4 bandits: [-3. -3. -4. -1.]
Running reward for the 4 bandits: [-4. -3. -1. -3.]
Running reward for the 4 bandits: [-5. -2. 14. -2.]
Running reward for the 4 bandits: [-6. -2. 34. -1.]
Running reward for the 4 bandits: [-5. -1. 42.  1.]
Running reward for the 4 bandits: [-4.  0. 45.  0.]
Running reward for the 4 bandits: [-4.  0. 41.  0.]
Running reward for the 4 bandits: [-5.  2. 54.  0.]
Running reward for the 4 bandits: [-5.  0. 70.  0.]
Running reward for the 4 bandits: [-4.  1. 84.  0.]
Running reward for the 4 bandits: [-4.  0. 92. -1.]
Running reward for the 4 bandits: [-4.  1. 89. -3.]
Running reward for the 4 bandits: [ -4.   0. 105.  -4.]
Running reward for the 4 bandits: [ -5.  -1. 123.  -4.]
Running reward for the 4 bandits: [ -5.  -2. 131.  -3.]
Running reward for the 4 bandits: [ -6.  -4. 148.  -3.]
Running reward for the 4 bandits: [ -5.  -4. 150.  -2.]
Running reward for the 4 bandits: [ -6.  -3.