Attempting to solve the Multi-Armed Bandit problem

In [32]:
import tensorflow as tf
import tensorflow.contrib.slim as slim
import numpy as np

<h3>The Bandits</h3>
Here we define our bandits. For this example we are using a four-armed bandit. The pullBandit function generates a random number from a normal distribution with a mean of 0. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit that will give that positive reward.

In [28]:
b1 = round(5 * np.random.randn(), 2)
b2 = round(5 * np.random.randn(), 2)
b3 = round(5 * np.random.randn(), 2)
b4 = round(5 * np.random.randn(), 2)

bandits = [b1, b2, b3, b4]
num_bandits = len(bandits)

def pullBandit(bandit):
    result = np.random.randn(1)
    if result > bandit:
        return 1
    else:
        return -1

<h3>The Agent</h3>
The code below established our simple neural agent. It consists of a set of values for each of the bandits. Each value is an estimate of the value of the return from choosing the bandit. We use a policy gradient method to update the agent by moving the value for the selected action toward the recieved reward.

In [29]:
tf.reset_default_graph()

weights = tf.Variable(tf.ones([num_bandits]))
chosen_action = tf.argmax(weights, 0)

reward_holder = tf.placeholder(shape=[1], dtype=tf.float32)
action_holder = tf.placeholder(shape=[1], dtype=tf.int32)
responsible_weight = tf.slice(weights, action_holder, [1])
loss = -(tf.log(responsible_weight) * reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
update = optimizer.minimize(loss)

<h3>Training the Agent</h3>
We will train our agent by taking actions in our environment, and recieving rewards. Using the rewards and actions, we can know how to properly update our network in order to more often choose actions that will yield the highest rewards over time.

In [31]:
print("Bandits:", bandits)

total_episodes = 1000
total_reward = np.zeros(num_bandits) #'scoreboard' for bandits
e = 0.1 #chance of taking random action

init = tf.global_variables_initializer()

with tf.Session() as sess:

    sess.run(init)

    for i in range(total_episodes):
        
        if np.random.rand(1) < e: #for random action
            action = np.random.randint(num_bandits)
        else:
            action = sess.run(chosen_action)
        
        # Get reward from chosen Bandit
        reward = pullBandit(bandits[action])
        
        _, resp, ww = sess.run([update, responsible_weight, weights],
                              feed_dict={reward_holder:[reward], action_holder:[action]})
        
        total_reward[action] += reward
        
        if i % 50 == 0:
            print ("Running reward for the " + str(num_bandits) + " bandits: " + str(total_reward))
        

print ("The agent thinks bandit " + str(np.argmax(ww)+1) + " is the most promising....")
if np.argmax(ww) == np.argmax(-np.array(bandits)):
    print ("...and it was right!")
else:
    print ("...and it was wrong!")

Bandits: [14.03, 4.06, -1.53, -0.43]
Running reward for the 4 bandits: [0. 0. 1. 0.]
Running reward for the 4 bandits: [ 0.  0. 42. -1.]
Running reward for the 4 bandits: [-1.  0. 87. -3.]
Running reward for the 4 bandits: [ -1.   0. 124.  -4.]
Running reward for the 4 bandits: [ -1.  -1. 167.  -2.]
Running reward for the 4 bandits: [ -3.  -1. 203.  -2.]
Running reward for the 4 bandits: [ -5.  -3. 235.  -4.]
Running reward for the 4 bandits: [ -6.  -3. 274.  -2.]
Running reward for the 4 bandits: [ -7.  -4. 311.   1.]
Running reward for the 4 bandits: [ -9.  -7. 349.   2.]
Running reward for the 4 bandits: [-12.  -8. 386.   3.]
Running reward for the 4 bandits: [-12.  -9. 422.   4.]
Running reward for the 4 bandits: [-13. -11. 460.   5.]
Running reward for the 4 bandits: [-13. -13. 498.   7.]
Running reward for the 4 bandits: [-14. -15. 534.   8.]
Running reward for the 4 bandits: [-15. -15. 571.   8.]
Running reward for the 4 bandits: [-15. -16. 604.   8.]
Running reward for the 4 ba

<h3>The Contextual Bandits</h3>
Here we define our contextual bandits. In this example, we are using three four-armed bandit. What this means is that each bandit has four arms that can be pulled. Each bandit has different success probabilities for each arm, and as such requires different actions to obtain the best result. The pullBandit function generates a random number from a normal distribution with a mean of 0. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit-arm that will most often give a positive reward, depending on the Bandit presented.

In [33]:
class contextual_bandit():
    
    def __init__(self):
        self.state = 0
        self.bandits = np.array([[0.2,0,-0.0,-5],[0.1,-5,1,0.25],
                                 [-5,5,5,5]])
        self.num_bandits = self.bandits.shape[0]
        self.num_actions = self.bandits.shape[1]
        
    def getBandit(self):
        self.state = np.random.randint(0, len(self.bandits)) #Returns a random state for each episode
        return self.state
    
    def pullArm(self, action):
        bandit = self.bandits[self.state, action]
        result = np.random.randn(1)
        if result > bandit:
            return 1
        else:
            return -1

<h3>The Policy-Based Agent</h3>
The code below established our simple neural agent. It takes as input the current state, and returns an action. This allows the agent to take actions which are conditioned on the state of the environment, a critical step toward being able to solve full RL problems. The agent uses a single set of weights, within which each value is an estimate of the value of the return from choosing a particular arm given a bandit. We use a policy gradient method to update the agent by moving the value for the selected action toward the recieved reward.

In [41]:
class agent():
    
    def __init__(self, lr, s_size, a_size):
        self.state_in = tf.placeholder(shape=[1], dtype=tf.int32)
        state_in_OH = slim.one_hot_encoding(self.state_in, s_size)
        output = slim.fully_connected(state_in_OH, a_size, biases_initializer=None,
                                      activation_fn=tf.nn.sigmoid, 
                                      weights_initializer=tf.ones_initializer())
        self.output = tf.reshape(output, [-1])
        self.chosen_action = tf.argmax(self.output, 0)
        
        # training procedure
        self.reward_holder = tf.placeholder(shape=[1], dtype=tf.float32)
        self.action_holder = tf.placeholder(shape=[1], dtype=tf.int32)
        self.responsible_weight = tf.slice(self.output, self.action_holder, [1])
        self.loss = -(tf.log(self.responsible_weight) * self.reward_holder)
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr)
        self.update = optimizer.minimize(self.loss)

<h3>Training the Agent</h3>
We will train our agent by getting a state from the environment, take an action, and recieve a reward. Using these three things, we can know how to properly update our network in order to more often choose actions given states that will yield the highest rewards over time.

In [44]:
tf.reset_default_graph()

cBandit = contextual_bandit()
myAgent = agent(lr=0.001, s_size=cBandit.num_bandits, a_size=cBandit.num_actions)
weights = tf.trainable_variables()[0]

total_episodes = 10000
total_reward = np.zeros([cBandit.num_bandits, cBandit.num_actions])
error = 0.1 # chance of taking a random action

init = tf.global_variables_initializer()

with tf.Session() as sess:
    
    sess.run(init)
    
    for i in range(total_episodes):
        s = cBandit.getBandit()
        if np.random.rand(1) < error:
            action = np.random.randint(cBandit.num_actions)
        else:
            action = sess.run(myAgent.chosen_action, feed_dict={myAgent.state_in:[s]})
        
        reward = cBandit.pullArm(action)
        
        feed_dict = {myAgent.reward_holder:[reward], myAgent.action_holder:[action], myAgent.state_in:[s]}
        _, ww = sess.run([myAgent.update, weights], feed_dict=feed_dict)
        
        total_reward[s, action] += reward
        if i % 500 == 0:
            print ("Mean reward for each of the " + str(cBandit.num_bandits) + " bandits: " + str(np.mean(total_reward,axis=1)))
            
# return results
for a in range (cBandit.num_bandits):
    print ("The agent thinks action " + str(np.argmax(ww[a])+1) + " for bandit " + str(a+1) + " is the most promising....")
    if np.argmax(ww[a]) == np.argmin(cBandit.bandits[a]):
        print ("...and it was right!")
    else:
        print ("...and it was wrong!")               

Mean reward for each of the 3 bandits: [0.   0.   0.25]
Mean reward for each of the 3 bandits: [36.   33.5  36.25]
Mean reward for each of the 3 bandits: [66.5  71.5  71.75]
Mean reward for each of the 3 bandits: [111.   109.5  103.75]
Mean reward for each of the 3 bandits: [151.5  147.   135.75]
Mean reward for each of the 3 bandits: [193.75 182.25 171.25]
Mean reward for each of the 3 bandits: [233.   221.   205.25]
Mean reward for each of the 3 bandits: [274.75 256.5  240.  ]
Mean reward for each of the 3 bandits: [312.5  297.   271.25]
Mean reward for each of the 3 bandits: [348.75 334.25 306.25]
Mean reward for each of the 3 bandits: [389.   372.75 336.5 ]
Mean reward for each of the 3 bandits: [427.5  410.75 369.5 ]
Mean reward for each of the 3 bandits: [467.5  450.75 402.5 ]
Mean reward for each of the 3 bandits: [505.5  487.   436.25]
Mean reward for each of the 3 bandits: [539.75 529.   472.5 ]
Mean reward for each of the 3 bandits: [582.25 568.25 503.75]
Mean reward for each