# Reinforcement Learning in Tensorflow Tutorial 1
## The two-armed bandit

In [1]:
import numpy as np
import tensorflow as tf

### The Bandits

Here we define our bandits. For this example we are using a two-armed bandit. The pullBandit function generates a random number from a normal distribution with a mean of 0. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit that will give that positive reward.

In [2]:
#List out our two bandits. Current bandit 0 is the optimal choice.
bandits = [-0.5,0.5]
def pullBandit(bandit):
    #Get a random number.
    result = np.random.randn(1)
    if result > bandit:
        #return a positive reward.
        return 1
    else:
        #return a negative reward.
        return -1

### The neural network agent

Here we set up all the parameters that will be used for training our network.

In [3]:
# hyperparameters
learning_rate = 0.1
gamma = 0.99 # discount factor for reward

Next we define our very simple neural network.

In [4]:
tf.reset_default_graph()
#While there aren't any states to this task, we will still use an x input as a placeholder.
input_x = tf.placeholder(tf.float32, [None,1] , name="input_x")
W = tf.Variable(tf.random_normal([1,1]),name='W') # Our single variable we are training
score = tf.matmul(input_x,W)
probability = tf.nn.sigmoid(score) # This is the liklihood of choosing bandit 1 over bandit 0

#Below we compute and set the gradients to use for adjusting the network towards a succesful policy. 
input_y = tf.placeholder(tf.float32,[None,1], name="input_y")
rewardSig = tf.placeholder(tf.float32,name="reward_sig")

#The computation below is the key to processing the gradients properly. 
theGrad = tf.gradients(probability,W,grad_ys= ((input_y*rewardSig)/probability) - rewardSig)

adam = tf.train.AdamOptimizer(learning_rate=learning_rate)
updateGrads = adam.apply_gradients([(theGrad[0],W)])

### Running the agent and environment

In [5]:
xs,drs,ys = [],[],[]
total_episodes = 500
running_reward = None
reward_sum = 0
episode_number = 1

init = tf.initialize_all_variables()

# Launch the graph
with tf.Session() as sess:
    sess.run(init)

    while episode_number <= total_episodes:
        #Generate a placeholder state
        x = np.ones([1,1])
        xs.append(x) 

        # forward the policy network and sample an action from the returned probability
        prob = sess.run(probability,feed_dict={input_x:np.ones([1,1])})
        action = 1 if np.random.uniform() < prob else 0

        y = 1 if action == 0 else 0 # a "fake label"
        ys.append(y)
        
        # Take our action in the environment, and get a reward
        reward = np.float64(pullBandit(bandits[action]))
        reward_sum += reward
        drs.append(reward) # record reward 

        
        if episode_number % 10 == 0: # Periodically update the network policy
            epx = np.vstack(xs)
            epy = np.vstack(ys)
            epr = np.vstack(drs)
            xs,drs,ys = [],[],[] # reset array memory

            # Update the network gradient towards choosing more ideal actions, given what it has observed.
            probBefore = sess.run(probability, feed_dict={input_x: epx, input_y: epy, rewardSig: epr})
            sess.run(updateGrads,feed_dict={input_x: epx, input_y: epy, rewardSig: epr})
            probAfter = sess.run(probability, feed_dict={input_x: epx, input_y: epy, rewardSig: epr})

            # Keep a record of rewards, and give some feedback
            running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
            print 'Total reward %f. Action: %f. Prob Before %f. Prob After %f.' % (reward_sum, action,probBefore[0],probAfter[0])
            reward_sum = 0
            prev_x = None
            
        episode_number += 1

Total reward 4.000000. Action: 1.000000. Prob Before 0.143842. Prob After 0.131960.
Total reward 2.000000. Action: 0.000000. Prob Before 0.131960. Prob After 0.122341.
Total reward 0.000000. Action: 0.000000. Prob Before 0.122341. Prob After 0.115324.
Total reward 6.000000. Action: 0.000000. Prob Before 0.115324. Prob After 0.107501.
Total reward 2.000000. Action: 0.000000. Prob Before 0.107501. Prob After 0.099853.
Total reward 0.000000. Action: 0.000000. Prob Before 0.099853. Prob After 0.093116.
Total reward 2.000000. Action: 0.000000. Prob Before 0.093116. Prob After 0.086765.
Total reward 4.000000. Action: 0.000000. Prob Before 0.086765. Prob After 0.080318.
Total reward 0.000000. Action: 0.000000. Prob Before 0.080318. Prob After 0.074615.
Total reward 4.000000. Action: 0.000000. Prob Before 0.074615. Prob After 0.068990.
Total reward 4.000000. Action: 0.000000. Prob Before 0.068990. Prob After 0.063545.
Total reward 2.000000. Action: 0.000000. Prob Before 0.063545. Prob After 0.

You should see that the agent learns to almost always choose action 0, and the probability of choosing action 1 decreases to near zero. Feel free to play with the two bandit values, and see how the agent changes what it learns.