# Reinforcement Learning

- Automatically determine the ideal behavior within a specific context, in order to maximize its performance
- Simple reward feedback to reinforce good behavior.
- Main elements:
  - Agent: the model we are programming
  - Observations + Rewards: the input to the agent
  - Environment: problem we are trying to solve, often a game. Represented as an array. OpenAI Gym offers a good starting point.
  - Action: decisions of the model

## Useful resources: 

Wikipedia: https://en.wikipedia.org/wiki/Reinforcement_learning

Keras: https://github.com/matthiasplappert/keras-rl

RL Algorithm Repository: https://github.com/dennybritz/reinforcement-learning

Berkeley course: http://rll.berkeley.edu/deeprlcourse/

Articles: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0


## OpenAI Gym

Official documentation: gym.openai.com

Available environments: http://gym.openai.com/envs/#classic_control

## Set-up

- Use editor or IDE and command line.
- Code in .py, do not use Notebooks


In [1]:
import gym
print("It WORKS")

It WORKS


## Gym Elements

A the `step` function returns the following values: 

- observation: The state of the environment.
- reward: The amount of reward achieved by the previous action. The number may vary depending on the environment. 
- done: Boolean indicating whether the environment needs to be reset.
- info: Dictionary with diagnostic information, used for debugging.


In [21]:
import gym
import time

In [23]:
env = gym.make('CartPole-v0')
print('Initial Observation')
observation = env.reset()
print(observation)
print('\n')
for _ in range(10):
    env.render()
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)
    print("Performed One Random Action")
    print('observation')
    print(observation)
    print('reward')
    print(reward)
    print('done')
    print(done)
    print('info')
    print(info)
    print('\n')
    time.sleep(0.5)
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Initial Observation
[ 0.02038799  0.01437287  0.02857049  0.01362293]


Performed One Random Action
observation
[ 0.02067544  0.20907368  0.02884295 -0.26991052]
reward
1.0
done
False
info
{}


Performed One Random Action
observation
[ 0.02485692  0.40377241  0.02344473 -0.5533585 ]
reward
1.0
done
False
info
{}


Performed One Random Action
observation
[ 0.03293237  0.59855742  0.01237756 -0.8385636 ]
reward
1.0
done
False
info
{}


Performed One Random Action
observation
[ 0.04490351  0.79350819 -0.00439371 -1.12732843]
reward
1.0
done
False
info
{}


Performed One Random Action
observation
[ 0.06077368  0.59844407 -0.02694028 -0.83602684]
reward
1.0
done
False
info
{}


Performed One Random Action
observation
[ 0.07274256  0.79392345 -0.04366081 -1.13705899]
reward
1.0
done
False
info
{}


Performed One Random Action
observation
[ 0.08862103  0.59939893 -0.06640199 -0.85838247

In [27]:
import gym
env = gym.make('CartPole-v0')
# Documentation: https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py#L40
env.action_space

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


Discrete(2)

In [28]:
env.observation_space

Box(4,)

In [31]:
import gym 
env = gym.make('CartPole-v0')
observation = env.reset()

for t in range(1000):
    env.render()
    cart_pos, cart_vel, pole_ang, ang_vel = observation
    action = int(pole_ang > 0)
    observation, rewared, done, info = env.step(action)
    if (done): 
        break
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


## Simple Neural Network Game

- Create a Neural Network that returns the probability of going left, given the four numbers of our observation.
- With the probability, we'll choose randomly which action to take.

In [32]:
import tensorflow as tf

In [33]:
import gym

In [34]:
import numpy as np

In [35]:
num_inputs = 4
num_hidden = 4
num_outputs = 1 # Probability to go left 1-left = right

In [36]:
initializer = tf.contrib.layers.variance_scaling_initializer()

In [37]:
X = tf.placeholder(tf.float32, shape=[None, num_inputs])
hidden_layer_one = tf.layers.dense(
    X, 
    num_hidden, 
    activation=tf.nn.relu, 
    kernel_initializer=initializer
)
hidden_layer_two = tf.layers.dense(
    hidden_layer_one, 
    num_hidden, 
    activation=tf.nn.relu, 
    kernel_initializer=initializer
)
output_layer = tf.layers.dense(
    hidden_layer_two, 
    num_outputs,
    activation=tf.nn.sigmoid,
    kernel_initializer=initializer
)

In [38]:
probabilities = tf.concat(axis=1, values=[output_layer, 1-output_layer])

In [39]:
action = tf.multinomial(probabilities, num_samples=1)

In [40]:
init = tf.global_variables_initializer()

In [42]:
epi = 250
step_limit = 500
env = gym.make('CartPole-v0')
avg_steps = []

with tf.Session() as sess:
    init.run()
    
    for i_episode in range(epi):
        obs = env.reset()
        
        for step in range(step_limit):
            action_val = action.eval(feed_dict={X: obs.reshape(1, num_inputs)})
            obs, reward, done, info = env.step(action_val[0][0]) # 0 or 1
            
            if done:
                avg_steps.append(step)
                print("Done after {} steps".format(step))
                break
                
print("After {} episodes, average steps per game was {}".format(epi, np.mean(avg_steps)))
env.close()    

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Done after 12 steps
Done after 13 steps
Done after 30 steps
Done after 10 steps
Done after 43 steps
Done after 15 steps
Done after 15 steps
Done after 13 steps
Done after 17 steps
Done after 8 steps
Done after 15 steps
Done after 10 steps
Done after 12 steps
Done after 16 steps
Done after 29 steps
Done after 23 steps
Done after 15 steps
Done after 38 steps
Done after 25 steps
Done after 22 steps
Done after 9 steps
Done after 13 steps
Done after 11 steps
Done after 10 steps
Done after 21 steps
Done after 28 steps
Done after 13 steps
Done after 19 steps
Done after 16 steps
Done after 14 steps
Done after 11 steps
Done after 8 steps
Done after 23 steps
Done after 10 steps
Done after 12 steps
Done after 12 steps
Done after 11 steps
Done after 13 steps
Done after 11 steps
Done after 17 steps
Done after 12 steps
Done after 20 steps
Done after 40 steps
Done after 27 steps
Done after 9 st

## Policy gradients

- Previous algorithm only considered last action and its consequences
- Assigment of credit problem: Which actions should be credited when the agent gets rewarded at time `t`?
- Solution: Discount rate, to be chosen: 0.95 - 0.99. Multiply the rewards by this discount rate at every time point with a power proportional to the time point.

Basic algorithm:
- Calculate the score of an action based on future actions
- Normalize the score by substracting the mean and dividing by std deviation.

- Neural Network plays several episodes
- The optimizer will calculate the gradients, instead of calling minimize
- Compute each action's discounted and normalized score.
- Multiply the gradient vector by the action's score
- Negative scores will create opposite gradients when multiplied
- Calculate mean of the resulting gradient vector for Gradient Descent


In [69]:
import tensorflow as tf
import gym
import numpy as np

In [70]:
tf.reset_default_graph()

In [71]:
num_inputs = 4
num_hidden = 4
num_outputs = 1

In [72]:
learning_rate = 0.01

In [73]:
initializer = tf.contrib.layers.variance_scaling_initializer()

In [74]:
X = tf.placeholder(tf.float32, shape=[None, num_inputs])

In [75]:
hidden_layer = tf.layers.dense(
    X, 
    num_hidden, 
    activation=tf.nn.elu, 
    kernel_initializer=initializer
)
logits = tf.layers.dense(
    hidden_layer, 
    num_outputs
)
outputs = tf.nn.sigmoid(logits)

In [76]:
probabilities = tf.concat(axis=1, values=[outputs, 1-outputs])
action = tf.multinomial(probabilities, num_samples=1)

In [77]:
y = 1.0 - tf.to_float(action)

In [78]:
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(
    labels=y, 
    logits=logits
)
optimizer = tf.train.AdamOptimizer(learning_rate)

In [79]:
gradients_and_variables = optimizer.compute_gradients(cross_entropy)

In [80]:
gradients = []
gradient_placeholders = []
grads_and_vars_feed = []

for gradient, variable in gradients_and_variables:
    gradients.append(gradient)
    gradient_placeholder = tf.placeholder(tf.float32, shape=gradient.get_shape())
    gradient_placeholders.append(gradient_placeholder)
    grads_and_vars_feed.append((gradient_placeholder, variable))

In [81]:
training_op = optimizer.apply_gradients(grads_and_vars_feed)

In [82]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [83]:
def helper_discount_rewards(rewards, discount_rate):
    '''
    Takes in rewards and applies discount rate
    '''
    discounted_rewards = np.zeros(len(rewards))
    cumulative_rewards = 0
    for step in reversed(range(len(rewards))):
        cumulative_rewards = rewards[step] + cumulative_rewards * discount_rate
        discounted_rewards[step] = cumulative_rewards
    return discounted_rewards

def discount_and_normalize_rewards(all_rewards, discount_rate):
    '''
    Takes in all rewards, applies helper_discount function and then normalizes
    using mean and std.
    '''
    all_discounted_rewards = []
    for rewards in all_rewards:
        all_discounted_rewards.append(
            helper_discount_rewards(rewards,discount_rate)
        )

    flat_rewards = np.concatenate(all_discounted_rewards)
    reward_mean = flat_rewards.mean()
    reward_std = flat_rewards.std()
    return [(discounted_rewards - reward_mean)/reward_std for discounted_rewards in all_discounted_rewards]

In [84]:
env = gym.make('CartPole-v0')
num_game_rounds = 10
max_game_steps = 1000
num_iterations = 750 
discount_rate = 0.9

with tf.Session() as sess:
    sess.run(init)
    for iteration in range(num_iterations):
        print("On iteration {}".format(iteration))
        
        all_rewards = []
        all_gradients = []
        
        for game in range(num_game_rounds):
            
            current_rewards = []
            current_gradients = []
            observation = env.reset()
            
            for step in range(max_game_steps):
                
                action_val, gradients_val = sess.run(
                    [ action, gradients ], 
                    feed_dict={X:observation.reshape(1, num_inputs)}
                )
                observation, reward, done, info = env.step(action_val[0][0])
                current_rewards.append(reward)
                current_gradients.append(gradients_val)
                
                if done:
                    break
                    
            all_rewards.append(current_rewards)
            all_gradients.append(current_gradients)
            
        all_rewards = discount_and_normalize_rewards(all_rewards, discount_rate)
        feed_dict = {}
        
        for var_index, gradient_placeholder in enumerate(gradient_placeholders):
            mean_gradients = np.mean(
                [reward * all_gradients[game_index][step][var_index]
                for game_index, rewards in enumerate(all_rewards)
                    for step, reward in enumerate(rewards)], axis=0
            )
            feed_dict[gradient_placeholder] = mean_gradients
            
        sess.run(training_op, feed_dict)
        
        print('SAVING GRAPH AND SESSION')
        meta_graph_def = tf.train.export_meta_graph(filename='./models/davizuku-policy-model.meta')
        saver.save(sess, './models/davizuku-policy-model')
        

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
On iteration 0
SAVING GRAPH AND SESSION
On iteration 1
SAVING GRAPH AND SESSION
On iteration 2
SAVING GRAPH AND SESSION
On iteration 3
SAVING GRAPH AND SESSION
On iteration 4
SAVING GRAPH AND SESSION
On iteration 5
SAVING GRAPH AND SESSION
On iteration 6
SAVING GRAPH AND SESSION
On iteration 7
SAVING GRAPH AND SESSION
On iteration 8
SAVING GRAPH AND SESSION
On iteration 9
SAVING GRAPH AND SESSION
On iteration 10
SAVING GRAPH AND SESSION
On iteration 11
SAVING GRAPH AND SESSION
On iteration 12
SAVING GRAPH AND SESSION
On iteration 13
SAVING GRAPH AND SESSION
On iteration 14
SAVING GRAPH AND SESSION
On iteration 15
SAVING GRAPH AND SESSION
On iteration 16
SAVING GRAPH AND SESSION
On iteration 17
SAVING GRAPH AND SESSION
On iteration 18
SAVING GRAPH AND SESSION
On iteration 19
SAVING GRAPH AND SESSION
On iteration 20
SAVING GRAPH AND SESSION
On iteration 21
SAVING GRAPH AND SESSION


On iteration 196
SAVING GRAPH AND SESSION
On iteration 197
SAVING GRAPH AND SESSION
On iteration 198
SAVING GRAPH AND SESSION
On iteration 199
SAVING GRAPH AND SESSION
On iteration 200
SAVING GRAPH AND SESSION
On iteration 201
SAVING GRAPH AND SESSION
On iteration 202
SAVING GRAPH AND SESSION
On iteration 203
SAVING GRAPH AND SESSION
On iteration 204
SAVING GRAPH AND SESSION
On iteration 205
SAVING GRAPH AND SESSION
On iteration 206
SAVING GRAPH AND SESSION
On iteration 207
SAVING GRAPH AND SESSION
On iteration 208
SAVING GRAPH AND SESSION
On iteration 209
SAVING GRAPH AND SESSION
On iteration 210
SAVING GRAPH AND SESSION
On iteration 211
SAVING GRAPH AND SESSION
On iteration 212
SAVING GRAPH AND SESSION
On iteration 213
SAVING GRAPH AND SESSION
On iteration 214
SAVING GRAPH AND SESSION
On iteration 215
SAVING GRAPH AND SESSION
On iteration 216
SAVING GRAPH AND SESSION
On iteration 217
SAVING GRAPH AND SESSION
On iteration 218
SAVING GRAPH AND SESSION
On iteration 219
SAVING GRAPH AND 

On iteration 392
SAVING GRAPH AND SESSION
On iteration 393
SAVING GRAPH AND SESSION
On iteration 394
SAVING GRAPH AND SESSION
On iteration 395
SAVING GRAPH AND SESSION
On iteration 396
SAVING GRAPH AND SESSION
On iteration 397
SAVING GRAPH AND SESSION
On iteration 398
SAVING GRAPH AND SESSION
On iteration 399
SAVING GRAPH AND SESSION
On iteration 400
SAVING GRAPH AND SESSION
On iteration 401
SAVING GRAPH AND SESSION
On iteration 402
SAVING GRAPH AND SESSION
On iteration 403
SAVING GRAPH AND SESSION
On iteration 404
SAVING GRAPH AND SESSION
On iteration 405
SAVING GRAPH AND SESSION
On iteration 406
SAVING GRAPH AND SESSION
On iteration 407
SAVING GRAPH AND SESSION
On iteration 408
SAVING GRAPH AND SESSION
On iteration 409
SAVING GRAPH AND SESSION
On iteration 410
SAVING GRAPH AND SESSION
On iteration 411
SAVING GRAPH AND SESSION
On iteration 412
SAVING GRAPH AND SESSION
On iteration 413
SAVING GRAPH AND SESSION
On iteration 414
SAVING GRAPH AND SESSION
On iteration 415
SAVING GRAPH AND 

On iteration 588
SAVING GRAPH AND SESSION
On iteration 589
SAVING GRAPH AND SESSION
On iteration 590
SAVING GRAPH AND SESSION
On iteration 591
SAVING GRAPH AND SESSION
On iteration 592
SAVING GRAPH AND SESSION
On iteration 593
SAVING GRAPH AND SESSION
On iteration 594
SAVING GRAPH AND SESSION
On iteration 595
SAVING GRAPH AND SESSION
On iteration 596
SAVING GRAPH AND SESSION
On iteration 597
SAVING GRAPH AND SESSION
On iteration 598
SAVING GRAPH AND SESSION
On iteration 599
SAVING GRAPH AND SESSION
On iteration 600
SAVING GRAPH AND SESSION
On iteration 601
SAVING GRAPH AND SESSION
On iteration 602
SAVING GRAPH AND SESSION
On iteration 603
SAVING GRAPH AND SESSION
On iteration 604
SAVING GRAPH AND SESSION
On iteration 605
SAVING GRAPH AND SESSION
On iteration 606
SAVING GRAPH AND SESSION
On iteration 607
SAVING GRAPH AND SESSION
On iteration 608
SAVING GRAPH AND SESSION
On iteration 609
SAVING GRAPH AND SESSION
On iteration 610
SAVING GRAPH AND SESSION
On iteration 611
SAVING GRAPH AND 

In [88]:
env = gym.make('CartPole-v0')

observations = env.reset()
with tf.Session() as sess:
    # https://www.tensorflow.org/api_guides/python/meta_graph
    saver = tf.train.import_meta_graph('./models/davizuku-policy-model.meta')
    saver.restore(sess,'./models/davizuku-policy-model')

    for x in range(500):
        env.render()
        action_val, gradients_val = sess.run([action, gradients], feed_dict={X: observations.reshape(1, num_inputs)})
        observations, reward, done, info = env.step(action_val[0][0])

env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from ./models/davizuku-policy-model
[33mWARN: You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.[0m
