# Reinforcement Learning: Model-Based RL and Model Learning

#### In this Notebook the Cart-Pole task is going to be solved with model-based RL and learned dynamics. This means that a network is used to learn the model dynamics, and one serves as the Actor. The interaction with the environment is needed for both networks to converge.

In [9]:
import numpy as np
import pickle
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline
import math

In [2]:
import sys
import gym
env = gym.make('CartPole-v0')

In [3]:
# RL and networks params

H = 8
lr = 1e-2
gamma = .99
decay_rate = .99
resume = False

# Batch Sizes
model_bs = 3
real_bs = 3

D = 4



####    

The Actor network is a simple single hidden layer NN with H neurons. Takes as input the state of the environment and outputs the action.
The objective is to maximize the log Likelihood (since the action choice is a 2-classes classification problem), training is done with Adam optimizer.

In [5]:
tf.reset_default_graph()

# Actor Network
# Simple 1-Hidden layer network (sigmoid output layer)
observations = tf.placeholder(shape=[None, 4], dtype=tf.float32, name="input_x")
W1 = tf.get_variable("W1", shape=[4, H], initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf.nn.relu(tf.matmul(observations, W1))
W2 = tf.get_variable("W2", shape=[H, 1], initializer=tf.contrib.layers.xavier_initializer())
score = tf.matmul(layer1, W2)
probability = tf.nn.sigmoid(score)

# 
tvars = tf.trainable_variables()
input_y = tf.placeholder(shape=[None, 1], dtype=tf.float32, name="input_y")
advantages = tf.placeholder(dtype=tf.float32, name="reward_signal")
optim = tf.train.AdamOptimizer(learning_rate=lr)
W1Grad = tf.placeholder(tf.float32, name="batch_grad1")
W2Grad = tf.placeholder(tf.float32, name="batch_grad2")
batchGrad = [W1Grad, W2Grad]
loglik = tf.log(input_y*(input_y - probability) + (1 - input_y)*(input_y + probability))
loss = -tf.reduce_mean(loglik * advantages)
newGrads = tf.gradients(loss, tvars)
updateGrads = optim.apply_gradients(zip(batchGrad, tvars))

####         


The Model Network is a bit more complex: the input is the vector $(state, action)$, the outputs are: 
* Predicted State (4-dimensional)
* Predicted Reward (1-dimensional)
* Predicted End (1-dimensional)

The network has 2 hidden layers shared among all the outputs, while the last layer has different weights for each output. The objective of the network is the mean of the losses of the single outputs (squared error for state and reward prediction - regression - and log likelihood for the predicted end - two classes classification). Training is done with Adam optimizer.

In [6]:
# Model Network
mH = 256

input_data = tf.placeholder(shape=[None, 5], dtype=tf.float32)
  
# Shared Part of the network
previous_state = tf.placeholder(dtype=tf.float32, shape=[None, 5], name="previous_state")
W1M = tf.get_variable("W1M", shape=[5, mH], initializer=tf.contrib.layers.xavier_initializer())
B1M = tf.Variable(tf.zeros([mH]), name="B1M")
layer1M = tf.nn.relu(tf.matmul(previous_state, W1M) + B1M)
W2M = tf.get_variable("W2M", shape=[mH, mH], initializer=tf.contrib.layers.xavier_initializer())
B2M = tf.Variable(tf.zeros([mH]), name="B2M")
layer2M = tf.nn.relu(tf.matmul(layer1M, W2M) + B2M)

# Different final layers for each output
wO = tf.get_variable("wO", shape=[mH, 4], initializer=tf.contrib.layers.xavier_initializer())
wR = tf.get_variable("wR", shape=[mH, 1], initializer=tf.contrib.layers.xavier_initializer())
wD = tf.get_variable("wD", shape=[mH, 1], initializer=tf.contrib.layers.xavier_initializer())

bO = tf.Variable(tf.zeros([4]), name="b0")
bR = tf.Variable(tf.zeros([1]), name="bR")
bD = tf.Variable(tf.ones([1]), name="bD")

predicted_observation = tf.matmul(layer2M, wO, name="predicted_observation") + bO
predicted_reward = tf.matmul(layer2M, wR, name="predicted_reward") + bR
predicted_done = tf.sigmoid(tf.matmul(layer2M, wD, name="predicted_observation") + bD)

true_observation = tf.placeholder(dtype=tf.float32, shape=[None, 4], name="true_observation")
true_reward = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="true_reward")
true_done = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="true_done")

predicted_state = tf.concat([predicted_observation, predicted_reward, predicted_done], 1)

# Losses and optimizer
observation_loss = tf.square(true_observation - predicted_observation)
reward_loss = tf.square(true_reward - predicted_reward)
done_loss = tf.multiply(predicted_done, true_done) + tf.multiply(1-predicted_done, 1-true_done)
done_loss = -tf.log(done_loss)

model_loss = tf.reduce_mean(observation_loss + done_loss + reward_loss)

modelOptim = tf.train.AdamOptimizer(learning_rate=lr)
updateModel = modelOptim.minimize(model_loss)

In [7]:
# Helper functions
def resetGradBuffer(gradBuffer):
    for ix, grad in enumerate(gradBuffer):
        gradBuffer[ix] = grad * 0
    return gradBuffer

def discount_rewards(r):
    discounted_r = np.zeros_like(r)
    running_add = 0
    for t in reversed(range(0, len(r))):
        running_add = r[t] + gamma * running_add
        discounted_r[t] = running_add
    return discounted_r

def stepModel(sess, xs, action):
    toFeed = np.reshape(np.hstack([xs[-1][0], np.array(action)]), [1, 5])
    myPredict = sess.run([predicted_state], feed_dict={previous_state: toFeed})
    reward = myPredict[0][:, 4]
    observation = myPredict[0][:, :4]
    observation[:, 0] = np.clip(observation[:, 0], -2.4, 2.4)
    observation[:, 2] = np.clip(observation[:, 2], -2.4, 2.4)
    doneP = np.clip(myPredict[0][:, 5], 0, 1)
    if doneP > 0.1 or len(xs) >= 300:
        done = True
    else:
        done = False
    return observation, reward, done

#     
The networks are trained using the whole episode (on-policy RL - similar to MC) in this way:
1. The **Model DNN is trained** taking actions on the **real environment**, using real observations and rewards
2. The **Actor DNN is trained** taking actions on the **Model**, using the predicted observations and rewards

These two phases are alternated, after the first 100 episodes of Model DNN training.


In [8]:
xs, drs, ys, ds = [], [], [], []  # reset history
total_episodes = 2000
running_reward = None
reward_sum = 0
episode_number = 1
real_episodes = 1

init = tf.global_variables_initializer()
batch_size = real_bs

drawFromModel = False    # True if DNN model is used to predict observations
trainModel = True        # True if DNN for model must be trained
trainPolicy = False      # True if Actor DNN must be trained

switch_point = 1

with tf.Session() as sess:
    rendering = False
    sess.run(init)
    observation = env.reset()
    x = observation
    gradBuffer = sess.run(tvars)
    gradBuffer = resetGradBuffer(gradBuffer)
    
    while episode_number < total_episodes:
        if (reward_sum/batch_size > 150 and not drawFromModel) or rendering:
            env.render()
            rendering = True
        x = np.reshape(observation, [1, 4])
        
        # Actor DNN action prediction
        tfprob = sess.run(probability, feed_dict={observations: x})
        if np.random.uniform() < tfprob:
            action = 1
            y = 0
        else:
            action = 0
            y = 1
            
        xs.append(x)
        ys.append(y)
        
        # Model/Real environment step
        if not drawFromModel:
            observation, reward, done, info = env.step(action)
        else: 
            observation, reward, done = stepModel(sess, xs, action)
            
        reward_sum += reward
        
        ds.append(done*1)
        drs.append(reward)
        
        # Update DNNs if episode is done
        if done:
            if not drawFromModel:
                real_episodes += 1
            episode_number += 1
            
            # Stacking all obs, rewards, done from episode
            epx = np.vstack(xs)
            epy = np.vstack(ys)
            epr = np.vstack(drs)
            epd = np.vstack(ds)
            # reset history buffer
            xs, drs, ys, ds = [], [], [], []
            
            if trainModel:
                actions = np.array([np.abs(y-1) for y in epy][:-1])
                state_prevs = epx[:-1, :]
                state_prevs = np.hstack([state_prevs, actions])
                state_nexts = epx[1:, :]
                rewards = np.array(epr[1:, :])
                dones = np.array(epd[1:, :])
                state_nextsAll = np.hstack([state_nexts, rewards, dones])
                
                feed_dict = {previous_state: state_prevs, true_observation: state_nexts, \
                             true_done: dones, true_reward: rewards}
                loss, pStat, _ = sess.run([model_loss, predicted_state, updateModel], feed_dict=feed_dict)
            
            if trainPolicy:
                discounted_epr = discount_rewards(epr).astype('float32')
                discounted_epr -= np.mean(discounted_epr)
                discounted_epr /= np.std(discounted_epr)
                tGrad = sess.run(newGrads, feed_dict={observations: epx, input_y: epy, advantages: discounted_epr})
                
                # break if gradients are too large
                if np.sum(tGrad[0] == tGrad[0]) == 0:
                    break
                for ix, grad in enumerate(tGrad):
                    gradBuffer[ix] += grad
            
            # Update Actor weights every switch_point+batch_size episodes 
            # Actor gradients are computed at each step (and Actor DNN is updated)
            if switch_point + batch_size == episode_number:
                switch_point = episode_number
                if trainPolicy:
                    sess.run(updateGrads, feed_dict={W1Grad: gradBuffer[0], W2Grad: gradBuffer[1]})
                    gradBuffer = resetGradBuffer(gradBuffer)
                    
                if running_reward is None:
                    running_reward = reward_sum
                else:
                    running_reward = running_reward * gamma + reward_sum * (1 - gamma)
                
                if not drawFromModel:
                    print('World Perf: Episode %f. Reward %f. action: %f. mean reward %f.' % (real_episodes,reward_sum/real_bs,action, running_reward/real_bs))
                    if reward_sum/batch_size > 200:
                        break
                reward_sum = 0

                # Once the model has been trained on 100 episodes, we start alternating between training the policy
                # from the model and training the model from the real environment.
                if episode_number > 100:
                    drawFromModel = not drawFromModel
                    trainModel = not trainModel
                    trainPolicy = not trainPolicy
            
            if drawFromModel == True:
                observation = np.random.uniform(-0.1,0.1,[4]) # Generate reasonable starting point
                batch_size = model_bs
            else:
                observation = env.reset()
                batch_size = real_bs
                
print(real_episodes)

World Perf: Episode 4.000000. Reward 22.333333. action: 1.000000. mean reward 22.333333.
World Perf: Episode 7.000000. Reward 17.333333. action: 1.000000. mean reward 22.283333.
World Perf: Episode 10.000000. Reward 15.666667. action: 1.000000. mean reward 22.217167.
World Perf: Episode 13.000000. Reward 17.666667. action: 1.000000. mean reward 22.171662.
World Perf: Episode 16.000000. Reward 18.666667. action: 0.000000. mean reward 22.136612.
World Perf: Episode 19.000000. Reward 20.333333. action: 0.000000. mean reward 22.118579.
World Perf: Episode 22.000000. Reward 26.333333. action: 1.000000. mean reward 22.160726.
World Perf: Episode 25.000000. Reward 16.333333. action: 1.000000. mean reward 22.102453.
World Perf: Episode 28.000000. Reward 19.000000. action: 1.000000. mean reward 22.071428.
World Perf: Episode 31.000000. Reward 22.333333. action: 1.000000. mean reward 22.074047.
World Perf: Episode 34.000000. Reward 15.333333. action: 1.000000. mean reward 22.006640.
World Perf: 

World Perf: Episode 277.000000. Reward 46.666667. action: 1.000000. mean reward 26.325121.
World Perf: Episode 280.000000. Reward 52.000000. action: 1.000000. mean reward 26.547165.
World Perf: Episode 283.000000. Reward 42.333333. action: 0.000000. mean reward 26.679253.
World Perf: Episode 286.000000. Reward 53.666667. action: 0.000000. mean reward 27.034597.
World Perf: Episode 289.000000. Reward 25.333333. action: 1.000000. mean reward 27.496367.
World Perf: Episode 292.000000. Reward 49.333333. action: 1.000000. mean reward 27.518257.
World Perf: Episode 295.000000. Reward 54.666667. action: 1.000000. mean reward 27.574442.
World Perf: Episode 298.000000. Reward 27.000000. action: 1.000000. mean reward 27.402647.
World Perf: Episode 301.000000. Reward 29.333333. action: 0.000000. mean reward 27.236933.
World Perf: Episode 304.000000. Reward 30.333333. action: 0.000000. mean reward 27.060095.
World Perf: Episode 307.000000. Reward 57.666667. action: 0.000000. mean reward 27.865881.

World Perf: Episode 550.000000. Reward 42.000000. action: 1.000000. mean reward 61.328968.
World Perf: Episode 553.000000. Reward 47.000000. action: 0.000000. mean reward 60.714920.
World Perf: Episode 556.000000. Reward 57.666667. action: 1.000000. mean reward 60.166168.
World Perf: Episode 559.000000. Reward 37.666667. action: 1.000000. mean reward 59.692135.
World Perf: Episode 562.000000. Reward 38.333333. action: 1.000000. mean reward 59.007572.
World Perf: Episode 565.000000. Reward 70.333333. action: 0.000000. mean reward 59.112961.
World Perf: Episode 568.000000. Reward 87.333333. action: 0.000000. mean reward 59.443542.
World Perf: Episode 571.000000. Reward 50.333333. action: 1.000000. mean reward 59.878445.
World Perf: Episode 574.000000. Reward 39.000000. action: 1.000000. mean reward 61.954762.
World Perf: Episode 577.000000. Reward 53.000000. action: 1.000000. mean reward 61.512684.
World Perf: Episode 580.000000. Reward 29.666667. action: 1.000000. mean reward 62.910248.

World Perf: Episode 820.000000. Reward 103.333333. action: 0.000000. mean reward 75.424019.
World Perf: Episode 823.000000. Reward 36.333333. action: 0.000000. mean reward 77.336006.
World Perf: Episode 826.000000. Reward 84.333333. action: 0.000000. mean reward 80.044983.
World Perf: Episode 829.000000. Reward 56.333333. action: 0.000000. mean reward 79.350853.
World Perf: Episode 832.000000. Reward 79.666667. action: 1.000000. mean reward 78.764214.
World Perf: Episode 835.000000. Reward 77.333333. action: 0.000000. mean reward 81.713387.
World Perf: Episode 838.000000. Reward 58.333333. action: 0.000000. mean reward 80.911804.
World Perf: Episode 841.000000. Reward 91.666667. action: 0.000000. mean reward 80.378761.
World Perf: Episode 844.000000. Reward 58.000000. action: 0.000000. mean reward 81.349953.
World Perf: Episode 847.000000. Reward 116.666667. action: 0.000000. mean reward 81.286018.
World Perf: Episode 850.000000. Reward 51.000000. action: 0.000000. mean reward 80.37723

World Perf: Episode 1087.000000. Reward 152.000000. action: 0.000000. mean reward 129.280716.
World Perf: Episode 1090.000000. Reward 155.000000. action: 0.000000. mean reward 128.482651.
World Perf: Episode 1093.000000. Reward 158.000000. action: 0.000000. mean reward 127.743248.
World Perf: Episode 1096.000000. Reward 63.333333. action: 0.000000. mean reward 125.944366.
World Perf: Episode 1099.000000. Reward 151.666667. action: 1.000000. mean reward 126.999542.
World Perf: Episode 1102.000000. Reward 118.333333. action: 0.000000. mean reward 126.575714.
World Perf: Episode 1105.000000. Reward 91.333333. action: 1.000000. mean reward 125.214539.
World Perf: Episode 1108.000000. Reward 142.333333. action: 1.000000. mean reward 124.255699.
World Perf: Episode 1111.000000. Reward 170.000000. action: 0.000000. mean reward 124.230011.
World Perf: Episode 1114.000000. Reward 72.000000. action: 0.000000. mean reward 122.755379.
World Perf: Episode 1117.000000. Reward 160.333333. action: 0.0

World Perf: Episode 1351.000000. Reward 121.333333. action: 0.000000. mean reward 161.156174.
World Perf: Episode 1354.000000. Reward 200.000000. action: 0.000000. mean reward 160.188278.
World Perf: Episode 1357.000000. Reward 175.333333. action: 1.000000. mean reward 160.021240.
World Perf: Episode 1360.000000. Reward 194.666667. action: 1.000000. mean reward 161.758118.
World Perf: Episode 1363.000000. Reward 200.000000. action: 1.000000. mean reward 161.673492.
World Perf: Episode 1366.000000. Reward 196.000000. action: 0.000000. mean reward 161.571671.
World Perf: Episode 1369.000000. Reward 154.666667. action: 0.000000. mean reward 161.998718.
World Perf: Episode 1372.000000. Reward 170.000000. action: 0.000000. mean reward 163.783325.
World Perf: Episode 1375.000000. Reward 187.666667. action: 0.000000. mean reward 164.505859.
World Perf: Episode 1378.000000. Reward 200.000000. action: 0.000000. mean reward 164.343155.
World Perf: Episode 1381.000000. Reward 153.333333. action: 

World Perf: Episode 1615.000000. Reward 200.000000. action: 1.000000. mean reward 182.175217.
World Perf: Episode 1618.000000. Reward 196.666667. action: 0.000000. mean reward 181.220047.
World Perf: Episode 1621.000000. Reward 188.666667. action: 1.000000. mean reward 182.555984.
World Perf: Episode 1624.000000. Reward 198.333333. action: 0.000000. mean reward 183.134140.
World Perf: Episode 1627.000000. Reward 200.000000. action: 0.000000. mean reward 181.700012.
World Perf: Episode 1630.000000. Reward 200.000000. action: 1.000000. mean reward 181.302429.
World Perf: Episode 1633.000000. Reward 200.000000. action: 0.000000. mean reward 182.752670.
World Perf: Episode 1636.000000. Reward 200.000000. action: 1.000000. mean reward 184.260330.
World Perf: Episode 1639.000000. Reward 200.000000. action: 1.000000. mean reward 185.619034.
World Perf: Episode 1642.000000. Reward 200.000000. action: 0.000000. mean reward 186.225876.
World Perf: Episode 1645.000000. Reward 200.000000. action: 

KeyboardInterrupt: 

In [10]:
env.close()