# The Habitual DDPG Network

Habitual network

Assuming generative model is perfect, then action selected would always be the action that maximises chance of observing prior preferences. Hence habitual network can be trained to output maximally rewarding actions, as these actions are the free energy minimising actions.

Also has a nice interpretation as long as the generative models keep training. Eventually the generative model is less sure about old things. Why people eventually revisit old states they were previously certain about.

As far as an agent knows, if observations are confirming perfectly to expectations then it has a perfect world model. So why would it change it? It’s only when an uncertain observation comes in that the agent needs to reconsider whether or not it has the best model of the world.


I think this network should be performing policy gradient method but instead of minimising the discounted reward sequence it should minimise the discounted external EFE/FEEF component sequence. That way in the end the end the fast and slow thinking methods should be converging as the world model continues to improve


What is this network trying to learn?
- This network is trying to learn the state action mapping that maximises the probability of being in the preferred states
- It is also trying to learn to output actions that minimise the extrinsic part of the EFE/FEEF


What does this network take as input?
- Current state
- Maybe sequence of previous states and actions

What should this network output?
- The action that leads to the next state that maximally achieves the prior preferences

How should this network learn?
- It should learn by outputting


Okay so new idea! DDPG seems pretty good so far. How about we have the Q function take latent states as input and use the VAEs good latent features as input. Then we'll have
- Q(o, a)
- p(s|o) and p(o|s)
- p(s'|s, a)
- V(o) or U(o)



In [1]:
from util import random_observation_sequence, transform_observations
import matplotlib.pyplot as plt
import gym

from habitual_action_network import HabitualAction, compute_discounted_cumulative_reward

In [2]:
import tensorflow as tf
import tensorflow_probability as tfp
import keras
from keras import layers
import numpy as np
import matplotlib.pyplot as plt

# Hide GPU from visible devices
tf.config.set_visible_devices([], 'GPU')

In [10]:
env = gym.make('MountainCarContinuous-v0')

num_states = env.observation_space.shape[0]
print("Size of State Space ->  {}".format(num_states))
num_actions = env.action_space.shape[0]
print("Size of Action Space ->  {}".format(num_actions))

upper_bound = env.action_space.high[0]
lower_bound = env.action_space.low[0]

print("Max Value of Action ->  {}".format(upper_bound))
print("Min Value of Action ->  {}".format(lower_bound))

Size of State Space ->  2
Size of Action Space ->  1
Max Value of Action ->  1.0
Min Value of Action ->  -1.0


In [11]:
class Buffer:
    def __init__(self,
                 buffer_capacity=100000,
                 batch_size=64,
                 gamma=0.99,
                 observation_dim=2,
                 action_dim=1,
                 critic_optimizer="adam",
                 actor_optimizer="adam"):

        # Number of "experiences" to store at max
        self.buffer_capacity = buffer_capacity
        # Num of tuples to train on.
        self.batch_size = batch_size

        # Its tells us num of times record() was called.
        self.buffer_counter = 0

        # Instead of list of tuples as the exp.replay concept go
        # We use different np.arrays for each tuple element
        self.state_buffer = np.zeros((self.buffer_capacity, observation_dim))
        self.action_buffer = np.zeros((self.buffer_capacity, action_dim))
        self.reward_buffer = np.zeros((self.buffer_capacity, 1))
        self.next_state_buffer = np.zeros((self.buffer_capacity, observation_dim))

        self.gamma = gamma

        self.critic_optimizer = critic_optimizer
        self.actor_optimizer = actor_optimizer

    # Takes (s,a,r,s') obervation tuple as input
    def record(self, obs_tuple):
        # Set index to zero if buffer_capacity is exceeded,
        # replacing old records
        index = self.buffer_counter % self.buffer_capacity

        self.state_buffer[index] = obs_tuple[0]
        self.action_buffer[index] = obs_tuple[1]
        self.reward_buffer[index] = obs_tuple[2]
        self.next_state_buffer[index] = obs_tuple[3]

        self.buffer_counter += 1

    # clears the buffer
    def clear(self):
        self.state_buffer= []
        self.action_buffer = []
        self.reward_buffer = []
        self.next_state_buffer = []

    # Eager execution is turned on by default in TensorFlow 2. Decorating with tf.function allows
    # TensorFlow to build a static graph out of the logic and computations in our function.
    # This provides a large speed up for blocks of code that contain many small TensorFlow operations such as this one.
    @tf.function
    def update(
            self, state_batch, action_batch, reward_batch, next_state_batch,
    ):
        # Training and updating Actor & Critic networks.
        # See Pseudo Code.
        with tf.GradientTape() as tape:
            target_actions = target_actor(next_state_batch, training=True)
            y = reward_batch + self.gamma * target_critic(
                [next_state_batch, target_actions], training=True
            )
            critic_value = critic_model([state_batch, action_batch], training=True)
            critic_loss = tf.math.reduce_mean(tf.math.square(y - critic_value))

        critic_grad = tape.gradient(critic_loss, critic_model.trainable_variables)
        self.critic_optimizer.apply_gradients(
            zip(critic_grad, critic_model.trainable_variables)
        )

        with tf.GradientTape() as tape:
            actions = actor_model(state_batch, training=True)
            critic_value = critic_model([state_batch, actions], training=True)
            # Used `-value` as we want to maximize the value given
            # by the critic for our actions
            actor_loss = -tf.math.reduce_mean(critic_value)

        actor_grad = tape.gradient(actor_loss, actor_model.trainable_variables)
        self.actor_optimizer.apply_gradients(
            zip(actor_grad, actor_model.trainable_variables)
        )

    # We compute the loss and update parameters
    def learn(self):
        # Get sampling range
        record_range = min(self.buffer_counter, self.buffer_capacity)
        # Randomly sample indices
        batch_indices = np.random.choice(record_range, self.batch_size)

        # Convert to tensors
        state_batch = tf.convert_to_tensor(self.state_buffer[batch_indices])
        action_batch = tf.convert_to_tensor(self.action_buffer[batch_indices])
        reward_batch = tf.convert_to_tensor(self.reward_buffer[batch_indices])
        reward_batch = tf.cast(reward_batch, dtype=tf.float32)
        next_state_batch = tf.convert_to_tensor(self.next_state_buffer[batch_indices])

        self.update(state_batch, action_batch, reward_batch, next_state_batch)


In [12]:
def get_actor(observation_dim, action_max):
    # Initialize weights between -3e-3 and 3-e3
    last_init = tf.random_uniform_initializer(minval=-0.003, maxval=0.003)

    inputs = layers.Input(shape=(observation_dim,))
    out = layers.Dense(16, activation="relu")(inputs)
    out = layers.Dense(32, activation="relu")(out)
    out = layers.Dense(16, activation="relu")(out)
    outputs = layers.Dense(1, activation="tanh", kernel_initializer=last_init)(out)

    # Our upper bound is 2.0 for Pendulum.
    outputs = outputs * action_max
    model = tf.keras.Model(inputs, outputs)
    return model


def get_critic(observation_dim, action_dim):
    # State as input
    state_input = layers.Input(shape=observation_dim)
    state_out = layers.Dense(16, activation="relu")(state_input)
    state_out = layers.Dense(32, activation="relu")(state_out)

    # Action as input
    action_input = layers.Input(shape=action_dim)
    action_out = layers.Dense(32, activation="relu")(action_input)

    # Both are passed through seperate layer before concatenating
    concat = layers.Concatenate()([state_out, action_out])

    # was 256
    out = layers.Dense(128, activation="relu")(concat)
    out = layers.Dense(128, activation="relu")(out)
    outputs = layers.Dense(1)(out)

    # Outputs single value for give state-action
    model = tf.keras.Model([state_input, action_input], outputs)

    return model

In [13]:
class BasicDDPG:

    def __init__(self, actor, critic, target_actor, target_critic, buffer, tau):

        self.actor_model = actor
        self.critic_model = critic

        self.target_actor = target_actor
        self.target_critic = target_critic

        self.buffer = buffer
        self.tau = tau

    def update_actor_target(self):
        update_target(self.target_actor.variables, self.actor_model.variables, self.tau)

    def update_critic_target(self):
        update_target(self.target_critic.variables, self.critic_model.variables, self.tau)


# This update target parameters slowly
# Based on rate `tau`, which is much less than one.
@tf.function
def update_target(target_weights, weights, tau):
    for (a, b) in zip(target_weights, weights):
        a.assign(b * tau + a * (1 - tau))

In [14]:
actor_model = get_actor(2, 1)
critic_model = get_critic(2, 1)

target_actor = get_actor(2, 1)
target_critic = get_critic(2, 1)

# Making the weights equal initially
target_actor.set_weights(actor_model.get_weights())
target_critic.set_weights(critic_model.get_weights())

# Learning rate for actor-critic models
critic_lr = 0.0001
actor_lr = 0.00005

critic_optimizer = tf.keras.optimizers.Adam(critic_lr)
actor_optimizer = tf.keras.optimizers.Adam(actor_lr)

total_episodes = 100
# Discount factor for future rewards
gamma = 0.99
# Used to update target networks
tau = 0.005

buffer = Buffer(50000, 64, 0.99, 2, 1, critic_optimizer, actor_optimizer)

ddpg = BasicDDPG(actor_model, critic_model, target_actor, target_critic, buffer, tau)

In [15]:
t_max = 1000
num_episodes = 50

min_reward_cutoff = -1000
min_reward_set = -0.5

reward_increase = 0

observation_max = np.array([0.6, 0.07])
observation_min = np.array([-1.2, -0.07])

for i in range(num_episodes):

    all_observations = []
    actions = []
    rewards = []

    total_reward = 0

    o, a, r = random_observation_sequence(env, t_max, epsilon=0.2)
    o = transform_observations(o, observation_max, observation_min, [0, 0])


    for i in range(len(a)):

        prev_state = o[i]
        state = o[i+1]
        action = a[i]
        reward = r[i] + reward_increase

        # if reward < 0:
        #     print("yes")

        total_reward += reward

        ddpg.buffer.record((prev_state, action, reward, state))
        # episodic_reward += reward
        #
        ddpg.buffer.learn()
        ddpg.update_actor_target()
        ddpg.update_critic_target()

        # buffer.record((prev_state, action, reward, state))
        # # episodic_reward += reward
        # #
        # buffer.learn()
        # update_target(target_actor.variables, actor_model.variables, tau)
        # update_target(target_critic.variables, critic_model.variables, tau)

    print(total_reward)

    acts = ddpg.actor_model((np.random.random(size=(10, 2))*2 - 1))
    print(np.max(acts), np.min(acts))


2022-09-06 17:30:34.298859: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


-32.963745824437254
0.0035701806 -0.03100249
86.68404216376675
-0.005363233 -0.043388568
92.1443371592248
0.061502587 0.02080527
-38.34286132328569
0.43083033 0.2759526
86.42506779463984
0.42712528 0.19821759
65.09329994858183
0.48775825 0.32344788
70.08626133059292
0.816972 0.50715816
81.57467205559598
0.85989547 0.41579473
73.3166470604361
0.86640894 0.29630744
69.98795232498819
0.9851574 0.121006206
-32.456244955254434
0.9982692 -0.13641658
-30.08506976348149
0.9999931 0.1497526
-27.42503418034901
0.99999976 -0.120499395
-35.41489023422968
0.99999976 -0.07794662
70.86375380387494
0.99999976 -0.14111122
94.52054392650672
0.99999976 -0.07773594
81.6220225772017
1.0 0.64832085
76.1405439089829
1.0 -0.0047395765
-34.11562576232058
0.99999976 -0.13441344
-30.5486287114235
0.99999976 -0.18020196
-33.23004983689618
1.0 -0.20430213
-29.15379126666621
1.0 -0.31424105
-36.87250747700676
0.99999976 -0.45255142
82.3017664596413
1.0 -0.77813107
91.09502325132146
1.0 -0.7645173
-34.95882308941864

In [10]:
ddpg.critic_model([(np.random.random(size=(10, 2))*2 - 1), (np.random.random(size=(10, 1))*2 - 1)])

<tf.Tensor: shape=(10, 1), dtype=float32, numpy=
array([[5811.5957 ],
       [2591.8093 ],
       [1217.9972 ],
       [2252.3525 ],
       [2650.7405 ],
       [1022.43146],
       [2465.303  ],
       [2555.3264 ],
       [2447.9504 ],
       [ 935.72943]], dtype=float32)>

## Can it solve the environment?

In [11]:
ddpg.actor_model((np.random.random(size=(10, 2))*2 - 1))

<tf.Tensor: shape=(10, 1), dtype=float32, numpy=
array([[ 0.99999976],
       [ 0.99999976],
       [ 0.99999976],
       [-0.9993669 ],
       [ 0.99999976],
       [ 0.99999976],
       [ 0.99999976],
       [ 0.99999976],
       [-0.99695086],
       [ 1.        ]], dtype=float32)>

In [12]:
num_episodes = 10

observation_max = np.array([0.6, 0.07])
observation_min = np.array([-1.2, -0.07])

# obs_stddev = [0.05, 0.05]
obs_stddev = [0, 0]


t_max = 1000

for i in range(num_episodes):

    env = gym.make('MountainCarContinuous-v0')

    obs = env.reset()

    done = False
    rewards = []

    t = 0
    while not done:

        env.render()

        obs = obs.reshape(1, obs.shape[0])
        obs = transform_observations(obs, observation_max, observation_min, obs_stddev)

        # print(obs)

        # action = act_net(obs) * 10
        # action = np.clip(action.numpy(), -1, 1)

        action = actor_model(obs)
        action = action.numpy()

        obs, reward, done, info = env.step(action)

        # print(obs)

        rewards.append(reward)

        t += 1

        if t == t_max:
            done = True

    print(t)
    if t < t_max:
        print("success")
    else:
        print("Failure")
        print("max obs", obs)

    print(np.sum(rewards))
    # print(rewards)




env.close()

68
success
93.36774634396815
68
success
93.3693611066235
68
success
93.36889756146266
69
success
93.29928721775096
68
success
93.39075685396485
68
success
93.38905150359172
68
success
93.37940230180583
68
success
93.36928243179368
68
success
93.39846199351936
69
success
93.30938248436043


In [10]:
n = 50
both = [[i/n, j/n] for i in range(-1*n, n) for j in range(-1*n, n)]
both = np.array(both)
both

both_acts = act_net(both)

both_acts

NameError: name 'act_net' is not defined

In [None]:
n = 50
coords = [[i/n, j/n] for i in range(-1*n, n) for j in range(-1*n, n)]
coords = np.array(coords)
coords


In [None]:
x = np.arange(-5, 5.1, 0.5)
y = np.arange(-5, 5.1, 0.5)
X,Y = np.meshgrid(x,y)

X