# The Habitual DDPG Network

Habitual network

Assuming generative model is perfect, then action selected would always be the action that maximises chance of observing prior preferences. Hence habitual network can be trained to output maximally rewarding actions, as these actions are the free energy minimising actions.

Also has a nice interpretation as long as the generative models keep training. Eventually the generative model is less sure about old things. Why people eventually revisit old states they were previously certain about.

As far as an agent knows, if observations are confirming perfectly to expectations then it has a perfect world model. So why would it change it? It’s only when an uncertain observation comes in that the agent needs to reconsider whether or not it has the best model of the world.


I think this network should be performing policy gradient method but instead of minimising the discounted reward sequence it should minimise the discounted external EFE/FEEF component sequence. That way in the end the end the fast and slow thinking methods should be converging as the world model continues to improve


What is this network trying to learn?
- This network is trying to learn the state action mapping that maximises the probability of being in the preferred states
- It is also trying to learn to output actions that minimise the extrinsic part of the EFE/FEEF


What does this network take as input?
- Current state
- Maybe sequence of previous states and actions

What should this network output?
- The action that leads to the next state that maximally achieves the prior preferences

How should this network learn?
- It should learn by outputting


Okay so new idea! DDPG seems pretty good so far. How about we have the Q function take latent states as input and use the VAEs good latent features as input. Then we'll have
- Q(o, a)
- p(s|o) and p(o|s)
- p(s'|s, a)
- V(o) or U(o)



In [24]:
from util import random_observation_sequence, transform_observations, test_policy
import matplotlib.pyplot as plt
import gym

from ddpg import *

In [2]:
import tensorflow as tf
import tensorflow_probability as tfp
import keras
from keras import layers
import numpy as np
import matplotlib.pyplot as plt

# Hide GPU from visible devices
tf.config.set_visible_devices([], 'GPU')

In [11]:
env = gym.make("Pendulum-v1")

num_states = env.observation_space.shape[0]
print("Size of State Space ->  {}".format(num_states))
num_actions = env.action_space.shape[0]
print("Size of Action Space ->  {}".format(num_actions))

upper_bound = env.action_space.high[0]
lower_bound = env.action_space.low[0]

print("Max Value of Action ->  {}".format(upper_bound))
print("Min Value of Action ->  {}".format(lower_bound))

Size of State Space ->  3
Size of Action Space ->  1
Max Value of Action ->  2.0
Min Value of Action ->  -2.0


In [17]:
actor_model = get_actor(3, 1)
critic_model = get_critic(3, 1)

target_actor = get_actor(3, 1)
target_critic = get_critic(3, 1)

# Making the weights equal initially
target_actor.set_weights(actor_model.get_weights())
target_critic.set_weights(critic_model.get_weights())

# Learning rate for actor-critic models
critic_lr = 0.0001
actor_lr = 0.00005

critic_optimizer = tf.keras.optimizers.Adam(critic_lr)
actor_optimizer = tf.keras.optimizers.Adam(actor_lr)

total_episodes = 100
# Discount factor for future rewards
gamma = 0.99
# Used to update target networks
tau = 0.005

# buffer = Buffer(50000, 64, 0.99, 2, 1, critic_optimizer, actor_optimizer)

ddpg = BasicDDPG(actor_model, critic_model, target_actor, target_critic, tau, observation_dim=3,
                 action_dim=1, critic_optimizer=critic_optimizer, actor_optimizer=actor_optimizer)

In [19]:
t_max = 1000
num_episodes = 50

min_reward_cutoff = -1000
min_reward_set = -0.5

reward_increase = 0

# observation_max = np.array([0.6, 0.07])
# observation_min = np.array([-1.2, -0.07])

for i in range(num_episodes):

    all_observations = []
    actions = []
    rewards = []

    total_reward = 0

    o, a, r = random_observation_sequence(env, t_max, epsilon=0.2)
    # o = transform_observations(o, observation_max, observation_min, [0, 0, 0])
    # o = transform_observations(o, observation_max, observation_min, [0.05, 0.05])
    # print(o)
    # print(o)
    for i in range(len(a)):

        prev_state = o[i]
        state = o[i+1]
        action = a[i]
        reward = r[i] + reward_increase

        # if reward < 0:
        #     print("yes")

        total_reward += reward

        # ddpg.buffer.record((prev_state, action, reward, state))
        # # episodic_reward += reward
        #
        # ddpg.buffer.learn()

        ddpg.record((prev_state, action, reward, state))
        # episodic_reward += reward

        ddpg.train([], [], [], [])

    print(total_reward)

    acts = ddpg.actor_model((np.random.random(size=(10, 3))*2 - 1))
    print(np.max(acts), np.min(acts))


-1495.2494026596196
-0.0688459 -0.39660046
-1256.8717758923344
-0.10642697 -0.58487535
-1214.5810506820167
-0.11996522 -0.30506968
-1401.7781165901954
-0.12607127 -0.7675984
-1631.6591951888495
-0.14300813 -0.66963863
-1069.423748113556
-0.21910842 -0.6791527
-1144.2693074016706
-0.1231804 -0.7378944
-1341.6635926206357
-0.29725012 -0.5785227
-1495.441860676611
-0.11625984 -0.7771332
-1109.6947476265611
-0.16452728 -0.7330205
-1537.1589163641968
-0.3928673 -0.696255
-1116.0084564900485
-0.1893647 -0.758295
-1333.7874024890866
-0.11314931 -0.7425609
-1069.3904607859513
-0.2038534 -0.8739721
-984.5857292149116
-0.23613612 -0.7819538
-1300.107918606033
-0.24947184 -0.9355565
-1265.5911715447724
-0.51112354 -0.9468732
-1452.2163229427795
-0.1797495 -0.94677
-1152.9031632642486
-0.33397523 -0.9612894
-1175.3469852472538
-0.20201847 -0.9246071
-1135.0321511129368
-0.21840541 -0.9738228
-1474.2933981944057
-0.35024738 -0.9621876
-934.1147352544053
-0.57081854 -0.9781603
-1120.8764051019275
-0

In [26]:
ddpg.critic_model([(np.random.random(size=(10, 3))*2 - 1), (np.random.random(size=(10, 1))*2 - 1)])

<tf.Tensor: shape=(10, 1), dtype=float32, numpy=
array([[-189.68011],
       [-270.22662],
       [-196.07523],
       [-215.2425 ],
       [-279.59113],
       [-172.63388],
       [-197.31567],
       [-207.98175],
       [-267.731  ],
       [-236.1659 ]], dtype=float32)>

In [27]:
ddpg.actor_model((np.random.random(size=(10, 3))*2 - 1))

<tf.Tensor: shape=(10, 1), dtype=float32, numpy=
array([[-0.9985531 ],
       [-0.499748  ],
       [-0.63756365],
       [ 0.7546994 ],
       [-0.99078894],
       [-0.9978237 ],
       [ 0.5274505 ],
       [ 0.31478363],
       [ 0.61984825],
       [ 0.89840513]], dtype=float32)>

In [28]:
obs_pos = np.vstack([np.linspace(-1, 1, 100), np.zeros(100)]).T
actions_pred = ddpg.actor_model(obs_pos)

actions_pred


plt.plot(obs_pos, actions_pred)

ValueError: Input 0 of layer "model_16" is incompatible with the layer: expected shape=(None, 3), found shape=(100, 2)

In [23]:
vel_pos = np.vstack([np.zeros(100), np.linspace(-1, 1, 100)]).T
actions_pred = ddpg.actor_model(vel_pos)
print(actions_pred)

plt.plot(obs_pos, actions_pred)

ValueError: Input 0 of layer "model_16" is incompatible with the layer: expected shape=(None, 3), found shape=(100, 2)

## Can it solve the environment?

In [29]:
test_policy(env, ddpg.actor_model, None, None, None, 5, 1, show_env=True)

Unnamed: 0,reward,timesteps,num_actions
0,-911.249023,200,200
1,-982.500122,200,200
2,-879.075562,200,200
3,-991.384033,200,200
4,-901.615967,200,200


In [None]:
test_policy(env, ddpg.actor_model, None, None, None, 5, 1)

In [10]:
n = 50
both = [[i/n, j/n] for i in range(-1*n, n) for j in range(-1*n, n)]
both = np.array(both)
both

both_acts = act_net(both)

both_acts

NameError: name 'act_net' is not defined

In [None]:
n = 50
coords = [[i/n, j/n] for i in range(-1*n, n) for j in range(-1*n, n)]
coords = np.array(coords)
coords


In [None]:
x = np.arange(-5, 5.1, 0.5)
y = np.arange(-5, 5.1, 0.5)
X,Y = np.meshgrid(x,y)

X