# Advantage Actor-Critic (A2C) using Keras

I was curious if I could use Keras loss functions to implement an advantage actor-critic reinforcement learning algorithm. Most examples I could find use PyTorch and calculate gradients for each episode. Keras doesn't really work like that; instead of calculating the gradient, you provide a loss function and Keras calculates the gradients automatically for you. I was using "Asynchronous Methods for Deep Reinforcement Learning" (Minh, et al 2016) as a basis and the authors included both the gradient and loss function formulations.

I used gymnasium's FrozenLake environment as a super simple way to learn. I went with the 4x4 map. I tried the 8x8 but the chances of random exploration stumbling on the goal were so slim, the algorithm wasn't able to find a good path. I'll have to keep working on that.

The map (from https://gymnasium.farama.org/environments/toy_text/frozen_lake/) is:
    "4x4":[
        "SFFF",
        "FHFH",
        "FFFH",
        "HFFG"
        ]
Where "S" is the start, "F" is frozen, "H" is a hole, and "G" is the goal.

In [6]:
import keras
import numpy as np
import gymnasium as gym

GAMMA = 0.99  # Reward-to-go discount
LEARNING_RATE = 0.01
EPSILON = 0.3  # Using eps-greedy exploration: this is the probability of choosing an action randomly rather than using the policy function

def create_critic_model(env: gym.Env) -> keras.Model:
    """Create a critic model: this approximates the state-value function V(s)"""
    model = keras.Sequential()
    model.add(keras.layers.Input(shape=(1,)))
    model.add(keras.layers.IntegerLookup(vocabulary=list(range(env.observation_space.n)), output_mode='multi_hot'))
    model.add(keras.layers.Dense(16, activation='gelu'))
    model.add(keras.layers.Dropout(0.2))
    model.add(keras.layers.LayerNormalization())
    model.add(keras.layers.Dense(16, activation='gelu'))
    model.add(keras.layers.Dropout(0.2))
    model.add(keras.layers.LayerNormalization())
    model.add(keras.layers.Dense(1))
    model.compile(optimizer=keras.optimizers.Adam(LEARNING_RATE), loss='mse', run_eagerly=True)
    return model


def play_rollout(env: gym.Env, actor: keras.Model | None=None, max_steps: int | None=None) -> list:
    """Run one rollout in the environment and return a list of [state, action, next_state, discounted_reward] values"""
    history = []
    finished = False
    state, _ = env.reset()
    while finished is False:
        if max_steps and len(history) > max_steps:
            break
        if actor is None:
            action = env.action_space.sample()
        else:
            if np.random.random() < EPSILON:
                action = env.action_space.sample()
            else:
                action_probs = actor(np.array([state]))
                action = keras.random.categorical(keras.ops.log(action_probs), num_samples=1)[0,0].numpy()
        next_state, reward, finished, _, _ = env.step(action)
        history.append([state, action, next_state, reward])
        state = next_state
    discounted_reward = 0.0
    for step in reversed(history):
        # Calculate discounted rewards
        # Add a small negative reward for each step to encourage quickly getting to the goal
        discounted_reward = step[3] + GAMMA*discounted_reward - 0.001
        step[3] = discounted_reward
    return history

In [7]:
def create_actor_model(env: gym.Env) -> keras.Model:
    """
    Create an actor model
    This is the policy model: given a state, it returns a probability distribution of actions
    """
    model = keras.Sequential()
    model.add(keras.layers.Input(shape=(1,)))
    model.add(keras.layers.IntegerLookup(vocabulary=list(range(env.observation_space.n)), output_mode='multi_hot'))
    model.add(keras.layers.Dense(16, activation='relu'))
    model.add(keras.layers.Dropout(0.2))
    model.add(keras.layers.LayerNormalization())
    model.add(keras.layers.Dense(16, activation='relu'))
    model.add(keras.layers.Dropout(0.2))
    model.add(keras.layers.LayerNormalization())
    model.add(keras.layers.Dense(env.action_space.n, activation='softmax'))
    model.compile(optimizer=keras.optimizers.Adam(1e-3), loss='sparse_categorical_crossentropy', run_eagerly=True)
    # Even out the model's predictions
    x = np.random.randint(0, env.observation_space.n, size=(1000, 1))
    y = np.random.randint(0, env.action_space.n, size=(1000, 1))
    model.fit(x, y, verbose=0)
    return model


def actor_loss_fn(actions_and_advantages: keras.KerasTensor, y_pred: keras.KerasTensor) -> keras.KerasTensor:
    """
    This is the actor loss function. It calculates log(action_probabilities) * Advantages for each sample.
    All of the operations in this function must be implemented by Keras: this is how Keras automatically
    calculates gradients.
    """
    # Takes actions and advantages as the "y_true" input. Separate those out.
    actions = keras.ops.cast(actions_and_advantages[:, 0], dtype=np.int32)
    advantages = actions_and_advantages[:, 1:2]
    # Calculate the losses
    losses = -keras.ops.log(y_pred + 1e-9) * advantages
    # Mask the losses: we only want to return losses for the actions actually taken, not all actions
    mask = np.zeros(shape=y_pred.shape)
    mask[range(mask.shape[0]), actions] = 1
    losses = keras.ops.multiply(losses, mask)
    # print(f'{actions=}\n{advantages=}\n{y_pred=}\n{losses=}')
    return losses


def run_episode(env: gym.Env,
                actor: keras.Model,
                critic: keras.Model,
                games_per_episode: int,
                max_steps: int | None=None) -> None:
    """
    Run one episode:
    An episode is a given number of game rollouts, followed by updating the actor and critic models
    """
    history = []
    for _ in range(games_per_episode):
        history += play_rollout(env, actor, max_steps)
    history = np.array(history)
    states = history[:, 0:1]
    actions = history[:, 1:2]
    rewards = history[:, 3:4]
    critic.fit(states, rewards, verbose=0)
    advantages = rewards - critic(states)
    actions_and_advantages = np.concatenate((actions, advantages), axis=1)
    actor.fit(states, actions_and_advantages, verbose=0)

In [8]:
GAMES_PER_EPISODE = 100
NUM_EPISODES = 30
MAX_STEPS = 32

env = gym.make('FrozenLake-v1', is_slippery=False, map_name='4x4')
critic = create_critic_model(env)
actor = create_actor_model(env)
# run_eagerly=True is for MacOS: model(inputs) and model.predict(inputs) seem to return different values
# if the model is compiled as a network vs. run eagerly.
actor.compile(optimizer=keras.optimizers.Adam(LEARNING_RATE), loss=actor_loss_fn, run_eagerly=True)

for ep_idx in range(NUM_EPISODES):
    print(f'Episode {ep_idx}')
    run_episode(env, actor, critic, GAMES_PER_EPISODE, MAX_STEPS)
    # Print actor and critic updates as we go in each episode
    print(f'{critic(np.arange(env.observation_space.n))=}, {actor(np.arange(env.observation_space.n))=}')

Episode 0
critic(np.arange(env.observation_space.n))=<tf.Tensor: shape=(16, 1), dtype=float32, numpy=
array([[ 0.07872982],
       [-0.07441281],
       [ 0.03079429],
       [ 0.07078435],
       [ 0.14678158],
       [ 0.13674854],
       [-0.08588161],
       [ 0.15105058],
       [-0.05529999],
       [ 0.25359997],
       [-0.34482044],
       [-0.54141784],
       [-0.26618367],
       [ 0.10897417],
       [ 0.0024041 ],
       [-0.19261594]], dtype=float32)>, actor(np.arange(env.observation_space.n))=<tf.Tensor: shape=(16, 4), dtype=float32, numpy=
array([[0.00865719, 0.96341723, 0.02660933, 0.00131625],
       [0.22889575, 0.73530173, 0.03119173, 0.00461081],
       [0.02553415, 0.7081416 , 0.25159574, 0.0147285 ],
       [0.36505347, 0.37899724, 0.17126094, 0.08468837],
       [0.9789428 , 0.01751637, 0.00146139, 0.00207941],
       [0.19823009, 0.41286072, 0.24495102, 0.14395824],
       [0.0262125 , 0.55299616, 0.3553921 , 0.06539923],
       [0.14617911, 0.06894917, 0.6861

## Observations

The directions are LEFT, DOWN, RIGHT, UP. The A2C algorithm pretty quickly converges on a path:

DOWN, DOWN, RIGHT, RIGHT, DOWN, RIGHT

What's a little odd is that it also thinks DOWN is a valid direction for state 3, which leads it right into a hole. This could be due to the e-greedy exploration. Something to explore would be decaying epsilon with each episode.

This is a deterministic Frozen Lake environment (is_slippery=False). Something else to try would be setting it to a probabilistic environment (is_slippery=True) and looking at how well A2C learns the environment.

This was pretty cool as it shows A2C converges to a valid solution, and it shows that reinforcement learning can be implemented in Keras by providing a loss function. The loss function in this case requires a couple values: the actual action taken and the advantage. Those can be passed in to the fit function in columns, and the loss function can separate them out from those columns.