# Basic Reinforcement Learning using DeepMind's RL Framework "Acme"
## Implement SARSA and Q Learning Agents in Acme
### by Andreas Stöffelbauer

>This notebook contains the code that accompanies my Medium blog post, written for Towards Data Science in June 2021.


Acme is a research framework for reinforcement learning, open sourced by Google's DeepMind in 2020 . It was designed to simplify the development of novel RL agents and accelerate RL research. According to their own statement, Acme is used on a daily basis at DeepMind, who is spearheading research in reinforcement learning and artificial intelligence.

In [1]:
# reinforcement learning
import acme
from acme import types
from acme.wrappers import gym_wrapper
from acme.environment_loop import EnvironmentLoop
from acme.utils.loggers import TerminalLogger, InMemoryLogger

# environments
import gym
import dm_env

# other
import numpy as np

## Blackjack Environment

Acme agents are not designed to interact with Gym environments. Instead, DeepMind has their own RL environment API. You can think of the difference mainly in terms of how the timesteps are represented.

Fortunately, however, you can still make use of Gym environments too since Acme's developers have provided wrapper functions for them.

In [2]:
env = acme.wrappers.GymWrapper(gym.make('Blackjack-v0'))
# env = acme.wrappers.SinglePrecisionWrapper(env)

# print env specs
env_specs = env.observation_space, env.action_space, env.reward_range # env.observation_spec()
print('Observation Spec:', env.observation_space)
print('Action Spec:', env.action_space)
print('Reward Spec:', env.reward_range)

Observation Spec: Tuple(Discrete(32), Discrete(11), Discrete(2))
Action Spec: Discrete(2)
Reward Spec: (-inf, inf)


In [3]:
# show a timestep
env.reset()

TimeStep(step_type=<StepType.FIRST: 0>, reward=None, discount=None, observation=(10, 9, False))

As you can see, the timesteps are somewhat different to Open AI Gym environments.

## Actor, Learners, and Agents

It is crucial to understand that there is a distinction between actors, learners, and agents. For the base class, see here: https://github.com/deepmind/acme/blob/master/acme/core.py. 

### A Random Agent

In [4]:
class RandomAgent(acme.Actor):
    """A random agent for the Black Jack environment."""
    
    def __init__(self):
        
        # init action values, will not be updated by random agent
        self.Q = np.zeros((32,11,2,2))
        
        # specify the behavior policy
        self.behavior_policy = lambda q_values: np.random.choice(2)
        
        # store timestep, action, next_timestep
        self.timestep = None
        self.action = None
        self.next_timestep = None
        
    def select_action(self, observation):
        "Choose an action according to the behavior policy."
        return self.behavior_policy(self.Q[observation])    

    def observe_first(self, timestep):
        "Observe the first timestep." 
        self.timestep = timestep

    def observe(self, action, next_timestep):
        "Observe the next timestep."
        self.action = action
        self.next_timestep = next_timestep
        
    def update(self, wait = False):
        "Update the policy."
        # no updates occur here, it's just a random policy
        self.timestep = self.next_timestep 

## Environment Loop

If you already know a bit about reinforcement learning, and certainly if you have already implemented an RL algorithm, the following loop will look very familiar to you.

In [5]:
agent = RandomAgent()

# make first observation
timestep = env.reset()
agent.observe_first(timestep)

# run an episode
while not timestep.last():
    
    # generate an action from the agent's policy and step the environment
    action = agent.select_action(timestep.observation)
    timestep = env.step(action)

    # have the agent observe the timestep and let the agent update itself
    agent.observe(action, next_timestep=timestep)
    agent.update()

Conveniently, there is a shortcut in Acme: the EnvironmentLoop, which performs pretty much exactly the steps seen above. You just have to pass your environment and agent instances and then you can run either a single episode or as many as you want with a single line of code. There are also various loggers available that track important metrics such as the number of steps taken in each episode and the collected rewards.

In [7]:
# or use Acme training loop
loop = EnvironmentLoop(env, agent, logger=InMemoryLogger())
loop.run_episode()
loop.run(100)

## SARSA and Q Learning Agents

### Some Policies

Some simple policies. Only `epsilon_greedy` is used by the agents in this notebook, however.

In [17]:
# uniform random policy
def random_policy(q_values):
    return np.random.choice(len(q_values))

# greedy policy
def greedy(q_values):
    return np.argmax(q_values)

# epsilon greedy policy
def epsilon_greedy(q_values, epsilon):
    if epsilon < np.random.random():
        return np.argmax(q_values)
    else:
        return np.random.choice(len(q_values))

### SARSA Agent
SARSA is an on-policy algorithm whose updates depend on the state, action, reward, next state, and next action (hence the name).

In [15]:
class SarsaAgent(acme.Actor):
    
    def __init__(self, env_specs=None, epsilon=0.1, step_size=0.1):
        
        
        # setting initial action values
        self.Q = np.zeros((32,11,2,2))
        
        # epsilon for policy and step_size for TD learning
        self.epsilon = epsilon
        self.step_size = step_size
        
        # set behavior policy
        # self.policy = None
        self.behavior = lambda q_values: epsilon_greedy(q_values, self.epsilon)
        
        # store timestep, action, next_timestep
        self.timestep = None
        self.action = None
        self.next_timestep = None

    def transform_state(self, state):
        # this is specifally required for the blackjack environment
        state = *map(int, state),
        return state
    
    def select_action(self, observation):
        state = self.transform_state(observation)
        return self.behavior(self.Q[state])

    def observe_first(self, timestep):
        self.timestep = timestep

    def observe(self, action, next_timestep):
        self.action = action
        self.next_timestep = next_timestep
        
    def update(self):
        
        # get variables for convenience
        state = self.timestep.observation
        _, reward, discount, next_state = self.next_timestep
        action = self.action
        
        # turn states into indices
        state = self.transform_state(state)
        next_state = self.transform_state(next_state)
        
        # sample a next action
        next_action = self.behavior(self.Q[next_state])

        # compute and apply the TD error
        td_error = reward + discount * self.Q[next_state][next_action] - self.Q[state][self.action]
        self.Q[state][action] += self.step_size * td_error
        
        # finally, set timestep to next_timestep
        self.timestep = self.next_timestep
        
sarsa = SarsaAgent()

To train SARSA on the environment for 500,000 episodes, simply run

In [20]:
loop = EnvironmentLoop(env, sarsa, logger=InMemoryLogger())
loop.run(50000)

In [21]:
df = loop._logger.to_dataframe()
df.tail()

Unnamed: 0,episode_length,episode_return,steps_per_second,episodes,steps
49995,3,1.0,3018.209,49996,77495
49996,1,1.0,1e+200,49997,77496
49997,2,1.0,2e+200,49998,77498
49998,1,0.0,1e+200,49999,77499
49999,1,1.0,1e+200,50000,77500


#### Q Learning Agent
The Q learning agent below is very similar to the SARSA agent. They only differ in the way how updates to the Q matrix are made. This is because Q learning is an off-policy algorithm.

In [22]:
class QLearningAgent(acme.Actor):
    
    def __init__(self, env_specs=None, step_size=0.1):
        
        self.Q = np.zeros((32,11,2,2))
        
        # set step size
        self.step_size = step_size
        
        # set behavior policy
        # self.policy = None
        self.behavior_policy = lambda q_values: epsilon_greedy(q_values, epsilon=0.1)
        
        # store timestep, action, next_timestep
        self.timestep = None
        self.action = None
        self.next_timestep = None

    def state_to_index(self, state):
        state = *map(int, state),
        return state
    
    def transform_state(self, state):
        # this is specifally required for the blackjack environment
        state = *map(int, state),
        return state
    
    def select_action(self, observation):
        state = self.transform_state(observation)
        return self.behavior_policy(self.Q[state])

    def observe_first(self, timestep):
        self.timestep = timestep

    def observe(self, action, next_timestep):
        self.action = action
        self.next_timestep = next_timestep  

    def update(self):
        # get variables for convenience
        state = self.timestep.observation
        _, reward, discount, next_state = self.next_timestep
        action = self.action
        
        # turn states into indices
        state = self.transform_state(state)
        next_state = self.transform_state(next_state)
        
        # Q-value update
        td_error = reward + discount * np.max(self.Q[next_state]) - self.Q[state][action]        
        self.Q[state][action] += self.step_size * td_error
        
        # finally, set timestep to next_timestep
        self.timestep = self.next_timestep
        
qagent = QLearningAgent()

In [23]:
loop = EnvironmentLoop(env, qagent, logger=InMemoryLogger())
loop.run(10000)

In [25]:
df = loop._logger.to_dataframe()
df.tail()

Unnamed: 0,episode_length,episode_return,steps_per_second,episodes,steps
9995,1,-1.0,1003.902,9996,14970
9996,2,1.0,2e+200,9997,14972
9997,2,1.0,2e+200,9998,14974
9998,2,1.0,2e+200,9999,14976
9999,2,0.0,2e+200,10000,14978


Thanks for reading. Feel free to get in touch with me on [LinkedIn](https://www.linkedin.com/in/andreas-stoeffelbauer/) if you have any questions.