In [None]:
import os
from write_load import load_env
from agent import DynaAgent

## Dyna agent

The first part of the assignment is to reproduce some of the results from the original Dyna paper (Sutton, 1990). The pdf of the paper is somewhere in the git repository you have cloned. In particular, **your task is to generate and visualise data plotted in Figure 5** in that paper. You can neglect Dyna-PI and only implement Dyna-$Q-$ and Dyna-$Q+$.

To make your life a little easier, and to let you jump right into the more interesting stuff, you are provided with the environment simulator located in `environment.py`, as well as a blueprint of the main code for the agent. That is, you have access to the file `agent.py` where you will find the agent class. This class has a method called `simulate` with the main simulation loop already implemented. You are of course invited to investigate it.

**Your task is to fill in the missing implementation**. Thus, you are tasked to complete the following functions:
- `_policy`. This is the typical $\pi(a\mid s)$ which specifies how the agent chooses actions in any given state
- `_update_qvals`. This is the $Q$-learning update rule
- `_update_experience_buffer`. This updates the agent's experience buffer from which it then samples planning updates
- `_update_action_count`. This counts the number of moves elapsed since each action has last been attempted
- `_plan`. This is the function which lets the agent plan

In [None]:
# load environments
maze_conf_path = os.path.abspath(os.path.join(os.getcwd(), '..', 'envs'))
maze1_conf = load_env(os.path.join(maze_conf_path, 'dyna1.txt')) # maze with the path on the right open 
maze2_conf = load_env(os.path.join(maze_conf_path, 'dyna2.txt')) # maze with the path on the left open

In [None]:
# these are the same as in the paper
beta    = 0.5   # learning rate
gamma   = 0.9   # discount factor
epsilon = 0.001 # exploration parameter
k       = 10    # number of planning updates

# initialise the agent
agent   = DynaAgent(beta, gamma, epsilon)

In [None]:
# run simulations
agent.init_env(**maze1_conf)
agent.simulate(num_trials=1000, reset_agent=True, num_planning_updates=k)
agent.init_env(**maze2_conf)
agent.simulate(num_trials=2000, reset_agent=False, num_planning_updates=k)

In [None]:
# compare performance
# plot figure 5 from Sutton (1990)
perf = agent.get_performace()

Finally, you will also implement what is known as the prioritized sweeping from Moore & Atkeson (1993). The pdf of the paper is also in this git repository.

Even though the Dyna-style planning you have already implented confers significant learning advantages over comparable agents which do not plan, there is still room for improvement. In particular, it seems somewhat wasteful to adopt a uniform sampling scheme for the selection of planning updates. This is the key idea behind prioritized sweeping.