Motivation | Summary | Results | How to use this repository
This is a Python reimplementation for the Taxi domain of
- An Object-Oriented Representation for Efficient Reinforcement Learning (C. Diuk's Dissertation)
- An Object-Oriented Representation for Efficient Reinforcement Learning (Paper by C. Diuk et al.)
In the Taxi domain the goal is to navigate the taxi (initially yellow box) towards the passenger (blue letter), take a Pickup action and then deliver the taxi with passenger inside (green box) towards the destination (magenta letter) and perform a Dropoff action. A reward of -1 is obtained for every time step it takes until delivery. Successful Dropoff results in +20 reward, while non-successful Dropoff or Pickup is penalized with -10. This task was introduced by Dietterich. |
It is a well known empirical fact in reinforcement learning that model-based approaches (e.g., Rmax) are more sample-efficient than model-free algorithms (e.g., Q-learning). One of the main reasons may be that model-based learning tackles the exploration-exploitation dilemma in a smarter way by using the accumulated experience to build an approximate model of the environment. Furthermore, it has been shown that rich state representations such as in a factored MDP can make model-based learning even more sample-efficient. Factored MDPs enable an effective parametrization of transition and reward dynamics by using dynamic Bayesian networks (DBNs) to represent partial dependency relations between state variables, thereby the environment dynamics can be learned with less samples. A major downside of these approaches is that the DBNs need to be provided as prior knowledge which might be impossible sometimes.
Motivated by human intelligence, Diuk et al. introduce a new framework propositional object-oriented MDPs (OO-MDPs) to model environments and their dynamics. As it turns out, humans are way more sample-efficient than state-of-the-art algorithms when playing games such as Taxi (Diuk actually performed an experiment). Diuk argues that humans must use some prior knowledge when playing this game, he further speculates that this knowledge might come in form of object representations, e.g., identifying horizontal lines as walls when observing that the taxi cannot move through them.
Diuk et al. provide a learning algorithm for deterministic OO-MDPs (DOORmax) which outperforms factored Rmax. As prior knowledge DOORmax needs the objects and relations to consider, which seems more natural as these may also be used throughout different games. Furthermore, this approach may also be used to inherit human biases.
This part shall give an overview about the different reimplemented algorithms. These can be divided into model-free and model-based approaches.
In model-free algorithms the agent learns the optimal action-value function (or value function or policy) directly from experience without having an actual model of the environment. Probably the most famous model-free algorithm is Q-learning which also builds the basis for the (perhaps even more famous) DQN paper.
Q-learning aims to approximate the optimal action-value function from which the optimal policy can be inferred. In the simplest case, a table (Q-table) is used as a function approximator.
The basic idea is to start with a random action-value function and then iteratively update this function towards the optimal action-value function. The update comes after each action a with the observed reward r and new state s', the update rule is very simple and is derived from Bellman's optimality equation:
where α is the learning rate. To allow for exploration, Q-learning commonly uses ε-greedy exploration or the Greedy in the Limit with Infinite Exploration approach (see David Silver, p.13 ff).
Diuk uses two variants of Q-learning:
- Q-learning: standard Q-learning approach with ε-greedy exploration where parameters α=0.1 and ε=0.6 have been found via parameter search.
- Q-learning with optimistic initialization: instead of some random initialization of the Q-table a smart initialization to an optimistic value (maximum possible value of any state action pair ) is used. Thereby unvisited state-action pairs become more likely to be visited. Here, α was is to 1 (deterministic environment) and ε to 0 (exploration ensured via initialization).
In model-based approaches the agent learns a model of the environment by accumulating experience. Then, an optimal action-value function (or value function or policy) is obtained through planning. Planning can be done exactly or approximately. In the experiments, Diuk et al. use exact planning, more precisely value iteration. The difference between the following three algorithms lies in the way they learn the environment dynamics.
Rmax is a provably efficient state-of-the-art algorithm to surpass the exploration-exploitation dilemma through an intuitive approach: Rmax divides state-action pairs into known (state-action pairs which have been visited often enough to build an accurate transition/reward function) and unknown. Whenever a state is known, the algorithm uses the empirical transition and reward function for planning. In case a state is unknown, Rmax assumes a transition to a fictious state from which maximum reward can be obtained consistently (hence the name) and it uses that for planning. Therefore, actions which have not been tried out (often enough) in the actual state will be preferred unless the known action also leads to maximal return. The parameter M defines the number of observations the agent has to see until it considers a transition/reward to be known, in a deterministic case such as the Taxi domain, it can be set to 1. Rmax is guaranteed to find a near-optimal action-value function in polynomial time.
The 5x5 Taxi domain has 500 different states:
- 5 x positions for taxi
- 5 y positions for taxi
- 5 passenger locations (4 designated locations plus in-taxi)
- 4 destinations
In the standard Rmax approach without any domain knowledge (except for the maximum possible reward Rmax, the number of states |S|, the number of actions |A|), the states are simply enumerated and the agent will not be able to transfer knowledge throughout the domain. E.g., assume the agent performs an action North at some location on the grid and learns the state transition (more precisely it would learn something like picking action 1 at state 200 results in ending up in state 220). Being at the same location but with a different passenger location or destination location the agent will not be able to predict the outcome of action North. It will take the agent at least 3000 (|S| · |A|) steps until it has fully learned the 5x5 Taxi transition dynamics. Furthermore, the learned transition and reward dynamics are rather difficult to interpret. To address this shortcoming, the agent needs a different representation and some prior knowledge.
Factored Rmax is a Rmax adaptation that builds on a factored MDP environment representation. In a factored MDP a state is represented as tuple (hence factored state), e.g., in the Taxi domain the state can be represented as the 4-tuple
[taxi x location, taxi y location, passenger location, passenger destination]
(Note that passenger location actually enumerates the different (x, y) start passenger locations plus whether the passenger is in taxi.) This representation allows to represent partial dependency relations for the environment dynamics between variables using Dynamic Bayesian Networks (DBNs). E.g., for action North we know that each state variable at time t+1 only depends on its own value at time t, i.e., the x location at time t+1 under action North is independent of the y location, passenger location and passenger destination at time t. This knowledge is encoded in a DBN (each action may have a different DBN) and it enables Factored Rmax to much more sample-efficient learning. The downside of this approach is that this kind of prior knowledge may not be available and that it lacks some generalization, e.g., although Factored Rmax knows that the x location is independent of all other state variables, Factored Rmax still needs to perfom action North at each x location to learn the outcome.
DOORmax is a Rmax adaptation that builds on a deterministic (propositional) object-oriented MDP (OO MDP) environment representation. This representation is based on objects and their interactions, a state is presented as the union of all (object) attribute values. Additionally, each state has an attributed boolean vector describing which relations are enabled and which are not in that state. During a transition each attribute of the state may exert some kind of effect which results in an attribute change. There are some limitations to the effects that can occur which are well explained in Diuk's dissertation. The basic idea of DOORmax is to recover the deterministic OO MDP using condition-effect learners (in these learners conditions are basically the relations that need to hold in order for an effect to occur). The paper results show that in DOORmax knowledge can much better transfer throughout the domain compared to the other algorithms indicating that DOORmax offers better generalization. Another feature is that the learned transition dynamics is easy to interpret, e.g., DOORmax will learn that action North has the effect ot incrementing taxi.y by 1 when the relation touch_north(taxi, wall) outputs False and there wont be any change in taxi.y if touch_north(taxi, wall) outputs True.
The experimental setup is described on p.31 of Diuk's Dissertation or p.7 of the paper. It consists of testing against six probe states and reporting the number of steps the agent had to take until the optimal policy for these 6 start states was reached. Since there is some randomness in the trials, each algorithm runs 100 times and the results are then averaged.
There are some differences between this reimplementation and Diuk's approach which are listed below:
-
For educational purposes, the reward function is also learned in this reimplementation (always in the simplest possible way). Note that Diuk mainly focused on learning the transition model:
I will focus on learning dynamics and assume the reward function is available as a black box function.
-
It is unknwon whether in Diuk's setting during training the passenger start location and destination could be the same. The original definition by Diettetrich states:
To keep things uniform, the taxi must pick up and drop off the passenger even if he/she is already at the destination.
Therefore, in this reimplementation this was also possible during training. While the results for Rmax an its adaptations indicate that Diuk used the same setting, there is a discrepancy for Q-learning. When the setting was changed such that passenger start and destination could not be the same (these are the results in brackets), similar results to Diuk could be obtained.
-
Some implementation details are different such as the update procedure of the empirical transition and reward functions or the condition-effect-learners which were not well enough documented or which did not fit into the reimplementation structure.
The dissertation results align with the reimplementation results. Clearly, DOORmax outperforms the other algorithms in terms of sample-efficiency.
For the differences in Q-Learning and the values in brackets, refer to 2) of Differences between Reimplementation and Diuk.
The results were obtained on a lenovo thinkpad yoga 260 (i7-6500 CPU @ 2.50 GHz x 4).
Domain knowledge | Algorithm | Diuk's Results |
Reimplementation Results | |||
---|---|---|---|---|---|---|
# Steps | Time/step | # Steps | Time/step | Total Time | ||
|S|, |A| |
Q-learning | 106859 | <1ms | 120148 (119941) |
<1ms (<1ms) |
4.9s (4.9s) |
|S|, |A|, Rmax | Q-learning - optimistic initialization |
29350 | <1ms | 75289 (28989) |
<1ms (<1ms) |
4.2s (1.7s) |
|S|, |A|, Rmax | Rmax | 4151 | 74ms | 4080 | 2.9ms | 11.8s |
Rmax, DBN structure | Factored Rmax | 1676 | 97.7ms | 1686 | 30.5ms | 51.4s |
Objects, relations to consider, Rmax |
DOORmax | 529 | 48.2ms | 498 | 36ms | 17.9s |
|A|, visualization of game | Humans (non- videogamers) |
101 | NA | NA | NA | NA |
|A|, visualization of game | Humans (videogamers) | 48.8 | NA | NA | NA | NA |
The paper results align with the reimplementation results. These results show that DOORmax not only outperforms Factored Rmax in terms of sample-efficiency, but also scales much better to larger problems. Note that the number of states increases by a factor of more than 14.
The results were obtained on a cluster from which I do not know the CPU specifics (this is not too important since the focus lies on the comparison). Note that Diuk et al. used a more powerful machine for the paper result: the average step times are notably smaller compared to the dissertation results.
Diuk's Result | Reimplementation Results | |||||
---|---|---|---|---|---|---|
Taxi 5x5 | Taxi 10x10 | Ratio | Taxi 5x5 |
Taxi 10x10 | Ratio | |
Number of states | 500 | 7200 | 14.40 | 500 | 7200 | 14.40 |
Factored Rmax # steps Time per step |
1676 43.59ms |
19866 306.71ms |
11.85 7.03 |
1687 24.61ms |
21868 1.02s |
12.96 41.45 |
DOORmax # steps Time per step |
529 13.88ms |
821 293.72ms |
1.55 21.16 |
502 22.12ms |
1086 1.22s |
2.16 55.15 |
git clone --depth 1 https://github.com/borea17/efficient_rl/
cd efficient_rl
python setup.py install
pip install efficient_rl
After successful installation, download dissertation_script.py
and paper_script.py
(which are in folder efficient_rl), then run
python dissertation_script.py
python paper_script.py
Defaultly, each agent runs only once. To increase the number of repetitions change n_repetitions
in the scripts.
WARNING: It is not recommended to run paper_script.py
on a standard computer as it may take
several hours.
If you want to use this repository for a different environment, you
may want to have a look at efficient_rl/environment
folder. There is
a self written environment called TaxiEnvironmentClass.py
and there
are extensions to the gym
Taxi environment in the corresponding folders.
Contributions are welcome and if needed, I will provide a more detailed documentation.