Skip to content
A leaderboard of human and machine performance on the Arcade Learning Environment (ALE).
Branch: master
Clone or download
Latest commit 932e284 Apr 20, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore Initial commit Apr 14, 2018
LICENSE Initial commit Apr 14, 2018
README.md fix Apr 20, 2018

README.md

Atari Reinforcement Learning Leaderboard

Any scores out of date? Make a Pull Request.

This is a leaderboard comparing world record human performance to start of the art machine performance in the Arcade Learning Environment (ALE).

Game Top Human Score Top Machine Score Best Best Machine Learning Type Notes
Alien 103583 9491 Human Rainbow Q-gradient
Amidar 71529 5131 Human Rainbow Q-gradient
Assault 8647 14497 Machine A3C Policy-gradient
Asterix 1000000 428200 Human Rainbow Q-gradient
Asteroids 57340 5093 Human A3C Policy-gradient *
Atlantis 10604840 2311815 Human PPO Policy-gradient
Bank Heist 45899 1611 Human Dueling DDQN Q-gradient
Battlezone 98000 62010 Human Rainbow Q-gradient
Beamrider 52866 26172 Human Prioritized DDQN Q-gradient 1B
Berzerk 1057940 2545 Human Rainbow Q-gradient
Bowling 279 135 Human HyperNEAT Genetic Policy J
Boxing 99 99 Draw Rainbow, ACER Q,Policy-gradient
Breakout 864 766 Human A3C Policy-gradient
Centipede 453916 25275 Human HyperNEAT Genetic Policy
Chopper Command 999999 16654 Human Rainbow Q-gradient
Crazy Climber 219900 183135 Human Prioritized DDQN Q-gradient
Defender 5443150 233021 Human A3C Policy-gradient N
Demon Attack 100100 115201 Machine A3C Policy-gradient +
Enduro 1666 2260 Machine Distribution DQN Q-gradient
Fishing Derby 51 46 Human Dueling DDQN Q-gradient
Freeway 38 34 Human Rainbow Q-gradient 1B
Frostbite 248460 9590 Human Rainbow Q-gradient
Gopher 30240 70354 Machine Rainbow Q-gradient
Gravitar 39100 1419 Human Rainbow Q-gradient
HERO 257310 55887 Human Rainbow Q-gradient J
Ice Hockey 25 10 Human HyperNEAT Genetic Policy
Kangaroo 1424600 14854 Human Dueling DDQN Q-gradient N
Krull 104100 12601 Human HyperNEAT Genetic Policy N
Kung Fu Master 79360 52181 Human Rainbow Q-gradient
Montezumas Revenge 400000 384 Human Rainbow Q-gradient
Ms Pacman 211480 6283 Human Dueling DDQN Q-gradient J
Name This Game 21210 13439 Human Prioritized DDQN Q-gradient
Phoenix 251180 108528 Human Rainbow Q-gradient
Pitfall 114000 0 Human Several Q-gradient
Pong 21 21 Draw Several Several E
Private Eye 101800 15172 Human Distribution DQN Q-gradient **
Qbert 2400000 33817 Human Rainbow Q-gradient N
Road Runner 210200 73949 Human A3C Policy-gradient
Robot Tank 68 65 Human Dueling DDQN Q-gradient
Seaquest 294940 50254 Human Dueling DDQN Q-gradient
Skiing -3272 -6522 Human Vanilla GA Genetic Policy
Space Invaders 43710 23864 Human A3C Policy-gradient 1B
Star Gunner 77400 164766 Machine A3C Policy-gradient N
Time Pilot 34400 27202 Human A3C Policy-gradient
Tutankham 2026 280 Human ACER Policy-gradient
Venture 38900 1107 Human Distribution DQN Q-gradient N
Video Pinball 3523988 533936 Human Rainbow Q-gradient 1B
Wizard of Wor 129500 18082 Human A3C Policy-gradient
Yars Revenge 2011099 102557 Human Rainbow Q-gradient ++
Zaxxon 83700 24622 Human A3C Q-gradient
  • N NTSC, no emulator results available
  • J Score from jvgs.net
  • E Game is so easy there's no world record category
  • 1B Game 1, Difficulty B
  • * Game 6, Difficulty B
  • + Game 7, Difficulty B
  • ** Game 1, Points
  • ++ Game 2, Difficulty A

What the point of this?

I decided to put this together after noticing two trends in reinforcement learning papers:

  • Not comparing to state of art.
  • Comparing an algorithm with 1000s of hours playtime to a human that played for a few hours.

Respectively, these make it hard to see the relative progress of the field from paper to paper, and the absolute progress compared to human level game playing.

Though RL papers routinely quote >100% normalized human performance, the reality is that machine learning algorithms just barely beat humans on only 5 out of 49 games here, and humans have a substantial lead in the rest. We have a long way to go.

Performance Among Machines

When we exclude human scores, per-algorithm win count are as follows (two way ties friendly, three or more unfriendly):

Algorithm Type Wins
Rainbow Q-gradient 18
A3C (FF and LSTM) Policy-gradient 11
Dueling DDQN Q-gradient 6
HyperNEAT Genetic Policy 4
Distribution DQN Q-gradient 3
Prioritized DDQN Q-gradient 3
ACER Policy-gradient 2
PPO Policy-gradient 1
Vanilla GA Genetic Policy 1
Noisy DQN Q-gradient 0
Vanilla ES Genetic Policy 0

Methodology

Human Scores

Since the ALE uses the stella Atari emulator, the Top Human Score is the top human score on an emulator. Atari (and other game) releases tend to vary across region, so this is the only way to ensure that both human and machine have, for example, equal access to game breaking bugs.

If possible, scores are taken from Twin Galaxies, which is the Guiness source for game world records, otherwise links are provided to score sources.

Machine Scores

A valid machine score is one achieved by a reinforcement learning algorithm trained directly on pixels and raw rewards, such as one that can be trained against common ALE wrappers / forks, like gym or xitari. This means that algorithms like this one which use hand-engineered intermediate rewards do not qualify.

Reference papers vary in:

  • Start type (no-op, random-op, human-op)
  • Number of test trials (from 30-200)

I take the approach here of favouring no-op starts over random ones (they usually have higher scores anyway), and treating all sample sizes equally.

References

You can’t perform that action at this time.