deepq

Teaching an AI to play the game peg solitaire using deep Q reinforcement learning.

The game is simple. There is a board with spots arranged in a large cross. At the beginning, every spot but one has a marble in it. The aim is to remove all but one marble from the board. This is done by moving marbles over each other, thereby removing the one that was jumped over. More info here: https://www.wikiwand.com/en/Peg_solitaire.

Here's a sneak peak of a trained AI playing a smaller version of the game, for an idea of how the game looks:

If you want to play the game, I implemented it in pygame and you can play it by cloning the repository locally and running the file play/ROGame.py.

The name in the files, Remain One, is a direct translation from the Portuguese name of the game, which started as a joke and now would be kind of impractical to change so it stays.

(I'm cleaning up code and running the learning routine on the full-sized version of the game. Expect full updates before the end of 2020, likely around Christmas.)

Reinforcement Learning and Deep Q Learning

Traditional supervised learning is useful for teaching AI to make decisions in scenarios where there is a correct decision we already know and want to emulate, e.g. classifying images, estimating values. Getting AI to make decisions in a situation where there is no ready repository of known decisions to learn from needs a whole different learning paradigm. This is the case of learning how to play a game.

We'll depict the game we want to play as a black box that articulates interactions between an environment (the game) and an actor (the player). The environment has a given state (in our case, the distribution of marbles on the board) and the actor chooses an action (a move) based on this state. This action both changes the state of the environment (removes a marble and changes location of another) and, for the purposes of the algorithm, elicits a reward. We code this reward so that it depends on both the action and the state. The same action can be worth more in some states than others. In our case, the reward is just inversely proportional to the number of marbles on the board. That way, removing a marble is better later in the game, when there's fewer remaining.

The game goes on for many such state-action-new state periods until it's either beat or it crashes (the player lost). In chess, for example, the state is the board configuration (positions of the pieces) and an action is a movement of a piece; this movement changes the board configuration (state) and may or may not elicit a reward (points for removing an opposing piece).

So at each turn, the actor should choose the action it thinks will elicit most reward in the long term, considering the future states resulting from the current action. To mimick human action, we have the actor value the present more than the future (i.e. discount the future). That is, if r(s, a) is the reward from acting on state s with action a, the actor wants to maximise

$V(s_0) = \sum_{t = 0}^T \gamma^t r(s_t, a_t)$

where is the initial state of the game, $s_{t+1}$ is the state that results from acting on state with action and $\gamma$ is a discount factor.

The actor will have a brain, which will be a neural network mapping states to values, to make this decision. However, instead of having the network predict the entire sequence of discounted rewards for each action in the current state and choosing the action with the highest discounted rewards, it'll predict instead the Q-value of action a on state s, defined as the maximum value you can obtain after acting on state s with a. i.e. you get today's reward and the rewards corresponding to acting optimally in the future, starting with next period's state, s'. I.e.:

$Q(s, a) = r(s, a) + \gamma \max_{a'}Q(s', a')$

Notice that $\max_{a}Q(s, a) = V(s)$ , which is what we wanted to maximise in the first place. We can thus pick actions by running the network on a given state and choosing the action with highest Q(s, a).

So we'll let our player play the game, with random behaviour at first so it can build a 'memory' of diversified past experience to learn from, and train a neural network (the actor's 'brain') on this memory of past states, actions and rewards. The network should predict the Q-value of a given state for each possible action, so the actor can just pick the action with highest predicted Q-value. After each turn, we re-train the network on the new memory example and repeat. Over iterations, the network's estimate of Q(s, a) should approach the true Q(s, a), leading to better and better decisions as the actor plays the game.

Some issues

There are a few aspects of peg solitaire that make it tricky to teach an AI using reinforcement learning.

The first is that to win the game, we have to play a perfect game, not just a good one. The algorithm is trained to maximise the discounted sum of rewards so if over the course of learning, the actor doesn't get to 'experience' many perfect game, it may well learn to attain a high number of points and stop there, content. There are a few things I implemented to counteract this:

I made the rewards increase convexly as the number of marbles left at the end of each turn falls. The idea is that by valueing removing the last marble so much more than removing a marble near the beginning or middle of the game, the actor will weigh it's memories of playing long games more heavily when learning
I set the discount rate high (close to 1) so that it values points it'll get in future actions in the same game as much as current ones
I made sure the actor plays a LOT of fully random games at first. Since it won't know how to play the game, if we let the actor choose their own actions at first, it will never actually play a full game and won't have that in it's memory to learn. Slowly, as it learns, the actor plays more and more games non-randomly and hopefully this slow change helps get more full games into its memory

A second issue is that the game is fully deterministic, i.e. there's a fear the actor will just memorise one specific sequence of moves to beat the game and repeat that, without actually learning anything about the game. Luckily, we can play this game in different ways, depending on which spot we leave empty before the game starts (in the gif above it's the centre spot but there's others that work just as well). What I did to counteract this, then, is I make sure to train the actor only on games that start with the centre spot empty but when I track its progress, I make it play games that start with different empty spots. That way, if we see progress, it's because the actor is learning what sorts of moves are helpful, not just memorising one sequence of moves that works!

The Algorithm

Initialise the class instances:

a) The game (a class which provides a state, takes in an action and returns a reward);

b) An actor with a neural network that can interact with the above game;
Run game a few times. For each game, repeat turns until the game is won or lost. In each turn:

a) Run a lottery to see if action will be random or if we'll allow the actor to pick the best action based on its network. This lottery will be initially biased towards random action and progressively loosened.

b) If random, pick a random action. If we'll let the actor pick the move,
- run network on current state;
- pick action with highest predicted Q-value;
c) Store old state, chosen action, the reward and the new state (inc. an indicator for whether the game ended, c) in memory: (s, a, r, s', c);
At the end of a set of games, train the network on a sample of the memory. In batches made from this sample:

a) Run network on the batch of s and s', to get sets of Q-values for all actions in each state: {Q(s, a)} and {Q(s', a)};

b) Update network's prediction of {Q(s, a)} using its prediction on s' and the observed reward of acting a on s:

$Q(s, a) \to r(s, a) + \gamma \max_{a'} Q(s', a')$

c) Train network using s as the input and the above updated prediction for {Q(s, a)} as the output, forcing the network to continuously update its own predictions;
Go back to step 2. Every few repetitions of steps 2 and 3, track actor's progress by making it play a game fully non-randomly and storing the number of turns it lasted. This is just step 2 with non-random actions.
Repeat 2, 3 and 4 for as many epochs as it takes for the actor to learn the game.

The result should be an actor whose network properly predicts Q-values given states as inputs and can thus play the game by maximising its expected Q-value.

Here's a plot of the actor learning to play the game shown in the gif above:

Tests

Tests: n-step Q Learning, varying learning rate, momentum rate and regularisation rate, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
images		images
players		players
utils		utils
.gitignore		.gitignore
README.md		README.md
game.py		game.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

players

players

utils

utils

.gitignore

.gitignore

README.md

README.md

game.py

game.py

main.py

main.py

Repository files navigation

deepq

Reinforcement Learning and Deep Q Learning

Some issues

The Algorithm

Tests

About

Releases

Packages

Languages

Navigation Menu

gavarela/deepq

Folders and files

Latest commit

History

Repository files navigation

deepq

Reinforcement Learning and Deep Q Learning

Some issues

The Algorithm

Tests

About

Resources

Stars

Watchers

Forks

Languages