# Tabular Q Learning, a Tic Tac Toe player that gets better and better #

Time for our first computer player which actually uses a machine learning technic to play. The machine learning approach we will use is called *Reinforcement Learning*, and the particular variant we will use is called *Tabular Q Learning*. In the following we will introduce all 3 concepts, *Reinforcement Learning*, *Q function*, and *Tabular*, and then put them all together to create a Tabular Q-Learning Tic Tac Toe player.

## What is Reinforcement learning?

The basic idea behind reinforcement learning is quite simple: Learn by doing and observing what happens. This concept is motivated by how humans seem to learn in many cases: We do something and if it had positive consequences we are likely to do it again, whereas if it had negative consequences we are generally less likely to do it again.

At each step the player (also called agent) takes a good look at the current environment and makes an "educated guess" of what it thinks is the best action. After executing an action, it will learn how that action has changed the environment and also receive feedback on how good or bad its action was in that particular situation: If it chose a good action it will get a positive reward, if it chose a bad action it will get a negative reward. Based on that feedback, it can then adjust its "educated guessing process". When it encounters the same situation again in the future, it should hopefully be more more likely to chose the same action again if the feedback was positive and more likely to chose a different action if it was negative.

The following graphic illustrates this process:
![Title](./Images/Reinforcement.PNG)

In many situations, the feedback the agent gets after chosing one action will not necessarily be very strong or conclusive. E.g. when making a move at the beginning of a game of Tic Tac Toe, it will be unclear if the chosen action was good or bad and thus the reward will be neutral and thus the feedback it gets is of very limited utility. At the end of the game however, it will get very strong feedback: It won, it lost, or the game was a draw. One fundamental concept of *Reinforcement Learning* is to have a process of attributing that final reward back to previous actions as well to reflect the fact that they contributed to the ultimate success or failure in some way. Based on the assumption, that the longer in the past an action has happened, the less likely it is that it was instrumental in the final outcome, we will discount the final reward at each step back in time.

Time for an example: Let's assume we have played a game of Tic Tac Toe chosing moves based on our current best guesses and won. None of our moves got a positive or negative reward apart from the very last one which got a 1 as it won the game. Now, it is more than just likely that the vicotry was not only because of the quality of the last move, but because, at least some of, our previous moves were "good" moves as well. The further in the past, the less likely it is that they contributed significantly to the outcome. So, we might give the second last move a discounted reward of 0.8, the one before that a reward of 0.5, the one before that a reward of 0.1 etc.

Over many games, the theory goes, that if there are moves that are particularly strong and important at the beginnig of the game they will still end up with a very high likelyhood of getting chosen as they will receive positive rewards over and over again, adding up over time despite the fact that they only ever received highly discounted rewards. If I remember correctly, with certain constraints on the set-up and number of iterations this can be proven to be true.

A more detailed introduction to Reinforcement Learning can be found on [Wikipedia](https://en.wikipedia.org/wiki/Reinforcement_learning) and the ultimate, authorative introduction is probably the book [Reinforcement Learning - An Introduction](https://mitpress.mit.edu/books/reinforcement-learning) by Richard S. Sutton and Andrew G. Barto. Draft versions of this book seem to be easily findable and downloadable over the internet, but as I'm not sure about the legality of those I will not provide links here. In the January 2018 Draft version, the tabular Q-learning approach from this tutorial can be found in part 1, chapter 6.5 (" Part 1: Tabular Solution Methods -> 6 Temporal Difference Learning -> 6.5 Q-Learning: Off-policy TD Control")

## What is the Q function?

Moving on the the next concept we need to understand: The Q function. The Q stands for *Quality* and the Q function is the function that assigns a quality score to an State - Action pair. I.e. given a state $S$ and an action $A$ the function $Q(S,A) \mapsto I\!R$ will return a real number reflecting the quality of doing this action $A$ in the state $S$.

We will assume that this function exists and is deterministic, but its concrete mapping of states and actions to quality values is not necessarily known to us. Q-learning is all about learning this mapping and thus the function Q.

If you remember our previous tutorial about Min-Max you might have noticed that the values we computed for each state there seems very similar to the values the Q function returns. And indeed they are. In MinMax we assigned quality values to states, whereas here we assign quality values to action that are potentially chosen in a given state. In a deterministic environment like Tic Tac Toe, where performing a certain action in a given state will always result in the same next state, the value of that action in that state is indeed identical to the value of the following state. 

Side Note: In other environments, this relationship can be a bit more complicated. In particular in environments where an action can probabilisticly result in any one of a number of different next states, the value of the action is not that directly tied to a particular next state. E.g. you have a pudding which for all intends and purposes looks fantastic. However, as the saying goes, you can't tell how good it is before eating it. So the action of eating the pudding can result in different outcomes: delight, meh, disgust, ..., each with their own probability. The Quality of the action eating the pudding is thus a much more complicated function of the possible outcomes and their probabilities.

