# Part 8 - Tic Tac Toe with Policy Gradient Descent

Do you really need values if you have good policies?

The aim of this part is to try a slightly different Reinforcement Learning approach. Instead of learning the Q function and then base our action policy on those Q values, we will try to learn a good policy directly.

## Value based learning vs policy based learning

### Value based learning recap

The general approach we have used in previous parts is a classical case of value based learning:

* We know that there is a Q function which can tell us exactly how good a certain move is in a certain state.
* We train a Neural Network to learn this Q function.
* When we have to make a move, we do so according to the following policy: We look at the Q values of all currently possible moves and chose the one with the highest Q value.

This, in the end, seemed to work reasonably well. However there are a few disadvantages to this approach:

1. We do not know the Q function when we start training. This means we are actually trying to do 2 things at the same time: 

    a. Learn the Q function
    
    b. Train the Neural Network to emulate the Q function
    
    This means, we are not actually training our NN against the real Q function, but rather against our current best guess what the Q function might be. If this *best guess* is not very acurate, the NN will not perform very well, no matter how well it can emulate it.
    
    
2. There are circumstances where, figuratively speaking, it's quite hard to determine how good a given situation is, however comparatively easier to determine what the best action is. In such cases, trying to learn the Q function to determine the action might be more complicated than necessary.

### Direct Policy Learning

The idea behind Direct Policy Learning based methods is to learn the best action policiy directly. In our case of playing Tic Tac Toe, this could e.g. be done by repeatedly playing the game and *rewarding* moves that led to a positive outcome by increasing the probability that they are chosen, while *punishing* moves that led to a negative outcome by decreasing their probability. 

In contrast to the Q learning approach we do not care how the NN comes up with the policy. There may be a network layer which by chance computes something very similar to the Q function, but then again it may not. All we care about is that the NN ouputs action probabilities in the end that reflect which currently available moves habe the highest probability of a positive final outcome.


# Policy Gradient Descent

Let's start by establishing some fundamentals:

## Deterministic policy vs stochastic policy

A policy can either be deterministic or stochastic. A deterministic policy, $\pi(s)$, will tell us explicitly which action to take in a given situation. A stochastic policy, $\pi(s,a)$, in contrast will tell us for a given state and action with what probability we should take this action.

Why would we ever need a stochsitic policy? Shouldn't a deterministic policy be just fine as it just tells us what the best move is and why would we ever take an action which is not the best? 

Turns out, in some scenarios, being able to choose based on probabilities is essential. One extreme case in the regards is the game Rock - Paper - Scissors. There is no good deterministic strategy for this game, it is absolutely essential that you chose your action stochastically to be successful.

In our case, being stochastic has the benefit to allow our NN to explore options while it learns the best policy. A deterministic policy is much more likely to get stuck in a local minimum.

For the rest of this part, we will focus on stochastic policies.

## Parameterized policies

Instead of talking about finding the best policy, we will actually talk about finding the best parameters $\phi$ for our policy $\pi$. That is, the policy is implemented by the NN and the parameters that drive the policy are the learnable weights in the network. We write this parameterized policy as $\pi_\phi$.

## Quality of policies

We previously mentioned that we want to find a policy that is *good* without defining what exactly we mean by *good*. This is obviously a bit vague and we need to tie this down: We thus define the quality of a policy as the reward we can expect to receive if we take actions according to the policy. The *reward* being a single number, the higher, the better.

In a deterministic policy this would simply be the expected reward when taking the action recommended by the policy. In the stochastic case we need to multiply the expected reward of currently possible action with the suggested probability of taking this action accorind to the policy and add all these expected rewards up.

We can then define a policy $\pi_{\phi_1}$ to be better than a policy $\pi_{\phi_2}$ if it has a higher expected reward. 

## Finding the best policy

Having defined the quality of a policy and a way of comparing two policies, finding the best policy is now reduced to an optimization problem. All we need to do is find the policy gradient and follow it to its maximum.

So, how do we compute the policy gradient?

If you really want to know, the answer is rather mathematical and involved. You can find the answer [here](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0ahUKEwjx8tLA25PcAhUIllQKHRRYDpAQFggsMAA&url=http%3A%2F%2Fwww.1-4-5.net%2F~dmm%2Fml%2Flog_derivative_trick.pdf&usg=AOvVaw1EOjIUcecch0o-JkahohXE) or various other places on the internet. 

Maybe, let's just pretend that's where we fell asleep in the lecture and wake up again when the final result is presented. Turns out, the derivative of the policy is the same as the derivative of the log probability of an action multiplied by its average reward:

$$ \sum_t{r_t log(p(s_t,a_t))} $$

All of these things we have available: The probability $p(s_t,a_t)$ is what our NN will output and we can sample the average reward $r$ by repeated play and defining the reward at step $t$ as a discounted version of the final game outcome / reward similar to how we did this in Q learning.

This means we have all we need to implement this.

# Policy Gradient in TensorFlow

We can recycle most of the code we used to build the NN from our Q-learning efforts. You can find the complete code in the class [DirectPolicyAgent.py](https://github.com/fcarsten/tic-tac-toe/blob/master/tic_tac_toe/DirectPolicyAgent.py). The only parts we need to change are:

* The loss function.
* What we feed into the loss function. In particular we will feed sample rewards instead of Q value estimates.
* A small change to the reward function.

## The new loss function

Let's start with the loss function. We now use

```Python
    self.reward_holder = tf.placeholder(shape=[None], dtype=tf.float32)
    self.action_holder = tf.placeholder(shape=[None], dtype=tf.int32)

    self.indexes = tf.range(0, tf.shape(self.output)[0]) * tf.shape(self.output)[1] + self.action_holder
    self.responsible_outputs = tf.gather(tf.reshape(self.output, [-1]), self.indexes)

    self.loss = - tf.reduce_mean(tf.log(self.responsible_outputs + 1e-9) * self.reward_holder)
```

Apart from the output of the NN in `self.output` we input what action was chosen `self.action_holder` and the discounted reward for that action we obseverved, `self.reward_holder`. 

We also add a small constant, $1e-9$, to the output probabilies to avoid the logarithm getting to close to infinity.

Finally, we multiply the loss by -1, to account for the fact that the Optimizer tries to minize the function rather than maximise as would be required per the previous section.



IndentationError: unexpected indent (<ipython-input-1-ecdbec1177aa>, line 2)