# 1. REINFORCE algorithm

### What is REINFORCE?

REINFORCE algorithm belongs to a special class of RL algorithms called **policy gradient** algorithms. In short, policy can be imagined a set guidelines that tells the agent what action it should take at each situation.

By implementing the REINFORCE algorithm, we simply create a policy. Each agent step, this policy is changed a bit until the environment is solved.

### Policy training

The policy is usually a Neural Network that takes inputs as a state and generates probability distribution across all possible spaces.

To better understand this, let's look at the following figure.

![image](https://miro.medium.com/max/690/1*TmJQYg9v0Z3EWIJpBF61Ow.png)

In such case, within the **environment** (Pacman game), we have a **state** that corresponds to the current situation of the agent (place in the board, etc). This becomes the input to our **policy neural network** that outputs the probability accross the **action space** (possible movements).

As it goes with all networks, the aim of training is to minimize / maximize function that characterizes the performance of that model. In REINFORCE case, we are trying to maximize the **expected reward**.

### Mathematics

We have just discussed the basic idea behind the training of REINFORCE algorithm - now let's look at the mathematical formulation of the problem.

#### Trajectory

In short, the trajectory $\tau$ can be imagined as a sequence of state-action-rewards - basically, a history of agent's actions at each state. To better visualize it, let's look at the following figure.

![image_1](https://miro.medium.com/max/754/1*wvJxhJbqPR4JmcjHFRCZrg.png)

Here we can see a representation of two different policies. Within those policies, we have a sequence of states $s_0, s_1, s_2, s_3$ (or positions in this case), actions that the agent take at each state $a_0, a_1, a_2, a_3$, and rewards the agent gets at the next step $r_1, r_2, r_3$. The trajectory can, therefore, be expressed in the followinng way.

$$
\tau=(s_0, a_0, r_1, s_1, a_1,...,a_T, r_{T+1}, s_{T+1})
$$

$T$ here marks the last timestep of the trajectory (for instance, the end of the game).


#### Return

Quite intuitively, the return ($R(\tau)$) is the sum of rewards that the agent receives in a specific trajectory.
$$
R(\tau) = (G_0,G_1, ..., G_T)
$$
Here the parameter $G$ is called discounted future return. In other words, it tries to approximate the return we expect to collect from time step $k$ to the end of the trajectory. To take into account the decaying importance of the return further away from the present step, we also introduce a parametter $\gamma$. At time step $k$, the discounted future return can be expressed as follows.
$$
G_k = \sum_{t=k+1}^{T}\gamma ^ {t-k-1}R_k
$$

#### Expected return

Finally, we can define the expected return function tries to connect all previously described functions parts.
$$
J(\theta) = \sum_\tau P(\tau; \theta)R(\tau)
$$
- $P(\tau; \theta)$ here represents a probability of each trajectory for all possible $R(\tau)$ values
- $R(\tau)$ is a return expressed as a function of trajectory

**Throughout the training process of REINFORCE model, we aim to maximize $J(\theta)$ function.**

But how is done?

As it has been mentioned at the start of the notebook, the agent in the environment is guided by policy that provides a set of instructions for state-action. 

Since the policy itself cannot be trained, we introduce a neural network that represents the policy. Consequently, the parameter $\theta$ becomes the weights of the newly created neural network. These weights shape the policy that formulates the trajectory and influences returns. As we want to change weights ($\theta$) to maximize the expected return function ($J(\theta)$), we will use **gradient ascent**.


#### Gradient

The main difference between the gradient descent and ascent is that the latter tries to go in the direction of the gradient to maximize the function, while for the descent, we want to go in the negative direction of the gradient to find the minimum.

In the mathematical terms, the update of the weigths will be guided by the following equation.
$$
\theta = \theta + \alpha\nabla_{\theta}J(\theta) = ?
$$

$\alpha$ here is the step size or the learning rate.

After some mathematical derivations (that won't be discussed in this tutorial), be derive the following expression for the derivative of expected return.
$$
\nabla_{\theta}J(\theta)=\sum_{t = 0}^T\nabla_{\theta}log\pi_{\theta}(a_t|s_t)G_t
$$

$\pi_\theta(a_t|s_t)$ represents a parametrized policy that maps states ($s_t$) to actions ($a_t$). 

Let's say, we are playing a game in which the agent receives a reward of $G_t = +1$ when it wins and $G_t = -1$ when it losses. In this game, there is an optimal policy $\pi_\theta(a_t|s_t)$ that shows what action $a_t$ should be taken at each state $s_t$. As the policy describes the probability, we want to maximize the probability that the agent will choose this specific action, therefore, we take the gradient of the $log$ of that policy:

$$
\nabla_{\theta}log\pi_{\theta}(a_t|s_t)
$$

To take into account the rewards, the $G_t$ to construct the previously described formula. If agent wins, the reward is +1 which means that the network will try to further increase the probability distribution of the policy. If, on the other hand, the agent losses and the reward is -1, the model will try to take as larger negative gradient as possible.