**Learning parametric policies**

Instead on focussing on learning $Q^*$, can we directly learn $\pi^*$?

Suppose we have a policy $\pi_\theta$ parameterized by a vector $\theta$. Our goal is to find the parameter(s) $\theta^*$ corresponding to $\pi^*$.

Remark: $\pi_\theta$ might not be able to represent $\pi^*$. We will take a shortcut and call $\pi^*$ the best policy among the $\pi_\theta$ ones.

Remark: for discrete state and action space, the tabular policy representation is a special case of policy parameterization.

For stochastic policies, we shall write $\pi_\theta(a|s)$.

Remark: for problems with significant policy approximation, the best approximate policy (among $\pi_\theta$ ones) may very well be stochastic.

Even though a value function might be used during the learning phase, what we aim at is really $\pi^*$. So, during execution, it really is a call to $\pi^*$ that is made for action selection, rather than a $Q$-greedy choice.

**Policy gradient methods**

Suppose now we define some performance metric $J(\pi_\theta) = J(\theta)$. If $J$ is differentiable and a stochastic estimate $\nabla_\theta J(\theta)$ of the gradient is available, then we can define the gradient ascent update procedure:
$$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta).$$

We will call *policy gradient methods* all methods that follow such a procedure (whether or not they also learn a value function or not).

Remark: note that $J$ is a more general criterion that might differ from $Q$ in the definition above (even though it seems reasonable to assume both should be related). For example, $J$ could be defined as the value of a starting state (or a distribution of starting states) in episodic cases, or as the undiscounted reward over a certain horizon, or as the average reward.

Remark: Why is it interesting to look at policy gradient methods? Because for continuous actions there is no maximization step ($\max_a Q(s,a)$) during evaluation but only a call to $\pi_\theta(s)$ (or a draw from $\pi_\theta(a|s)$). This makes Policy Gradient a method of choice for continuous actions domains (especially common in Robotics).

Remark: When do policy gradient approaches outperform value-based ones? It's hard to give a precise criterion; it really depends on the problem. One thing that comes into play is how easy it is to approximate the optimal policy or the optimal value function. If one is simpler than the other (by "simpler", we mean "it is easier to find a parameterization whose spanned function space almost includes the function to approximate"), then it is a good heuristic to try to approximate it. But this criterion might itself be hard to assess.

Remark: Policy parameterization is an easy way to reduce the policy space and thus to include prior knowledge about the policy shape (in particular the correlation between $\pi(s)$ and $\pi(s')$).

Methods that jointly learn a policy and a value function are often called *actor-critic methods*

**Assumptions and notations**

We write $\theta$ the policy's parameters and $w$ the value function parameters ($V$ or $Q$).

We assume $\pi_\theta(s)$ (or $\pi_\theta(a|s)$) is differentiable with respect to $\theta$ in all $s$.

Remark: smooth convergence. A major advantage of PG methods stems from the continuity of $\pi$ with respect to $\theta$. Due to this assumption, a small change in $\theta$ (as in the update presented earlier) results in a small change in $\pi_theta$. That is, action probabilities change smoothly along learning. Conversely, with value-based methods, a small value change during learning can yield a drastic policy change. The direct consequence is that gradient ascent is indeed possible!

