# Approximate Solution Methods
In order to avoid the combinatory explosion of tabular mehods, when the number of states and/or actions is large, we must resort to approximate methods.
Approximate methods are based on the idea of representing the value function as a parametrized function, and then updating the parameters to minimize the error between the true value function and the approximated one.

## Policy Gradient Methods
Policy gradient methods are a class of reinforcement learning algorithms that rely on estimating the gradient of the policy, and then updating the policy parameters in the direction of the gradient. They can select actions **without consulting the value function**.

Let's denote the policy parameters as $\theta$, and the policy as $\pi(a\mid s,\theta)= Pr\{A_t=a\mid S_t=s, \theta_t=\theta\}$. We will attempt to learn $\theta$ such that some *performance measure* $J(\theta)$ is maximized.

\begin{equation}
\theta_{t+1} = \theta_t + \alpha \widehat{\nabla_\theta J(\theta)}
\end{equation}

where $\alpha$ is the step size.

In policy gradient methods, the performance measure $J(\theta)$ is usually the value function $v_\pi(s)$, the state-action value function $q_\pi(s,a)$, or some other function that is related to the value function. 

### Policy Gradient Theorem
The policy gradient theorem states that the gradient of the performance measure $J(\theta)$ is given by
\begin{equation}
\nabla_\theta J(\theta) \propto \sum_s \mu_\pi(s) \sum_a q_\pi(s,a) \nabla_\theta \pi(a\mid s,\theta)
\end{equation}

where $\mu_\pi(s)$ is the stationary distribution of the Markov chain induced by the policy $\pi$.

### REINFORCE

The policy gradient theorem can be rewriten as
\begin{equation}
\nabla_\theta J(\theta) = \mathbb{E}_\pi \left[ \sum_a q_\pi(s,a) \nabla_\theta \pi(a\mid s,\theta) \right]
\end{equation}

The REINFORCE algorithm is a Monte Carlo policy gradient algorithm that uses the policy gradient theorem to update the policy parameters. A stochastic gradient ascent algorithm based on it would be: 
\begin{equation}
\theta_{t+1} = \theta_t + \alpha G_t \sum_a q_\pi(s,a) \nabla_\theta \pi(a\mid s,\theta)
\end{equation}

where $G_t$ is the return at time $t$. This algorithm is called an "*all actions*" method, because its update considers all possible actions. The classical REINFORCE algorithm is an ``one action" method, which means that it only considers the action that was actually taken. The update rule for the classical REINFORCE algorithm is
\begin{equation}
\theta_{t+1} = \theta_t + \alpha G_t \frac{\nabla_\theta \pi(A_t\mid S_t,\theta_t)}{\pi(A_t\mid S_t,\theta_t)} 
\end{equation}

That is, the product of a return and the gradient of the probability of taking the action $a$ divided by the probability of taking the action $a$.

#### REINFORCE with Baseline
The REINFORCE algorithm can be improved by subtracting a baseline from the return. The baseline is a function that does not depend on the action, and that is subtracted from the return. The baseline can be any function, but the best choice is the state value function $v_\pi(s)$. The update rule for the REINFORCE algorithm with baseline is
\begin{equation}
\theta_{t+1} = \theta_t + \alpha (G_t - v_\pi(S_t)) \frac{\nabla_\theta \pi(A_t\mid S_t,\theta_t)}{\pi(A_t\mid S_t,\theta_t)}
\end{equation}






### Actor-Critic Methods
Actor-critic methods are a class of reinforcement learning algorithms that combine the policy gradient theorem with a learned value function. The policy is called the *actor*, and the value function is called the *critic*. The critic is used to estimate the gradient of the performance measure $J(\theta)$, and the actor is used to improve the policy. The critic is usually updated using temporal difference learning, and the actor is updated using the policy gradient theorem.