# Approximate Solution Methods
In order to avoid the combinatory explosion of tabular mehods, when the number of states and/or actions is large, we must resort to approximate methods.
Approximate methods are based on the idea of representing the value function as a parametrized function, and then updating the parameters to minimize the error between the true value function and the approximated one.

## Policy Gradient Methods
Policy gradient methods are a class of reinforcement learning algorithms that rely on estimating the gradient of the policy, and then updating the policy parameters in the direction of the gradient. Let's denote the policy parameters as $\theta$, and the policy as $\pi(a\mid s,\theta)= Pr\{A_t=a\mid S_t=s, \theta_t=\theta\}$. We will attempt to learn $\theta$ such that some performance measure $J(\theta)$ is maximized. 
\begin{equation}
\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta)
\end{equation}

where $\alpha$ is the learning rate.

In policy gradient methods, the performance measure $J(\theta)$ is usually the value function $v_\pi(s)$, the state-action value function $q_\pi(s,a)$, or some other function that is related to the value function. 

### Policy Gradient Theorem
The policy gradient theorem states that the gradient of the performance measure $J(\theta)$ is given by
\begin{equation}
\nabla_\theta J(\theta) = \sum_s \mu_\pi(s) \sum_a q_\pi(s,a) \nabla_\theta \pi(a\mid s,\theta)
\end{equation}

where $\mu_\pi(s)$ is the stationary distribution of the Markov chain induced by the policy $\pi$.

### REINFORCE
The REINFORCE algorithm is a Monte Carlo policy gradient algorithm that uses the policy gradient theorem to update the policy parameters. The algorithm is as follows: 
\begin{equation}
\theta_{t+1} = \theta_t + \alpha G_t \sum_a q_\pi(s,a) \nabla_\theta \pi(a\mid s,\theta)
\end{equation}

where $G_t$ is the return at time $t$.




