# **Introduction to Policy Gradient Algorithms**

### **Motivation**

In reinforcement learning (RL), **policy-based methods** directly parameterize the policy (ie, the mapping from states to actions) and optimize these parameters by maximizing the expected reward. This is in contrast to value-based methods, which learn a value function to indirectly derive a policy. Policy gradient methods are particularly useful in continuous or high-dimensional action spaces, or when you want a stochastic policy.

### **High-Level Overview**

**Pros of Policy-Based Methods:**
- **Direct Optimization:** They optimize the policy directly, which can yield smoother updates.
- **Stochastic Policies:** Naturally handle stochastic policies, which can be beneficial in environments with uncertainty.
- **Continuous Actions:** They work well in continuous action spaces where value-based methods might struggle.
- **Convergence to Local Optima:** They often converge to a locally optimal policy, which might be more desirable in some settings.

**Cons of Policy-Based Methods:**
- **High Variance:** Gradient estimates often have high variance, making learning unstable without careful techniques.
- **Sample Inefficiency:** They may require many samples to learn effectively.
- **Local Optima:** Direct optimization might get stuck in suboptimal policies.

**Policy Representation**

A policy $\pi_{\theta}(a|s)$ is often represented using a neural network (for instance, a multi-layer perceptron) parameterized by $\theta$. For discrete actions, the network may output a probability distribution over actions (e.g., via a softmax), while for continuous actions it might output the parameters (like mean and variance) of a distribution (e.g., a Gaussian).

# **Policy Gradient Derivation**

### **1. Setting Up the Problem**

#### **What Are We Trying to Do?**

In reinforcement learning, our goal is to adjust the parameters $ \theta $ of our policy (which is a function that tells the agent which action to take in a given state) so that the agent gets as much reward as possible. We denote our policy as:
$$
\pi_\theta(a|s)
$$
which means "the probability of taking action $a$ when in state $s$" given parameters $ \theta $.

#### **The Objective Function**

The performance of our policy is measured by the **expected total reward** (also called the return) that we get by following the policy. We can write this as:
$$
J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]
$$
Here:
- $ \tau $ represents a **trajectory**, which is a sequence of states and actions: $ (s_0, a_0, s_1, a_1, \dots) $.
- $ R(\tau) $ is the total reward accumulated along the trajectory $ \tau $.
- The notation $ \tau \sim \pi_\theta $ means that the trajectory is generated by following the policy $ \pi_\theta $.

The goal is to find the parameters $ \theta $ that maximize $ J(\theta) $.

---

### **2. Taking the Gradient**

Since we want to maximize $ J(\theta) $, we can use **gradient ascent**. This means we want to compute the gradient (or derivative) of $ J(\theta) $ with respect to $ \theta $, denoted as $ \nabla_\theta J(\theta) $, and then adjust $ \theta $ in the direction of this gradient.

#### **Step-by-Step Differentiation**

1. **Express the Objective as an Integral:**

   The expectation $ \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] $ can be written as an integral over all possible trajectories:
   $$
   J(\theta) = \int \pi_\theta(\tau) R(\tau) \, d\tau
   $$
   Here, $ \pi_\theta(\tau) $ is the probability of the trajectory $ \tau $ under the policy $ \pi_\theta $.

2. **Differentiate Under the Integral:**

   To find the gradient, differentiate $ J(\theta) $ with respect to $ \theta $:
   $$
   \nabla_\theta J(\theta) = \nabla_\theta \int \pi_\theta(\tau) R(\tau) \, d\tau
   $$
   Under suitable conditions (which hold in our context), we can interchange the order of integration and differentiation:
   $$
   \nabla_\theta J(\theta) = \int \nabla_\theta \pi_\theta(\tau) R(\tau) \, d\tau
   $$

3. **Introduce the Log-Derivative Trick:**

   Now, differentiating $ \pi_\theta(\tau) $ directly can be hard. But we can use a very useful trick from calculus:
   $$
   \nabla_\theta \pi_\theta(\tau) = \pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau)
   $$
   This is true because of the chain rule (think of it as the derivative of $\log$ and then multiplying back by the original function). Plug this back into our expression:
   $$
   \nabla_\theta J(\theta) = \int \pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau) R(\tau) \, d\tau
   $$

4. **Express as an Expectation:**

   Notice that the integral
   $$
   \int \pi_\theta(\tau) \, [\cdot] \, d\tau
   $$
   is just the expectation with respect to trajectories $ \tau $ sampled from $ \pi_\theta $. So, we can write:
   $$
   \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(\tau) \, R(\tau) \right]
   $$
   This is the famous **policy gradient theorem**.

---

### **3. Breaking Down $ \log \pi_\theta(\tau) $**

#### **Understanding Trajectories**

A trajectory $ \tau $ is a sequence of states and actions:
$$
\tau = (s_0, a_0, s_1, a_1, \ldots, s_T, a_T)
$$
The probability of this trajectory under the policy can be written as:
$$
\pi_\theta(\tau) = p(s_0) \prod_{t=0}^{T} \pi_\theta(a_t|s_t) \, p(s_{t+1}|s_t, a_t)
$$
- $ p(s_0) $ is the probability of the starting state.
- $ \pi_\theta(a_t|s_t) $ is our policy.
- $ p(s_{t+1}|s_t, a_t) $ represents the dynamics of the environment (which we assume does not depend on $ \theta $).

#### **Taking the Logarithm**

We take the logarithm of $ \pi_\theta(\tau) $:
$$
\log \pi_\theta(\tau) = \log p(s_0) + \sum_{t=0}^{T} \left[ \log \pi_\theta(a_t|s_t) + \log p(s_{t+1}|s_t, a_t) \right]
$$
Since the environment dynamics $ p(s_{t+1}|s_t, a_t) $ and the initial state distribution $ p(s_0) $ do not depend on $ \theta $, their gradients will be zero. Hence, when we compute:
$$
\nabla_\theta \log \pi_\theta(\tau)
$$
only the terms involving $ \pi_\theta(a_t|s_t) $ contribute:
$$
\nabla_\theta \log \pi_\theta(\tau) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t)
$$

---

### **4. Putting It All Together**

Substitute the above result back into our expression for the gradient:
$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \left( \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \right) R(\tau) \right]
$$

#### **Intuitive Explanation**

- **Intuition Behind the Log-Derivative:**  
  Instead of directly differentiating the probability of the entire trajectory, we differentiate its logarithm, which conveniently breaks into a sum over time steps. This makes the mathematics much more manageable.

- **Why Multiply by $ R(\tau) $?**  
  If a trajectory $ \tau $ results in a high reward $ R(\tau) $, then we want to **increase** the likelihood of the actions taken in that trajectory. Multiplying by $ R(\tau) $ ensures that the update reinforces actions that lead to high rewards.

- **Summing Over Time Steps:**  
  The sum $ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) $ means that each decision in the trajectory gets credit (or blame) proportional to the final outcome $ R(\tau) $. If an action at time $ t $ helped lead to a good result, its log probability is increased; if not, it is decreased.

#### **A Simple Code Analogy**

Imagine you have a recipe (the policy) and you try it out (generating a trajectory). If the dish (the outcome) is delicious (high reward), you note down what you did at each step (the actions) and try to remember to do those things again (increase their probability). If the dish is bad, you change the recipe in the opposite direction.

---

### **5. Summary of the Derivation**

1. **Objective:**  
   Maximize the expected return:
   $$
   J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]
   $$

2. **Gradient of the Objective:**  
   Differentiate $ J(\theta) $ to get:
   $$
   \nabla_\theta J(\theta) = \int \nabla_\theta \pi_\theta(\tau) R(\tau) \, d\tau
   $$

3. **Log-Derivative Trick:**  
   Replace $ \nabla_\theta \pi_\theta(\tau) $ with:
   $$
   \pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau)
   $$
   leading to:
   $$
   \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(\tau) \, R(\tau) \right]
   $$

4. **Breaking Down the Trajectory:**  
   Recognize that only the action probabilities depend on $ \theta $, so:
   $$
   \nabla_\theta \log \pi_\theta(\tau) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t)
   $$

5. **Final Expression:**  
   Thus, the gradient becomes:
   $$
   \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \, R(\tau) \right]
   $$

This final result tells us how to adjust our policy parameters $ \theta $ by taking into account the contribution of each action (through its log probability) weighted by how good the overall outcome was.

---

### **6. Intuitive Recap**

- **Imagine you're learning to play a game:**  
  Each time you play, you try out a series of moves (a trajectory). If you win (high reward), you want to remember which moves were good. The formula tells you to “nudge” your strategy to make those moves more likely in the future.

- **Logarithm Simplifies Things:**  
  Taking the log of the probability lets you add up contributions from each move rather than multiply small probabilities together, which is much easier to work with mathematically.

- **Weighting by Reward:**  
  The moves are adjusted by the overall success, so good outcomes lead to larger adjustments.