# Policy gradient methods in PyTorch

## Table of contents

1. [Understanding policy gradient methods](#understanding-policy-gradient-methods)
2. [Setting up the environment](#setting-up-the-environment)
3. [Defining the environment for training](#defining-the-environment-for-training)
4. [Building the policy network](#building-the-policy-network)
5. [Implementing the action selection policy](#implementing-the-action-selection-policy)
6. [Computing the policy gradient](#computing-the-policy-gradient)
7. [Implementing the REINFORCE algorithm](#implementing-the-reinforce-algorithm)
8. [Training the policy gradient agent](#training-the-policy-gradient-agent)
9. [Evaluating the policy gradient agent](#evaluating-the-policy-gradient-agent)
10. [Experimenting with hyperparameters](#experimenting-with-hyperparameters)

## Understanding policy gradient methods

Policy gradient methods are a family of reinforcement learning algorithms that focus on optimizing the policy directly. Instead of estimating value functions like in Q-learning, policy gradient methods learn a policy that maps states to actions by adjusting the parameters of a function, typically a neural network. These methods are especially effective in environments with continuous action spaces or in tasks where the policy needs to be stochastic, such as robotics and control systems.

In reinforcement learning, the agent's goal is to maximize the expected cumulative reward by taking actions in an environment. The policy gradient methods achieve this by parameterizing the policy and adjusting those parameters to improve the expected reward. These methods rely on computing the gradient of the expected reward with respect to the policy's parameters and updating the parameters in the direction that maximizes the expected reward.

### **Key concepts in policy gradient methods**

There are several key concepts that form the foundation of policy gradient methods:

- **Policy**: A policy defines the probability of taking a specific action in a given state. In policy gradient methods, the policy is often stochastic, meaning it outputs a distribution over actions rather than a single deterministic action. This allows the agent to explore different actions in a given state.
  
- **Trajectory**: A trajectory is a sequence of states, actions, and rewards that the agent experiences over time as it interacts with the environment. It represents a full episode of experience from start to end.

- **Return**: The return is the total accumulated reward the agent receives over a trajectory. It represents the agent's performance in that episode.

- **Objective**: The objective of policy gradient methods is to maximize the expected return over all trajectories generated by the policy. The higher the return, the better the policy performs in the environment.

### **The policy gradient theorem**

The policy gradient theorem provides a way to compute how the parameters of the policy should be updated to maximize the expected return. The key insight is that the agent can adjust the probability of taking certain actions in certain states in a way that increases the expected reward. The method uses the gradient of the policy function to update the policy parameters in the direction that increases the likelihood of better outcomes.

By doing this, the agent learns to take actions that lead to higher cumulative rewards over time, gradually improving its performance as it interacts with the environment.

### **REINFORCE algorithm**

One of the simplest policy gradient methods is the REINFORCE algorithm. REINFORCE updates the policy parameters based on the reward obtained from each action taken during an episode. The key idea is to increase the probability of actions that lead to higher rewards and decrease the probability of actions that lead to lower rewards.

REINFORCE works by sampling trajectories from the environment, calculating the total return for each trajectory, and using that return to adjust the policy parameters. Actions that lead to higher returns are reinforced, while actions that result in lower returns are penalized. This simple mechanism allows the agent to learn a better policy over time, but it can suffer from high variance in the gradient estimates.

### **Challenges with REINFORCE**

While REINFORCE is conceptually simple, it has a few challenges:
- **High variance**: The gradient estimates in REINFORCE can have high variance, which makes the learning process unstable and slower. The variance arises because the algorithm updates the policy based on entire episodes, meaning feedback is delayed and noisy.
- **Delayed reward signal**: REINFORCE updates the policy only after an episode finishes, which can delay learning for earlier actions that influenced the final outcome. This makes it less efficient in complex environments with long trajectories.

### **Reducing variance with baseline subtraction**

To reduce the variance in policy gradient estimates, it’s common to introduce a **baseline**. The baseline helps stabilize learning by providing a reference point that allows the algorithm to better focus on actions that performed above or below expectations. One commonly used baseline is the value function, which estimates the expected reward for being in a particular state. By subtracting this baseline from the total return, the agent focuses on actions that are relatively better than the expected outcome.

### **Actor-Critic methods**

Actor-Critic methods combine value-based learning with policy gradient methods to improve learning efficiency. The **actor** is responsible for selecting actions based on the policy, while the **critic** estimates the value function, which helps guide the actor’s updates. The advantage of this approach is that the critic provides immediate feedback to the actor, allowing it to learn from individual time steps rather than waiting for an entire episode to finish.

In Actor-Critic methods, the actor updates the policy using the gradient, while the critic evaluates the value of states to provide a more refined signal for the policy update. This helps reduce the variance of the gradient estimates and makes the learning process more stable.

### **Advantage Actor-Critic (A2C) and Asynchronous Advantage Actor-Critic (A3C)**

A2C is an improved version of the Actor-Critic method that uses the **advantage function**, which measures how much better or worse an action is compared to the average action for a given state. This helps focus the policy updates on actions that are better than average, leading to more efficient learning.

A3C extends A2C by running multiple agents in parallel, each interacting with its own copy of the environment. These agents update a shared global policy, which improves the learning speed and helps the algorithm avoid getting stuck in local optima. The parallelization also makes A3C more scalable for large environments.

### **Proximal Policy Optimization (PPO)**

Proximal Policy Optimization (PPO) is one of the most popular and stable policy gradient methods. It addresses some of the issues with traditional policy gradient methods by preventing large updates to the policy. In PPO, the policy is constrained to change only by a small amount in each update, ensuring that the learning process remains stable. This makes PPO more robust and easier to tune compared to earlier policy gradient methods.

PPO uses a clipped objective function that limits how much the policy can change at each step, which helps prevent destructive updates that could degrade the policy’s performance. This balance between exploration and exploitation leads to stable learning and has made PPO widely used in various reinforcement learning tasks.

### **Applications of policy gradient methods**

Policy gradient methods have broad applications, particularly in tasks that involve continuous actions or require a stochastic policy. Some key areas where policy gradient methods are used include:
- **Robotics**: Policy gradient methods are commonly applied in robotic control tasks, where the agent needs to perform continuous actions like walking, grasping, or manipulation.
- **Autonomous systems**: These methods are used in self-driving cars, drones, and other systems that require dynamic decision-making in complex environments.
- **Game playing**: Policy gradient algorithms are used in games where the action space is large or continuous, such as Go, chess, and complex video games.

### **Maths**

#### **Objective in reinforcement learning**

In policy gradient methods, the goal is to learn a policy $ \pi_\theta(a | s) $ that maximizes the expected cumulative reward over all trajectories an agent experiences while interacting with the environment. The objective function $ J(\theta) $ is defined as the expected return across all possible trajectories generated by the policy $ \pi_\theta $:

$$
J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} r_t \right]
$$

Where:
- $ \theta $ represents the parameters of the policy,
- $ \tau $ is a trajectory, which is a sequence of states, actions, and rewards,
- $ r_t $ is the reward at time step $ t $,
- $ T $ is the length of the trajectory.

The agent's goal is to optimize the policy parameters $ \theta $ to maximize the expected cumulative reward $ J(\theta) $.

#### **Policy gradient theorem**

The policy gradient theorem provides a formula for computing the gradient of the objective function $ J(\theta) $ with respect to the policy parameters $ \theta $. This gradient is used to update the policy in the direction that increases the expected return.

The gradient of the expected return can be expressed as:

$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) G_t \right]
$$

Where:
- $ \nabla_\theta J(\theta) $ is the gradient of the objective function with respect to the policy parameters $ \theta $,
- $ \nabla_\theta \log \pi_\theta(a_t | s_t) $ is the gradient of the log-probability of the action $ a_t $ given state $ s_t $,
- $ G_t $ is the return from time step $ t $, representing the total cumulative reward from that point onward.

This equation shows that the policy gradient is proportional to the return $ G_t $ and the gradient of the log-likelihood of the action taken. This approach is often referred to as the **log-likelihood trick**, which allows us to compute gradients for stochastic policies.

#### **Return and discounting**

In many reinforcement learning tasks, future rewards are discounted by a factor $ \gamma $ to prioritize immediate rewards over distant ones. The **discounted return** $ G_t $ at time step $ t $ is defined as:

$$
G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots = \sum_{k=0}^{\infty} \gamma^k r_{t+k}
$$

Where $ \gamma \in [0, 1] $ is the discount factor that determines how much weight is placed on future rewards. A discount factor closer to 0 emphasizes short-term rewards, while a value closer to 1 places more importance on long-term rewards.

The goal of the policy gradient method is to adjust the policy parameters such that the expected discounted return $ \mathbb{E}[G_t] $ is maximized.

#### **REINFORCE algorithm**

The **REINFORCE algorithm** is a basic policy gradient method that uses the return $ G_t $ from each trajectory to update the policy parameters. The update rule for the parameters $ \theta $ is:

$$
\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) G_t
$$

Where:
- $ \alpha $ is the learning rate,
- $ \nabla_\theta \log \pi_\theta(a_t | s_t) $ is the gradient of the log-probability of the action,
- $ G_t $ is the cumulative return from time step $ t $.

The REINFORCE algorithm updates the policy parameters based on the rewards obtained from each episode, reinforcing actions that lead to higher rewards and penalizing those that lead to lower rewards. However, REINFORCE tends to suffer from high variance, which makes learning slower and more unstable.

#### **Baseline subtraction for variance reduction**

To reduce the variance of the gradient estimates, policy gradient methods often use a **baseline**. The baseline does not change the expected value of the gradient but helps to reduce the variance in updates. The most commonly used baseline is the **value function** $ V(s_t) $, which estimates the expected return from state $ s_t $.

The updated policy gradient with baseline subtraction is:

$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) (G_t - V(s_t)) \right]
$$

By subtracting the baseline $ V(s_t) $, the algorithm focuses on actions that lead to returns better or worse than expected, reducing the variance in gradient estimates.

#### **Actor-Critic methods**

**Actor-Critic methods** combine the advantages of policy gradient methods (actor) and value-based methods (critic). The actor represents the policy, and the critic estimates the value function $ V(s_t) $, which is used to guide the actor’s policy updates. Instead of using the full return $ G_t $, the actor uses the **advantage function** $ A(s_t, a_t) $ to update the policy. The advantage function measures how much better or worse an action is compared to the expected outcome:

$$
A(s_t, a_t) = G_t - V(s_t)
$$

The actor is updated using the advantage, which reduces the variance of the updates while still improving the policy. The critic, on the other hand, is trained to minimize the error between the predicted value $ V(s_t) $ and the actual return $ G_t $.

#### **Proximal Policy Optimization (PPO)**

**Proximal Policy Optimization (PPO)** is an advanced policy gradient method that introduces a constraint on how much the policy can change in each update, ensuring more stable learning. PPO uses a clipped objective function to prevent large updates that could destabilize the policy. The PPO update rule prevents the policy from deviating too far from the previous policy:

$$
L^{\text{PPO}}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \min \left( r_t(\theta) A(s_t, a_t), \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) A(s_t, a_t) \right) \right]
$$

Where:
- $ r_t(\theta) $ is the ratio of the new policy to the old policy,
- $ A(s_t, a_t) $ is the advantage function,
- $ \epsilon $ is a small threshold that limits how much the policy is allowed to change.

By constraining the updates, PPO ensures stable and reliable policy improvement, making it one of the most popular and widely used policy gradient methods in reinforcement learning.

## Setting up the environment


##### **Q1: How do you install the necessary libraries for implementing policy gradient methods in PyTorch?**


##### **Q2: How do you import the required modules for building the policy network and interacting with the environment in PyTorch?**


##### **Q3: How do you configure the environment to utilize a GPU for training policy gradient agents in PyTorch?**

## Defining the environment for training


##### **Q4: How do you load a reinforcement learning environment using OpenAI Gym for training a policy gradient agent?**


##### **Q5: How do you retrieve the state and action space of the Gym environment for defining the policy network input and output?**


##### **Q6: How do you reset the environment and retrieve the initial state for the training episodes?**

## Building the policy network


##### **Q7: How do you define the architecture of the policy network using `torch.nn.Module`?**


##### **Q8: How do you implement the forward pass of the policy network to output action probabilities for a given state?**


##### **Q9: How do you initialize the weights of the policy network to ensure stable training and faster convergence?**

## Implementing the action selection policy


##### **Q10: How do you implement an action selection mechanism by sampling from the policy network’s output?**


##### **Q11: How do you store the log-probabilities of the selected actions to be used later in computing the policy gradient?**


##### **Q12: How do you modify the action selection process for continuous action spaces using different probability distributions ?**

## Computing the policy gradient


##### **Q13: How do you compute the reward-to-go (discounted sum of future rewards) for each action in an episode?**


##### **Q14: How do you calculate the policy gradient using the log-probabilities of actions and the corresponding rewards?**


##### **Q15: How do you implement a method to compute the loss function for policy gradient updates using the log-probabilities and rewards?**

## Implementing the REINFORCE algorithm


##### **Q16: How do you collect experiences (state, action, reward) for each episode and store them for gradient computation?**


##### **Q17: How do you compute the policy gradient for an entire episode using the REINFORCE algorithm?**


##### **Q18: How do you apply backpropagation and perform gradient descent to update the weights of the policy network based on the computed gradients?**

## Training the policy gradient agent


##### **Q19: How do you implement the training loop for a policy gradient agent, including resetting the environment and selecting actions based on the policy?**


##### **Q20: How do you track and store the rewards for each episode to monitor the agent’s performance over time?**


##### **Q21: How do you update the policy network at the end of each episode using the policy gradient computed with the REINFORCE algorithm?**

## Evaluating the policy gradient agent


##### **Q22: How do you evaluate the policy gradient agent by running it on the environment without exploration (i.e., using deterministic behavior)?**


##### **Q23: How do you visualize the cumulative reward over episodes to assess the performance of the trained policy gradient agent?**


##### **Q24: How do you compare the agent’s performance when using a policy gradient method to other reinforcement learning methods?**

## Experimenting with hyperparameters


##### **Q25: How do you experiment with different learning rates to observe their impact on the agent’s learning speed and convergence?**


##### **Q26: How do you adjust the discount factor (gamma) to analyze how it affects the agent’s long-term planning and reward optimization?**


##### **Q27: How do you experiment with different policy network architectures, such as adding more layers or units, and analyze the effect on performance?**


##### **Q28: How do you experiment with different batch sizes for updating the policy and observe their impact on training stability?**


##### **Q29: How do you experiment with different methods of reward normalization to improve the stability of the policy gradient updates?**


##### **Q30: How do you implement and evaluate an entropy bonus to encourage exploration during training and avoid premature convergence?**

## Conclusion