# **REINFORCE Algorithm**

#### **The code in this notebook is based on [Nimish Sanghi's book on Deep RL](https://link.springer.com/book/10.1007/979-8-8688-0273-7).**

Feel free to check out this notebook on Kaggle: https://www.kaggle.com/code/aryamanbansal/reinforce

### **1. Overview of REINFORCE**

**REINFORCE** is one of the earliest policy gradient methods in reinforcement learning. It is a **Monte Carlo** algorithm that estimates the gradient of the expected reward by sampling complete episodes. The key idea is to update the policy parameters in the direction that increases the likelihood of actions that produced high rewards.

#### **Core Objective**

We want to maximize the expected total reward (or return) given by the objective function:
$$
J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]
$$
where:
- $ \pi_\theta(a|s) $ is the policy parameterized by $ \theta $,
- $ \tau $ is a trajectory $ (s_0, a_0, s_1, a_1, \dots, s_T, a_T) $,
- $ R(\tau) $ is the cumulative reward (possibly discounted) received along trajectory $ \tau $.

---

### **2. Deriving the REINFORCE Gradient**

#### **Step 1: Write the Objective as an Expectation**

We express the expected return as an integral (or sum in the discrete case):
$$
J(\theta) = \int \pi_\theta(\tau) R(\tau) \, d\tau
$$
Here, $ \pi_\theta(\tau) $ is the probability of observing trajectory $ \tau $ under policy $ \pi_\theta $.

#### **Step 2: Differentiate the Objective**

To maximize $ J(\theta) $, we need its gradient:
$$
\nabla_\theta J(\theta) = \nabla_\theta \int \pi_\theta(\tau) R(\tau) \, d\tau
$$
Assuming we can interchange the gradient and the integral (which is standard under regularity conditions), we get:
$$
\nabla_\theta J(\theta) = \int \nabla_\theta \pi_\theta(\tau) R(\tau) \, d\tau
$$

#### **Step 3: Introduce the Log-Derivative Trick**

Directly differentiating $ \pi_\theta(\tau) $ can be tricky. The log-derivative trick (also known as the likelihood ratio method) simplifies this:
$$
\nabla_\theta \pi_\theta(\tau) = \pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau)
$$
Substitute this back into the gradient:
$$
\nabla_\theta J(\theta) = \int \pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau) R(\tau) \, d\tau
$$
Since the integral weighted by $ \pi_\theta(\tau) $ is an expectation, we write:
$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(\tau) \, R(\tau) \right]
$$

#### **Step 4: Decomposing the Trajectory Probability**

A trajectory $ \tau $ is a sequence of states and actions:
$$
\tau = (s_0, a_0, s_1, a_1, \dots, s_T, a_T)
$$
Under the Markov property, the probability of a trajectory can be factorized as:
$$
\pi_\theta(\tau) = p(s_0) \prod_{t=0}^{T} \pi_\theta(a_t|s_t) \, p(s_{t+1}|s_t, a_t)
$$
Since the environment dynamics $ p(s_{t+1}|s_t, a_t) $ and the initial state distribution $ p(s_0) $ do not depend on $ \theta $, only the policy terms matter when we differentiate with respect to $ \theta $.

Taking the logarithm, we have:
$$
\log \pi_\theta(\tau) = \log p(s_0) + \sum_{t=0}^{T} \left[ \log \pi_\theta(a_t|s_t) + \log p(s_{t+1}|s_t, a_t) \right]
$$
Because $ \log p(s_0) $ and $ \log p(s_{t+1}|s_t, a_t) $ are independent of $ \theta $, their gradients vanish. Therefore:
$$
\nabla_\theta \log \pi_\theta(\tau) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t)
$$

#### **Step 5: Final Form of the Gradient**

Substitute the decomposition back into our gradient expression:
$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \left( \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \right) R(\tau) \right]
$$

#### **Step 6: Variance Reduction via Rewards-to-Go**

Using the total return $ R(\tau) $ for every time step can introduce high variance. To reduce this, we often replace $ R(\tau) $ with the **reward-to-go** $ G_t $ defined as:
$$
G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k
$$
Now the gradient becomes:
$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \, G_t \right]
$$
This tells us to update the parameters based on how each action contributed to the future rewards (with discounting $ \gamma $).

---

### **3. Intuitive Explanation**

Imagine you are learning a recipe:
- **Trajectory:** Each complete cooking session is a trajectory - a series of steps (actions) you take (mixing, frying, etc.).
- **Reward:** At the end, you taste the dish (the overall reward).
- **Adjustment:** If the dish is great (high reward), you want to remember which steps helped make it good. The algorithm "nudges" your recipe to favor those steps.

The key idea is to **increase the probability of actions** that led to high rewards. By computing the gradient of the log probability for each action and weighting it by the future rewards (or rewards-to-go), you tell the algorithm, "These actions were good; do them more often!"

### **4. Summary**

- **Derivation:**  
  Starting from the expected return, we used the log-derivative trick to derive the gradient estimator:
  $$
  \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \, G_t \right]
  $$
- **Intuition:**  
  Actions that lead to higher rewards are reinforced by increasing their log probabilities, while poor actions are discouraged.

### **REINFORCE with Rewards-to-Go Algorithm**

**Given:**
- A parameterized policy $ \pi_\theta(a|s) $
- Discount factor $ \gamma \in [0,1] $
- Learning rate $ \alpha $
- Environment $ \mathcal{E} $

**Algorithm:**

1. **Initialize** policy parameters $ \theta $.

2. **For** episode $ i = 1, 2, \dots, N $:
   1. Initialize state $ s_0 $.
   2. **For** $ t = 0, 1, \dots, T $ (until terminal state):
      - Sample action $ a_t \sim \pi_\theta(\cdot|s_t) $.
      - Execute $ a_t $, observe reward $ r_t $ and next state $ s_{t+1} $.
   3. **Compute rewards-to-go:**  
      $$
      G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k \quad \text{for } t=0,\ldots,T.
      $$
   4. **Policy Update:**  
      Update parameters using:
      $$
      \theta \leftarrow \theta + \alpha \, \nabla_\theta J(\theta)
      $$
      where
      $$
      \nabla_\theta J(\theta) \approx \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \, G_t.
      $$

3. **Repeat** until convergence.

### **Final Remarks**


#### **Rewards-to-Go Trick**

- **Basic Idea:**  
  When you take an action at time *t*, you cannot affect the rewards that occurred **before** time *t*. Therefore, only the rewards from time *t* onward should be considered. This is why we sum the rewards starting at time *t* (instead of from the beginning) to compute the “reward-to-go.”
  
- **Mathematical Expression:**  
  Instead of summing rewards from time 1 to *T* for every action, we modify the gradient expression so that for each time *t*, we sum only over rewards from *t* to *T*. Formally:
  
  $$
  \nabla_{\theta} J(\theta) = \frac{1}{N} \sum_{i=1}^{N} \left[ \sum_{t=1}^{T} \left( \nabla_{\theta} \log \pi_\theta(a_t^i|s_t^i) \sum_{t'=t}^{T} \gamma^{t'-t} \, r(s_{t'}^i, a_{t'}^i) \right) \right]
  $$
  
  This means that the effect of an action at time *t* is weighted by the sum of future (discounted) rewards starting from that time.

#### **Implementing Loss and Gradient Step in PyTorch**

- **Pseudo Loss Function:**  
  In practice, we want to maximize the expected reward. However, most deep learning libraries (like PyTorch) minimize a loss function. Therefore, we define a pseudo loss as the negative of our reward objective.
  
- **Loss Function for REINFORCE with Rewards-to-Go:**  
  The loss is formulated as:
  
  $$
  L_{\text{Cross-entropy}}(\theta) = - J(\theta) = - \frac{1}{N} \sum_{i=1}^{N} \left[ \sum_{t=1}^{T} \left( \log \pi_\theta(a_t^i|s_t^i) \sum_{t'=t}^{T} \gamma^{t'-t} \, r(s_{t'}^i, a_{t'}^i) \right) \right]
  $$
  
  This means that for every action taken, we calculate the negative log probability (as in cross-entropy) and weight it by the rewards-to-go.

- **Entropy Regularization:**  
  To prevent the policy from becoming too certain too quickly (which would limit exploration), an entropy term is added. The entropy of a probability distribution is defined as:
  
  $$
  H(X) = - \sum_x p(x) \log p(x)
  $$
  
  A higher entropy means a more spread-out (uncertain) distribution. We add a regularization term (scaled by a coefficient $\beta$) to encourage exploration.

- **Combined Loss:**  
  The total loss (which is minimized) becomes:
  
  $$
  Loss(\theta) = - J(\theta) - H(\pi_\theta(a_t^i|s_t^i)) = - \frac{1}{N} \sum_{i=1}^{N} \left[ \sum_{t=1}^{T} \left( \log \pi_\theta(a_t^i|s_t^i) \sum_{t'=t}^{T} \gamma^{t'-t} \, r(s_{t'}^i, a_{t'}^i) \right) - \beta \sum_{a_i} \pi_\theta(a_t^i|s_t^i) \log \pi_\theta(a_t^i|s_t^i) \right]
  $$
  
  In simple terms:
  - We compute the (negative) weighted log probabilities of the actions.
  - We weight each by its corresponding future rewards (reward-to-go).
  - We add an entropy term to keep the policy "spread out" (promoting exploration).
  - Finally, we perform a gradient descent step on this loss to update our policy parameters.

In [None]:
import gymnasium as gym
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import numpy as np
import matplotlib.pyplot as plt
import random
from scipy.signal import convolve
from scipy.signal.windows import gaussian

from base64 import b64encode
from IPython.display import HTML, clear_output

from tqdm import trange

print("imports done!")

imports done!


In [7]:
# set a seed
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

<torch._C.Generator at 0x2a8911765f0>

In [8]:
# Assuming a global device setting (CPU or CUDA)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

In [9]:
class Agent(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=192):
        super(Agent, self).__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x
    

In [10]:
def predict_probs(agent, states):
    """
    Compute action probabilities for a given batch of states using the model.
    
    Parameters:
        agent (nn.Module): The neural network model of the agent.
        states (list or array): A batch of states with shape [batch, state_dim].
    
    Returns:
        numpy.ndarray: An array of probabilities with shape [batch, n_actions].
    """
    # Convert the states to a PyTorch tensor and send to device
    states = torch.tensor(np.array(states), device=device, dtype=torch.float32)
    
    # Compute the logits from the model without tracking gradients
    with torch.no_grad():
        logits = agent(states)
    
    # Convert logits to probabilities using softmax and return them as numpy array
    probs = nn.functional.softmax(logits, dim=-1).detach().cpu().numpy()
    return probs


In [11]:
def generate_trajectory(env, agent, n_steps=1000):
    """
    Generate a trajectory (sequence of states, actions, rewards) by running the policy.
    
    Parameters:
        env: The environment instance.
        agent: The agent model.
        n_steps (int): Maximum number of steps to run in the episode.
    
    Returns:
        tuple: Arrays of states, actions, and rewards collected during the episode.
    """
    states, actions, rewards = [], [], []

    # Initialize the environment and get the initial state.
    s, _ = env.reset(seed=seed)    
    
    # Generate the trajectory for a maximum of n_steps or until termination.
    for t in range(n_steps):
        # Get action probabilities for the current state.
        action_probs = predict_probs(agent, np.array([s]))[0]
        
        # Sample an action from the probability distribution.
        n_actions = env.action_space.n
        a = np.random.choice(n_actions, p=action_probs)
        
        # Execute the action and observe the next state, reward, and done flag.
        next_state, r, done, _, _ = env.step(a)
        
        # Store the current state, chosen action, and obtained reward.
        states.append(s)
        actions.append(a)
        rewards.append(r)
        
        s = next_state  # Update state to the next state.
        if done:
            break  # End the trajectory if the environment signals termination.
    
    return np.array(states), np.array(actions), np.array(rewards)


In [12]:
def evaluate(env, agent, n_games=3, t_max=10000):
    """
    Evaluate the current policy by averaging returns over several games.
    
    Parameters:
        env: The environment instance for evaluation.
        agent: The policy model (not directly used in this function).
        n_games (int): Number of games to run for evaluation.
        t_max (int): Maximum number of steps per game.
    
    Returns:
        float: Mean return (total reward) over the evaluation games.
    """
    rewards = []
    for i in range(n_games):
        # Reset the environment with a different seed for each evaluation game.
        s, _ = env.reset(seed=seed + i)
        reward = 0
        for _ in range(t_max):
            # Get action probabilities for the current state.
            action_probs = predict_probs(agent, np.array([s]))[0]
            # Sample an action based on the probabilities.
            n_actions = env.action_space.n
            a = np.random.choice(n_actions, p=action_probs)
            # Execute the action and obtain the next state and reward.
            next_state, r, terminated, _, _ = env.step(a)
            reward += r
            s = next_state
            if terminated:
                break  # End the game if terminated.
        rewards.append(reward)
    return np.mean(rewards)


In [13]:
def get_rewards_to_go(rewards, gamma=0.99):
    """
    Compute the rewards-to-go for a given sequence of rewards.
    
    The rewards-to-go at time t is the discounted sum of rewards from time t onward.
    
    Parameters:
        rewards (list): List of rewards for each time step in the trajectory.
        gamma (float): Discount factor.
    
    Returns:
        list: Rewards-to-go for each time step.
    """
    T = len(rewards)  # Total number of rewards.
    rewards_to_go = [0] * T  # Initialize the rewards-to-go list.
    rewards_to_go[T - 1] = rewards[T - 1]  # The last time step's reward-to-go is just its reward.
    
    # Iterate backwards from the second last reward to the first.
    for i in range(T - 2, -1, -1):
        rewards_to_go[i] = gamma * rewards_to_go[i + 1] + rewards[i]
    
    return rewards_to_go


In [None]:
def train_one_episode(states, actions, rewards, agent, optimizer,
                      gamma=0.99, entropy_coef=1e-2):
    """
    Train the policy network on one trajectory (episode) using the REINFORCE algorithm
    with rewards-to-go and an entropy regularization term.
    
    The loss is computed as the negative of:
        J(θ) = (1/T) * sum_t [ log π(a_t|s_t) * (reward-to-go)_t ]
    plus an entropy bonus to encourage exploration.
    
    Parameters:
        states (np.array): Array of states encountered in the episode.
        actions (np.array): Array of actions taken in the episode.
        rewards (np.array): Array of rewards received in the episode.
        gamma (float): Discount factor for computing rewards-to-go.
        entropy_coef (float): Coefficient for the entropy regularization term.
    
    Returns:
        torch.Tensor: The loss value (detached) for monitoring training progress.
    """
    # Compute rewards-to-go for the trajectory.
    rewards_to_go = get_rewards_to_go(rewards, gamma)

    # Convert numpy arrays to PyTorch tensors and move them to the proper device.
    states = torch.tensor(states, device=device, dtype=torch.float)
    actions = torch.tensor(actions, device=device, dtype=torch.long)
    rewards_to_go = torch.tensor(rewards_to_go, device=device, dtype=torch.float)

    # Pass the states through the model to obtain logits.
    logits = agent(states)
    # Convert logits to probabilities.
    probs = nn.functional.softmax(logits, dim=-1)
    # Compute the log probabilities.
    log_probs = nn.functional.log_softmax(logits, dim=-1)
    
    # Select the log probabilities corresponding to the actions taken.
    log_probs_for_actions = log_probs[range(len(actions)), actions]
    
    # Compute the objective J(θ) (average over the episode).
    J = torch.mean(log_probs_for_actions * rewards_to_go)
    # Compute the entropy of the policy to encourage exploration.
    H = -(probs * log_probs).sum(dim=-1).mean()
    
    # The loss to minimize is the negative of (J + entropy bonus).
    loss = -(J + entropy_coef * H)

    # Backpropagation: clear gradients, compute new gradients, and update parameters.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # Return the loss value (detached from the computation graph).
    return loss.detach().cpu()


In [None]:
def smoothen(values):
    """
    Smooths out the given values using a Gaussian filter.

    Args:
        values (list or np.array): The sequence of values to smooth.

    Returns:
        np.array: The smoothed values.
    """
    kernel = gaussian(100, std=100)
    kernel = kernel / np.sum(kernel)
    return convolve(values, kernel, 'valid')

In [None]:
def main_train_loop(env, env_name, agent, optimizer, n_iterations=1000, 
                    eval_freq=10, gamma=0.99, entropy_coef=1e-2):
    """
    Main training loop for the REINFORCE algorithm.

    This function iteratively generates trajectories from the environment,
    trains the policy network on each trajectory using the REINFORCE update
    with rewards-to-go and an entropy regularization term, and periodically
    evaluates the policy performance.

    Parameters:
        env (gym.Env): The training environment.
        env_name (str): Name of the environment (used for evaluation environment creation).
        agent (torch.nn.Module): The policy network to be trained.
        optimizer (torch.optim.Optimizer): The optimizer used to update the policy network.
        n_iterations (int): Total number of training iterations (episodes).
        eval_freq (int): Frequency (in iterations) at which to evaluate the policy.
        gamma (float): Discount factor for computing rewards-to-go.
        entropy_coef (float): Coefficient for the entropy regularization term.

    Returns:
        tuple: A tuple containing two lists:
            - loss_history (list): Loss values recorded at each training iteration.
            - return_history (list): Mean returns recorded at evaluation steps.
    """
    # Initialize histories for monitoring progress
    loss_history = []
    return_history = []
    
    # Loop over the specified number of iterations (episodes)
    for i in trange(n_iterations):
        # Generate one trajectory (episode) from the environment
        states, actions, rewards = generate_trajectory(env, agent)
        
        # Train the model on the collected trajectory
        loss = train_one_episode(states, actions, rewards, agent, optimizer,
                                 gamma=gamma, entropy_coef=entropy_coef)
        loss_history.append(loss)
        
        # Periodically evaluate the policy
        if i != 0 and i % eval_freq == 0:
            # Create a new evaluation environment instance (assumes make_env is defined)
            eval_env = gym.make(env_name, render_mode="rgb_array")
            mean_return = evaluate(eval_env, agent)
            return_history.append(mean_return)
            eval_env.close()
            
            # Optionally, clear output and plot progress (if running in a notebook)
            clear_output(wait=True)
            import matplotlib.pyplot as plt
            plt.figure(figsize=[16, 5])
            
            # Plot mean return per episode
            plt.subplot(1, 2, 1)
            plt.title("Mean return per episode")
            plt.plot(return_history)
            plt.grid()
            
            # Plot loss history (smoothened)
            plt.subplot(1, 2, 2)
            plt.title("Loss history (smoothened)")
            plt.plot(smoothen(loss_history))
            plt.grid()
            plt.show()
    
    return loss_history, return_history


### **Setting up the training parameters**



In [None]:
# Setup the environment and agent networks
env_name = 'CartPole-v1'
env = gym.make(env_name, render_mode="rgb_array", max_episode_steps=4000)    # Create the environment
state_dim = env.observation_space.shape     # e.g., (4,) for CartPole
n_actions = env.action_space.n              # e.g., 2 for CartPole

In [None]:
# Reset environment and set seed for reproducibility
state, _ = env.reset(seed=seed)

# Initialize DQN agent with initial high exploration (epsilon=1)
agent = Agent(state_dim, n_actions).to(device)

# Initialize the optimizer (Adam) for updating the agent's parameters
opt = torch.optim.Adam(agent.parameters(), lr=1e-3)

# Define frequencies for logging and updating the target network
eval_freq = 10                    # Evaluate the agent every 10 steps
n_episodes = 5000                 # Total number of episodes to train the agent

loss_history = []                 # Initialize a list to store the loss values
return_history = []               # Initialize a list to store the return values

### **Applying DQN on the CartPole**



In [None]:
main_train_loop(env, env_name, agent, opt, n_episodes, n_episodes, eval_freq)

In [None]:
final_score = evaluate(
  gym.make(env_name, render_mode="rgb_array", max_episode_steps=4000),
  agent, n_games=30, t_max=1000
)
print('final score:', final_score)