# ðŸ§  Unit 5.5: Policy Gradient Methods (REINFORCE)

**Course:** Advanced Machine Learning (AICC 303)  
**Topic:** 5.5 Policy Gradient Methods (REINFORCE)

So far (Q-Learning, SARSA, DQN), we used **Value-Based** methods: estimate $Q(s,a)$ and then pick $\text{argmax} Q$.
**Policy-Based** methods learn the probability distribution of actions $\pi(a|s; \theta)$ directly.

**Advantages:**
1.  Can handle continuous action spaces.
2.  Can learn stochastic policies (e.g., in Rock-Paper-Scissors).

**REINFORCE Algorithm (Monte Carlo Policy Gradient):**
1.  Run an entire episode.
2.  Calculate the total return $G_t$.
3.  Increase probability of actions that resulted in high $G_t$.

In [4]:
!pip install gymnasium

Collecting gymnasium
  Downloading gymnasium-1.2.3-py3-none-any.whl.metadata (10 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl.metadata (558 bytes)
Downloading gymnasium-1.2.3-py3-none-any.whl (952 kB)
   ---------------------------------------- 0.0/952.1 kB ? eta -:--:--
   ----------- ---------------------------- 262.1/952.1 kB ? eta -:--:--
   ---------------------- ----------------- 524.3/952.1 kB 1.1 MB/s eta 0:00:01
   --------------------------------- ------ 786.4/952.1 kB 1.8 MB/s eta 0:00:01
   --------------------------------- ------ 786.4/952.1 kB 1.8 MB/s eta 0:00:01
   ---------------------------------------- 952.1/952.1 kB 959.6 kB/s  0:00:00
Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, gymnasium

   ---------------------------------------- 0/2 [farama-notifications]
   ---------------------------------------- 0/2 [farama-notific


[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categorical

env = gym.make('CartPole-v1')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

class PolicyNetwork(nn.Module):
    def __init__(self, state_size, action_size):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size, 24)
        self.fc2 = nn.Linear(24, 24)
        self.fc3 = nn.Linear(24, action_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return F.softmax(self.fc3(x), dim=-1)  # Outputs Probabilities

model = PolicyNetwork(env.observation_space.shape[0], env.action_space.n).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.002)

def compute_returns(rewards, gamma=0.99):
    """Compute discounted returns"""
    returns = []
    G =0
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    
    # Normalize (Baseline reduction for variance)
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + 1e-9)
    return returns

## 1. Training Loop

In [6]:
episodes = 200  # Need more to converge usually

for episode in range(episodes):
    state, _ = env.reset()
    
    log_probs = []
    rewards = []
    done = False
    trunc = False
    
    while not (done or trunc):
        # Convert state to tensor
        state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
        
        # Get Action Probabilities
        probs = model(state_tensor)
        
        # Sample action from distribution
        m = Categorical(probs)
        action = m.sample()
        
        next_state, reward, done, trunc, _ = env.step(action.item())
        
        log_probs.append(m.log_prob(action))
        rewards.append(reward)
        
        state = next_state
    
    # Compute returns
    returns = compute_returns(rewards)
    
    # Compute loss
    policy_loss = []
    for log_prob, G in zip(log_probs, returns):
        policy_loss.append(-log_prob * G)
    
    # Update Policy after full episode (Monte Carlo)
    optimizer.zero_grad()
    loss = torch.stack(policy_loss).sum()
    loss.backward()
    optimizer.step()
    
    if episode % 10 == 0:
        print(f"Episode {episode}, Total Reward: {sum(rewards)}, Loss: {loss.item():.4f}")

Episode 0, Total Reward: 21.0, Loss: -0.0970
Episode 10, Total Reward: 15.0, Loss: 0.1465
Episode 20, Total Reward: 15.0, Loss: 0.1363
Episode 30, Total Reward: 20.0, Loss: -0.4313
Episode 40, Total Reward: 40.0, Loss: -0.4784
Episode 50, Total Reward: 24.0, Loss: -0.2114
Episode 60, Total Reward: 47.0, Loss: -1.4042
Episode 70, Total Reward: 63.0, Loss: 0.1203
Episode 80, Total Reward: 29.0, Loss: -0.2223
Episode 90, Total Reward: 27.0, Loss: 0.3090
Episode 100, Total Reward: 30.0, Loss: -0.2519
Episode 110, Total Reward: 48.0, Loss: 0.1980
Episode 120, Total Reward: 59.0, Loss: -1.9546
Episode 130, Total Reward: 57.0, Loss: -1.2323
Episode 140, Total Reward: 39.0, Loss: -0.4346
Episode 150, Total Reward: 72.0, Loss: -2.2292
Episode 160, Total Reward: 26.0, Loss: 1.1136
Episode 170, Total Reward: 57.0, Loss: -1.5818
Episode 180, Total Reward: 59.0, Loss: -0.4601
Episode 190, Total Reward: 35.0, Loss: 0.8967
