# Actor-Critic Methods in Reinforcement Learning

## ðŸ“š Learning Objectives

By completing this notebook, you will:
- Understand the key concepts of this topic
- Apply the topic using Python code examples
- Practice with small, realistic datasets or scenarios

## ðŸ”— Prerequisites

- âœ… Basic Python
- âœ… Basic NumPy/Pandas (when applicable)

---

## Official Structure Reference

This notebook supports **Course 09, Unit 3** requirements from `DETAILED_UNIT_DESCRIPTIONS.md`.

---


# Actor-Critic Methods in Reinforcement Learning
## AIAT 123 - Reinforcement Learning

## Learning Objectives

- Understand Actor-Critic architecture
- Implement Actor-Critic algorithm
- Apply to continuous control tasks
- Compare with value-based methods

## Real-World Context

Robotics control, autonomous systems, and continuous action spaces.

**Industry Impact**: Used in robotics, autonomous vehicles, and game AI.

In [1]:
%pip install gym torch numpy matplotlib -q
import torch
import torch.nn as nn
import torch.optim as optim
import gym
import numpy as np
import matplotlib.pyplot as plt
print('âœ… Setup complete!')

Note: you may need to restart the kernel to use updated packages.


âœ… Setup complete!


Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.


## Part 1: Actor-Critic Architecture


In [2]:
class ActorCritic(nn.Module):
    """
    Actor-Critic network.
    Actor: Policy network (outputs action probabilities)
    Critic: Value network (estimates state values)
    """
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super().__init__()
        
        # Shared feature extractor
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(), nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        # Actor (policy)
        self.actor = nn.Sequential(
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=-1)
        )
        
        # Critic (value)
        self.critic = nn.Linear(hidden_dim, 1)
    
    def forward(self, state):
        features = self.shared(state)
        action_probs = self.actor(features)
        value = self.critic(features)
        return action_probs, value

print('âœ… Actor-Critic architecture defined')

âœ… Actor-Critic architecture defined


## Part 2: Training Loop


In [3]:
def train_actor_critic(env_name='CartPole-v1', episodes=500):
    """
    Train Actor-Critic agent.
    
    Real-world: Training robots or game agents
    """
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    model = ActorCritic(state_dim, action_dim)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    
    rewards_history = []
    
    for episode in range(episodes):
        state = env.reset()[0]
        episode_reward = 0
        
        while True:
            # Get action probabilities and value
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            action_probs, value = model(state_tensor)
            
            # Sample action
            action = torch.multinomial(action_probs, 1).item()
            
            # Take step
            next_state, reward, done, truncated, _ = env.step(action)
            episode_reward += reward
            
            if done or truncated:
                # Calculate advantage
                advantage = reward - value.item()
                
                # Update actor (policy)
                actor_loss = -torch.log(action_probs[0][action]) * advantage
                
                # Update critic (value)
                critic_loss = advantage ** 2
                
                # Total loss
                loss = actor_loss + critic_loss
                
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                
                break
            
            state = next_state
        
        rewards_history.append(episode_reward)
        
        if (episode + 1) % 50 == 0:
            avg_reward = np.mean(rewards_history[-50:])
            print(f'Episode {episode+1}, Avg Reward: {avg_reward:.2f}')
    
    env.close()
    return rewards_history

print('âœ… Training function ready!')
print('\nNote: Full training takes time. This demonstrates the concept.')

âœ… Training function ready!

Note: Full training takes time. This demonstrates the concept.


## Real-World Applications

- **Robotics**: Continuous control (robot arm, walking)
- **Autonomous Vehicles**: Steering, acceleration control
- **Game AI**: Real-time strategy games
- **Finance**: Portfolio optimization

---

**End of Notebook**