### Code Description for First-Time Users

This script implements a **Dueling Deep Q-Network (Dueling DQN)** to train an agent to play the Atari game **Breakout** using reinforcement learning. Below is an overview of the key components and their functions:

#### 1. **Core Model Architecture (Dueling DQN)**
   - The `DuelingDQN` class defines the neural network architecture with:
     - **Convolutional Layers** to process image input.
     - **Value and Advantage Streams**: These streams separate the estimation of the overall value of a state and the advantages of individual actions, combining them to output action Q-values.

#### 2. **Replay Buffer with Prioritization**
   - The `PrioritizedReplayBuffer` stores past experiences (state, action, reward, next state, and done flag).
   - It samples experiences based on their importance (priority), which accelerates learning by focusing on significant transitions.

#### 3. **BreakoutAgent**
   - Handles the agent's:
     - **Model**: Policy and target Dueling DQN networks.
     - **Experience Replay**: Storing and sampling experiences for training.
     - **Action Selection**: Epsilon-greedy strategy balances exploration (random actions) and exploitation (choosing the best-known action).
     - **Training**: Uses the sampled experiences to compute and minimize the temporal-difference loss.

#### 4. **Game Interaction**
   - Uses the `ALEInterface` (Atari Learning Environment) to interact with the Breakout game:
     - Captures screen images, processes them into grayscale and resized tensors.
     - Executes actions using the minimal action set provided by the ALE.

#### 5. **Training Process**
   - The `train_agent` function:
     - Initializes the environment and agent.
     - Runs for a specified number of episodes, during which:
       - The agent interacts with the environment, storing experiences in the replay buffer.
       - Periodically trains the policy network using sampled experiences.
       - Adjusts exploration (epsilon) based on recent performance trends.
       - Tracks rewards and updates the target network at intervals.
     - Optionally renders the game at regular intervals for visualization.

#### 6. **Evaluation**
   - The `evaluate_agent` function:
     - Runs the trained agent in the game environment for a few episodes.
     - Measures and displays the performance of the trained agent.

#### 7. **Visualization**
   - Plots the training rewards over episodes to monitor learning progress.

#### 8. **Execution Flow**
   - The script trains the agent for 1,000 episodes, plots training progress, and evaluates the agent's performance over 10 evaluation episodes.

### Key Notes for First-Time Use
- Ensure **Python dependencies** (`numpy`, `torch`, `ale-py`, `opencv`, etc.) are installed.
- Verify that the Breakout ROM is correctly loaded by `ale-py`.
- For training visualization, the game screen will display (requires a graphical environment).
- Training can take considerable time depending on hardware (using a GPU is recommended).

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque, namedtuple
import random
from ale_py import ALEInterface, roms
import cv2
from PIL import Image
import matplotlib.pyplot as plt
from tqdm import tqdm

# Dueling DQN architecture
class DuelingDQN(nn.Module):
    def __init__(self, input_shape, n_actions):
        super(DuelingDQN, self).__init__()

        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU()
        )

        conv_out_size = self._get_conv_out(input_shape)

        self.fc_value = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, 1)
        )

        self.fc_advantage = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions)
        )

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        conv_out = self.conv(x).view(x.size()[0], -1)
        value = self.fc_value(conv_out)
        advantage = self.fc_advantage(conv_out)
        return value + (advantage - advantage.mean(dim=1, keepdim=True))

Experience = namedtuple('Experience', ['state', 'action', 'reward', 'next_state', 'done'])

class PrioritizedReplayBuffer:
    def __init__(self, capacity, alpha=0.6):
        self.alpha = alpha
        self.capacity = capacity
        self.buffer = deque(maxlen=capacity)
        self.priorities = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        max_priority = max(self.priorities, default=1.0)
        self.buffer.append(Experience(state, action, reward, next_state, done))
        self.priorities.append(float(max_priority))

    def sample(self, batch_size, beta=0.4):
        priorities = np.array(list(self.priorities), dtype=np.float32)
        probabilities = priorities ** self.alpha
        probabilities /= probabilities.sum()

        indices = random.choices(range(len(self.buffer)), k=batch_size, weights=probabilities)
        experiences = [self.buffer[idx] for idx in indices]

        weights = (len(self.buffer) * probabilities[indices]) ** (-beta)
        weights /= weights.max()

        states = torch.stack([exp.state for exp in experiences])
        actions = torch.tensor([exp.action for exp in experiences], dtype=torch.long)
        rewards = torch.tensor([exp.reward for exp in experiences], dtype=torch.float)
        next_states = torch.stack([exp.next_state for exp in experiences])
        dones = torch.tensor([exp.done for exp in experiences], dtype=torch.float)
        weights = torch.tensor(weights, dtype=torch.float)

        return states, actions, rewards, next_states, dones, indices, weights

    def update_priorities(self, indices, priorities):
        for idx, priority in zip(indices, priorities):
            self.priorities[idx] = float(priority.item())

    def __len__(self):
        return len(self.buffer)

class BreakoutAgent:
    def __init__(self, state_shape, n_actions, device="cuda" if torch.cuda.is_available() else "cpu"):
        self.device = device
        self.state_shape = state_shape
        self.n_actions = n_actions

        self.policy_net = DuelingDQN(state_shape, n_actions).to(device)
        self.target_net = DuelingDQN(state_shape, n_actions).to(device)
        self.target_net.load_state_dict(self.policy_net.state_dict())

        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=0.0001)
        self.memory = PrioritizedReplayBuffer(100000)
        self.batch_size = 32
        self.gamma = 0.99
        self.epsilon = 1.0
        self.epsilon_min = 0.05
        self.epsilon_max = 1.0
        self.epsilon_decay = 0.995
        self.target_update = 1000
        self.steps = 0

        self.frame_skip = 4

        # Performance tracking
        self.performance_window = deque(maxlen=5)
        self.exploration_increase_threshold = -10

    def adjust_exploration(self, recent_reward):
        """Dynamically adjust exploration based on performance trends"""
        self.performance_window.append(recent_reward)

        if len(self.performance_window) == self.performance_window.maxlen:
            # Calculate the trend in recent performance
            performance_trend = sum(y - x for x, y in zip(
                self.performance_window,
                list(self.performance_window)[1:]
            )) / (len(self.performance_window) - 1)

            if performance_trend < self.exploration_increase_threshold:
                # Bad performance trend - increase exploration
                self.epsilon = min(
                    self.epsilon_max,
                    self.epsilon + 0.1
                )
                return "increased"
            elif performance_trend > 0:
                # Good performance trend - decrease exploration gradually
                self.epsilon = max(
                    self.epsilon_min,
                    self.epsilon * self.epsilon_decay
                )
                return "decreased"
            else:
                # Neutral trend - maintain current exploration
                return "maintained"
        return "initializing"

    def preprocess_state(self, state):
        gray = cv2.cvtColor(state, cv2.COLOR_RGB2GRAY)
        resized = cv2.resize(gray, (84, 84), interpolation=cv2.INTER_AREA)
        processed = torch.FloatTensor(resized).unsqueeze(0) / 255.0
        return processed

    def select_action(self, state):
        if random.random() < self.epsilon:
            return random.randrange(self.n_actions)

        with torch.no_grad():
            state = state.unsqueeze(0).to(self.device)
            q_values = self.policy_net(state)
            return q_values.max(1)[1].item()

    def train_step(self):
        if len(self.memory) < self.batch_size:
            return

        states, actions, rewards, next_states, dones, indices, weights = self.memory.sample(self.batch_size)

        states = states.to(self.device)
        actions = actions.to(self.device)
        rewards = rewards.to(self.device)
        next_states = next_states.to(self.device)
        dones = dones.to(self.device)
        weights = weights.to(self.device)

        current_q_values = self.policy_net(states).gather(1, actions.unsqueeze(1))
        next_q_values = self.target_net(next_states).max(1)[0].detach()
        expected_q_values = rewards + (self.gamma * next_q_values * (1.0 - dones))

        loss = nn.MSELoss()(current_q_values, expected_q_values.unsqueeze(1))

        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.policy_net.parameters(), 1.0)
        self.optimizer.step()

        td_errors = torch.abs(current_q_values - expected_q_values.unsqueeze(1)).detach().cpu().numpy()
        self.memory.update_priorities(indices, td_errors)

        if self.steps % self.target_update == 0:
            self.target_net.load_state_dict(self.policy_net.state_dict())

        self.steps += 1
        return loss.item()

def train_agent(episodes=1000, render_frequency=100):
    ale = ALEInterface()
    ale.setInt('random_seed', 123)
    ale.setBool('sound', False)
    ale.setBool('display_screen', True)
    ale.setFloat('repeat_action_probability', 0.0)
    ale.loadROM(roms.get_rom_path("breakout"))

    actions = ale.getMinimalActionSet()
    state_shape = (1, 84, 84)
    agent = BreakoutAgent(state_shape, len(actions))

    episode_rewards = []

    for episode in tqdm(range(episodes)):
        ale.reset_game()
        total_reward = 0
        done = False

        state = agent.preprocess_state(ale.getScreenRGB())

        while not done:
            action_idx = agent.select_action(state)
            reward = 0

            for _ in range(agent.frame_skip):
                reward += ale.act(actions[action_idx])
                if ale.game_over():
                    done = True
                    break

            if not done:
                next_state = agent.preprocess_state(ale.getScreenRGB())
            else:
                next_state = state.clone()

            agent.memory.push(state, action_idx, float(reward), next_state, float(done))
            loss = agent.train_step()

            state = next_state
            total_reward += reward

            if episode % render_frequency == 0:
                screen = ale.getScreenRGB()
                cv2.imshow('Breakout', cv2.cvtColor(screen, cv2.COLOR_RGB2BGR))
                cv2.waitKey(1)

        episode_rewards.append(total_reward)
        exploration_status = agent.adjust_exploration(total_reward)

        if episode % 10 == 0:
            avg_reward = np.mean(episode_rewards[-10:])
            print(f"Episode {episode}, Average Reward: {avg_reward:.2f}, "
                  f"Epsilon: {agent.epsilon:.2f}, Exploration: {exploration_status}")

    cv2.destroyAllWindows()
    return agent, episode_rewards

def evaluate_agent(agent, n_episodes=10):
    ale = ALEInterface()
    ale.setBool('sound', False)
    ale.setBool('display_screen', True)
    ale.loadROM(roms.get_rom_path("breakout"))
    actions = ale.getMinimalActionSet()

    total_rewards = []

    for episode in range(n_episodes):
        ale.reset_game()
        total_reward = 0
        done = False

        while not done:
            state = agent.preprocess_state(ale.getScreenRGB())
            action_idx = agent.select_action(state)

            reward = 0
            for _ in range(agent.frame_skip):
                reward += ale.act(actions[action_idx])
                if ale.game_over():
                    done = True
                    break

            total_reward += reward

            screen = ale.getScreenRGB()
            cv2.imshow('Breakout', cv2.cvtColor(screen, cv2.COLOR_RGB2BGR))
            cv2.waitKey(1)

        total_rewards.append(total_reward)
        print(f"Episode {episode + 1} Reward: {total_reward}")

    cv2.destroyAllWindows()
    return total_rewards

if __name__ == "__main__":
    trained_agent, training_rewards = train_agent(episodes=1000)

    plt.figure(figsize=(10, 5))
    plt.plot(training_rewards)
    plt.title("Training Progress")
    plt.xlabel("Episode")
    plt.ylabel("Reward")
    plt.show()

    evaluation_rewards = evaluate_agent(trained_agent)
    print(f"Average Evaluation Reward: {np.mean(evaluation_rewards):.2f}")

### Code Description for First-Time Users: Contrasting Simple vs Complex RL Implementations

This script implements a **complex reinforcement learning (RL) system** using a **swarm-based DQN approach**. It contrasts with a simpler RL implementation by introducing advanced features like swarm collaboration, probabilistic layers, and prioritized replay, applied to the Atari game **Q*bert**.

---

### **1. Simple RL Implementation Overview**
- **Goal**: Train a single agent using a basic Dueling DQN to play a game.
- **Key Features**:
  - **Dueling DQN**: Splits Q-value computation into value and advantage streams for better performance.
  - **Prioritized Replay Buffer**: Focuses training on important experiences.
  - **Action Selection**: Uses an epsilon-greedy strategy for exploration and exploitation.
- **Limitations**:
  - Single-agent setup lacks collaborative behavior.
  - Exploration strategies are limited to basic epsilon-greedy methods.
  - Model focuses purely on individual performance with no inter-agent learning.

---

### **2. Complex RL Implementation (This Script)**
- **Goal**: Train a **swarm of agents** collaboratively to solve the same task, leveraging advanced exploration and probabilistic enhancements.
- **Key Features**:

#### **Advanced Neural Network Enhancements**
- **Gaussian Probability Layer (GPL)**: Introduces controlled noise to promote robust learning through probabilistic "twistronics."
- **NoisyLinear Layer**: Replaces standard linear layers to dynamically balance exploration and exploitation by adding learnable noise.

#### **Swarm Collaboration**
- **SwarmDQN**: Multiple agents (swarm members) independently learn and contribute to a shared **global best policy**.
- Each agent tracks and updates its **personal best reward and weights**, enabling decentralized learning.
- A **global best reward and weights** are updated based on the highest-performing agent, promoting knowledge sharing.

#### **Environment Interaction**
- Uses **stacked frames** to capture temporal information.
- Maintains a **Prioritized Replay Buffer**, shared among all swarm members, to store and sample experiences for training.

#### **Exploration and Reward Adjustment**
- **Action Suggestion**: Combines epsilon-greedy exploration with a mechanism where agents "suggest" actions based on personal policies, weighted by their expertise.
- **Reward Calculation**: Rewards are adjusted based on the alignment of a taken action with the suggested action, encouraging coordinated behavior.

#### **Dynamic Member Updates**
- Each swarm member independently trains on minibatches, using:
  - **Temporal Difference Learning**: Updates Q-values based on predictions and actual rewards.
  - **Shared Replay Buffer**: Ensures consistent experiences across the swarm.

#### **Visualization and Metrics**
- Tracks:
  - **Overall training progress** for the swarm.
  - **Performance of individual swarm members**.

---

### **Execution Flow**
1. **Swarm Initialization**:
   - A specified number of agents (`swarm_size`) is initialized with identical architectures.
   - Agents share access to a centralized replay buffer but operate semi-independently.

2. **Training Loop**:
   - For each episode, a random swarm member interacts with the environment:
     - Selects actions using both personal policy and shared suggestions.
     - Updates the replay buffer with experiences.
     - Trains on minibatches from the shared replay buffer.
   - Updates personal and global best policies based on performance.

3. **Evaluation and Visualization**:
   - After training, plots are generated to compare:
     - **Overall swarm progress**.
     - **Individual performance** of each swarm member.

---

### **Key Points of Contrast**
| Feature                            | **Simple RL Implementation**          | **Complex RL Implementation**        |
|------------------------------------|---------------------------------------|---------------------------------------|
| **Model**                          | Single Dueling DQN                    | Swarm of Dueling DQNs                |
| **Exploration**                    | Epsilon-greedy                        | Noisy layers and GPL for enhanced exploration |
| **Collaboration**                  | None                                  | Swarm collaboration with global policy sharing |
| **Replay Buffer**                  | Individual, prioritized               | Centralized, prioritized              |
| **Learning Strategy**              | Independent learning                  | Collaborative learning                |
| **Target Environment**             | Single agent in Breakout              | Multi-agent system in Q*bert          |

This complex RL script demonstrates how advanced exploration strategies, probabilistic mechanisms, and multi-agent collaboration can significantly enhance learning efficiency and performance in RL tasks.

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque, namedtuple
import random
from ale_py import ALEInterface, roms
import cv2
import matplotlib.pyplot as plt
from tqdm import tqdm
import copy

# Define Gaussian Probability Layer (GPL) for probabilistic "twistronics" effect
class GaussianProbabilityLayer(nn.Module):
    def __init__(self, std_dev=0.1):
        super(GaussianProbabilityLayer, self).__init__()
        self.std_dev = std_dev

    def forward(self, x):
        if self.training:
            noise = torch.randn_like(x) * self.std_dev
            return x + noise
        return x

# Define NoisyLinear for exploration-exploitation tradeoff
class NoisyLinear(nn.Module):
    def __init__(self, in_features, out_features, std_init=0.5):
        super(NoisyLinear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight_mu = nn.Parameter(torch.FloatTensor(out_features, in_features))
        self.weight_sigma = nn.Parameter(torch.FloatTensor(out_features, in_features))
        self.register_buffer('weight_epsilon', torch.FloatTensor(out_features, in_features))
        self.bias_mu = nn.Parameter(torch.FloatTensor(out_features))
        self.bias_sigma = nn.Parameter(torch.FloatTensor(out_features))
        self.register_buffer('bias_epsilon', torch.FloatTensor(out_features))
        self.std_init = std_init
        self.reset_parameters()
        self.reset_noise()

    def reset_parameters(self):
        mu_range = 1 / np.sqrt(self.in_features)
        self.weight_mu.data.uniform_(-mu_range, mu_range)
        self.weight_sigma.data.fill_(self.std_init / np.sqrt(self.in_features))
        self.bias_mu.data.uniform_(-mu_range, mu_range)
        self.bias_sigma.data.fill_(self.std_init / np.sqrt(self.out_features))

    def reset_noise(self):
        epsilon_in = self._scale_noise(self.in_features)
        epsilon_out = self._scale_noise(self.out_features)
        self.weight_epsilon.copy_(epsilon_out.ger(epsilon_in))
        self.bias_epsilon.copy_(self._scale_noise(self.out_features))

    def forward(self, input):
        if self.training:
            weight = self.weight_mu + self.weight_sigma * self.weight_epsilon
            bias = self.bias_mu + self.bias_sigma * self.bias_epsilon
        else:
            weight = self.weight_mu
            bias = self.bias_mu
        return nn.functional.linear(input, weight, bias)

    @staticmethod
    def _scale_noise(size):
        x = torch.randn(size)
        return x.sign().mul_(x.abs().sqrt_())

class DuelingDQN(nn.Module):
    def __init__(self, input_shape, n_actions):
        super(DuelingDQN, self).__init__()

        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU()
        )

        # Calculate the correct conv output size
        conv_out_size = self._get_conv_out(input_shape)

        # Apply GaussianProbabilityLayer between convolutional and fully connected layers
        self.gpl1 = GaussianProbabilityLayer(std_dev=0.1)

        # Value stream
        self.fc_value = nn.Sequential(
            NoisyLinear(conv_out_size, 512),
            nn.ReLU(),
            self.gpl1,
            NoisyLinear(512, 1)
        )

        # Advantage stream
        self.fc_advantage = nn.Sequential(
            NoisyLinear(conv_out_size, 512),
            nn.ReLU(),
            GaussianProbabilityLayer(std_dev=0.05),
            NoisyLinear(512, n_actions)
        )

    def _get_conv_out(self, shape):
        # Generate a dummy input to pass through conv layers and calculate output size
        dummy_input = torch.zeros(1, *shape)
        o = self.conv(dummy_input)
        return int(np.prod(o.size()[1:]))  # Only multiply the feature dimensions, not batch

    def forward(self, x):
        # Add batch dimension if input is a single sample
        if len(x.size()) == 3:
            x = x.unsqueeze(0)

        # Pass through convolutional layers
        conv_out = self.conv(x)

        # Flatten while preserving batch dimension
        batch_size = conv_out.size(0)
        conv_out = conv_out.view(batch_size, -1)

        # Apply GPL and compute value and advantage streams
        conv_out = self.gpl1(conv_out)
        value = self.fc_value(conv_out)
        advantage = self.fc_advantage(conv_out)

        # Combine value and advantage using dueling architecture formula
        return value + (advantage - advantage.mean(dim=1, keepdim=True))
# Define the Experience tuple for replay buffer
Experience = namedtuple('Experience', ['state', 'action', 'reward', 'next_state', 'done'])

# Define the Prioritized Replay Buffer
class PrioritizedReplayBuffer:
    def __init__(self, capacity, alpha=0.6):
        self.alpha = alpha
        self.capacity = capacity
        self.buffer = deque(maxlen=capacity)
        self.priorities = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        max_priority = max(self.priorities, default=1.0)
        self.buffer.append(Experience(state, action, reward, next_state, done))
        self.priorities.append(float(max_priority))

    def sample(self, batch_size, beta=0.4):
        if len(self.buffer) < batch_size:
            return None

        priorities = np.array(list(self.priorities), dtype=np.float32)
        probabilities = priorities ** self.alpha
        probabilities /= probabilities.sum()

        indices = random.choices(range(len(self.buffer)), k=batch_size, weights=probabilities)
        experiences = [self.buffer[idx] for idx in indices]

        weights = (len(self.buffer) * probabilities[indices]) ** (-beta)
        weights /= weights.max()

        states = torch.stack([exp.state for exp in experiences])
        actions = torch.tensor([exp.action for exp in experiences], dtype=torch.long)
        rewards = torch.tensor([exp.reward for exp in experiences], dtype=torch.float)
        next_states = torch.stack([exp.next_state for exp in experiences])
        dones = torch.tensor([exp.done for exp in experiences], dtype=torch.float)
        weights = torch.tensor(weights, dtype=torch.float)

        return states, actions, rewards, next_states, dones, indices, weights

    def update_priorities(self, indices, priorities):
        for idx, priority in zip(indices, priorities):
            self.priorities[idx] = float(priority.item())

    def __len__(self):
        return len(self.buffer)

# Preprocess the state from the environment
def preprocess_state(state):
    gray = cv2.cvtColor(state, cv2.COLOR_RGB2GRAY)
    resized = cv2.resize(gray, (84, 84), interpolation=cv2.INTER_AREA)
    processed = torch.tensor(resized, dtype=torch.float32).unsqueeze(0) / 255.0
    return processed

# Define the Swarm Member for individual DQN agents
class SwarmMember:
    def __init__(self, state_shape, n_actions, device, id):
        self.id = id
        self.device = device
        self.state_shape = state_shape
        self.n_actions = n_actions

        self.policy_net = DuelingDQN(state_shape, n_actions).to(device)
        self.target_net = DuelingDQN(state_shape, n_actions).to(device)
        self.target_net.load_state_dict(self.policy_net.state_dict())
        self.target_net.eval()

        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=0.0001)
        self.epsilon = 1.0
        self.epsilon_min = 0.1
        self.epsilon_decay = 0.9995

        self.personal_best_reward = float('-inf')
        self.personal_best_weights = copy.deepcopy(self.policy_net.state_dict())

    def update_personal_best(self, episode_reward):
        if episode_reward > self.personal_best_reward:
            self.personal_best_reward = episode_reward
            self.personal_best_weights = copy.deepcopy(self.policy_net.state_dict())
            return True
        return False

    def get_suggested_action(self, state):
        # Epsilon-greedy policy for exploration
        if random.random() < self.epsilon:
            return random.randint(0, self.n_actions - 1)
        else:
            with torch.no_grad():
                return self.policy_net(state).argmax(dim=1).item()

# Define the Swarm DQN with Qbert environment
class SwarmDQN:
    def __init__(self, state_shape, n_actions, swarm_size=5, device="cuda" if torch.cuda.is_available() else "cpu"):
        self.device = device
        self.swarm_size = swarm_size
        self.state_shape = state_shape
        self.n_actions = n_actions

        self.swarm = [SwarmMember(state_shape, n_actions, device, i) for i in range(swarm_size)]

        self.memory = PrioritizedReplayBuffer(100000)
        self.batch_size = 32
        self.gamma = 0.99

        self.global_best_reward = float('-inf')
        self.global_best_weights = copy.deepcopy(self.swarm[0].policy_net.state_dict())

    def select_action(self, state, member_idx):
        member = self.swarm[member_idx]

        suggested_action = member.get_suggested_action(state.squeeze().cpu())
        if random.random() < member.epsilon:
            if suggested_action is not None and random.random() < 0.7:
                return suggested_action
            return random.randrange(self.n_actions)
        else:
            with torch.no_grad():
                q_values = member.policy_net(state)
                if suggested_action is not None:
                    q_values[0][suggested_action] += 0.1
                return torch.argmax(q_values).item()

    def calculate_reward(self, raw_reward, suggested_action, taken_action):
        reward = raw_reward
        if suggested_action is not None and suggested_action == taken_action:
            reward += 0.1
        return reward

    def update_member(self, member_idx, batch):
        member = self.swarm[member_idx]
        states, actions, rewards, next_states, dones, indices, weights = batch

        states = states.view(self.batch_size, -1, 84, 84).to(self.device)
        next_states = next_states.view(self.batch_size, -1, 84, 84).to(self.device)

        actions = actions.to(self.device)
        rewards = rewards.to(self.device)
        dones = dones.to(self.device)
        weights = weights.to(self.device)

        # Forward pass only
        with torch.no_grad():
            current_q_values = member.policy_net(states).gather(1, actions.unsqueeze(1))
            next_actions = member.policy_net(next_states).max(1)[1]
            next_q_values = member.target_net(next_states).gather(1, next_actions.unsqueeze(1)).squeeze(1)
            target_q_values = rewards + (self.gamma * next_q_values * (1.0 - dones))

        td_errors = torch.abs(current_q_values - target_q_values.unsqueeze(1)).detach().cpu().numpy()
        self.memory.update_priorities(indices, td_errors.squeeze())

    def update_global_best(self, member_idx, episode_reward):
        if episode_reward > self.global_best_reward:
            self.global_best_reward = episode_reward
            self.global_best_weights = copy.deepcopy(self.swarm[member_idx].policy_net.state_dict())
            return True
        return False

# Train the Swarm Agent for Qbert
def train_swarm_agent(episodes=1000, render_frequency=100, score_threshold=50, hit_threshold=20):
    ale = ALEInterface()
    ale.setInt('random_seed', 123)
    ale.setBool('sound', False)
    ale.setBool('display_screen', True)
    ale.setFloat('repeat_action_probability', 0.0)
    ale.loadROM(roms.get_rom_path("qbert"))

    actions = ale.getMinimalActionSet()
    state_shape = (4, 84, 84)
    swarm = SwarmDQN(state_shape, len(actions))

    episode_rewards = []
    swarm_rewards = [[] for _ in range(swarm.swarm_size)]

    for episode in tqdm(range(1, episodes + 1)):
        member_idx = random.randrange(swarm.swarm_size)
        member = swarm.swarm[member_idx]

        ale.reset_game()
        total_reward = 0
        done = False

        state_stack = deque([preprocess_state(ale.getScreenRGB()) for _ in range(4)], maxlen=4)

        while not done:
            stacked_state = torch.cat(list(state_stack), dim=0).unsqueeze(0).to(swarm.device)

            action_idx = swarm.select_action(stacked_state, member_idx)
            reward = 0

            for _ in range(4):
                reward += ale.act(actions[action_idx])
                if ale.game_over():
                    done = True
                    break

            next_state = preprocess_state(ale.getScreenRGB()) if not done else state_stack[-1].clone()
            state_stack.append(next_state)

            stacked_next_state = torch.cat(list(state_stack), dim=0).unsqueeze(0).to(swarm.device)

            swarm.memory.push(stacked_state, action_idx, float(reward), stacked_next_state, float(done))

            if len(swarm.memory) >= swarm.batch_size:
                batch = swarm.memory.sample(swarm.batch_size)
                if batch is not None:
                    swarm.update_member(member_idx, batch)

            total_reward += reward

            if episode % render_frequency == 0:
                screen = ale.getScreenRGB()
                cv2.imshow('Qbert', cv2.cvtColor(screen, cv2.COLOR_RGB2BGR))
                if cv2.waitKey(1) & 0xFF == ord('q'):
                    break

        episode_rewards.append(total_reward)
        swarm_rewards[member_idx].append(total_reward)

        personal_best_updated = member.update_personal_best(total_reward)
        if personal_best_updated:
            swarm.update_global_best(member_idx, total_reward)

        member.epsilon = max(member.epsilon_min, member.epsilon * member.epsilon_decay)

        if episode % 10 == 0:
            avg_reward = np.mean(episode_rewards[-10:])
            print(f"Episode {episode}, Average Reward: {avg_reward:.2f}, Member {member_idx} Epsilon: {member.epsilon:.2f}")

    cv2.destroyAllWindows()
    return swarm, episode_rewards, swarm_rewards

if __name__ == "__main__":
    try:
        swarm, episode_rewards, swarm_rewards = train_swarm_agent()
        plt.figure(figsize=(15, 5))

        plt.subplot(1, 2, 1)
        plt.plot(episode_rewards)
        plt.title("Overall Training Progress")
        plt.xlabel("Episode")
        plt.ylabel("Reward")

        plt.subplot(1, 2, 2)
        for i, rewards in enumerate(swarm_rewards):
            plt.plot(rewards, label=f"Member {i}")
        plt.title("Individual Swarm Member Performance")
        plt.xlabel("Episode")
        plt.ylabel("Reward")
        plt.legend()
        plt.show()

    except Exception as e:
        print("An error occurred:", e)