# Multi-Agent Reinforcement Learning: When AI Agents Interact

What happens when multiple learning agents share the same world?

## What You'll Learn

By the end of this notebook, you'll understand:
- The dance troupe analogy: agents learning to coordinate
- Game theory foundations: Nash equilibrium and strategic thinking
- Cooperative, competitive, and mixed-motive scenarios
- The non-stationarity challenge: learning in a changing world
- Key algorithms: Independent Q-Learning, QMIX, MADDPG
- Centralized Training with Decentralized Execution (CTDE)
- Emergent behaviors and communication

**Prerequisites:** All previous RL notebooks

**Time:** ~40 minutes

---
## The Big Picture: The Dance Troupe Analogy

```
    ┌────────────────────────────────────────────────────────────────┐
    │          THE DANCE TROUPE ANALOGY                              │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  Imagine a dance troupe learning a new routine...             │
    │                                                                │
    │  EACH DANCER (Agent):                                         │
    │    Has their own moves to learn                              │
    │    Watches what others are doing                             │
    │    Adjusts based on the group                                │
    │                                                                │
    │  THE CHALLENGE:                                               │
    │    Everyone is learning at the same time!                    │
    │    Your partner's moves keep changing                        │
    │    What worked yesterday might not work today                │
    │                                                                │
    │  COOPERATIVE DANCE:                                          │
    │    All dancers want the show to succeed                      │
    │    Shared reward: Audience applause                          │
    │                                                                │
    │  COMPETITIVE DANCE:                                          │
    │    Dance battle! One winner                                  │
    │    Your success = opponent's failure                         │
    │                                                                │
    │  MIXED:                                                       │
    │    Team competition                                          │
    │    Cooperate with team, compete with rivals                  │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, Circle, Rectangle, FancyArrowPatch
from matplotlib.colors import LinearSegmentedColormap
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

In [None]:
# Visualize multi-agent settings

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Cooperative
ax1 = axes[0]
ax1.set_xlim(0, 10)
ax1.set_ylim(0, 10)
ax1.axis('off')
ax1.set_title('COOPERATIVE', fontsize=14, fontweight='bold', color='#388e3c')

# Agents working together
positions = [(3, 5), (5, 7), (7, 5), (5, 3)]
for i, (x, y) in enumerate(positions):
    circle = Circle((x, y), 0.8, facecolor='#4caf50', edgecolor='black', linewidth=2)
    ax1.add_patch(circle)
    ax1.text(x, y, f'A{i+1}', ha='center', va='center', fontsize=10, fontweight='bold', color='white')

# Connections (cooperation)
for i, (x1, y1) in enumerate(positions):
    for j, (x2, y2) in enumerate(positions):
        if i < j:
            ax1.plot([x1, x2], [y1, y2], 'g-', linewidth=1, alpha=0.5)

# Shared goal
goal = FancyBboxPatch((3.5, 0.5), 3, 1.2, boxstyle="round,pad=0.1",
                       facecolor='#c8e6c9', edgecolor='#388e3c', linewidth=2)
ax1.add_patch(goal)
ax1.text(5, 1.1, 'Shared Goal', ha='center', fontsize=10, fontweight='bold', color='#388e3c')

# Competitive
ax2 = axes[1]
ax2.set_xlim(0, 10)
ax2.set_ylim(0, 10)
ax2.axis('off')
ax2.set_title('COMPETITIVE', fontsize=14, fontweight='bold', color='#d32f2f')

# Two opposing agents
circle1 = Circle((3, 5), 1, facecolor='#f44336', edgecolor='black', linewidth=2)
circle2 = Circle((7, 5), 1, facecolor='#2196f3', edgecolor='black', linewidth=2)
ax2.add_patch(circle1)
ax2.add_patch(circle2)
ax2.text(3, 5, 'A1', ha='center', va='center', fontsize=12, fontweight='bold', color='white')
ax2.text(7, 5, 'A2', ha='center', va='center', fontsize=12, fontweight='bold', color='white')

# Conflict
ax2.annotate('', xy=(4.2, 5), xytext=(5.8, 5),
             arrowprops=dict(arrowstyle='<->', lw=3, color='#d32f2f'))
ax2.text(5, 3, 'Zero-Sum', ha='center', fontsize=10, style='italic', color='#666')
ax2.text(5, 2.3, 'Your win = My loss', ha='center', fontsize=9, color='#666')

# Mixed
ax3 = axes[2]
ax3.set_xlim(0, 10)
ax3.set_ylim(0, 10)
ax3.axis('off')
ax3.set_title('MIXED', fontsize=14, fontweight='bold', color='#7b1fa2')

# Team 1
for i, (x, y) in enumerate([(2, 6), (2, 4)]):
    circle = Circle((x, y), 0.6, facecolor='#f44336', edgecolor='black', linewidth=2)
    ax3.add_patch(circle)
ax3.plot([2, 2], [6, 4], 'g-', linewidth=2)  # Cooperation
ax3.text(2, 2.5, 'Team A', ha='center', fontsize=9, color='#f44336', fontweight='bold')

# Team 2
for i, (x, y) in enumerate([(8, 6), (8, 4)]):
    circle = Circle((x, y), 0.6, facecolor='#2196f3', edgecolor='black', linewidth=2)
    ax3.add_patch(circle)
ax3.plot([8, 8], [6, 4], 'g-', linewidth=2)  # Cooperation
ax3.text(8, 2.5, 'Team B', ha='center', fontsize=9, color='#2196f3', fontweight='bold')

# Competition between teams
ax3.annotate('', xy=(3, 5), xytext=(7, 5),
             arrowprops=dict(arrowstyle='<->', lw=2, color='#d32f2f'))
ax3.text(5, 5.5, 'Compete', ha='center', fontsize=9, color='#d32f2f')

plt.tight_layout()
plt.show()

print("\nMULTI-AGENT SETTINGS:")
print("  Cooperative: Agents share rewards, work toward common goal")
print("  Competitive: One agent's gain is another's loss")
print("  Mixed: Both cooperation and competition (teams, alliances)")

---
## Game Theory Foundations: Strategic Thinking

```
    ┌────────────────────────────────────────────────────────────────┐
    │              GAME THEORY BASICS                                │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  NORMAL FORM GAME:                                            │
    │    Players choose actions simultaneously                      │
    │    Payoffs depend on ALL players' actions                    │
    │                                                                │
    │  PRISONER'S DILEMMA:                                          │
    │                        Player 2                               │
    │                    Cooperate  Defect                          │
    │    Player 1  Cooperate  (3,3)    (0,5)                       │
    │              Defect     (5,0)    (1,1)                       │
    │                                                                │
    │    Dilemma: Both defecting is worse than both cooperating!   │
    │                                                                │
    │  NASH EQUILIBRIUM:                                            │
    │    No player can improve by unilaterally changing strategy   │
    │    In Prisoner's Dilemma: (Defect, Defect) is Nash!         │
    │                                                                │
    │  PARETO OPTIMALITY:                                           │
    │    No way to make someone better off without hurting others  │
    │    (Cooperate, Cooperate) is Pareto optimal but not Nash!   │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Implement classic games

class MatrixGame:
    """
    Two-player normal-form game.
    
    Players simultaneously choose actions.
    Payoffs determined by the payoff matrices.
    """
    
    def __init__(self, payoff_matrix_1, payoff_matrix_2, action_names=None):
        """
        Args:
            payoff_matrix_1: Payoffs for player 1 (rows=P1, cols=P2)
            payoff_matrix_2: Payoffs for player 2
            action_names: Names for actions (optional)
        """
        self.payoff_1 = np.array(payoff_matrix_1)
        self.payoff_2 = np.array(payoff_matrix_2)
        self.n_actions_1 = self.payoff_1.shape[0]
        self.n_actions_2 = self.payoff_1.shape[1]
        self.action_names = action_names or [f'Action {i}' for i in range(self.n_actions_1)]
        
    def play(self, action_1, action_2):
        """Execute a round of the game."""
        return self.payoff_1[action_1, action_2], self.payoff_2[action_1, action_2]
    
    def find_nash_equilibria(self):
        """
        Find pure strategy Nash equilibria.
        
        Nash equilibrium: Neither player can improve by deviating.
        """
        nash = []
        
        for i in range(self.n_actions_1):
            for j in range(self.n_actions_2):
                # Check if P1 can improve by changing action
                p1_can_improve = any(
                    self.payoff_1[k, j] > self.payoff_1[i, j] 
                    for k in range(self.n_actions_1) if k != i
                )
                
                # Check if P2 can improve by changing action
                p2_can_improve = any(
                    self.payoff_2[i, k] > self.payoff_2[i, j] 
                    for k in range(self.n_actions_2) if k != j
                )
                
                if not p1_can_improve and not p2_can_improve:
                    nash.append((i, j))
        
        return nash
    
    def is_pareto_optimal(self, action_1, action_2):
        """
        Check if outcome is Pareto optimal.
        
        Pareto optimal: Can't make anyone better off without hurting someone.
        """
        p1, p2 = self.play(action_1, action_2)
        
        for i in range(self.n_actions_1):
            for j in range(self.n_actions_2):
                other_p1, other_p2 = self.play(i, j)
                # Check if (i,j) Pareto dominates (action_1, action_2)
                if (other_p1 >= p1 and other_p2 >= p2 and 
                    (other_p1 > p1 or other_p2 > p2)):
                    return False
        return True


# Create classic games
print("CLASSIC GAMES")
print("="*60)

# Prisoner's Dilemma
# Actions: 0 = Cooperate, 1 = Defect
prisoners_dilemma = MatrixGame(
    payoff_matrix_1=[[3, 0], [5, 1]],
    payoff_matrix_2=[[3, 5], [0, 1]],
    action_names=['Cooperate', 'Defect']
)

print("\nPRISONER'S DILEMMA:")
print("                    Player 2")
print("                Cooperate  Defect")
print("Player 1  Cooperate  (3,3)    (0,5)")
print("          Defect     (5,0)    (1,1)")

nash = prisoners_dilemma.find_nash_equilibria()
print(f"\nNash Equilibria: {[(prisoners_dilemma.action_names[i], prisoners_dilemma.action_names[j]) for i, j in nash]}")

print("\nPareto Optimal Outcomes:")
for i in range(2):
    for j in range(2):
        if prisoners_dilemma.is_pareto_optimal(i, j):
            p1, p2 = prisoners_dilemma.play(i, j)
            print(f"  ({prisoners_dilemma.action_names[i]}, {prisoners_dilemma.action_names[j]}): ({p1}, {p2})")

print("\nTHE DILEMMA:")
print("  Nash equilibrium (Defect, Defect) gives (1,1)")
print("  But (Cooperate, Cooperate) gives (3,3) - better for both!")
print("  Individual rationality leads to collective irrationality!")

In [None]:
# Visualize Prisoner's Dilemma

fig, ax = plt.subplots(figsize=(10, 8))
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')
ax.set_title("Prisoner's Dilemma: Payoff Matrix", fontsize=16, fontweight='bold')

# Draw payoff matrix
cell_width = 2.5
cell_height = 2
start_x = 3
start_y = 3

outcomes = [
    [(3, 3), (0, 5)],
    [(5, 0), (1, 1)]
]
colors = [
    ['#c8e6c9', '#ffcdd2'],  # (C,C) green, (C,D) red for P1
    ['#ffcdd2', '#fff3e0']   # (D,C) red for P2, (D,D) orange (Nash)
]

for i in range(2):
    for j in range(2):
        x = start_x + j * cell_width
        y = start_y + (1-i) * cell_height
        
        # Cell
        rect = Rectangle((x, y), cell_width, cell_height, 
                         facecolor=colors[i][j], edgecolor='black', linewidth=2)
        ax.add_patch(rect)
        
        # Payoffs
        p1, p2 = outcomes[i][j]
        ax.text(x + cell_width/2, y + cell_height/2, f'({p1}, {p2})',
               ha='center', va='center', fontsize=14, fontweight='bold')

# Labels
ax.text(start_x + cell_width, start_y + 2*cell_height + 0.5, 'Player 2',
       ha='center', fontsize=12, fontweight='bold')
ax.text(start_x + cell_width/2, start_y + 2*cell_height + 0.2, 'Cooperate',
       ha='center', fontsize=10)
ax.text(start_x + 1.5*cell_width, start_y + 2*cell_height + 0.2, 'Defect',
       ha='center', fontsize=10)

ax.text(start_x - 0.8, start_y + cell_height, 'Player 1',
       ha='center', va='center', fontsize=12, fontweight='bold', rotation=90)
ax.text(start_x - 0.3, start_y + 1.5*cell_height, 'Cooperate',
       ha='right', va='center', fontsize=10)
ax.text(start_x - 0.3, start_y + 0.5*cell_height, 'Defect',
       ha='right', va='center', fontsize=10)

# Annotations
# Nash equilibrium
ax.annotate('Nash\nEquilibrium', xy=(start_x + 2*cell_width - 0.3, start_y + 0.3),
           xytext=(start_x + 2*cell_width + 1.5, start_y - 0.5),
           fontsize=10, color='#d32f2f', fontweight='bold',
           arrowprops=dict(arrowstyle='->', color='#d32f2f', lw=2))

# Pareto optimal
ax.annotate('Pareto\nOptimal!', xy=(start_x + 0.3, start_y + 2*cell_height - 0.3),
           xytext=(start_x - 1.5, start_y + 2*cell_height + 1),
           fontsize=10, color='#388e3c', fontweight='bold',
           arrowprops=dict(arrowstyle='->', color='#388e3c', lw=2))

# Legend
ax.text(5, 1, 'The Dilemma: Nash ≠ Pareto Optimal', ha='center',
       fontsize=12, style='italic', color='#666')

plt.tight_layout()
plt.show()

---
## The Non-Stationarity Challenge

```
    ┌────────────────────────────────────────────────────────────────┐
    │              NON-STATIONARITY                                  │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  THE PROBLEM:                                                 │
    │    In single-agent RL, the environment is fixed              │
    │    In multi-agent RL, other agents ARE the environment!      │
    │    And they're learning too - the world keeps changing!      │
    │                                                                │
    │  EXAMPLE:                                                     │
    │    Agent 1 learns: "Attack works great!"                     │
    │    Agent 2 learns: "I should defend more"                    │
    │    Agent 1: "Attack doesn't work anymore!"                   │
    │    Agent 2: "They stopped attacking, I can be aggressive!"   │
    │    ... and the cycle continues                               │
    │                                                                │
    │  CONSEQUENCES:                                                │
    │    • Q-values can oscillate forever                          │
    │    • Convergence guarantees break down                       │
    │    • Training can be unstable                                │
    │                                                                │
    │  SOLUTIONS:                                                   │
    │    • Centralized training (CTDE)                             │
    │    • Opponent modeling                                       │
    │    • Self-play curriculum                                    │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Demonstrate non-stationarity in iterated games

class IteratedPrisonersDilemma:
    """
    Iterated Prisoner's Dilemma with learning agents.
    
    Both agents learn simultaneously, creating non-stationarity.
    """
    
    def __init__(self):
        self.payoff = np.array([
            [(3, 3), (0, 5)],
            [(5, 0), (1, 1)]
        ])
        
    def step(self, action_1, action_2):
        """Play one round."""
        return self.payoff[action_1, action_2]


class LearningAgent:
    """
    Simple Q-learning agent.
    
    Learns action values through trial and error.
    """
    
    def __init__(self, n_actions=2, learning_rate=0.1, epsilon=0.1):
        self.n_actions = n_actions
        self.lr = learning_rate
        self.epsilon = epsilon
        self.q_values = np.zeros(n_actions)  # Q(action)
        
    def select_action(self):
        """Epsilon-greedy action selection."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        return np.argmax(self.q_values)
    
    def update(self, action, reward):
        """Update Q-value for action."""
        self.q_values[action] += self.lr * (reward - self.q_values[action])


def simulate_iterated_game(n_rounds=1000, track_interval=10):
    """
    Simulate iterated game with two learning agents.
    
    This demonstrates non-stationarity:
    - Both agents learn simultaneously
    - Each agent's optimal strategy depends on the other
    - The "best" action keeps changing!
    """
    game = IteratedPrisonersDilemma()
    agent_1 = LearningAgent(learning_rate=0.1, epsilon=0.15)
    agent_2 = LearningAgent(learning_rate=0.1, epsilon=0.15)
    
    history = {
        'q_cooperate_1': [],
        'q_defect_1': [],
        'q_cooperate_2': [],
        'q_defect_2': [],
        'cooperation_rate': [],
    }
    
    cooperations = 0
    
    for t in range(n_rounds):
        # Both agents choose simultaneously
        a1 = agent_1.select_action()
        a2 = agent_2.select_action()
        
        # Get rewards
        r1, r2 = game.step(a1, a2)
        
        # Both agents update
        agent_1.update(a1, r1)
        agent_2.update(a2, r2)
        
        # Track cooperation
        if a1 == 0 and a2 == 0:
            cooperations += 1
        
        # Record history
        if t % track_interval == 0:
            history['q_cooperate_1'].append(agent_1.q_values[0])
            history['q_defect_1'].append(agent_1.q_values[1])
            history['q_cooperate_2'].append(agent_2.q_values[0])
            history['q_defect_2'].append(agent_2.q_values[1])
            history['cooperation_rate'].append(cooperations / (t + 1))
    
    return history


# Run simulation
print("NON-STATIONARITY DEMONSTRATION")
print("="*60)

history = simulate_iterated_game(n_rounds=2000)

print("\nSimulating two learning agents in Iterated Prisoner's Dilemma...")
print("Watch how Q-values oscillate as agents adapt to each other!")

In [None]:
# Visualize non-stationarity

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

steps = np.arange(len(history['q_cooperate_1'])) * 10

# Top-left: Agent 1 Q-values
ax1 = axes[0, 0]
ax1.plot(steps, history['q_cooperate_1'], 'g-', label='Q(Cooperate)', linewidth=2)
ax1.plot(steps, history['q_defect_1'], 'r-', label='Q(Defect)', linewidth=2)
ax1.set_xlabel('Round', fontsize=11)
ax1.set_ylabel('Q-Value', fontsize=11)
ax1.set_title('Agent 1: Q-Values Over Time', fontsize=12, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Top-right: Agent 2 Q-values
ax2 = axes[0, 1]
ax2.plot(steps, history['q_cooperate_2'], 'g-', label='Q(Cooperate)', linewidth=2)
ax2.plot(steps, history['q_defect_2'], 'r-', label='Q(Defect)', linewidth=2)
ax2.set_xlabel('Round', fontsize=11)
ax2.set_ylabel('Q-Value', fontsize=11)
ax2.set_title('Agent 2: Q-Values Over Time', fontsize=12, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Bottom-left: Cooperation rate
ax3 = axes[1, 0]
ax3.plot(steps, history['cooperation_rate'], 'b-', linewidth=2)
ax3.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
ax3.set_xlabel('Round', fontsize=11)
ax3.set_ylabel('Cooperation Rate', fontsize=11)
ax3.set_title('Mutual Cooperation Rate', fontsize=12, fontweight='bold')
ax3.set_ylim(0, 1)
ax3.grid(True, alpha=0.3)

# Bottom-right: Q-value difference
ax4 = axes[1, 1]
q_diff_1 = np.array(history['q_defect_1']) - np.array(history['q_cooperate_1'])
q_diff_2 = np.array(history['q_defect_2']) - np.array(history['q_cooperate_2'])
ax4.plot(steps, q_diff_1, 'r-', label='Agent 1: Q(D)-Q(C)', linewidth=2, alpha=0.7)
ax4.plot(steps, q_diff_2, 'b-', label='Agent 2: Q(D)-Q(C)', linewidth=2, alpha=0.7)
ax4.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax4.set_xlabel('Round', fontsize=11)
ax4.set_ylabel('Q-Value Difference', fontsize=11)
ax4.set_title('Defection Preference (Q(Defect) - Q(Cooperate))', fontsize=12, fontweight='bold')
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nOBSERVATIONS:")
print("  • Q-values oscillate - they don't converge to fixed values!")
print("  • When one agent cooperates more, the other learns to defect")
print("  • This creates cycles of cooperation and defection")
print("  • This is NON-STATIONARITY: the optimal action keeps changing")

---
## Multi-Agent Algorithms

```
    ┌────────────────────────────────────────────────────────────────┐
    │              MULTI-AGENT ALGORITHMS                            │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  1. INDEPENDENT Q-LEARNING (IQL)                              │
    │     Each agent learns independently                          │
    │     Treats other agents as part of environment               │
    │     Simple but suffers from non-stationarity                 │
    │                                                                │
    │  2. CENTRALIZED TRAINING, DECENTRALIZED EXECUTION (CTDE)     │
    │     Training: Agents share information                       │
    │     Execution: Each agent acts independently                 │
    │     Best of both worlds!                                     │
    │                                                                │
    │  3. QMIX (for cooperative)                                    │
    │     Learns individual Q-values                               │
    │     Mixes them into a global Q-value                         │
    │     Constraint: Monotonic mixing (argmax preserved)          │
    │                                                                │
    │  4. MADDPG (Multi-Agent DDPG)                                 │
    │     Actor-Critic for continuous control                      │
    │     Centralized critic, decentralized actors                 │
    │     Works for cooperative, competitive, mixed                │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Visualize CTDE

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: Training (centralized)
ax1 = axes[0]
ax1.set_xlim(0, 10)
ax1.set_ylim(0, 10)
ax1.axis('off')
ax1.set_title('TRAINING: Centralized', fontsize=14, fontweight='bold', color='#1976d2')

# Central coordinator
coord = FancyBboxPatch((3.5, 6.5), 3, 2, boxstyle="round,pad=0.1",
                        facecolor='#e3f2fd', edgecolor='#1976d2', linewidth=3)
ax1.add_patch(coord)
ax1.text(5, 7.8, 'Central', ha='center', fontsize=10, fontweight='bold')
ax1.text(5, 7.2, 'Coordinator', ha='center', fontsize=10)

# Agents
for i, (x, y) in enumerate([(2, 3), (5, 3), (8, 3)]):
    circle = Circle((x, y), 0.8, facecolor='#4caf50', edgecolor='black', linewidth=2)
    ax1.add_patch(circle)
    ax1.text(x, y, f'A{i+1}', ha='center', va='center', fontweight='bold', color='white')
    # Connection to coordinator
    ax1.plot([x, 5], [y+0.8, 6.5], 'b-', linewidth=2, alpha=0.5)

# Labels
ax1.text(5, 5, 'Share observations,\nactions, rewards', ha='center', fontsize=9, color='#666')
ax1.text(5, 1.5, 'Full information access', ha='center', fontsize=10, style='italic')

# Right: Execution (decentralized)
ax2 = axes[1]
ax2.set_xlim(0, 10)
ax2.set_ylim(0, 10)
ax2.axis('off')
ax2.set_title('EXECUTION: Decentralized', fontsize=14, fontweight='bold', color='#388e3c')

# Environment
env = FancyBboxPatch((2, 6.5), 6, 2, boxstyle="round,pad=0.1",
                      facecolor='#fff3e0', edgecolor='#f57c00', linewidth=2)
ax2.add_patch(env)
ax2.text(5, 7.5, 'Environment', ha='center', fontsize=10, fontweight='bold')

# Agents (independent)
for i, (x, y) in enumerate([(2, 3), (5, 3), (8, 3)]):
    circle = Circle((x, y), 0.8, facecolor='#4caf50', edgecolor='black', linewidth=2)
    ax2.add_patch(circle)
    ax2.text(x, y, f'A{i+1}', ha='center', va='center', fontweight='bold', color='white')
    # Individual connection to environment
    ax2.annotate('', xy=(x, 6.5), xytext=(x, 3.8),
                arrowprops=dict(arrowstyle='<->', lw=1.5, color='#666'))
    ax2.text(x, 5, f'obs{i+1}\n↓\nact{i+1}', ha='center', fontsize=7, color='#666')

ax2.text(5, 1.5, 'Each agent acts on local observation only', ha='center', fontsize=10, style='italic')

plt.tight_layout()
plt.show()

print("\nCTDE (Centralized Training, Decentralized Execution):")
print("  Training: Share all information (states, actions, rewards)")
print("  Execution: Each agent uses only its local observation")
print("  Benefit: Stable training + practical deployment!")

In [None]:
# Simple cooperative multi-agent environment

class CooperativeGridWorld:
    """
    Cooperative multi-agent grid world.
    
    Two agents must meet at a goal location.
    Reward only given when BOTH agents reach the goal.
    """
    
    def __init__(self, size=5):
        self.size = size
        self.goal = (size-1, size-1)
        self.reset()
        
    def reset(self):
        """Reset agents to starting positions."""
        self.agent_1_pos = (0, 0)
        self.agent_2_pos = (self.size-1, 0)
        return self._get_obs()
    
    def _get_obs(self):
        """Return observations for both agents."""
        return {
            'agent_1': (*self.agent_1_pos, *self.goal),
            'agent_2': (*self.agent_2_pos, *self.goal),
        }
    
    def step(self, action_1, action_2):
        """
        Execute actions for both agents.
        
        Actions: 0=up, 1=right, 2=down, 3=left, 4=stay
        """
        # Move agent 1
        self.agent_1_pos = self._move(self.agent_1_pos, action_1)
        # Move agent 2
        self.agent_2_pos = self._move(self.agent_2_pos, action_2)
        
        # Check if both at goal
        done = (self.agent_1_pos == self.goal and self.agent_2_pos == self.goal)
        
        # Shared reward
        reward = 10.0 if done else -0.1
        
        return self._get_obs(), reward, done
    
    def _move(self, pos, action):
        """Move agent based on action."""
        x, y = pos
        if action == 0:  # up
            y = min(y + 1, self.size - 1)
        elif action == 1:  # right
            x = min(x + 1, self.size - 1)
        elif action == 2:  # down
            y = max(y - 1, 0)
        elif action == 3:  # left
            x = max(x - 1, 0)
        # action 4 = stay
        return (x, y)


class CooperativeAgent:
    """
    Q-learning agent for cooperative task.
    """
    
    def __init__(self, state_dims=4, n_actions=5, lr=0.1, gamma=0.99, epsilon=0.2):
        self.n_actions = n_actions
        self.lr = lr
        self.gamma = gamma
        self.epsilon = epsilon
        # Simple state discretization
        self.q_table = {}
        
    def get_q(self, state, action):
        """Get Q-value."""
        return self.q_table.get((state, action), 0.0)
    
    def select_action(self, state):
        """Epsilon-greedy action selection."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        
        q_values = [self.get_q(state, a) for a in range(self.n_actions)]
        return np.argmax(q_values)
    
    def update(self, state, action, reward, next_state, done):
        """Q-learning update."""
        if done:
            target = reward
        else:
            next_q = max(self.get_q(next_state, a) for a in range(self.n_actions))
            target = reward + self.gamma * next_q
        
        old_q = self.get_q(state, action)
        self.q_table[(state, action)] = old_q + self.lr * (target - old_q)


# Train cooperative agents
print("COOPERATIVE MULTI-AGENT TRAINING")
print("="*60)

env = CooperativeGridWorld(size=5)
agent_1 = CooperativeAgent()
agent_2 = CooperativeAgent()

n_episodes = 1000
episode_rewards = []
success_count = 0

for episode in range(n_episodes):
    obs = env.reset()
    total_reward = 0
    
    for step in range(50):  # Max steps per episode
        # Get actions
        action_1 = agent_1.select_action(obs['agent_1'])
        action_2 = agent_2.select_action(obs['agent_2'])
        
        # Step environment
        next_obs, reward, done = env.step(action_1, action_2)
        total_reward += reward
        
        # Update both agents with shared reward
        agent_1.update(obs['agent_1'], action_1, reward, next_obs['agent_1'], done)
        agent_2.update(obs['agent_2'], action_2, reward, next_obs['agent_2'], done)
        
        obs = next_obs
        
        if done:
            success_count += 1
            break
    
    episode_rewards.append(total_reward)

print(f"\nTraining completed!")
print(f"Success rate: {success_count/n_episodes:.1%}")
print(f"Average reward (last 100): {np.mean(episode_rewards[-100:]):.2f}")

In [None]:
# Visualize cooperative training

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Learning curve
ax1 = axes[0]
window = 50
smoothed = np.convolve(episode_rewards, np.ones(window)/window, mode='valid')
ax1.plot(smoothed, 'b-', linewidth=2)
ax1.set_xlabel('Episode', fontsize=11)
ax1.set_ylabel('Total Reward', fontsize=11)
ax1.set_title('Cooperative Learning Curve', fontsize=12, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Right: Visualize final policy
ax2 = axes[1]
ax2.set_xlim(-0.5, env.size - 0.5)
ax2.set_ylim(-0.5, env.size - 0.5)
ax2.set_aspect('equal')
ax2.set_title('Final Learned Paths', fontsize=12, fontweight='bold')

# Draw grid
for i in range(env.size + 1):
    ax2.axhline(y=i - 0.5, color='gray', linewidth=0.5)
    ax2.axvline(x=i - 0.5, color='gray', linewidth=0.5)

# Draw goal
goal = Rectangle((env.goal[0]-0.4, env.goal[1]-0.4), 0.8, 0.8,
                  facecolor='#4caf50', alpha=0.5)
ax2.add_patch(goal)
ax2.text(env.goal[0], env.goal[1], 'GOAL', ha='center', va='center', fontsize=9, fontweight='bold')

# Simulate one episode with trained agents
obs = env.reset()
path_1 = [env.agent_1_pos]
path_2 = [env.agent_2_pos]

agent_1.epsilon = 0  # Greedy
agent_2.epsilon = 0

for _ in range(20):
    action_1 = agent_1.select_action(obs['agent_1'])
    action_2 = agent_2.select_action(obs['agent_2'])
    obs, _, done = env.step(action_1, action_2)
    path_1.append(env.agent_1_pos)
    path_2.append(env.agent_2_pos)
    if done:
        break

# Plot paths
path_1 = np.array(path_1)
path_2 = np.array(path_2)
ax2.plot(path_1[:, 0], path_1[:, 1], 'r-o', linewidth=2, markersize=8, label='Agent 1')
ax2.plot(path_2[:, 0], path_2[:, 1], 'b-s', linewidth=2, markersize=8, label='Agent 2')

# Mark start positions
ax2.plot(0, 0, 'r*', markersize=15, label='A1 Start')
ax2.plot(env.size-1, 0, 'b*', markersize=15, label='A2 Start')

ax2.legend(loc='upper left')
ax2.set_xlabel('X', fontsize=11)
ax2.set_ylabel('Y', fontsize=11)

plt.tight_layout()
plt.show()

print("\nCOOPERATIVE LEARNING:")
print("  Both agents learned to navigate to the shared goal!")
print("  Success requires COORDINATION - not just individual skill.")

---
## Emergent Behaviors

```
    ┌────────────────────────────────────────────────────────────────┐
    │              EMERGENT BEHAVIORS                                │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  WHAT IS EMERGENCE?                                           │
    │    Complex behaviors that arise from simple rules             │
    │    Not explicitly programmed - learned through interaction    │
    │                                                                │
    │  EXAMPLES IN MULTI-AGENT RL:                                  │
    │                                                                │
    │  • COOPERATION                                                │
    │    Agents learn to help each other                           │
    │    Division of labor emerges                                 │
    │                                                                │
    │  • COMMUNICATION                                              │
    │    Agents develop shared signals                             │
    │    "Language" emerges from scratch                           │
    │                                                                │
    │  • DECEPTION                                                  │
    │    In competitive settings, agents learn to mislead          │
    │    Feints and bluffs emerge naturally                        │
    │                                                                │
    │  • SPECIALIZATION                                             │
    │    Different agents develop different roles                  │
    │    "Scout", "Fighter", "Gatherer"                            │
    │                                                                │
    │  FAMOUS EXAMPLES:                                             │
    │    • OpenAI Hide and Seek: Tool use emerged                  │
    │    • AlphaStar: Novel Starcraft strategies                   │
    │    • Emergent communication in cooperative games             │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Simple emergence demonstration: Predator-Prey

class PredatorPreyWorld:
    """
    Simple predator-prey environment.
    
    Demonstrates emergent chase/flee behaviors.
    """
    
    def __init__(self, size=10):
        self.size = size
        self.reset()
        
    def reset(self):
        """Reset to random positions."""
        self.predator = np.array([np.random.randint(self.size), np.random.randint(self.size)], dtype=float)
        self.prey = np.array([np.random.randint(self.size), np.random.randint(self.size)], dtype=float)
        # Ensure they don't start at same position
        while np.array_equal(self.predator.astype(int), self.prey.astype(int)):
            self.prey = np.array([np.random.randint(self.size), np.random.randint(self.size)], dtype=float)
        return self._get_obs()
    
    def _get_obs(self):
        """Return observations."""
        return {
            'predator': np.concatenate([self.predator, self.prey - self.predator]),
            'prey': np.concatenate([self.prey, self.predator - self.prey]),
        }
    
    def step(self, predator_action, prey_action):
        """
        Execute actions.
        
        Actions are direction vectors (dx, dy) normalized to speed.
        """
        predator_speed = 0.8
        prey_speed = 0.6  # Prey is slower
        
        # Move predator
        self.predator += predator_speed * np.clip(predator_action, -1, 1)
        self.predator = np.clip(self.predator, 0, self.size - 1)
        
        # Move prey
        self.prey += prey_speed * np.clip(prey_action, -1, 1)
        self.prey = np.clip(self.prey, 0, self.size - 1)
        
        # Check catch
        distance = np.linalg.norm(self.predator - self.prey)
        caught = distance < 0.5
        
        # Rewards
        predator_reward = 10.0 if caught else -distance / self.size
        prey_reward = -10.0 if caught else distance / self.size
        
        return self._get_obs(), predator_reward, prey_reward, caught


# Simulate with simple policies
print("EMERGENT BEHAVIOR: PREDATOR-PREY")
print("="*60)

env = PredatorPreyWorld(size=10)

# Record one episode
obs = env.reset()
predator_path = [env.predator.copy()]
prey_path = [env.prey.copy()]

for step in range(100):
    # Simple heuristic policies (chase/flee)
    # Predator: Move toward prey
    direction_to_prey = obs['predator'][2:4]  # (prey - predator)
    if np.linalg.norm(direction_to_prey) > 0:
        predator_action = direction_to_prey / np.linalg.norm(direction_to_prey)
    else:
        predator_action = np.array([0, 0])
    
    # Prey: Move away from predator + some randomness
    direction_from_predator = obs['prey'][2:4]  # (predator - prey)
    if np.linalg.norm(direction_from_predator) > 0:
        prey_action = -direction_from_predator / np.linalg.norm(direction_from_predator)
    else:
        prey_action = np.random.randn(2)
    prey_action += 0.2 * np.random.randn(2)  # Some randomness
    
    obs, _, _, caught = env.step(predator_action, prey_action)
    predator_path.append(env.predator.copy())
    prey_path.append(env.prey.copy())
    
    if caught:
        print(f"Prey caught at step {step}!")
        break

predator_path = np.array(predator_path)
prey_path = np.array(prey_path)

In [None]:
# Visualize predator-prey dynamics

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: Trajectories
ax1 = axes[0]
ax1.set_xlim(-0.5, env.size - 0.5)
ax1.set_ylim(-0.5, env.size - 0.5)
ax1.set_aspect('equal')
ax1.set_title('Predator-Prey Chase', fontsize=14, fontweight='bold')

# Plot trajectories
ax1.plot(predator_path[:, 0], predator_path[:, 1], 'r-', linewidth=2, alpha=0.7, label='Predator')
ax1.plot(prey_path[:, 0], prey_path[:, 1], 'b-', linewidth=2, alpha=0.7, label='Prey')

# Start and end points
ax1.plot(predator_path[0, 0], predator_path[0, 1], 'rs', markersize=15, label='Predator Start')
ax1.plot(prey_path[0, 0], prey_path[0, 1], 'bs', markersize=15, label='Prey Start')
ax1.plot(predator_path[-1, 0], predator_path[-1, 1], 'r*', markersize=20)
ax1.plot(prey_path[-1, 0], prey_path[-1, 1], 'b*', markersize=20)

ax1.set_xlabel('X', fontsize=11)
ax1.set_ylabel('Y', fontsize=11)
ax1.legend(loc='upper right')
ax1.grid(True, alpha=0.3)

# Right: Distance over time
ax2 = axes[1]
distances = np.linalg.norm(predator_path - prey_path, axis=1)
ax2.plot(distances, 'purple', linewidth=2)
ax2.axhline(y=0.5, color='red', linestyle='--', label='Catch threshold')
ax2.set_xlabel('Step', fontsize=11)
ax2.set_ylabel('Distance', fontsize=11)
ax2.set_title('Distance Between Agents', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nEMERGENT BEHAVIORS OBSERVED:")
print("  • Predator: Pursuit strategy - direct chase toward prey")
print("  • Prey: Evasion strategy - flee away from predator")
print("  • Neither behavior was programmed explicitly!")
print("  • With RL, even more complex strategies emerge")

---
## Real-World Applications

```
    ┌────────────────────────────────────────────────────────────────┐
    │              MULTI-AGENT RL APPLICATIONS                       │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  GAMES:                                                       │
    │    • AlphaGo/AlphaZero: Self-play for board games            │
    │    • AlphaStar: Grandmaster StarCraft II                     │
    │    • OpenAI Five: Dota 2                                     │
    │    • Pluribus: Poker (6-player)                              │
    │                                                                │
    │  ROBOTICS:                                                    │
    │    • Multi-robot coordination                                │
    │    • Swarm robotics                                          │
    │    • Warehouse automation                                    │
    │                                                                │
    │  AUTONOMOUS VEHICLES:                                         │
    │    • Traffic coordination                                    │
    │    • Intersection management                                 │
    │    • Fleet management                                        │
    │                                                                │
    │  ECONOMICS:                                                   │
    │    • Market simulation                                       │
    │    • Auction design                                          │
    │    • Resource allocation                                     │
    │                                                                │
    │  LLM AGENTS:                                                  │
    │    • Multi-agent debate for reasoning                        │
    │    • Collaborative problem-solving                           │
    │    • Agent-to-agent negotiation                              │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Visualize applications

fig, ax = plt.subplots(figsize=(14, 10))
ax.set_xlim(0, 14)
ax.set_ylim(0, 12)
ax.axis('off')
ax.set_title('Multi-Agent RL: Real-World Applications', fontsize=16, fontweight='bold')

applications = [
    (2, 9, 'Games', 'AlphaStar, OpenAI Five\nDota 2, StarCraft', '#e91e63'),
    (7, 9, 'Robotics', 'Swarm coordination\nWarehouse automation', '#4caf50'),
    (12, 9, 'Autonomous\nVehicles', 'Traffic coordination\nFleet management', '#2196f3'),
    (4.5, 5, 'Economics', 'Market simulation\nAuction design', '#ff9800'),
    (9.5, 5, 'LLM Agents', 'Multi-agent debate\nCollaborative AI', '#9c27b0'),
]

for x, y, title, desc, color in applications:
    box = FancyBboxPatch((x-1.5, y-1.5), 3, 3, boxstyle="round,pad=0.1",
                          facecolor=color, edgecolor='black', linewidth=2, alpha=0.8)
    ax.add_patch(box)
    ax.text(x, y+0.8, title, ha='center', fontsize=11, fontweight='bold', color='white')
    ax.text(x, y-0.4, desc, ha='center', fontsize=8, color='white')

# Key challenges
ax.text(7, 2, 'Key Challenges: Non-stationarity | Credit Assignment | Scalability | Communication',
       ha='center', fontsize=11, style='italic', color='#666')

plt.tight_layout()
plt.show()

---
## Summary: Multi-Agent RL

### Key Concepts

| Concept | Description |
|---------|-------------|
| **Cooperative** | Agents share rewards, work toward common goal |
| **Competitive** | Zero-sum: one agent's gain is another's loss |
| **Mixed** | Both cooperation and competition (teams) |
| **Non-stationarity** | Environment changes as agents learn |
| **CTDE** | Train together, execute independently |

### Game Theory Concepts

| Concept | Definition |
|---------|------------|
| **Nash Equilibrium** | No agent can improve by changing alone |
| **Pareto Optimal** | Can't improve one without hurting another |
| **Best Response** | Optimal action given others' strategies |

### Key Algorithms

| Algorithm | Type | Key Idea |
|-----------|------|----------|
| IQL | Independent | Each agent learns separately |
| QMIX | Cooperative | Mix individual Q-values |
| MADDPG | General | Centralized critic, decentralized actors |
| COMA | Cooperative | Counterfactual baselines |

### Quick Reference

```python
# Multi-agent training loop
for episode in range(n_episodes):
    obs = env.reset()
    for step in range(max_steps):
        # Each agent selects action based on its observation
        actions = [agent.select_action(obs[i]) for i, agent in enumerate(agents)]
        
        # Environment steps with joint action
        next_obs, rewards, done = env.step(actions)
        
        # Each agent updates
        for i, agent in enumerate(agents):
            agent.update(obs[i], actions[i], rewards[i], next_obs[i])
```

---
## Test Your Understanding

**1. What is non-stationarity and why is it a problem?**
<details>
<summary>Click to reveal answer</summary>
Non-stationarity means the environment changes over time. In multi-agent RL, other agents ARE part of the environment, and they're learning too. This means:
- The optimal action keeps changing
- Q-values may never converge
- Standard RL convergence guarantees don't apply

Solution: CTDE, opponent modeling, or self-play.
</details>

**2. What's the difference between Nash equilibrium and Pareto optimality?**
<details>
<summary>Click to reveal answer</summary>
Nash Equilibrium: No player can improve by changing their strategy alone (stability concept).

Pareto Optimality: No way to make someone better off without making someone else worse off (efficiency concept).

Key insight: Nash equilibria are NOT always Pareto optimal! (Prisoner's Dilemma: Defect-Defect is Nash but Cooperate-Cooperate is Pareto better)
</details>

**3. What is CTDE and why is it useful?**
<details>
<summary>Click to reveal answer</summary>
CTDE = Centralized Training, Decentralized Execution

Training: All agents share information (observations, actions, rewards)
Execution: Each agent acts only on its local observation

Benefits:
- Stable training (addresses non-stationarity)
- Practical deployment (no communication needed at runtime)
- Best of both worlds!
</details>

**4. What are emergent behaviors in multi-agent systems?**
<details>
<summary>Click to reveal answer</summary>
Emergent behaviors are complex strategies that arise from simple rules without being explicitly programmed. Examples:

- Cooperation: Agents learn to help each other
- Communication: Agents develop shared signals
- Deception: Agents learn to mislead opponents
- Specialization: Different roles emerge (scout, fighter, etc.)

Famous example: OpenAI Hide and Seek - agents learned to use tools!
</details>

**5. When would you use QMIX vs MADDPG?**
<details>
<summary>Click to reveal answer</summary>
QMIX:
- Cooperative settings only
- Discrete actions
- Factored Q-function with monotonicity constraint

MADDPG:
- Any setting (cooperative, competitive, mixed)
- Continuous actions
- Actor-Critic framework

Rule of thumb: QMIX for cooperative discrete, MADDPG for everything else.
</details>

---
## Congratulations!

You've completed the entire **Reinforcement Learning** curriculum!

### Your Journey:

```
Fundamentals → Classic Algorithms → Deep RL → Policy Gradients
     ↓                                              ↓
  MDPs, Value Functions              REINFORCE, Actor-Critic
     ↓                                              ↓
     └──────────────────────────────────────────────┘
                           ↓
                   Advanced Algorithms
                   (PPO, SAC, TRPO)
                           ↓
                         RLHF
               (Reward Models, DPO, TRL)
                           ↓
                     Applications
          (Games, Robotics, Recommendations,
           LLM Alignment, Multi-Agent)
```

### What You've Learned:

- ✅ **Fundamentals**: MDPs, policies, value functions, Bellman equations
- ✅ **Classic Algorithms**: Q-learning, SARSA, Monte Carlo, TD learning
- ✅ **Deep RL**: DQN, experience replay, target networks
- ✅ **Policy Gradients**: REINFORCE, Actor-Critic, A2C/A3C
- ✅ **Advanced**: PPO, SAC, TRPO
- ✅ **RLHF**: Reward modeling, PPO for LLMs, DPO
- ✅ **Applications**: Games, robotics, recommendations, alignment, multi-agent

### Where to Go Next:

1. **Practice**: Implement algorithms from scratch
2. **Explore**: Try PettingZoo, RLlib for multi-agent
3. **Apply**: Use RLHF to fine-tune your own models
4. **Research**: Read papers on emergent communication, social dilemmas

---

*"The future of AI is not just individual intelligence, but the collective intelligence of many agents working together!"*