# Robotics Simulation: Teaching Robots to Move

From balancing pendulums to walking humanoids, RL makes robots move!

## What You'll Learn

By the end of this notebook, you'll understand:
- The dance instructor analogy: how RL teaches robots to move
- Continuous vs discrete control
- Popular robotics environments (Pendulum, MuJoCo)
- Why SAC excels at continuous control
- Reward shaping for locomotion
- Sim-to-real transfer challenges

**Prerequisites:** Game playing notebook, PPO basics

**Time:** ~35 minutes

---
## The Big Picture: The Dance Instructor Analogy

```
    ┌────────────────────────────────────────────────────────────────┐
    │          THE DANCE INSTRUCTOR ANALOGY                          │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  Teaching a robot to walk is like teaching someone to dance...│
    │                                                                │
    │  THE BODY (Robot State):                                      │
    │    Joint angles: How bent is each joint?                     │
    │    Joint velocities: How fast is each joint moving?          │
    │    Body orientation: Is the robot upright?                   │
    │    Contact sensors: Are feet touching ground?                │
    │                                                                │
    │  THE MUSCLES (Continuous Actions):                            │
    │    Not just "move left" or "move right"                      │
    │    But "apply 0.73 torque to hip, 0.21 to knee"              │
    │    Smooth, continuous control is harder!                     │
    │                                                                │
    │  THE INSTRUCTOR (Reward Signal):                              │
    │    "Good! You moved forward" (+1)                            │
    │    "Oops, you fell" (-10)                                    │
    │    "Using too much energy" (-0.1)                            │
    │    "Nice and smooth!" (+0.5)                                 │
    │                                                                │
    │  THE LEARNING:                                                │
    │    Practice → Feedback → Adjust → Master the dance!         │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, Circle, Rectangle, Wedge
import warnings
warnings.filterwarnings('ignore')

# Check gymnasium availability
try:
    import gymnasium as gym
    GYM_AVAILABLE = True
    print("✓ Gymnasium is installed!")
except ImportError:
    GYM_AVAILABLE = False
    print("✗ Gymnasium not installed.")

# Check stable-baselines3
try:
    from stable_baselines3 import PPO, SAC, TD3
    from stable_baselines3.common.evaluation import evaluate_policy
    SB3_AVAILABLE = True
    print("✓ Stable-Baselines3 is installed!")
except ImportError:
    SB3_AVAILABLE = False
    print("✗ Stable-Baselines3 not installed.")

# Check for MuJoCo
MUJOCO_AVAILABLE = False
try:
    env = gym.make('HalfCheetah-v4')
    env.close()
    MUJOCO_AVAILABLE = True
    print("✓ MuJoCo environments available!")
except:
    print("✗ MuJoCo not available (optional).")
    print("  Install with: pip install gymnasium[mujoco]")

In [None]:
# Visualize discrete vs continuous control

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: Discrete actions (games)
ax1 = axes[0]
ax1.set_xlim(0, 10)
ax1.set_ylim(0, 10)
ax1.axis('off')
ax1.set_title('Discrete Control (Games)', fontsize=14, fontweight='bold', color='#1976d2')

# Draw discrete action choices
discrete_actions = [
    (2, 7, 'LEFT', '←'),
    (5, 7, 'NONE', '•'),
    (8, 7, 'RIGHT', '→'),
]

for x, y, label, symbol in discrete_actions:
    box = FancyBboxPatch((x-1, y-0.5), 2, 1.5, boxstyle="round,pad=0.1",
                          facecolor='#e3f2fd', edgecolor='#1976d2', linewidth=2)
    ax1.add_patch(box)
    ax1.text(x, y+0.3, symbol, ha='center', fontsize=20)
    ax1.text(x, y-0.2, label, ha='center', fontsize=9)

ax1.text(5, 5, 'Choose ONE action', ha='center', fontsize=12, style='italic')
ax1.text(5, 4, 'Like pressing a button', ha='center', fontsize=10, color='#666')

# Example environments
ax1.text(5, 2.5, 'Examples:', ha='center', fontsize=11, fontweight='bold')
ax1.text(5, 1.8, 'CartPole, LunarLander, Atari', ha='center', fontsize=10, color='#666')

# Right: Continuous actions (robotics)
ax2 = axes[1]
ax2.set_xlim(0, 10)
ax2.set_ylim(0, 10)
ax2.axis('off')
ax2.set_title('Continuous Control (Robotics)', fontsize=14, fontweight='bold', color='#388e3c')

# Draw continuous slider
ax2.plot([2, 8], [7, 7], 'k-', linewidth=4)
ax2.text(2, 6.3, '-1.0', ha='center', fontsize=10)
ax2.text(5, 6.3, '0.0', ha='center', fontsize=10)
ax2.text(8, 6.3, '+1.0', ha='center', fontsize=10)

# Current value marker
current_val = 6.3
circle = Circle((current_val, 7), 0.3, facecolor='#388e3c', edgecolor='black', linewidth=2)
ax2.add_patch(circle)
ax2.text(current_val, 7.8, '0.43', ha='center', fontsize=10, fontweight='bold', color='#388e3c')

ax2.text(5, 5, 'Choose ANY value in range', ha='center', fontsize=12, style='italic')
ax2.text(5, 4, 'Like turning a dial smoothly', ha='center', fontsize=10, color='#666')

# Example environments
ax2.text(5, 2.5, 'Examples:', ha='center', fontsize=11, fontweight='bold')
ax2.text(5, 1.8, 'Pendulum, HalfCheetah, Humanoid', ha='center', fontsize=10, color='#666')

plt.tight_layout()
plt.show()

print("\nKEY DIFFERENCE:")
print("  Discrete: action ∈ {0, 1, 2, 3} (finite choices)")
print("  Continuous: action ∈ [-1, +1] (infinite choices!)")
print("  Continuous control is MUCH harder to learn!")

---
## Robotics Environments: A Tour

```
    ┌────────────────────────────────────────────────────────────────┐
    │              ROBOTICS ENVIRONMENTS                             │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  PENDULUM (Difficulty: ⭐)                                     │
    │    Task: Swing up and balance a pendulum                      │
    │    State: [cos(θ), sin(θ), angular_velocity]                  │
    │    Action: Torque ∈ [-2, +2]                                  │
    │    Great for learning continuous control!                    │
    │                                                                │
    │  HALFCHEETAH (Difficulty: ⭐⭐)                                │
    │    Task: Make a 2D cheetah run fast                          │
    │    State: 17D (joint angles, velocities, body orientation)   │
    │    Action: 6 joint torques                                   │
    │    Classic locomotion benchmark                              │
    │                                                                │
    │  ANT (Difficulty: ⭐⭐⭐)                                      │
    │    Task: 4-legged robot walking                              │
    │    State: 111D (lots of sensors!)                            │
    │    Action: 8 joint torques                                   │
    │    Coordination challenge                                    │
    │                                                                │
    │  HUMANOID (Difficulty: ⭐⭐⭐⭐)                               │
    │    Task: Bipedal walking                                     │
    │    State: 376D (huge state space!)                           │
    │    Action: 17 joint torques                                  │
    │    Very hard - balancing + coordination                      │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Explore robotics environments

print("ROBOTICS ENVIRONMENTS")
print("="*70)

if GYM_AVAILABLE:
    environments = {
        'Pendulum-v1': {
            'description': 'Swing up and balance pendulum',
            'difficulty': '⭐',
            'state_dim': 3,
            'action_dim': 1,
            'best_algorithm': 'SAC, TD3, PPO',
            'requires_mujoco': False,
        },
        'HalfCheetah-v4': {
            'description': '2D cheetah locomotion',
            'difficulty': '⭐⭐',
            'state_dim': 17,
            'action_dim': 6,
            'best_algorithm': 'SAC, TD3',
            'requires_mujoco': True,
        },
        'Ant-v4': {
            'description': '4-legged robot locomotion',
            'difficulty': '⭐⭐⭐',
            'state_dim': 111,
            'action_dim': 8,
            'best_algorithm': 'SAC, TD3',
            'requires_mujoco': True,
        },
        'Humanoid-v4': {
            'description': 'Bipedal walking',
            'difficulty': '⭐⭐⭐⭐',
            'state_dim': 376,
            'action_dim': 17,
            'best_algorithm': 'SAC + careful tuning',
            'requires_mujoco': True,
        },
    }
    
    for env_name, info in environments.items():
        print(f"\n{env_name}")
        print(f"  Description: {info['description']}")
        print(f"  Difficulty: {info['difficulty']}")
        print(f"  State dim: {info['state_dim']}, Action dim: {info['action_dim']}")
        print(f"  Best algorithms: {info['best_algorithm']}")
        
        if info['requires_mujoco'] and not MUJOCO_AVAILABLE:
            print(f"  (Requires MuJoCo - not installed)")
        else:
            try:
                env = gym.make(env_name)
                print(f"  Observation: {env.observation_space}")
                print(f"  Action: {env.action_space}")
                env.close()
            except Exception as e:
                print(f"  (Environment not available)")

print("\n" + "="*70)

In [None]:
# Visualize complexity progression

fig, ax = plt.subplots(figsize=(14, 8))
ax.set_xlim(0, 14)
ax.set_ylim(0, 10)
ax.axis('off')
ax.set_title('Robotics Environment Complexity', fontsize=16, fontweight='bold')

# Robots as boxes with increasing complexity
robots = [
    (1.5, 5, 'Pendulum', '1 joint', '#4caf50', 3, 1),
    (5, 5, 'HalfCheetah', '6 joints', '#ff9800', 17, 6),
    (8.5, 5, 'Ant', '8 joints', '#f44336', 111, 8),
    (12, 5, 'Humanoid', '17 joints', '#9c27b0', 376, 17),
]

for i, (x, y, name, joints, color, state_dim, action_dim) in enumerate(robots):
    # Robot body (size increases with complexity)
    size = 1.5 + i * 0.3
    box = FancyBboxPatch((x - size/2, y - size/2), size, size, boxstyle="round,pad=0.1",
                          facecolor=color, edgecolor='black', linewidth=2, alpha=0.8)
    ax.add_patch(box)
    
    ax.text(x, y + size/2 + 0.8, name, ha='center', fontsize=12, fontweight='bold')
    ax.text(x, y + size/2 + 0.3, joints, ha='center', fontsize=10, color='#666')
    
    # State/Action dimensions
    ax.text(x, y - size/2 - 0.5, f'State: {state_dim}D', ha='center', fontsize=9)
    ax.text(x, y - size/2 - 1, f'Action: {action_dim}D', ha='center', fontsize=9)

# Difficulty arrow
ax.annotate('', xy=(13, 2), xytext=(1, 2),
            arrowprops=dict(arrowstyle='->', lw=3, color='#333'))
ax.text(7, 1.3, 'Increasing Complexity →', ha='center', fontsize=12, style='italic')

# Training time indicators
times = ['~50K steps', '~1M steps', '~3M steps', '~10M+ steps']
for i, ((x, y, _, _, _, _, _), time) in enumerate(zip(robots, times)):
    ax.text(x, 3.3, time, ha='center', fontsize=9, color='#666')

ax.text(7, 3.8, 'Typical Training Time:', ha='center', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

---
## Why SAC Excels at Continuous Control

```
    ┌────────────────────────────────────────────────────────────────┐
    │              SAC: SOFT ACTOR-CRITIC                            │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  WHY SAC IS GREAT FOR ROBOTICS:                               │
    │                                                                │
    │  1. SAMPLE EFFICIENCY (Off-Policy)                            │
    │     Reuses past experiences from replay buffer               │
    │     Important: Robot simulation is slow!                     │
    │                                                                │
    │  2. EXPLORATION (Entropy Bonus)                               │
    │     Objective: Maximize reward + entropy                     │
    │     J = E[R + α × H(π)]                                      │
    │     Prevents premature convergence to bad gaits              │
    │                                                                │
    │  3. STABILITY                                                 │
    │     Twin Q-networks reduce overestimation                    │
    │     Automatic temperature tuning                             │
    │     Soft updates for smooth learning                         │
    │                                                                │
    │  COMPARISON:                                                  │
    │    • PPO: Stable but sample inefficient                      │
    │    • TD3: Sample efficient but less exploration              │
    │    • SAC: Best of both worlds for robotics!                  │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# SAC vs PPO on Pendulum

print("TRAINING ON PENDULUM: SAC vs PPO")
print("="*60)

if SB3_AVAILABLE and GYM_AVAILABLE:
    results = {}
    
    for name, AlgoClass in [('SAC', SAC), ('PPO', PPO)]:
        print(f"\nTraining {name}...")
        
        env = gym.make('Pendulum-v1')
        model = AlgoClass('MlpPolicy', env, verbose=0)
        
        # Train
        model.learn(total_timesteps=20000)
        
        # Evaluate
        mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
        results[name] = (mean_reward, std_reward)
        print(f"  Result: {mean_reward:.2f} ± {std_reward:.2f}")
        
        env.close()
    
    # Visualize
    fig, ax = plt.subplots(figsize=(10, 6))
    
    names = list(results.keys())
    means = [results[n][0] for n in names]
    stds = [results[n][1] for n in names]
    colors = ['#388e3c', '#1976d2']
    
    bars = ax.bar(names, means, yerr=stds, capsize=5, color=colors,
                  edgecolor='black', linewidth=2)
    
    # Pendulum reward range: -16.2 to 0 (higher is better)
    ax.axhline(y=-200, color='red', linestyle='--', label='Random baseline (~-1200)')
    
    ax.set_ylabel('Mean Reward (higher = better)', fontsize=12)
    ax.set_title('Pendulum: SAC vs PPO\n(20K training steps)', fontsize=14, fontweight='bold')
    ax.grid(True, alpha=0.3, axis='y')
    
    for bar, mean in zip(bars, means):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() - 50,
                f'{mean:.0f}', ha='center', fontsize=12, fontweight='bold', color='white')
    
    plt.tight_layout()
    plt.show()
    
    print("\nKEY INSIGHT:")
    print("  SAC often learns faster on continuous control tasks!")
    print("  Entropy bonus helps explore the action space smoothly.")
else:
    print("Install required libraries to train agents")

---
## Reward Shaping for Locomotion

```
    ┌────────────────────────────────────────────────────────────────┐
    │              REWARD SHAPING                                    │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  BAD REWARD (too simple):                                     │
    │    reward = +1 if reached_goal else 0                        │
    │    Problem: Sparse, hard to learn                            │
    │                                                                │
    │  GOOD REWARD (shaped):                                        │
    │    reward = velocity_forward                # Move fast!     │
    │           - 0.1 * |control|^2               # Use energy     │
    │           - 10 * (z < threshold)            # Don't fall     │
    │           + 0.1 * still_alive               # Stay up!       │
    │                                                                │
    │  COMMON REWARD COMPONENTS:                                    │
    │    • Forward velocity: Encourage movement                    │
    │    • Control cost: Penalize jerky motions                    │
    │    • Alive bonus: Reward staying upright                     │
    │    • Impact cost: Penalize hard contacts                     │
    │    • Smoothness: Reward consistent motion                    │
    │                                                                │
    │  TIP: Good reward shaping is an ART!                         │
    │       Too complex → hard to optimize                         │
    │       Too simple → wrong behavior                            │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Demonstrate reward components

print("LOCOMOTION REWARD BREAKDOWN")
print("="*60)

def compute_locomotion_reward(
    velocity_forward,
    control_magnitude,
    height,
    alive,
    ctrl_cost_weight=0.1,
    alive_bonus=1.0,
    height_threshold=0.5,
    fall_penalty=10.0
):
    """
    Compute locomotion reward with multiple components.
    
    This is similar to what MuJoCo envs use.
    """
    # Forward progress
    velocity_reward = velocity_forward
    
    # Control cost (penalize large actions)
    ctrl_cost = ctrl_cost_weight * np.sum(control_magnitude ** 2)
    
    # Alive bonus
    alive_reward = alive_bonus if alive else 0
    
    # Fall penalty
    fall_cost = fall_penalty if height < height_threshold else 0
    
    total = velocity_reward - ctrl_cost + alive_reward - fall_cost
    
    return {
        'velocity': velocity_reward,
        'ctrl_cost': -ctrl_cost,
        'alive': alive_reward,
        'fall': -fall_cost,
        'total': total,
    }


# Example scenarios
scenarios = [
    ("Walking nicely", 2.0, np.array([0.3, 0.2, 0.1]), 1.2, True),
    ("Running hard", 5.0, np.array([0.9, 0.8, 0.7]), 1.0, True),
    ("Standing still", 0.0, np.array([0.1, 0.1, 0.1]), 1.2, True),
    ("Falling down", 1.0, np.array([0.5, 0.5, 0.5]), 0.3, False),
]

print(f"\n{'Scenario':<18} {'Velocity':>10} {'Ctrl Cost':>10} {'Alive':>8} {'Fall':>8} {'TOTAL':>10}")
print("-" * 70)

for name, vel, ctrl, height, alive in scenarios:
    r = compute_locomotion_reward(vel, ctrl, height, alive)
    print(f"{name:<18} {r['velocity']:>10.2f} {r['ctrl_cost']:>10.2f} {r['alive']:>8.2f} {r['fall']:>8.2f} {r['total']:>10.2f}")

print("\n" + "="*60)
print("\nKEY INSIGHTS:")
print("  • Running hard has high velocity but high control cost")
print("  • Walking nicely balances speed and efficiency")
print("  • Falling incurs a large penalty!")

In [None]:
# Visualize reward components

fig, ax = plt.subplots(figsize=(12, 7))

scenarios_names = ["Walking nicely", "Running hard", "Standing still", "Falling down"]
scenarios_data = [
    (2.0, np.array([0.3, 0.2, 0.1]), 1.2, True),
    (5.0, np.array([0.9, 0.8, 0.7]), 1.0, True),
    (0.0, np.array([0.1, 0.1, 0.1]), 1.2, True),
    (1.0, np.array([0.5, 0.5, 0.5]), 0.3, False),
]

# Compute rewards
all_rewards = []
for vel, ctrl, height, alive in scenarios_data:
    r = compute_locomotion_reward(vel, ctrl, height, alive)
    all_rewards.append(r)

# Stacked bar chart
x = np.arange(len(scenarios_names))
width = 0.6

velocity_vals = [r['velocity'] for r in all_rewards]
ctrl_vals = [r['ctrl_cost'] for r in all_rewards]
alive_vals = [r['alive'] for r in all_rewards]
fall_vals = [r['fall'] for r in all_rewards]

# Plot positive components
ax.bar(x, velocity_vals, width, label='Velocity Reward', color='#4caf50')
ax.bar(x, alive_vals, width, bottom=velocity_vals, label='Alive Bonus', color='#2196f3')

# Plot negative components (below zero)
ax.bar(x, ctrl_vals, width, label='Control Cost', color='#ff9800')
ax.bar(x, fall_vals, width, bottom=ctrl_vals, label='Fall Penalty', color='#f44336')

# Total line
totals = [r['total'] for r in all_rewards]
ax.plot(x, totals, 'ko-', markersize=10, linewidth=2, label='Total Reward')

ax.axhline(y=0, color='black', linewidth=1)
ax.set_xlabel('Scenario', fontsize=12)
ax.set_ylabel('Reward', fontsize=12)
ax.set_title('Locomotion Reward Components', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(scenarios_names, rotation=15)
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

---
## Sim-to-Real Transfer

```
    ┌────────────────────────────────────────────────────────────────┐
    │              SIM-TO-REAL TRANSFER                              │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  THE PROBLEM:                                                 │
    │    Training in simulation: ✓ Fast, cheap, safe               │
    │    Running on real robot: ? Does it still work?              │
    │                                                                │
    │  THE REALITY GAP:                                             │
    │    • Friction is different                                   │
    │    • Sensors are noisy                                       │
    │    • Motors have delays                                      │
    │    • Physics isn't perfect                                   │
    │                                                                │
    │  SOLUTIONS:                                                   │
    │                                                                │
    │  1. DOMAIN RANDOMIZATION                                      │
    │     Train with random friction, mass, etc.                   │
    │     Robot learns to be robust!                               │
    │                                                                │
    │  2. SYSTEM IDENTIFICATION                                     │
    │     Measure real robot parameters                            │
    │     Make simulation match reality                            │
    │                                                                │
    │  3. SIM-TO-REAL FINE-TUNING                                   │
    │     Train mostly in sim                                      │
    │     Fine-tune on real robot (carefully!)                     │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Visualize sim-to-real challenge

fig, ax = plt.subplots(figsize=(14, 8))
ax.set_xlim(0, 14)
ax.set_ylim(0, 10)
ax.axis('off')
ax.set_title('The Sim-to-Real Challenge', fontsize=16, fontweight='bold')

# Simulation box
sim_box = FancyBboxPatch((0.5, 5.5), 5, 4, boxstyle="round,pad=0.1",
                          facecolor='#c8e6c9', edgecolor='#388e3c', linewidth=3)
ax.add_patch(sim_box)
ax.text(3, 9, 'SIMULATION', ha='center', fontsize=12, fontweight='bold', color='#388e3c')
ax.text(3, 8.3, '✓ Fast (1000x real-time)', ha='center', fontsize=10)
ax.text(3, 7.7, '✓ Cheap (no hardware)', ha='center', fontsize=10)
ax.text(3, 7.1, '✓ Safe (can crash)', ha='center', fontsize=10)
ax.text(3, 6.3, '✗ Not perfect physics', ha='center', fontsize=10, color='#d32f2f')

# Real world box
real_box = FancyBboxPatch((8.5, 5.5), 5, 4, boxstyle="round,pad=0.1",
                           facecolor='#fff3e0', edgecolor='#f57c00', linewidth=3)
ax.add_patch(real_box)
ax.text(11, 9, 'REAL WORLD', ha='center', fontsize=12, fontweight='bold', color='#f57c00')
ax.text(11, 8.3, '✗ Slow (real-time)', ha='center', fontsize=10, color='#d32f2f')
ax.text(11, 7.7, '✗ Expensive (hardware)', ha='center', fontsize=10, color='#d32f2f')
ax.text(11, 7.1, '✗ Dangerous (can break)', ha='center', fontsize=10, color='#d32f2f')
ax.text(11, 6.3, '✓ True physics', ha='center', fontsize=10)

# Gap in the middle
ax.annotate('', xy=(8.4, 7.5), xytext=(5.6, 7.5),
            arrowprops=dict(arrowstyle='->', lw=3, color='#666'))
ax.text(7, 8.2, 'THE GAP', ha='center', fontsize=11, fontweight='bold', color='#d32f2f')
ax.text(7, 7.8, '(Reality is different!)', ha='center', fontsize=9, color='#666')

# Solutions
solutions = [
    (2, 3.5, 'Domain\nRandomization', 'Train with varied\nparameters', '#4caf50'),
    (7, 3.5, 'System\nIdentification', 'Match sim to\nreal robot', '#2196f3'),
    (12, 3.5, 'Fine-Tuning', 'Train in sim,\ntune on real', '#9c27b0'),
]

ax.text(7, 4.8, 'SOLUTIONS:', ha='center', fontsize=12, fontweight='bold')

for x, y, title, desc, color in solutions:
    box = FancyBboxPatch((x-1.5, y-1.5), 3, 2.5, boxstyle="round,pad=0.1",
                          facecolor=color, edgecolor='black', linewidth=2, alpha=0.3)
    ax.add_patch(box)
    ax.text(x, y+0.5, title, ha='center', fontsize=10, fontweight='bold')
    ax.text(x, y-0.5, desc, ha='center', fontsize=9)

plt.tight_layout()
plt.show()

print("\nSIM-TO-REAL SUCCESS STORIES:")
print("  • OpenAI: Robot hand solving Rubik's cube (domain randomization)")
print("  • Boston Dynamics: Walking robots (careful system ID)")
print("  • Google: Robot manipulation (sim + real fine-tuning)")

---
## Complete Example: Training a Walking Agent

Let's train SAC on a continuous control task!

In [None]:
# Complete training example on Pendulum

print("TRAINING SAC ON PENDULUM")
print("="*60)

if SB3_AVAILABLE and GYM_AVAILABLE:
    # Create environment
    env = gym.make('Pendulum-v1')
    
    print("\nEnvironment Details:")
    print(f"  Observation space: {env.observation_space}")
    print(f"  Action space: {env.action_space}")
    
    # Create SAC model with good defaults for robotics
    model = SAC(
        'MlpPolicy',
        env,
        learning_rate=3e-4,
        buffer_size=100000,
        batch_size=256,
        tau=0.005,               # Soft update coefficient
        gamma=0.99,              # Discount factor
        learning_starts=1000,    # Random exploration first
        verbose=0,
    )
    
    # Evaluate before training
    print("\nEvaluating before training...")
    mean_before, std_before = evaluate_policy(model, env, n_eval_episodes=5)
    print(f"  Mean reward: {mean_before:.2f} ± {std_before:.2f}")
    
    # Train
    print("\nTraining for 30K timesteps...")
    model.learn(total_timesteps=30000, progress_bar=True)
    
    # Evaluate after training
    print("\nEvaluating after training...")
    mean_after, std_after = evaluate_policy(model, env, n_eval_episodes=10)
    print(f"  Mean reward: {mean_after:.2f} ± {std_after:.2f}")
    
    improvement = mean_after - mean_before
    print(f"\nImprovement: {improvement:+.2f}")
    
    # Run an episode and record actions
    print("\nRecording trained policy...")
    obs, _ = env.reset()
    states = [obs]
    actions = []
    rewards = []
    
    for _ in range(200):
        action, _ = model.predict(obs, deterministic=True)
        obs, reward, terminated, truncated, _ = env.step(action)
        states.append(obs)
        actions.append(action[0])
        rewards.append(reward)
        if terminated or truncated:
            break
    
    env.close()
    
    # Visualize learned behavior
    fig, axes = plt.subplots(3, 1, figsize=(12, 10), sharex=True)
    
    # Actions
    ax1 = axes[0]
    ax1.plot(actions, 'b-', linewidth=1.5)
    ax1.set_ylabel('Action (Torque)', fontsize=11)
    ax1.set_title('Learned Pendulum Control', fontsize=14, fontweight='bold')
    ax1.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(-2.5, 2.5)
    
    # Angle (from cos and sin)
    angles = [np.arctan2(s[1], s[0]) for s in states[:-1]]
    ax2 = axes[1]
    ax2.plot(angles, 'g-', linewidth=1.5)
    ax2.set_ylabel('Angle (rad)', fontsize=11)
    ax2.axhline(y=0, color='red', linestyle='--', alpha=0.5, label='Target (upright)')
    ax2.grid(True, alpha=0.3)
    ax2.legend()
    
    # Rewards
    ax3 = axes[2]
    ax3.plot(rewards, 'orange', linewidth=1.5)
    ax3.set_xlabel('Timestep', fontsize=11)
    ax3.set_ylabel('Reward', fontsize=11)
    ax3.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\nLEARNED BEHAVIOR:")
    print("  The agent learns to swing up and balance!")
    print("  Notice how actions become small once balanced.")
else:
    print("Install required libraries to train")

---
## Summary: Robotics Simulation with RL

### Key Concepts

| Concept | Description |
|---------|-------------|
| Continuous Control | Actions are real numbers, not discrete choices |
| Reward Shaping | Combining multiple objectives (speed, energy, stability) |
| Sim-to-Real | Training in simulation, deploying on real robots |

### Algorithm Comparison

| Algorithm | Sample Efficiency | Stability | Best For |
|-----------|------------------|-----------|----------|
| **SAC** | High | High | Default for robotics |
| **TD3** | High | Medium | When SAC is unstable |
| **PPO** | Low | Very High | When stability is critical |

### Quick Reference

```python
# SAC for robotics
from stable_baselines3 import SAC

env = gym.make('Pendulum-v1')  # or HalfCheetah-v4, etc.
model = SAC('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=100_000)
```

---
## Test Your Understanding

**1. Why is continuous control harder than discrete control?**
<details>
<summary>Click to reveal answer</summary>
Continuous control has infinite possible actions (any value in a range), while discrete control has finite choices. This makes:
- Exploration harder (can't try all actions)
- Learning slower (larger search space)
- Coordination more complex (multiple continuous values)
</details>

**2. Why does SAC work well for robotics?**
<details>
<summary>Click to reveal answer</summary>
SAC has three key advantages:
1. Sample efficiency: Uses replay buffer to reuse experiences
2. Entropy bonus: Encourages exploration of continuous action space
3. Stability: Twin Q-networks and soft updates
</details>

**3. What is the "reality gap" in sim-to-real transfer?**
<details>
<summary>Click to reveal answer</summary>
The reality gap is the difference between simulation and the real world:
- Different friction, mass, dynamics
- Sensor noise and delays
- Imperfect physics modeling

A policy trained in simulation may fail on a real robot because it learned to exploit unrealistic simulation quirks.
</details>

**4. What's domain randomization and why does it help?**
<details>
<summary>Click to reveal answer</summary>
Domain randomization trains with randomly varied simulation parameters (friction, mass, etc.). This forces the policy to be robust to variations, making it more likely to work in the real world where exact parameters are unknown.
</details>

**5. Why include control cost in the reward?**
<details>
<summary>Click to reveal answer</summary>
Without control cost, robots learn jerky, energy-wasting motions. Penalizing large control signals:
- Encourages smooth, efficient movement
- Reduces wear on real motors
- Produces more natural-looking gaits
</details>

---
## What's Next?

Robotics shows RL in physical systems. In the next notebook, we'll explore a very different application: **recommendation systems** where RL learns user preferences!

**Continue to:** [Notebook 3: Recommendation Systems](03_recommendation_systems.ipynb)

---

*"Teaching robots to walk is like teaching children - patience, practice, and lots of falling down!"*