# Satellite Sensor Tasking with Reinforcement Learning

## Learning Objectives

1. Understand reinforcement learning fundamentals (states, actions, rewards, Q-learning)
2. Apply RL to satellite sensor tasking problem
3. Visualize agent learning and convergence
4. Understand the impact of sparse rewards on learning
5. Connect grid-based learning to physical satellite gimbal pointing

## Problem Overview

Satellites with sensors (cameras, radar) need to decide where to point to observe ground targets.
This notebook models the problem as a grid where the agent learns to navigate to high-value target locations.

**Setup**: Ensure you're using the satellite_rl kernel (see README.md)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sys
from pathlib import Path

# Add src to path
sys.path.append(str(Path('../src').resolve()))

from environment.grid_env import SatelliteSensorGridEnv
from agents.q_learning import QLearningAgent
from visualization.grid_viz import plot_learning_curve, plot_value_heatmap, plot_policy_arrows
from visualization.satellite_viz import create_satellite_gimbal_visualization, grid_to_satellite_coordinates

# Set random seed for reproducibility
np.random.seed(42)

# Matplotlib settings
%matplotlib inline

print("All imports successful!")

## Step 1: Create the Environment

We'll create an 11x11 grid environment where:
- The satellite sensor can "point" to any of 121 positions
- The goal (high-value target) is at the center position (state 60)
- Rewards are given ONLY for reaching positions directly adjacent to the goal
- Actions: 0=up, 1=down, 2=left, 3=right

**Important**: This is a **sparse reward** environment - only 4 out of 121 states give rewards!

In [None]:
env = SatelliteSensorGridEnv(grid_x=11, grid_y=11)

print(f"Environment created!")
print(f"  Grid size: {env.grid_x}x{env.grid_y}")
print(f"  Total states: {env.observation_space.n}")
print(f"  Actions: {env.action_space.n} (0=up, 1=down, 2=left, 3=right)")
print(f"  Goal state: {env.goal_state}")
print(f"  Goal position: (row={env.goal_state // 11}, col={env.goal_state % 11})")

# Visualize initial random state
state, _ = env.reset()
print(f"\nInitial random state: {state}")
print(f"\nNote: Only 4 states (adjacent to goal) give +100 reward.")
print(f"      All other 117 states learn through value propagation!")

## Step 2: Train Q-Learning Agent

We'll train a Q-learning agent with **optimized parameters** for sparse rewards.

### Hyperparameters (OPTIMIZED for sparse rewards):
- **Learning rate (α): 0.15** - Higher than typical (0.1) for faster learning
- **Discount factor (γ): 0.95** - Higher than typical (0.9) for better value propagation
- **Epsilon (ε): 0.02** - Very low for deterministic policy convergence
- **Episodes: 300,000** - Much more than typical due to sparse rewards

### Why so many episodes?
With sparse rewards, positions far from the goal must learn through **value propagation**:
- Position (4,5) learns quickly: it's adjacent to goal, gets direct +100 reward
- Position (3,5) learns next: values propagate from (4,5)
- Position (0,5) learns slowly: values must propagate through 5 intermediate states!

This process requires many episodes. You'll see positions near the goal learn correctly first,
while edge positions take much longer.

**Estimated training time: 3-5 minutes**

In [None]:
agent = QLearningAgent(
    env=env,
    learning_rate=0.15,      # Higher for faster learning with sparse rewards
    discount_factor=0.95,    # Higher for better value propagation
    epsilon=0.02             # Low for deterministic optimal policy
)

print("Training Q-learning agent with OPTIMIZED parameters...")
print(f"  Learning rate: {agent.lr}")
print(f"  Discount factor: {agent.gamma}")
print(f"  Epsilon: {agent.epsilon}")
print(f"  Episodes: 300,000")
print(f"\nThis will take a few minutes. Watch the scores increase!\n")

scores = agent.train(num_episodes=300000, max_steps=50, verbose=True)

print("\n" + "="*60)
print("Training complete!")
print("="*60)
print("Note: Positions near the goal should have learned perfectly.")
print("      Some edge positions may still be improving.")
print("      For perfect convergence at ALL positions, try 500k+ episodes.")

## Step 3: Analyze Learning Convergence

The learning curve shows how the agent's Q-values (summed across all states) increase over episodes.
A rising curve indicates the agent is learning to estimate future rewards more accurately.

You should see:
- Initial rapid increase as positions adjacent to goal learn
- Slower continued growth as values propagate to distant states
- Eventually plateaus when most positions have converged

In [None]:
plot_learning_curve(scores, window=1000, title="Q-Learning Convergence (11x11 Grid, 300k episodes)")

## Step 4: Value Function Heatmap

The value function V(s) = max_a Q(s,a) shows the expected cumulative reward from each state.

**What to look for**:
- Brightest colors at the goal (highest value)
- Values decrease as you move away from goal
- Radial pattern centered on goal
- Corners have lowest values (furthest from goal)

In [None]:
value_grid = agent.get_value_grid()
plot_value_heatmap(value_grid, title="Learned State Values (11x11 Grid)")

## Step 5: Learned Policy Visualization

**THIS IS THE KEY VISUALIZATION!**

Arrows show the best action to take from each grid position.
**All arrows should point TOWARD the center goal at (5,5)**.

### What You Should See:
- **Top half (rows 0-4)**: Arrows point DOWN ↓
- **Bottom half (rows 6-10)**: Arrows point UP ↑
- **Left side (cols 0-4)**: Arrows point RIGHT →
- **Right side (cols 6-10)**: Arrows point LEFT ←
- **Center row/column**: Arrows point directly to (5,5)

### Visualization Details:
- Plot uses `origin='upper'` (row 0 at top, like a normal matrix)
- Action 0 (up) → arrow points upward
- Action 1 (down) → arrow points downward
- Red arrows show the learned policy

### If Some Arrows Look Wrong:
Those positions haven't fully converged yet. This is normal with sparse rewards!
The positions closest to the goal should all be correct.

In [None]:
policy_grid = agent.get_policy_grid()
plot_policy_arrows(policy_grid, value_grid, 
                  title="Learned Sensor Tasking Policy (11x11 Grid)\nAll arrows should point toward goal at (5,5)")

# Print verification for key positions
print("\nPolicy Verification (Middle Column):")
print("="*50)
action_names = {0: 'up ↑', 1: 'down ↓', 2: 'left ←', 3: 'right →'}
for row in range(11):
    action = policy_grid[row, 5]
    if row < 5:
        expected = "down ↓ (to reach goal)"
        correct = "✓" if action == 1 else "✗"
    elif row > 5:
        expected = "up ↑ (to reach goal)"
        correct = "✓" if action == 0 else "✗"
    else:
        expected = "any (at goal)"
        correct = "-"
    
    print(f"{correct} Row {row:2d}, Col 5: {action_names[action]:8s} | Expected: {expected}")

print("\nIf you see ✗ marks, those positions need more training.")
print("Positions near the goal (rows 4-6) should always be ✓.")

## Step 6: Satellite Gimbal Visualization

Now let's visualize how the learned grid policy translates to actual satellite gimbal pointing.

We'll:
1. Map grid positions to geographic coordinates (latitude/longitude)
2. Convert to 3D Earth-Centered Earth-Fixed (ECEF) coordinates
3. Show satellite gimbal pointing from orbit to ground targets

In [None]:
# Satellite position (500 km altitude)
satellite_pos = np.array([6371 + 500, 0, 0])  # Earth radius + altitude

# Create multiple target positions from grid (using 11x11 grid)
target_positions = []
for gx in range(3, 9):  # Sample targets near center
    for gy in range(3, 9):
        tpos, _ = grid_to_satellite_coordinates(gx, gy, 11, 500.0)
        target_positions.append(tpos)
target_positions = np.array(target_positions)

# Get pointing vector to center target
_, gimbal_vec = grid_to_satellite_coordinates(5, 5, 11, 500.0)

# Visualize
create_satellite_gimbal_visualization(
    satellite_position=satellite_pos,
    target_positions=target_positions,
    gimbal_pointing=gimbal_vec,
    title="Satellite Gimbal Pointing to Learned Targets"
)

## Summary

In this notebook, we:
1. ✅ Created a Gymnasium environment for satellite sensor tasking
2. ✅ Trained a Q-learning agent with optimized parameters for sparse rewards
3. ✅ Visualized learning convergence and value functions
4. ✅ Analyzed the learned policy with arrow visualizations
5. ✅ Connected grid-based RL to physical satellite gimbal visualization

## Key Insights

### Sparse Rewards are Challenging!
- Only 4 out of 121 states give direct rewards
- Requires 300,000+ episodes for convergence
- Positions far from goal learn through value propagation
- This is realistic for many real-world RL problems!

### Optimal Hyperparameters Matter
- Higher learning rate (0.15) → faster learning
- Higher discount (0.95) → better value propagation
- Lower epsilon (0.02) → more deterministic policy

## Next Steps for Students

1. **Experiment with hyperparameters**: Try different learning rates, discount factors, epsilon values
2. **Reduce episodes**: Train with only 50,000 episodes and see which positions fail to learn
3. **Modify reward structure**: Add distance-based rewards to speed up learning
4. **Multi-target scenarios**: Extend to multiple targets with different priorities
5. **Different grid sizes**: Try 7×7 (easier) or 15×15 (harder)
6. **Deep Q-Learning**: Replace Q-table with neural network for continuous states
7. **Multi-satellite coordination**: Extend to multi-agent scenarios

## References

- Sutton & Barto, "Reinforcement Learning: An Introduction" (2018)
- Gymnasium Documentation: https://gymnasium.farama.org/
- Satellite RL Research: https://arxiv.org/html/2409.02270v1

---

🎉 **Notebook complete! You've successfully trained an RL agent for satellite sensor tasking!**

**Note**: If policy arrows don't all point perfectly toward the goal, try training for 500,000 episodes or check the test images in `tests/` folder to see what correct arrows should look like.