# Satellite Sensor Tasking with Reinforcement Learning

## Learning Objectives

1. Understand reinforcement learning fundamentals (states, actions, rewards, Q-learning)
2. Apply RL to satellite sensor tasking problem
3. Visualize agent learning and convergence
4. Connect grid-based learning to physical satellite gimbal pointing

## Problem Overview

Satellites with sensors (cameras, radar) need to decide where to point to observe ground targets. 
This notebook models the problem as a grid where the agent learns to navigate to high-value target locations.

**Setup**: Ensure you're using the satellite_rl kernel (see README.md)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sys
from pathlib import Path

# Add src to path
sys.path.append(str(Path('../src').resolve()))

from environment.grid_env import SatelliteSensorGridEnv
from agents.q_learning import QLearningAgent
from visualization.grid_viz import plot_learning_curve, plot_value_heatmap, plot_policy_arrows
from visualization.satellite_viz import create_satellite_gimbal_visualization, grid_to_satellite_coordinates

# Set random seed for reproducibility
np.random.seed(42)

# Matplotlib settings
%matplotlib inline

print("All imports successful!")

## Step 1: Create the Environment

We'll create an 11x11 grid environment where:
- The satellite sensor can "point" to any of 121 positions
- The goal (high-value target) is at the center position (state 60)
- Rewards are given for reaching positions adjacent to the goal
- Actions: 0=up, 1=down, 2=left, 3=right

In [None]:
env = SatelliteSensorGridEnv(grid_x=11, grid_y=11)

print(f"Environment created!")
print(f"  Grid size: {env.grid_x}x{env.grid_y}")
print(f"  Total states: {env.observation_space.n}")
print(f"  Actions: {env.action_space.n} (up, down, left, right)")
print(f"  Goal state: {env.goal_state}")

# Visualize initial random state
state, _ = env.reset()
print(f"\nInitial random state: {state}")

## Step 2: Train Q-Learning Agent

We'll train a Q-learning agent for 30,000 episodes to learn the optimal sensor tasking policy.

**Hyperparameters:**
- Learning rate (α): 0.1 - controls how much new information overrides old
- Discount factor (γ): 0.9 - importance of future rewards
- Epsilon (ε): 0.1 - exploration rate (10% random actions)

In [None]:
agent = QLearningAgent(
    env=env,
    learning_rate=0.1,
    discount_factor=0.9,
    epsilon=0.1
)

print("Training Q-learning agent...")
scores = agent.train(num_episodes=30000, max_steps=50, verbose=True)
print("Training complete!")

## Step 3: Analyze Learning Convergence

The learning curve shows how the agent's performance improves over episodes. 
A rising curve indicates the agent is learning to reach the goal more efficiently.

In [None]:
plot_learning_curve(scores, window=500, title="Q-Learning Convergence (11x11 Grid)")

## Step 4: Value Function Heatmap

The value function V(s) shows the expected cumulative reward from each state. 
Brighter colors indicate higher values (closer to goal or better positions).

In [None]:
value_grid = agent.get_value_grid()
plot_value_heatmap(value_grid, title="Learned State Values (11x11 Grid)")

## Step 5: Learned Policy Visualization

Arrows show the best action to take from each grid position. 
All arrows should point toward the center (goal position).

In [None]:
policy_grid = agent.get_policy_grid()
plot_policy_arrows(policy_grid, value_grid, title="Learned Sensor Tasking Policy (11x11 Grid)")

## Step 6: Satellite Gimbal Visualization

Now let's visualize how the learned grid policy translates to actual satellite gimbal pointing.

We'll:
1. Map grid positions to geographic coordinates (latitude/longitude)
2. Convert to 3D Earth-Centered Earth-Fixed (ECEF) coordinates
3. Show satellite gimbal pointing from orbit to ground targets

In [None]:
# Satellite position (500 km altitude)
satellite_pos = np.array([6371 + 500, 0, 0])  # Earth radius + altitude

# Create multiple target positions from grid (using 11x11 grid)
target_positions = []
for gx in range(3, 9):  # Sample targets near center
    for gy in range(3, 9):
        tpos, _ = grid_to_satellite_coordinates(gx, gy, 11, 500.0)
        target_positions.append(tpos)
target_positions = np.array(target_positions)

# Get pointing vector to center target
_, gimbal_vec = grid_to_satellite_coordinates(5, 5, 11, 500.0)

# Visualize
create_satellite_gimbal_visualization(
    satellite_position=satellite_pos,
    target_positions=target_positions,
    gimbal_pointing=gimbal_vec,
    title="Satellite Gimbal Pointing to Learned Targets"
)

## Summary

In this notebook, we:
1. ✅ Created a Gymnasium environment for satellite sensor tasking
2. ✅ Trained a Q-learning agent to learn optimal sensor pointing
3. ✅ Visualized learning convergence and value functions
4. ✅ Connected grid-based RL to physical satellite gimbal visualization

## Next Steps for Students

1. **Experiment with hyperparameters**: Try different learning rates, discount factors
2. **Modify reward structure**: Add penalties for distance traveled, rewards for coverage
3. **Multi-target scenarios**: Extend to multiple targets with different priorities
4. **Deep Q-Learning**: Implement neural network-based agent
5. **Realistic orbits**: Integrate with poliastro for true orbital mechanics
6. **Multi-satellite coordination**: Extend to multi-agent scenarios
7. **Transfer learning**: Experiment with transferring knowledge to larger grids

## References

- Sutton & Barto, "Reinforcement Learning: An Introduction"
- Gymnasium Documentation: https://gymnasium.farama.org/
- Satellite RL Research: [arxiv.org/html/2409.02270v1](https://arxiv.org/html/2409.02270v1)

🎉 **Notebook complete! Great work!**