# Q-Learning Irrigation Experiments - Historical Archive

This notebook contains archived experiments from the Q-learning irrigation project, including:
- Physical stability calibration for soil moisture bins
- Monitored training loops with Q-value evolution tracking
- Rain regime comparison experiments (dry vs moderate)
- State coverage experiments and debugging

**Purpose:** Historical record of experimental exploration leading to final methodology.

**Note:** For the final, clean experimental workflow, see `experiments.ipynb`.

## Environment Setup

In [None]:
# Create environment instance with stability-calibrated parameters
# Parameters chosen via physical stability experiments (see Stability Calibration section below)
# to ensure soil_bin=1 is dynamically stable (mean residence ≥10 steps)
env = IrrigationEnv(
    max_et0=8.0,
    max_rain=50.0,
    et0_range=(2.0, 8.0),
    rain_range=(0.0, 0.8),           # Calibrated (was 40.0) - reduces perturbations
    max_soil_moisture=320.0,         # Calibrated (was 100.0) - increases capacity
    episode_length=90,
)

print(f"Environment created (stability-calibrated)")
print(f"Action space: {env.action_space}")
print(f"Observation space: {env.observation_space}")

---
# PHYSICAL STABILITY CALIBRATION (Soil Bin 1)

**Problem:** Under baseline parameters (rain 0-40mm, capacity 100mm), soil_bin=1 (SM ∈ [0.333, 0.667)) was highly unstable with mean residence time of only **1.62 steps**. This prevented meaningful Q-learning in the mid-moisture range.

**Objective:** Make soil_bin=1 dynamically stable under random policy exploration with **mean residence time ≥10 consecutive steps**.

**Constraints:**
- No drainage, runoff, or percolation modifications
- No reward function changes
- No discretization changes
- Only climate sampling and soil capacity adjustments allowed

**Method:** Systematic parameter sweeps testing rain reduction and soil capacity increases.

**Results:**

| Configuration | rain_range | max_soil_moisture | Mean Residence | Status |
|--------------|-----------|-------------------|----------------|--------|
| Baseline | (0, 40) | 100 | 1.62 steps | ✗ Unstable |
| E3+ Final | (0, 0.8) | 320 | 11.28 steps | ✓ Stable |

**Mechanism:** Rain reduction (6× improvement) combined with capacity increase (3× improvement) creates synergistic 10× improvement. Physical stability emerges from ratio: bin_width / max_input_perturbation.

**Validation:** 100% success rate across 5 independent trials (range: 10.96-11.59 steps).

✅ **Bin 1 is now dynamically stable and suitable for policy learning.**

## Validate Stability in This Notebook

In [None]:
def measure_bin1_stability(env, n_episodes=100, n_soil_bins=3):
    """
    Measure stability of soil_bin=1 under random policy.
    
    Returns mean residence time, median, max, and bin occupancy percentage.
    """
    from collections import defaultdict
    
    residence_times = []
    bin_visits = defaultdict(int)
    entry_count = 0
    
    for episode in range(n_episodes):
        obs, _ = env.reset()
        done = False
        in_bin1 = False
        current_residence = 0
        
        while not done:
            state_idx = discretize_state(obs, n_soil_bins)
            # Extract soil_bin from state index
            soil_bin = state_idx // (3 * 2 * 2)  # state = soil_bin * 12 + crop * 4 + et0 * 2 + rain
            bin_visits[soil_bin] += 1
            
            if soil_bin == 1:
                if not in_bin1:
                    in_bin1 = True
                    entry_count += 1
                    current_residence = 1
                else:
                    current_residence += 1
            else:
                if in_bin1:
                    residence_times.append(current_residence)
                    in_bin1 = False
                    current_residence = 0
            
            action = np.random.randint(N_ACTIONS)
            obs, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated
        
        if in_bin1:
            residence_times.append(current_residence)
    
    if len(residence_times) > 0:
        mean_residence = np.mean(residence_times)
        median_residence = np.median(residence_times)
        max_residence = np.max(residence_times)
    else:
        mean_residence = 0
        median_residence = 0
        max_residence = 0
    
    total_steps = sum(bin_visits.values())
    bin1_percentage = 100 * bin_visits[1] / total_steps if total_steps > 0 else 0
    
    return {
        'mean_residence': mean_residence,
        'median_residence': median_residence,
        'max_residence': max_residence,
        'bin1_percentage': bin1_percentage,
        'entry_count': entry_count,
    }

In [None]:
# Run stability validation with current calibrated parameters
print("Validating soil_bin=1 stability...")
print("="*70)

stats = measure_bin1_stability(env, n_episodes=100, n_soil_bins=3)

print(f"\n✅ STABILITY VALIDATION RESULTS:")
print(f"   Mean residence time:    {stats['mean_residence']:.2f} steps")
print(f"   Median residence time:  {stats['median_residence']:.2f} steps")
print(f"   Maximum residence time: {stats['max_residence']} steps")
print(f"   Bin 1 occupancy:        {stats['bin1_percentage']:.1f}%")
print(f"   Number of entries:      {stats['entry_count']}")

if stats['mean_residence'] >= 10.0:
    print(f"\n✓ SUCCESS: Bin 1 is dynamically stable (mean ≥ 10 steps)")
else:
    print(f"\n✗ WARNING: Bin 1 stability below target (mean < 10 steps)")

print("="*70)

---
# Q-LEARNING TRAINING WITH MONITORING

Train a tabular Q-learning agent on the stability-calibrated environment with comprehensive monitoring of the learning process.

## Step 1: Training Configuration

In [None]:
# Training hyperparameters
TRAIN_CONFIG = {
    'n_episodes': 4000,
    'alpha': 0.1,           # Learning rate
    'gamma': 0.95,          # Discount factor
    'epsilon_start': 1.0,   # Initial exploration rate
    'epsilon_end': 0.01,    # Final exploration rate
    'epsilon_decay': 0.995, # Multiplicative decay per episode
    'n_soil_bins': 3,       # Discretization (must match env)
}

print("="*70)
print("Q-LEARNING TRAINING CONFIGURATION")
print("="*70)
for key, value in TRAIN_CONFIG.items():
    print(f"  {key:20s} = {value}")
print("="*70)
print(f"\nState space size: {get_state_space_size(TRAIN_CONFIG['n_soil_bins'])}")
print(f"Action space size: {N_ACTIONS}")
print(f"Q-table shape: ({get_state_space_size(TRAIN_CONFIG['n_soil_bins'])}, {N_ACTIONS})")
print("="*70)

## Step 2: Instrumented Training Loop

Enhanced training with monitoring hooks for action usage, state visitation, rewards, and Q-value evolution.

In [None]:
from collections import defaultdict

def train_q_learning_monitored(env, n_episodes, alpha, gamma, epsilon_start, epsilon_end, 
                                epsilon_decay, n_soil_bins, log_interval=500):
    """
    Train Q-learning agent with comprehensive monitoring.
    
    Returns:
        Q_table: trained Q-values (state_space_size, n_actions)
        monitoring: dict with training metrics
    """
    state_space_size = get_state_space_size(n_soil_bins)
    Q = np.zeros((state_space_size, N_ACTIONS))
    
    # Monitoring structures
    monitoring = {
        'episode_rewards': [],
        'action_counts_per_interval': [],
        'state_visit_counts': np.zeros(state_space_size, dtype=int),
        'q_stats': {'min': [], 'max': [], 'mean': [], 'nonzero_count': []},
        'epsilon_history': [],
    }
    
    epsilon = epsilon_start
    action_counts = np.zeros(N_ACTIONS, dtype=int)
    
    print(f"\nStarting training for {n_episodes} episodes...")
    print(f"Logging interval: every {log_interval} episodes\n")
    
    for episode in range(n_episodes):
        obs, _ = env.reset()
        state = discretize_state(obs, n_soil_bins)
        done = False
        episode_reward = 0
        
        while not done:
            # Track state visitation
            monitoring['state_visit_counts'][state] += 1
            
            # Epsilon-greedy action selection
            if np.random.random() < epsilon:
                action = np.random.randint(N_ACTIONS)
            else:
                action = np.argmax(Q[state, :])
            
            # Track action usage
            action_counts[action] += 1
            
            # Environment step
            next_obs, reward, terminated, truncated, info = env.step(action)
            next_state = discretize_state(next_obs, n_soil_bins)
            done = terminated or truncated
            
            # Q-learning update
            best_next_action = np.argmax(Q[next_state, :])
            td_target = reward + gamma * Q[next_state, best_next_action]
            td_error = td_target - Q[state, action]
            Q[state, action] += alpha * td_error
            
            episode_reward += reward
            state = next_state
        
        # Record episode metrics
        monitoring['episode_rewards'].append(episode_reward)
        monitoring['epsilon_history'].append(epsilon)
        
        # Decay epsilon
        epsilon = max(epsilon_end, epsilon * epsilon_decay)
        
        # Periodic logging
        if (episode + 1) % log_interval == 0:
            # Action statistics
            monitoring['action_counts_per_interval'].append(action_counts.copy())
            
            # Q-value statistics
            monitoring['q_stats']['min'].append(np.min(Q))
            monitoring['q_stats']['max'].append(np.max(Q))
            monitoring['q_stats']['mean'].append(np.mean(Q))
            monitoring['q_stats']['nonzero_count'].append(np.count_nonzero(Q))
            
            # Rolling reward mean
            recent_rewards = monitoring['episode_rewards'][-100:]
            mean_reward = np.mean(recent_rewards)
            
            print(f"Episode {episode+1:4d}/{n_episodes} | "
                  f"Mean Reward (last 100): {mean_reward:7.2f} | "
                  f"Epsilon: {epsilon:.4f} | "
                  f"Actions: {action_counts} | "
                  f"Q-range: [{np.min(Q):.2f}, {np.max(Q):.2f}]")
            
            # Reset action counts for next interval
            action_counts = np.zeros(N_ACTIONS, dtype=int)
    
    print(f"\n{'='*70}")
    print("TRAINING COMPLETE")
    print(f"{'='*70}\n")
    
    return Q, monitoring

## Step 3: Execute Training

In [None]:
# Run training
Q_table, monitoring = train_q_learning_monitored(
    env=env,
    n_episodes=TRAIN_CONFIG['n_episodes'],
    alpha=TRAIN_CONFIG['alpha'],
    gamma=TRAIN_CONFIG['gamma'],
    epsilon_start=TRAIN_CONFIG['epsilon_start'],
    epsilon_end=TRAIN_CONFIG['epsilon_end'],
    epsilon_decay=TRAIN_CONFIG['epsilon_decay'],
    n_soil_bins=TRAIN_CONFIG['n_soil_bins'],
    log_interval=500
)

## Step 4: Extract Q-Table and Policy

In [None]:
# Display Q-table structure
print("Q-TABLE SHAPE:", Q_table.shape)
print(f"Total Q-values: {Q_table.size}")
print(f"Non-zero Q-values: {np.count_nonzero(Q_table)}")
print(f"Q-value range: [{np.min(Q_table):.3f}, {np.max(Q_table):.3f}]")
print(f"Q-value mean: {np.mean(Q_table):.3f}")
print(f"Q-value std: {np.std(Q_table):.3f}")

print("\n" + "="*70)
print("SAMPLE Q-VALUES (first 10 states)")
print("="*70)
print(f"{'State':<7} {'Action 0':<12} {'Action 1':<12} {'Action 2':<12} {'Best Action'}")
print("-"*70)
for state in range(min(10, Q_table.shape[0])):
    soil_bin, crop_stage, et0_bin, rain_bin = from_discrate_to_full_state(state, TRAIN_CONFIG['n_soil_bins'])
    best_action = np.argmax(Q_table[state, :])
    print(f"{state:<7} {Q_table[state, 0]:11.3f}  {Q_table[state, 1]:11.3f}  {Q_table[state, 2]:11.3f}  {best_action}")
print("="*70)

In [None]:
# Extract deterministic policy using new API
policy = extract_policy(Q_table)

# Display policy using new API function
print_policy(policy, n_soil_bins=TRAIN_CONFIG['n_soil_bins'])

## Step 5: Policy Analysis & Interpretation

In [None]:
print("="*70)
print("POLICY SUMMARY & INTERPRETATION")
print("="*70)

# Action frequency in policy
action_freq = np.bincount(policy, minlength=N_ACTIONS)
print(f"\n1. ACTION FREQUENCY IN LEARNED POLICY:")
print(f"   Action 0 (No irrigation): {action_freq[0]:2d} states ({100*action_freq[0]/len(policy):.1f}%)")
print(f"   Action 1 (5mm):          {action_freq[1]:2d} states ({100*action_freq[1]/len(policy):.1f}%)")
print(f"   Action 2 (15mm):         {action_freq[2]:2d} states ({100*action_freq[2]/len(policy):.1f}%)")

# Analysis by soil bin
print(f"\n2. POLICY BY SOIL BIN:")
for soil_bin in range(3):
    states_in_bin = [s for s in range(len(policy)) 
                     if from_discrate_to_full_state(s, TRAIN_CONFIG['n_soil_bins'])[0] == soil_bin]
    actions_in_bin = [policy[s] for s in states_in_bin]
    action_counts = np.bincount(actions_in_bin, minlength=N_ACTIONS)
    
    sm_range = f"[{soil_bin/3:.3f}, {(soil_bin+1)/3:.3f})"
    print(f"   Soil bin {soil_bin} (SM {sm_range}): ", end="")
    print(f"No-irr={action_counts[0]}, 5mm={action_counts[1]}, 15mm={action_counts[2]}")

# Analysis by crop stage
print(f"\n3. POLICY BY CROP STAGE:")
for crop_stage in range(3):
    states_in_stage = [s for s in range(len(policy)) 
                       if from_discrate_to_full_state(s, TRAIN_CONFIG['n_soil_bins'])[1] == crop_stage]
    actions_in_stage = [policy[s] for s in states_in_stage]
    action_counts = np.bincount(actions_in_stage, minlength=N_ACTIONS)
    
    stage_name = ['Early', 'Mid', 'Late'][crop_stage]
    print(f"   Stage {crop_stage} ({stage_name}): ", end="")
    print(f"No-irr={action_counts[0]}, 5mm={action_counts[1]}, 15mm={action_counts[2]}")

# State visitation analysis
print(f"\n4. STATE VISITATION DURING TRAINING:")
visited_states = np.sum(monitoring['state_visit_counts'] > 0)
never_visited = np.sum(monitoring['state_visit_counts'] == 0)
print(f"   States visited: {visited_states}/{len(monitoring['state_visit_counts'])}")
print(f"   States never visited: {never_visited}")

if never_visited > 0:
    unvisited_indices = np.where(monitoring['state_visit_counts'] == 0)[0]
    print(f"   Unvisited states: {list(unvisited_indices)}")

# Visit statistics by soil bin
print(f"\n5. VISIT DISTRIBUTION BY SOIL BIN:")
for soil_bin in range(3):
    states_in_bin = [s for s in range(len(policy)) 
                     if from_discrate_to_full_state(s, TRAIN_CONFIG['n_soil_bins'])[0] == soil_bin]
    visits_in_bin = np.sum([monitoring['state_visit_counts'][s] for s in states_in_bin])
    total_visits = np.sum(monitoring['state_visit_counts'])
    pct = 100 * visits_in_bin / total_visits if total_visits > 0 else 0
    print(f"   Soil bin {soil_bin}: {visits_in_bin:7d} visits ({pct:5.1f}%)")

print("\n" + "="*70)
print("BEHAVIORAL INTERPRETATION:")
print("="*70)

# Interpret learned behavior
low_sm_irrigate = sum([1 for s in range(len(policy)) 
                       if from_discrate_to_full_state(s, TRAIN_CONFIG['n_soil_bins'])[0] == 0 
                       and policy[s] > 0])
high_sm_no_irrigate = sum([1 for s in range(len(policy)) 
                           if from_discrate_to_full_state(s, TRAIN_CONFIG['n_soil_bins'])[0] == 2 
                           and policy[s] == 0])

print(f"✓ Low soil moisture (bin 0) irrigation: {low_sm_irrigate}/12 states use irrigation")
print(f"✓ High soil moisture (bin 2) conservation: {high_sm_no_irrigate}/12 states avoid irrigation")

if action_freq[1] > 0 or action_freq[2] > 0:
    print(f"✓ Agent learned to use irrigation actions (not defaulting to no-irrigation)")
else:
    print(f"⚠ Agent only uses no-irrigation (possible reward shaping issue)")

print("="*70)

## Step 6: Training Progression Visualization

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Reward progression
ax = axes[0, 0]
episodes = np.arange(len(monitoring['episode_rewards']))
ax.plot(episodes, monitoring['episode_rewards'], alpha=0.3, label='Episode reward')
# Rolling mean
window = 100
rolling_mean = np.convolve(monitoring['episode_rewards'], 
                           np.ones(window)/window, mode='valid')
ax.plot(episodes[window-1:], rolling_mean, linewidth=2, label=f'Rolling mean ({window})')
ax.set_xlabel('Episode')
ax.set_ylabel('Total Reward')
ax.set_title('Reward Progression')
ax.legend()
ax.grid(True, alpha=0.3)

# 2. Action usage over time
ax = axes[0, 1]
intervals = np.arange(len(monitoring['action_counts_per_interval'])) * 500
action_data = np.array(monitoring['action_counts_per_interval'])
ax.plot(intervals, action_data[:, 0], marker='o', label='Action 0 (No irr)')
ax.plot(intervals, action_data[:, 1], marker='s', label='Action 1 (5mm)')
ax.plot(intervals, action_data[:, 2], marker='^', label='Action 2 (15mm)')
ax.set_xlabel('Episode')
ax.set_ylabel('Action Count (per 500 episodes)')
ax.set_title('Action Usage Over Time')
ax.legend()
ax.grid(True, alpha=0.3)

# 3. Q-value evolution
ax = axes[1, 0]
intervals = np.arange(len(monitoring['q_stats']['min'])) * 500
ax.plot(intervals, monitoring['q_stats']['min'], label='Min Q-value')
ax.plot(intervals, monitoring['q_stats']['mean'], label='Mean Q-value')
ax.plot(intervals, monitoring['q_stats']['max'], label='Max Q-value')
ax.set_xlabel('Episode')
ax.set_ylabel('Q-value')
ax.set_title('Q-Value Evolution')
ax.legend()
ax.grid(True, alpha=0.3)

# 4. Exploration rate (epsilon)
ax = axes[1, 1]
episodes = np.arange(len(monitoring['epsilon_history']))
ax.plot(episodes, monitoring['epsilon_history'], linewidth=2)
ax.set_xlabel('Episode')
ax.set_ylabel('Epsilon')
ax.set_title('Exploration Rate Decay')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Training progression plots generated.")

---
# PHYSICAL REGIME SHIFT EXPERIMENT: Moderate Rain Regime

**Objective:** Compare learned policies under different climatic regimes by retraining the agent with increased rainfall.

**Hypothesis:** Higher rainfall will reduce irrigation frequency and change policy behavior, particularly in rain-present states.

## [Remaining regime comparison cells would continue here...]

## Archived Note

This archive contains the full experimental history, including:
- Physical stability calibration methodology
- Monitored training loop implementation
- Detailed Q-value evolution tracking
- Rain regime comparison experiments
- State coverage experiments

For the clean, final experimental workflow, see the main `experiments.ipynb` notebook.