# Training of RL Agent in Durotaxis Simulation
This file contains the history of training transactions of the proposed RL agent for the durotaxis simulation.

## Workflow for Training

### 1. Initialize the durotaxis environment
Here are the parameters for Durotaxis class:
* `substrate_size` : Dimensions of the substrate environment (width, height) in pixels.
* `substrate_type` : Type of substrate gradient ('linear', 'exponential').
* `substrate_params` : Parameters for substrate generation: {'m': slope, 'b': intercept}.
* `init_num_nodes` : Initial number of nodes when environment resets.
* `max_critical_nodes` : Maximum allowed nodes before applying growth penalties ($N_c$).
* `threshold_critical_nodes` : The critical threshold number of nodes such that episode terminates if exceeded (fail condition).
* `max_steps` : Maximum steps per episode before timeout termination.
* `embedding_dim` : Dimension of node embeddings for graph neural network processing.
* `hidden_dim` : Hidden layer dimension for graph transformer networks.
* `delta_time` : Time window ($\Delta t$) for topology history comparison for node reduction rule which activates deletion if $I_{t+\Delta t} \leq \epsilon$.
* `delta_intensity` : Minimum intensity difference ($\delta$) required for successful durotaxis spawning, such that $I(x',y') - I(x,y) \geq \delta$.
    
* `graph_rewards` : Graph-level reward components:
    - `connectivity_penalty`: Penalty when nodes < 2 (loss of connectivity)
    - `growth_penalty`: Penalty when nodes > max_nodes (excessive growth: $N > N_c$)
    - `survival_reward`: Base reward for maintaining valid topology
    - `action_reward`: Reward multiplier per action taken (encourages exploration)

* `node_rewards` : Node-level reward components:
    - `movement_reward`: Reward multiplier for rightward movement 
    - `intensity_penalty`: Penalty for nodes below average substrate intensity
    - `intensity_bonus`: Bonus for nodes at/above average substrate intensity
    - `substrate_reward`: Reward multiplier for substrate intensity values
    
* `edge_reward` : Edge direction rewards:
    - `rightward_bonus`: Reward for edges pointing rightward (positive x-direction)
    - `leftward_penalty`: Penalty for edges pointing leftward (negative x-direction)
    
* `spawn_rewards` : dict, Spawning behavior rewards:
    - `span_success_reward`: Reward for successful durotaxis-based spawning
    - `spawn_failure_penalty`: Penalty for spawning without sufficient intensity gradient

* `delete_reward` : Deletion compliance rewards: 
    - `proper_deletion`: Reward for deleting nodes marked with to_delete flag
    - `persistence_penalty`: Penalty for keeping nodes marked for deletion
    
* `position_rewards` : Positional behavior rewards:
    - `boundary_bonus`: Bonus for nodes on topology boundary (frontier exploration)
    - `left_edge_penalty`: Penalty for nodes near left substrate edge
    - `edge_position_penalty`: Penalty for nodes near top/bottom substrate edges
    
* `termination_rewards` : Episode termination rewards:
    - `success_reward`: Large reward for reaching rightmost substrate boundary
    - `out_of_bounds_penalty`: Penalty for nodes moving outside substrate bounds
    - `no_nodes_penalty`: Penalty for losing all nodes (topology collapse)
    - `leftward_drift_penalty`: Penalty for consistent leftward centroid movement
    - `timeout_penalty`: Small penalty for reaching maximum time steps
    - `critical_nodes_penalty`: Penalty for exceeding critical node threshold
    
* `render_mode` : Visualization mode ('human' for real-time rendering, None for headless).
* `flush_delay` : Delay between visualization updates (seconds) for rendering control.
* `enable_visualization` : Enable/disable automatic topology visualization during episodes.
* `model_path` : Base directory for saving models with automatic run organization.

In [1]:
from durotaxis_sim import Durotaxis

env = Durotaxis(
        substrate_size=(200, 200),
        substrate_type='linear',
        substrate_params={'m': 0.01, 'b': 1.0},
        init_num_nodes=1,
        max_critical_nodes=50,
        threshold_critical_nodes=200,
        max_steps=100,
        embedding_dim=64,
        hidden_dim=128,  
        delta_time=3,
        delta_intensity=2.50,
        graph_rewards={
            'connectivity_penalty': 10.0,  
            'growth_penalty': 10.0,  
            'survival_reward': 0.01,  
            'action_reward': 0.005,  
        },
        node_rewards={
            'movement_reward': 0.01,  
            'intensity_penalty': 5.0,  
            'intensity_bonus': 0.01,  
            'substrate_reward': 0.05, 
        },
        edge_reward={
        'rightward_bonus': 0.1, 
        'leftward_penalty': 0.1},  
        spawn_rewards={
            'spawn_success_reward': 1.0, 
            'spawn_failure_penalty': 1.0,  
        },
        delete_reward={
        'proper_deletion': 2.0, 
        'persistence_penalty': 2.0},  
        position_rewards={
            'boundary_bonus': 0.1,  
            'left_edge_penalty': 0.2,  
            'edge_position_penalty': 0.1, 
        },
        termination_rewards={
            'success_reward': 100.0, 
            'out_of_bounds_penalty': -30.0,  
            'no_nodes_penalty': -30.0,  
            'leftward_drift_penalty': -30.0,  
            'timeout_penalty': -10.0,  
            'critical_nodes_penalty': -25.0,  
        },
        enable_visualization=False,
        render_mode=None,  
        model_path="./saved_models"  
    )

  from .autonotebook import tqdm as notebook_tqdm


📁 Created run directory: ./saved_models/run0005 (Run #5)


### 2. Define a model and register it with the environment

In the current code version, the model is represented with the policy agent using the customize Graph Transformer Policy, It can perform intelligent action selection. Thus, the creation of the model is internally performed. Although, you can rename the policy name for progress tracking.

In [2]:
env.set_algorithm_name("GraphTransformerPolicy")

🔧 Algorithm name set to: GraphTransformerPolicy


### 3. Train the Topology Policy Agent
During training, the `enable_visualization` parameter is set to `False` so that it will only print the progress reports per episode, instead of providing the visualization of the agent within the substrate. This is to accelerate the training process, as visualization has periodic `flush_delay` to allow matplotlib to generate the plot without render issues.

In [3]:
env.train()

⚪ Episode terminated: No nodes remaining
📊 Ep 1 Step  1: N= 0 E= 0 | R=-39.995 (S:+0.0 N:+0.0 E:+0.0) | C=  0.0= | A= 1 | T=True False
  📊 Visualization disabled (terminal output only)
✅ Episode 1 completed in 1 steps with reward: -39.995
💾 Saved model files for episode 1 (Run #0005):
   📄 GraphTransformerPolicy_ep00001_metadata.json
📊 Ep 2 Step  1: N= 2 E= 1 | R=-1.176 (S:-1.0 N:-0.3 E:+0.1) | C=  9.2= | A= 1 | T=False False
  📊 Visualization disabled (terminal output only)
📊 Ep 2 Step  2: N= 4 E= 3 | R=-2.260 (S:-2.0 N:-0.6 E:+0.3) | C= 10.2→ | A= 2 | T=False False
  📊 Visualization disabled (terminal output only)
📊 Ep 2 Step  3: N= 2 E= 1 | R=-1.119 (S:-1.0 N:-0.2 E:+0.1) | C= 11.2→ | A= 4 | T=False False
  📊 Visualization disabled (terminal output only)
⚪ Episode terminated: No nodes remaining
📊 Ep 2 Step  4: N= 0 E= 0 | R=-39.990 (S:+0.0 N:+0.0 E:+0.0) | C=  0.0← | A= 2 | T=True False
  📊 Visualization disabled (terminal output only)
✅ Episode 2 completed in 4 steps with reward: -