A physics-based extension of the VFH-QL algorithm (Abdalmanan et al., 2023) from 2D pixel simulation to Gazebo/ROS 2, with sim-to-real deployment on a physical DAGU Wild Thumper 4WD robot.
Imperial College London — MSc Applied Machine Learning / AML Devices Module Project
Authors: Ali Haidar, Andria Kyriacou, Mahmoud El Etreby
Reference: Abdalmanan et al., IEEE Access (2023). DOI: 10.1109/ACCESS.2023.3265207
Robotron trains a tabular Q-learning agent to navigate unknown environments using only LiDAR and a target bearing, then deploys the learned policy onto real hardware with no fine-tuning. The project reproduces the core VFH-QL formulation from the reference paper and introduces a set of targeted modifications that unlock hard-spawn performance and enable sim-to-real transfer.
Relative to Abdalmanan et al. (2023), this work introduces:
- Forward-facing sector reduction — 8 omnidirectional LiDAR sectors reduced to 5 forward-facing sectors, eliminating decision-irrelevant rear observations and compressing the state space to 512 states.
- Distance-to-target zone encoding — 4 discrete distance zones added to the state to resolve positional aliasing, expanding the space to 2,048 states.
- Composite reward function — zone-transition bonuses, anti-oscillation penalty, and a visibility reward term, designed to shape exploration under sparse goal feedback.
- Curriculum learning — 11 spawn positions across 4 difficulty levels, progressing from easy to full-maze starts.
- Sim-to-real deployment — physical hardware rollout on a DAGU Wild Thumper 4WD platform (the reference paper used only a Python grid simulation).
- Custom URDF digital twin — DAGU Wild Thumper model replacing the standard TurtleBot3 for physics-accurate training.
| Configuration | Overall Success | Hard / Full-Maze Spawns |
|---|---|---|
| 2,048-state + curriculum (final) | ~91% | 87–91% |
| 512-state baseline | ~55% | <7% |
| Fixed-spawn (no curriculum) baseline | 0% | 0% |
The jump from 512 → 2,048 states (distance-zone encoding) and the use of curriculum learning were each individually necessary; removing either collapses hard-spawn performance.
- Physics-based Gazebo environment with Gaussian sensor noise
rl_agent.py— tabular Q-learning loop with curriculum scheduler- Custom Wild Thumper URDF with calibrated inertial and friction parameters
- Q-table exported as a static C++ header (
q_table.h) for firmware use
Modular architecture:
lidar_sensor.h/.cpp— RPLidar A1 driver and sector aggregationchassis.h/.cpp— skid-steer motor control with feedforward + PIconfig.h— pin mappings, tuning constants, thresholdsq_table.h— exported policy (static)main.ino— control loop and state-machine
- Chassis: DAGU Wild Thumper 4WD
- MCU: Arduino Portenta H7 + Hat Carrier (mbed pin naming:
PD_4,PG_10,PJ_7, etc.) - LiDAR: RPLidar A1 (on
Serial1) - IMU: BNO055 (onboard Kalman filter handles short-episode drift)
- Motor driver: Cytron MDD20A
- Motors: Pololu 34:1 HP 6V 25D gearmotors
Robotron/
├── arduino/ # Portenta H7 firmware
│ ├── main.ino # Entry point + Ethernet control state machine
│ ├── chassis.{h,cpp} # Skid-steer control, encoder odom, IMU fusion, PI + FF
│ ├── lidar_sensor.{h,cpp} # RPLidar A1 driver, sector aggregation, state index
│ ├── config.h # Target, spawn, logging constants
│ ├── q_table.h # Exported policy (2048 × 3 floats)
│ └── tests/
│ ├── test_motors/ # Basic PWM sweep
│ ├── test_motors_bt/ # HC-05 Bluetooth feedforward calibration
│ ├── test_motors_eth/ # Ethernet-controlled motor test
│ ├── test_encoders/ # Encoder tick + odometry sanity
│ ├── test_imu/ # BNO055 sanity
│ ├── test_lidar/ # RPLidar A1 sanity
│ └── test_eth_connection/ # Portenta Ethernet link check
├── robot_simulation/ # ROS 2 (ament_python) package
│ ├── package.xml
│ ├── setup.py # Entry points: rl_agent, lidar_debug
│ ├── launch/
│ │ └── simulation.launch.py # Cleans Gazebo, spawns URDF, loads world
│ ├── urdf/
│ │ ├── dagu.urdf # Wild Thumper digital twin (active)
│ │ └── my_robot.urdf # Legacy TurtleBot-derivative
│ ├── worlds/
│ │ ├── maze.world # Full 4-splitter curriculum maze
│ │ ├── maze_zig_zag.world # Default world in the launch file
│ │ ├── maze_simple.world
│ │ ├── maze_easy.world
│ │ ├── u_shape.world
│ │ └── my_world.world
│ ├── meshes/ # URDF visual / collision meshes
│ └── robot_simulation/
│ ├── rl_agent.py # Q-learning trainer (main)
│ └── lidar.py # Terminal LiDAR sector dashboard
└── utils/
├── export_qtable.py # .npy → q_table.h
├── plot_training_fixed_spawn.py
├── plot_training_curriculum_spawn.py
└── plot_training_curriculum_spawn_classify.py
The main training script. Runs a ROS 2 node (rl_agent) that consumes /scan and /odom, publishes /cmd_vel, and resets the robot each episode via either /reset_simulation (fixed spawn) or /set_entity_state (curriculum teleport).
CLI
# Fresh run
ros2 run robot_simulation rl_agent --fresh
# Resume from a checkpoint (relative to ./q_tables/)
ros2 run robot_simulation rl_agent --checkpoint run_20250128_143022/q_table_ep500.npyOutputs
q_tables/run_<timestamp>/q_table_ep<N>.npy— checkpoint every 100 episodesq_tables/run_<timestamp>/q_table_latest.npy— rolling latesttraining_logs/training_log_<timestamp>.txt— CSV with reward, steps, success/collision/timeout, ε, α, action distribution, spawn index
A debugging node that clears the terminal at 5 Hz and prints the robot pose, the 5-sector occupancy grid (Front/Left/Right/FarLeft/FarRight), target sector, target visibility, and the derived state index. Useful when tuning safe_distance_threshold or verifying sector geometry.
ros2 run robot_simulation lidar_debugKills any stale Gazebo processes, sets GAZEBO_MODEL_PATH, launches Gazebo, publishes the robot description, and spawns the robot. To switch maze or robot, edit the variables at the top:
world_file— path inworlds/; currentlymaze_zig_zag.worldurdf_file— path inurdf/; currentlydagu.urdfrobot_pose_maze—[x, y, z, yaw]spawn for the spawner (only used whenrl_agentis in fixed-spawn mode)
Converts a trained .npy Q-table into a C++ header for the firmware. Edit the two hardcoded paths at the bottom (input_file, output_file) before running:
python3 utils/export_qtable.pyThe generated q_table.h writes const float Q_TABLE[2048][3]; copy it into arduino/ before compiling the firmware.
plot_training_fixed_spawn.py— reward/steps/success curves for a fixed-spawn runplot_training_curriculum_spawn.py— curriculum run with per-spawn success breakdownplot_training_curriculum_spawn_classify.py— same as above with spawn-difficulty classification
All three read a training_logs/*.txt CSV — point them at your run.
State machine: WAIT_CLIENT → WAIT_TRIGGER → INITIALIZING → RUNNING.
The Portenta serves Telnet on 192.168.1.177:23. Connect from a host on the same subnet, send any character to arm, and the RL loop starts inferring from Q_TABLE at 5.5 Hz. Sending another character after an episode ends (success or collision) restarts.
| Sketch | Purpose |
|---|---|
test_motors |
PWM sweep to verify direction and deadzone |
test_motors_bt |
HC-05 Bluetooth PWM calibration (feedforward constants) |
test_motors_eth |
Ethernet-commanded motor test |
test_encoders |
Encoder ticks + derived odometry |
test_imu |
BNO055 orientation |
test_lidar |
RPLidar A1 scan |
test_eth_connection |
Portenta Ethernet link-up |
All parameters live in robot_simulation/robot_simulation/rl_agent.py. File:line references point to the default values.
| Parameter | Default | Meaning |
|---|---|---|
target_coords |
(0.0, 1.8) |
Goal position in world frame |
action_space |
[0, 1, 2] |
Forward / Left / Right |
random_spawn_enabled |
True |
True = curriculum (random from spawns), False = fixed spawn |
fixed_spawn |
(-0.5, -2.1, 0.0) |
Used only when curriculum is off |
spawns[] |
11 entries | Curriculum spawns across 4 difficulty tiers (see below) |
| Parameter | Default | Meaning |
|---|---|---|
num_lidar_bits |
5 |
Sectors: FarLeft, Left, Front, Right, FarRight |
num_vis_bits |
1 |
Binary target-visibility flag |
num_target_sectors |
8 |
Relative bearing buckets (45° each) |
num_distance_zones |
4 |
Distance-to-target buckets |
distance_zone_boundaries |
[1.0, 2.0, 3.0] m |
Zone thresholds; zone 3 is anything ≥ 3 m |
Total states: 2⁵ × 2 × 4 × 8 = 2048.
| Parameter | Default | Meaning |
|---|---|---|
alpha |
0.1 |
Initial learning rate |
alpha_decay |
0.9998 |
Per-episode α decay (α > 0.05 until ≈ ep 3500) |
alpha_min |
0.02 |
Floor |
gamma |
0.95 |
Discount factor |
epsilon |
1.0 |
Initial exploration rate |
epsilon_decay |
0.9995 |
Per-episode ε decay (ε > 0.1 until ≈ ep 4600) |
epsilon_min |
0.05 |
Floor |
| Parameter | Default | Meaning |
|---|---|---|
safe_distance_threshold |
0.6 m |
Sector flips to BLOCKED below this minimum range |
collision_threshold |
0.20 m |
Any ray below this ends the episode |
target_radius |
0.30 m |
Success radius around target_coords |
| Parameter | Default | Meaning |
|---|---|---|
reward_success |
+2500.0 |
Terminal success (must exceed max_steps × per-step penalty) |
reward_collision |
-200.0 |
Terminal collision |
reward_mode |
'hybrid' |
Enables zone-transition + visibility + oscillation shaping |
| Zone transition | ±15.0 |
Per-episode reward for moving to a closer/farther distance zone |
| Target-visible bonus | +1.0 |
Added when target_vis == 1 |
| Oscillation penalty | -3.0 |
Triggered on L–R–L or R–L–R patterns (max_action_history = 3) |
| VFH penalty | -1.0 to -5.0 |
Scales with angular deviation from the optimal open action; -5.0 if the chosen action is blocked |
| Parameter | Default | Meaning |
|---|---|---|
max_steps |
1000 |
Episode-length cap → timeout |
max_episodes |
10000 |
Soft cap; training keeps running while rclpy.spin is alive |
| Control timer | 0.18 s |
~5.5 Hz — matched to firmware CONTROL_LOOP_INTERVAL_MS |
| Action | linear.x |
angular.z |
|---|---|---|
| 0 — Forward | 0.18 |
0.0 |
| 1 — Left (in-place) | 0.03 |
+0.5 |
| 2 — Right (in-place) | 0.03 |
-0.5 |
Sampled uniformly at random every episode. Easy spawns produce quick successes whose Q-values propagate backward to bootstrap harder starts.
| Level | Description | Count |
|---|---|---|
| 1 — Easy | Past Splitter-3, short straight path | 3 |
| 2 — Medium | Between Splitter-2 and Splitter-3 (one turn) | 2 |
| 3 — Hard | Between Splitter-1 and Splitter-2 (two turns) | 3 |
| 4 — Full maze | Start zone (three turns) | 3 |
| Constant | Default | Meaning |
|---|---|---|
LOG_LEVEL |
2 |
0 silent / 1 one-line-per-step / 2 detailed per-module |
TARGET_X, TARGET_Y |
0.0, 1.8 |
Goal in world frame (must match training) |
TARGET_RADIUS |
0.30 |
Success radius (must match training) |
SPAWN_X, SPAWN_Y, SPAWN_YAW |
-0.5, -2.1, 0.0 |
Physical robot placement; offsets odometry so it shares the Q-table's frame |
| Constant | Default | Meaning |
|---|---|---|
mac[] |
DE:AD:BE:EF:FE:ED |
Portenta MAC |
ip |
192.168.1.177 |
Static IP (host PC must be on same subnet) |
ETH_PORT |
23 |
Telnet-style control port |
CONTROL_LOOP_INTERVAL_MS |
180 |
RL step period — keep synced with Python self.timer |
COLLISION_COOLDOWN_MS |
1000 |
Motor quiescence after each episode reset |
- Simulation host: Ubuntu 22.04, ROS 2 Humble (or newer), Gazebo 11,
ros-<distro>-gazebo-ros-pkgs, Python 3 withnumpy,matplotlib. - Firmware host: Arduino IDE or
arduino-cliwith thearduino:mbed_portentacore; Portenta-specific Ethernet library plus the LiDAR/IMU drivers referenced inchassis.cpp/lidar_sensor.cpp.
From your colcon workspace root (with Robotron/robot_simulation inside src/):
colcon build --symlink-install --packages-select robot_simulation
source install/setup.bashros2 launch robot_simulation simulation.launch.pyThis boots Gazebo with maze_zig_zag.world and spawns the DAGU URDF. To change either, edit robot_simulation/launch/simulation.launch.py and rebuild.
In a second terminal (with the workspace sourced):
# Fresh training run
ros2 run robot_simulation rl_agent --fresh
# Resume — path is relative to ./q_tables/
ros2 run robot_simulation rl_agent --checkpoint run_<timestamp>/q_table_ep500.npyε and α are auto-restored from the episode number encoded in the checkpoint filename.
With the simulation running:
ros2 run robot_simulation lidar_debugpython3 utils/plot_training_curriculum_spawn.py # or the fixed-spawn variant
# Edit the log path inside the script to point at training_logs/*.txt# Edit input_file and output_file at the bottom of utils/export_qtable.py first
python3 utils/export_qtable.py
cp <output_path>/q_table.h arduino/q_table.harduino-cli compile --fqbn arduino:mbed_portenta:envie_m7 arduino/
arduino-cli upload -p /dev/ttyACM0 --fqbn arduino:mbed_portenta:envie_m7 arduino/- Place the robot physically at
(SPAWN_X, SPAWN_Y, SPAWN_YAW)fromconfig.h. - Connect your PC to the Portenta over Ethernet on the same subnet as
192.168.1.177. - Open a TCP session:
telnet 192.168.1.177 23 # or: nc 192.168.1.177 23 - Send any character to arm. Telemetry streams back at
LOG_LEVELverbosity. After aSUCCESSorCOLLISION, send another character to start the next episode.
- Distance-zone state encoding unlocked hard-spawn performance. A pure 5-sector + visibility + target-bearing state collapsed spatially distinct regions of the maze onto identical observations. Adding 4 distance zones expanded the space from 512 → 2,048 states, broke the aliasing, and lifted hard-spawn success from single digits to ~90%.
- Explicit coordinate-frame handling across the sim-to-real boundary. The RPLidar reports angles clockwise, while the Q-table is indexed in Gazebo's counter-clockwise frame. Keeping
lidar_target_angleandtarget_sector_angleas separate variables (rather than a single shared bearing) made the frame conversion trivial and kept left/right target sectors consistent between simulation and hardware. - Feedforward + PI outperformed pure PID for discrete-action control. Per-action feedforward PWM constants deliver the bulk of each command on cycle one; a low-gain PI loop handles the residual; integrator state resets on every action change. The result is crisp, jitter-free transitions that match the 5.5 Hz RL cadence.
- Ordered motor-command pipeline for clean transitions. Deadzone compensation runs before the slew limiter so the two stages compose correctly — the deadzone boost never gets clipped back out, and current draw stays within the PMIC's headroom across sharp action changes.
- Curriculum learning was the key to generalisation. Sampling uniformly across 11 spawns spanning four difficulty tiers let easy-spawn successes seed the Q-table early; those values then propagated backward to bootstrap the hard and full-maze starts, producing a single policy that generalises across the whole maze.
If this work is useful, please cite the original VFH-QL paper:
Abdalmanan et al., "VFH-QL: A Hybrid Approach for Autonomous Navigation," IEEE Access, 2023. DOI: 10.1109/ACCESS.2023.3265207