Add `airstack test` by andrewjong · Pull Request #344 · castacks/AirStack

andrewjong · 2026-04-23T21:53:46Z

What does this pull request do?

Adds automated testing framework via pytest that automatically runs Systems tests. Currently tests building images, docker container liveness and connectivity, and takeoff hover landing.

Which issue number does this address? #16

Video: https://www.youtube.com/watch?v=EzgGHnYDI_k

How did you implement it?

Omar added stuff

Testing

System Testing

AirStack's system tests bring up the full Docker-based stack — simulator, robot containers, and GCS — and verify end-to-end behavior: container health, ROS 2 node presence, sensor publishing rates, and compute resource usage. Tests are written in Python with pytest and live under tests/ at the repo root.

Test Suite Structure

Module	Mark	What it tests	Hardware required
`test_build_docker.py`	`build_docker`	Docker image builds (robot-desktop, gcs, isaac-sim, ms-airsim); records image sizes	Docker daemon
`test_build_packages.py`	`build_packages`	`colcon build` inside each container (robot, GCS, ms-airsim ROS workspace)	Docker daemon
`test_liveliness.py`	`liveliness`	Full stack up: container health, tmux process liveness, sentinel ROS 2 nodes, sim topic publishing rates, compute usage, sustained stability	Docker daemon, GPU, sim license
`test_takeoff_hover_land.py`	`autonomy`	End-to-end flight: PX4 readiness gate, takeoff to 10 m, hover stability, land — one chain per (sim, num_robots, iteration, velocity)	Docker daemon, GPU, sim license

Marks can be combined with pytest logic:
-m "build_docker or build_packages", -m liveliness, -m autonomy.

Test Infrastructure

All shared fixtures, helpers, and configuration live in tests/conftest.py.

`airstack_env` fixture

Parametrized over (sim, num_robots, iteration) tuples derived from CLI flags. For each combination it:

Calls airstack up with the appropriate COMPOSE_PROFILES, NUM_ROBOTS, and headless flags
Records airstack_up_duration_s to metrics.json
Yields an env dict used by every TestLiveliness test
Tears down with airstack down and records airstack_down_duration_s

`MetricsRecorder`

Writes custom metrics to tests/results/<timestamp>/metrics.json after each record() call. Keys follow the pattern test_node_id → metric_key → {value, unit, direction}. Time-series data (Hz samples, compute snapshots) are stored as {key}_samples lists and expanded into scalar aggregates (mean, min, max, start_mean, end_mean) by parse_metrics.py.

Output files

Every test run produces a timestamped directory:

tests/results/
└── 2025-04-21_14-30-00/
    ├── results.xml        # JUnit XML — test durations and pass/fail status
    ├── metrics.json       # Custom metrics (image sizes, Hz, compute, timing)
    └── logs/
        ├── test_build_docker.TestDockerBuilds.test_build_robot_desktop.log
        ├── test_liveliness.TestLiveliness.test_stable[msairsim-1-iter0].log
        └── ...            # One log file per test execution

Running Tests

`airstack test` (primary interface)

airstack test is the standard way to run tests. It builds the containerized
test runner from tests/docker/, mounts the repo read-only, and forwards all
arguments directly to pytest. No local Python environment needed.

# From the repo root (AirStack must be set up: airstack setup):

# Build tests only — fast, no GPU needed
airstack test -m "build_docker or build_packages" -v

# Liveliness run — ms-airsim, 1 robot, 1 iteration, 60 s stability window
airstack test -m liveliness \
  --sim msairsim \
  --num-robots 1 \
  --stress-iterations 1 \
  --stable-duration 60 \
  -v

# Autonomy run — takeoff/hover/land at three velocities
airstack test -m autonomy \
  --sim msairsim \
  --num-robots 1 \
  --stress-iterations 1 \
  --takeoff-velocities 0.5,1,2 \
  -v

# Show GUI windows (for local visual inspection)
airstack test -m liveliness --gui -v

airstack test calls xhost + automatically so GUI-mode sim containers
can reach the host X server; it is a no-op when DISPLAY is not set.

Prerequisites

Docker daemon running with your user in the docker group
NVIDIA drivers + nvidia-container-toolkit for liveliness/autonomy tests
airstack setup completed (adds airstack to PATH)

Direct pytest (for development / debugging)

Run pytest directly when you need faster iteration (no container rebuild) or
want to attach a debugger. Requires a local Python environment.

export AIRSTACK_ROOT=$(pwd)
pip install -r tests/requirements.txt

# Build tests only
pytest tests/ -m "build_docker or build_packages" -v

# Liveliness run
pytest tests/ -m liveliness \
  --sim msairsim \
  --num-robots 1 \
  --stress-iterations 1 \
  --stable-duration 60 \
  -v

CLI option reference

Option	Default	Description
`--sim`	`msairsim,isaacsim`	Comma-separated sim targets
`--num-robots`	`1,3`	Comma-separated robot counts
`--stress-iterations`	`3`	Up/down cycles per (sim, num_robots) config
`--stable-duration`	`120`	Seconds `test_stable` polls for
`--stable-interval`	`10`	Seconds between polls in `test_stable`
`--gui`	off	Show simulator GUI (disables headless mode)
`--takeoff-velocities`	`0.5,1,2`	Takeoff/land speeds in m/s

Autonomy Tests (`test_takeoff_hover_land.py`)

TestTakeoffHoverLand runs a 4-phase flight chain for every combination of
(sim, num_robots, iteration, velocity). The drone returns to the ground after
each velocity so the next velocity starts from a clean state.

Phase order

Phase	Test	What happens
1	`test_px4_ready`	Waits for MAVROS + PX4 EKF ready; once per env
2	`test_takeoff`	Sends TakeoffTask; asserts altitude within 10 %
3	`test_hover`	Captures odom for 10 s; asserts altitude drift < 0.5 m
4	`test_landing`	Sends LandTask; asserts final altitude < 0.5 m

If any phase other than test_hover fails, the remaining phases for that env
are skipped (the chain guard prevents a stuck-in-air drone from blocking later
velocity sweeps). A hover failure does not skip landing, so the drone always
returns to the ground.

Recorded metrics

Metric key	Unit	Description
`ready_duration_sys_s`	s	Wall-clock time from test start until PX4 ready
`takeoff_duration_sim_s`	s	Sim-time from first motion to 95 % of target
`land_duration_sim_s`	s	Sim time from 80 % peak descent to < 0.5 m
`velocity_rmse_m_sim_s`	m/s	RMSE of dz/dt vs commanded velocity during climb/descent
`altitude_error_m`	m	Signed steady-state error at takeoff success (+ = high)
`overshoot_m`	m	Unsigned transient overshoot above target
`hover_altitude_mean_error_m`	m	Mean altitude drift during hover
`hover_position_stddev_m`	m	3-D position jitter (sqrt of summed axis variances)
`final_altitude_m`	m	Altitude at landing action completion
`odometry_error_mean_m`	m	Mean 3-D position error vs ground-truth odom
`odometry_error_max_m`	m	Peak 3-D error vs ground-truth odom
`odometry_altitude_bias_m`	m	Signed z-axis bias vs ground-truth odom

Metrics are recorded per robot as robot_N.<key> and written to
tests/results/<timestamp>/metrics.json.

Running autonomy tests

# Sweep velocities 0.5, 1, 2 m/s; 1 robot; ms-airsim
airstack test -m autonomy \
  --sim msairsim \
  --num-robots 1 \
  --stress-iterations 1 \
  --takeoff-velocities 0.5,1,2 \
  -v

# Single velocity, Isaac Sim, 3 robots
airstack test -m autonomy \
  --sim isaacsim \
  --num-robots 3 \
  --stress-iterations 1 \
  --takeoff-velocities 1 \
  -v

Metrics Reporting (`parse_metrics.py`)

tests/parse_metrics.py reads results.xml and metrics.json from a run directory and produces a markdown report. It has two modes:

Single-run report

python tests/parse_metrics.py \
  --current tests/results/2025-04-21_14-30-00/

Prints a markdown table of all recorded metrics. Always exits 0.

Diff / regression check

python tests/parse_metrics.py \
  --current  tests/results/2025-04-21_14-30-00/ \
  --baseline tests/results/2025-04-20_09-00-00/ \
  --threshold 20          # optional: regression if change% exceeds this (default 20)
  --output   report.md    # optional: also write to file

Prints a side-by-side comparison. Exits 1 if any metric regresses beyond the threshold; exits 0 otherwise.

The report has three sections per test module:

Metrics — flat table of scalar metrics (test name, metric key, value/baseline, change%)
Sim publishing rates — pivot table of topic Hz aggregates (mean, start_mean, end_mean, min, max)
Compute usage — pivot table of CPU/memory/GPU metrics per container

Regressions are flagged with 🔴, improvements with 🟢.

Did you update the docs (and where)?

yes under the tests/README.md and added to mkdocs.yml

…iables not just .env

…verge to correct location at startup. More elegent solution is needed. Also added tmux plugin to airsim tmux windows

… markdown marking regressions

…y spaced.

Compare metrics still unchecked. And modelling cross test dependencies not done yet

Still need to work on making metrics more informative and informative logging messages. Also have not tested state calculating state estimation errors as we will need ground truth from airsim and isaacsim

…endered from results.xml using parse_metrics.py

…or better reliability

OasisArtisan and others added 25 commits April 16, 2026 17:52

Add airstack image-delete. Also change URDF check to consider env var…

98900a3

…iables not just .env

Add delay before starting PX4 for airsim. This is allowing PX4 to con…

a07950c

…verge to correct location at startup. More elegent solution is needed. Also added tmux plugin to airsim tmux windows

Initial docker build and package build tests

9842876

Compute docker image sizes and add compare_metrics.py to output clean…

fb79a89

… markdown marking regressions

Auto fetch airsim scene if user didn't specify path.

0db708c

Upgrade pegasus example from two to arbitrary number of drones equall…

1ec1945

…y spaced.

Initial liveliness checks.

514b589

Compare metrics still unchecked. And modelling cross test dependencies not done yet

Liveliness with properly grouped output table

22e7fb8

Add compute usage logging to liveliness

21a8b09

Parse metrics to support parsing single results file into markdown.

307aac8

Add docs, add serivces and scripts for github ci/cd

18b1212

Tag test docker + add measuring realtime factor to liveliness

dbf425f

Initial working autonomy (Takeoff, hover, land) tests.

95b52ab

Still need to work on making metrics more informative and informative logging messages. Also have not tested state calculating state estimation errors as we will need ground truth from airsim and isaacsim

Add airsim GT publishing and compute odom vs GT metrics

dacaeb4

Hover measures drift from hover start not to target. Pass rates are r…

26392b6

…endered from results.xml using parse_metrics.py

Fix broken px4_ready now uses MAVROS connected and odom publication f…

3b2a827

…or better reliability

Standardize displayed parameterization order in results and tables

2922a7e

Enforce module ordering

acf59b1

Warn if testing build packages and packages are already built

175891f

Set version to 0.18.0-alpha.5

2b39a2b

Rename test_autonomy to test_takeoff_hover_land

3dd3bd9

Update docs

ba3c7a1

Add help for airstack test

70651fd

Add video

3174535

Merge branch 'main' into omar/ci

29faca8

andrewjong merged commit 1104628 into main Apr 23, 2026
1 of 3 checks passed

andrewjong deleted the omar/ci branch April 23, 2026 21:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `airstack test`#344

Add `airstack test`#344
andrewjong merged 25 commits into
mainfrom
omar/ci

andrewjong commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andrewjong commented Apr 23, 2026

What does this pull request do?

How did you implement it?

Testing

System Testing

Test Suite Structure

Test Infrastructure

airstack_env fixture

MetricsRecorder

Output files

Running Tests

airstack test (primary interface)

Prerequisites

Direct pytest (for development / debugging)

CLI option reference

Autonomy Tests (test_takeoff_hover_land.py)

Phase order

Recorded metrics

Running autonomy tests

Metrics Reporting (parse_metrics.py)

Single-run report

Diff / regression check

Did you update the docs (and where)?

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`airstack_env` fixture

`MetricsRecorder`

`airstack test` (primary interface)

Autonomy Tests (`test_takeoff_hover_land.py`)

Metrics Reporting (`parse_metrics.py`)