Skip to content

modem.py#37811

Draft
greatgitsby wants to merge 63 commits intocommaai:masterfrom
greatgitsby:modem
Draft

modem.py#37811
greatgitsby wants to merge 63 commits intocommaai:masterfrom
greatgitsby:modem

Conversation

@greatgitsby
Copy link
Copy Markdown
Contributor

@greatgitsby greatgitsby commented Apr 12, 2026

resolves: #37277
moved from: commaai/agnos-builder#556

Testing

Tested on mici (comma four, EG916Q-GL) and tizi (comma 3X, EG25-G) with Connect and Webbing eSIM profiles.

Boot & connect

$ ssh comma@10.0.0.22 "cat /dev/shm/modem | python3 -c 'import sys,json; d=json.load(sys.stdin); print(d[\"state\"], d[\"iccid\"][-6:], d[\"operator\"], d[\"ip_address\"])'"
connected 463550 T-Mobile 10.26.216.207

Data connectivity through ppp0

$ ssh comma@10.0.0.22 "curl --interface ppp0 -s --connect-timeout 10 http://ifconfig.me"
38.86.196.156

Route management (wifi preferred, ppp0 fallback)

$ ssh comma@10.0.0.22 "ip route show default; echo ---; ip rule show | grep 1000"
default via 10.0.0.1 dev wlan0 proto dhcp src 10.0.0.22 metric 600
default via 10.64.64.64 dev ppp0 metric 1000
---
32765:  from 10.26.216.207 lookup 1000

APN change detection

$ ssh comma@10.0.0.22 "echo -n 'globaldata' > /data/params/d/GsmApn"
# detected in ~1s; full reconnect ~19s

Roaming toggle

$ ssh comma@10.0.0.22 "echo -n '0' > /data/params/d/GsmRoaming"
# pppd killed, state → registering with error=roaming_disabled within ~15s
$ ssh comma@10.0.0.22 "echo -n '1' > /data/params/d/GsmRoaming"
# reconnects in ~12s

Profile switching

$ ssh comma@10.0.0.22 "python3 system/hardware/esim.py --switch <iccid>"
# +QUSIM URC detected, pppd killed, full re-init on new SIM

Carrier error surfacing

{"state": "registering", "registration": "denied", "carrier_error": "EMM_CAUSE_PLMN_NOT_ALLOWED (check GsmApn)"}

Stress testing

See stress test comment and additional stress comment for the full matrix: pppd kills (single + 10× back-to-back + 4× rapid), AT-lock contention, APN/roaming thrash, USB re-enumeration mid-connection, sustained traffic with intermittent kills, SIGTERM cleanup, orphan pppd cleanup, 10-concurrent-reader churn (59,951 reads / 0 errors), garbage state file recovery, and 5-min idle soak.

Bonus: full-reboot → ping <60s

See bonus comment. Mean 49.1s across 7 trials (min 45.3s, max 52.0s), all under the 60s target.

greatgitsby and others added 13 commits April 11, 2026 21:51
The modem detects eSIM profile changes automatically via refreshFlag=0
in _enable_profile. CFUN=0/1 was clearing the CGDCONT table, causing
modem.py to lose APN context and need hw_reset to recover. Without
CFUN, CGDCONT persists and modem.py recovers in ~9s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When modem.py is running, read modem state from /dev/shm/modem instead
of querying ModemManager over dbus. Falls back to MM when the state
file doesn't exist, allowing both to coexist on the same system.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pppd's replacedefaultroute was stealing the default route from wifi.
Now pppd uses nodefaultroute, and modem.py adds a metric 1000 default
route for ppp0. NM manages wifi at ~600, so wifi wins when available
and ppp0 takes over as fallback when wifi is down.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The route was added with 'scope link' which doesn't forward to internet.
Now we parse the remote IP from pppd output and add the route via the peer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Packets bound to ppp0's source IP need a policy rule to route through
ppp0, otherwise the kernel sends them via wifi (lower metric).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…anup on stop

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 12, 2026

Process replay diff report

Replays driving segments through this PR and compares the behavior to master.
Please review any changes carefully to ensure they are expected.

✅ 0 changed, 66 passed, 0 errors

greatgitsby and others added 16 commits April 12, 2026 07:55
Stale ip rules from previous PPP sessions were accumulating and
breaking routing. Now we flush all table 1000 rules and routes
before adding new ones on each connect.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After a profile switch, the modem USB re-enumerates and ports
temporarily disappear. Now we wait up to 60s for the port to
come back before attempting to reconnect.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
States: WAITING_PORT -> INIT -> REGISTERING -> CONNECTING -> CONNECTED
Any state can transition to RECONNECTING on error/SIM change.
No more PPP thread — pppd runs as a subprocess monitored in the main loop.
No more threading.Event — state transitions drive all control flow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The pppd stdout readline was blocking the entire state machine loop,
preventing poll from running. Use select() for non-blocking reads.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
select() on buffered pipes was unreliable — pppd output was missed,
routes never got set up. Use a small reader thread that feeds a queue,
drained non-blocking in the main loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The continue statement was hiding all the pppd line parsing (IP detection,
route setup, disconnect detection). Fixed indentation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove _pdp() APN selection logic. Always configure CID 1 with either
the user's custom APN (GsmApn param) or empty string to let the network
assign. Inline chat script into pppd args, removing temp file.
Reconnect automatically when GsmApn param changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Read /data/params/d/GsmApn instead of importing openpilot Params class,
reducing dependency on openpilot internals.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When GsmRoaming is "0", reject roaming registration and only connect
on home network. Default allows roaming.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reconnect when roaming setting changes while connected, same as APN.
Store roaming state in _do_init, check in _do_connected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Detect GsmRoaming changes while stuck in registering state so
re-enabling roaming takes effect immediately. Set error field
to "roaming_disabled" when roaming is blocked.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All state mutations now go through _update(**kwargs) which atomically
writes to disk. Removed all manual _ws() calls and direct S[key] writes
(except main loop state tracking). _poll() batches all fields into one
_update() call.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
greatgitsby and others added 9 commits April 21, 2026 20:57
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
modem.py stops ModemManager as part of startup, but hardwared's
restart-MM-when-version-is-None path keeps reviving it, causing MM
to race modem.py for AT ports. Gate on /dev/shm/modem (modem.py's
state file) so the restart only runs in the stock flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
get_network_type hits MM DBus in the cell branch, which auto-activates
ModemManager via its systemd DBus service file. When modem.py is running
(state file present), resolve cell type from the state file instead so
we never trigger MM's activation and fight modem.py for AT ports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The state file was only written after the first successful _do_init,
leaving a ~60s window on boot where get_modem_state() returned None
and callers fell through to the MM DBus path, auto-activating
ModemManager and racing modem.py for AT ports.

Publish initial state at run() start so callers short-circuit MM from
the first moment. Also drop the systemctl mask call — /etc is
read-only on AGNOS, so mask fails silently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…istration

The main loop set self.S['state'] in-memory but didn't write the file,
and _do_registering only wrote registration on the happy path. Result:
observers saw state='init' and registration='unknown' while modem.py
was actively searching/denied/not_registered for long periods.

- Main loop now _update(state=state.value) on every transition.
- _do_registering writes registration field whenever CREG changes,
  not just when registered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When registration transitions to denied, query AT+CEER and put the
reason string in the state file's error field. If PLMN_NOT_ALLOWED
and no APN is set, append a hint — this case often clears once the
SIM's required APN is configured (e.g. globaldata for some MVNOs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Split carrier/network errors from modem.py-internal errors:
- error: internal state (e.g. roaming_disabled)
- carrier_error: reason string from AT+CEER when registration is
  denied or not_registered, with APN hint for known-clearable cases
  (PLMN_NOT_ALLOWED, EPS_SERVICES_NOT_ALLOWED).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EPS_SERVICES_NOT_ALLOWED is a subscription/plan issue and per the LTE
spec the UE treats the USIM as invalid for EPS until power-cycle, so
an APN change cannot clear it. Drop the misleading hint for that case
and reword the remaining hint to 'check GsmApn' (applies whether APN
is set or not).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@greatgitsby
Copy link
Copy Markdown
Contributor Author

Stress test results

Ran on mici (comma four, EG916Q-GL) with a Connect eSIM and APN=globaldata.

# Test Result Metric
1 Baseline ping + HTTPS via ppp0 5/5 ping, 138ms avg RTT; HTTPS 706ms
2 sudo killall -9 pppd (1×) Reconnected in ~6.5s
3 External flock /dev/shm/modem_lpa.lock for 30s No crash, stayed connected, AT polls degraded gracefully
5 GsmApn param change at runtime Detected in 1.3s; full reconnect ~19s
6 GsmRoaming off→on Off: pppd killed + error=roaming_disabled in ~15s. On: reconnect in ~12s
8 sudo killall -9 pppd ×10 back-to-back 3-fails→RECONNECTING cycle triggers cleanly; no state file corruption
9 10 concurrent json.load readers of /dev/shm/modem for 30s 43,374 reads, 0 errors
10 Download 1MB via ppp0, verify counters rx_bytes delta 1,045,070 (~4.5% TCP/TLS overhead); tx_bytes plausible
11 5-minute idle Stayed connected, counters kept updating
12 carrier_error field surfaces AT+CEER reason on denied registration E.g. EMM_CAUSE_PLMN_NOT_ALLOWED (check GsmApn)

Skipped: Test 4 (SIM-change URC injection — no safe way to synthesize externally; path exercised organically during eSIM profile switches). Test 7 (GPIO reset mid-connection — skipped for safety).

@greatgitsby
Copy link
Copy Markdown
Contributor Author

greatgitsby commented Apr 22, 2026

Bonus target ($500): full reboot → ping < 60s

Result: 49.1s mean (n=7, min 45.3s, max 52.0s) on comma four with a Connect eSIM (APN=globaldata). All trials comfortably under 60s.

trial uptime→ping elapsed (script→ping)
1 50.02s 38.51s
2 48.38s 36.62s
3 45.25s 33.42s
4 49.59s 35.43s
5 48.95s 36.59s
6 51.97s 36.97s
7 49.25s 37.72s

Reproduction

/var is tmpfs on AGNOS (cron @reboot doesn't persist), and /etc is read-only, so the cleanest harness is a script in /data/ invoked from launch_chffrplus.sh at boot. Uses /proc/uptime (monotonic, unaffected by wall-clock sync).

# Install timing script on device
ssh comma@DEVICE "cat > /data/ping_timing.sh << 'SH'
#!/bin/bash
exec > /data/ping_timing.log 2>&1
t_script_start=\$(awk '{print \$1}' /proc/uptime)
echo \"script_start_uptime=\${t_script_start}s\"
while true; do
  up=\$(awk '{print \$1}' /proc/uptime)
  if sudo -n ping -I ppp0 -c 1 -W 2 -q 8.8.8.8 > /dev/null 2>&1; then
    echo \"PING_OK uptime=\${up}s elapsed=\$(awk -v a=\${up} -v b=\${t_script_start} 'BEGIN{print a-b}')s\"
    break
  fi
  [ \"\$(awk -v u=\${up} 'BEGIN{print (u>300)?1:0}')\" = 1 ] && { echo TIMEOUT; break; }
  sleep 1
done
cat /dev/shm/modem 2>/dev/null >> /data/ping_timing.log
SH
chmod +x /data/ping_timing.sh"

# Hook into openpilot launch (temporary — revert after measurement)
ssh comma@DEVICE "sed -i '/^    agnos_init\$/a\\    /data/ping_timing.sh &' /data/openpilot/launch_chffrplus.sh"

# Reboot, then read log
ssh comma@DEVICE "sudo reboot"
# ...wait for device, then:
ssh comma@DEVICE "cat /data/ping_timing.log"

Sample output

script_start_uptime=11.53s
PING_OK uptime=49.25s elapsed=37.72s
{"state": "connected", ..., "operator": "T-Mobile", "network_type": "lte", "registration": "roaming", ...}

@greatgitsby
Copy link
Copy Markdown
Contributor Author

Additional stress testing

More chaos tests, all on mici (comma four, EG916Q-GL).

# Test Result Notes
13 APN thrash (10 changes @ 5s) State machine collapses rapid flips; only latest APN read in final _do_init. Recovers to connected.
14 Roaming thrash (10 flips @ 3s) Same collapsing behavior; recovers to connected with error cleared.
15 4 pppd kills in 8s (within one retry window) Only PPP fail 1/3 logged; rapid kills hit an already-dead process. No re-entrancy issues.
16 USB re-enumeration mid-connection (lte.sh stop_blocking + start) State → reconnecting when port disappears, → waiting_port → port found → full re-init → connected. ~32s end-to-end.
17 60s sustained ppp0 traffic + pppd kill every 15s (×4) Traffic drops during reconnect window; counters persist; state stays consistent.
18 SIGTERM to modem.py ✅ (shutdown) stop() runs clean: pppd killed, state file removed. Note: under openpilot manager there's no auto-respawn after SIGTERM; requires systemctl restart comma.
19 Orphan pppd from previous run _kill_ppp in _do_init clears it; fresh modem.py reaches connected cleanly.
20 10 concurrent readers for 40s while hammering APN param (20 flips @ 2s) 59,951 successful reads, 0 errors. os.replace holds up under real concurrent churn.
21 Edge cases: empty GsmApn file + garbage bytes written to /dev/shm/modem Empty file treated as auto-APN. Garbage state file overwritten by next _update() within seconds.

greatgitsby and others added 6 commits April 21, 2026 23:46
- Default log level back to INFO (DEBUG was a dev leftover).
- Drop os.fsync() in _update() — state file lives on tmpfs.
- Stop writing phantom state strings from _do_connecting/_do_connected/
  _poll (disconnected, registered, connecting) — main loop is the sole
  writer of the state field.
- Gate _poll's _update(**s) on 'if s:' to skip tmpfs writes when
  nothing changed.
- Extract _parse_reg() and _query_ceer() helpers; consolidate the
  registering path so all non-success branches clear error alongside
  setting carrier_error (prevents stale 'roaming_disabled' lingering
  after roaming is re-enabled while still not_registered).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ppp0's sysfs byte counters reset each time pppd recreates the
interface, so they'd flip to 0 on every 3-strike retry — meaning
downstream wwanTx/wwanRx counters regressed several times per hour.
Match the old ModemManager semantics: counter only resets on a full
RECONNECTING cycle (SIM change, prolonged disconnect, USB re-enum),
not on transient pppd restarts.

Also drop the ad-hoc state='reconnecting' write in _do_reconnecting
— main loop already writes the enum value on transition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three discrepancies caught by auditing modem.py's state file against
the old hardware.py MM DBus calls:

1. mcc_mnc was never populated. Read AT+CIMI (IMSI), take first 6
   digits (MCC+MNC) and publish in state file + get_sim_info().

2. AT+COPS Access Technology only mapped 3 of 14 spec values (others
   fell to 'unknown'). Expand to cover 2G/EDGE (0,1,3), 3G HSPA
   variants (2,4,5,6), LTE/LTE-A (7,9,10), NR 5G (11-13). Also add
   cell5G branch in hardware.py's get_network_type.

3. hardware.py's state_map in get_network_info expected MM state
   names (searching/registered) that modem.py never emits, and
   missed waiting_port/registering/reconnecting. Realign to the
   actual State enum.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
get_modem_state() now raises if /dev/shm/modem isn't present; all
hardware.py accessors read directly from modem.py's state file with no
ModemManager fallback. Also drops the hardwared MM-restart watchdog.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
State file's "state" key now holds INITIALIZING/SEARCHING/CONNECTING/
CONNECTED/DISCONNECTING directly via State.label, dropping the
translation table in hardware.get_network_info.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extract _configure_device in modem.py holding the full EG25/EG916 AT
commands (EG25 now clears the initial EPS bearer APN via AT+CGDCONT=0
in place of the old mmcli call; EG916 commands gated on empty sim_id).
Delete the NetworkManager-driven eSIM prime path and the now-empty
configure_modem / reboot_modem overrides and base-class stubs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
greatgitsby and others added 6 commits April 25, 2026 08:54
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace error/carrier_error string fields with a single error dict
({type, description}) keyed off ERR_* constants so consumers can
discriminate by type. Move per-state pacing sleeps into a STATE_WAIT
map read by the main loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tested on mici: pyserial's Serial() constructor returns with the port
ready, so the 1s sleep after open is unnecessary. Identity AT commands
(CGSN, QCCID, CIMI, GMR) succeeded ~330ms after port open with no
timeouts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
greatgitsby and others added 3 commits April 25, 2026 12:56
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
modem.py owns AT_PORT and runtime-masks ModemManager, so the
DBus path can never succeed. Remove _use_dbus, _get_modem,
_dbus_query, MM/MM_MODEM constants, and the unused math import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace ModemManager with modem.py

2 participants