-
Notifications
You must be signed in to change notification settings - Fork 24
lme_changes for rpc‐go v3
Documentation of the LME-path (HECI + APF tunnel) activation fixes that
landed on top of the main branch of
rpc-go, plus the
related WSMAN dependency changes.
Scope: legacy
ws/wssremote activation flow driven over LME (the in-band APF tunnel on the HECI MEI device) instead of the user-mode LMS daemon. Files touched live under internal/lm/, internal/rps/executor.go, pkg/heci/, internal/commands/diagnostics/wsman.go, and go.mod.
In topological order (oldest → newest):
| # | SHA | Subject |
|---|---|---|
| 1 | 9c80f39 |
fix: keep LME HECI handle open between RPS requests |
| 2 | 54f0c81 |
fix(lme): make AMT 18.x LME activation reliable through TLS port switch |
| 3 | 17cbebb |
fix(lme): fall back to LME on TLS-enforced AMT when LMS is unavailable |
| 4 | 7dbce6b |
fix(lme): stabilize persistent APF TLS tunnel and channel-close recovery |
| 5 | 79135e6 |
build(deps): replace go-wsman-messages with local sibling clone |
| 6 | a59d38e |
refactor: update wsman.go to use dedicated typed functions for all AMT classes |
Together they take rpc-go from "LMS-only with brittle LME fallback" to "LME is a first-class transport that survives AMT 18.x port-switch and TLS-handshake quirks".
flowchart LR
subgraph host[Host running rpc-go]
RPC[rpc-go executor]
LMS[LMS daemon<br/>16992/16993]
HECI[/dev/mei0/]
end
subgraph fw[AMT firmware]
AMTHTTP[AMT WS-MAN<br/>16992/16993]
APF[APF server<br/>over MEI]
end
RPS[RPS<br/>over WebSocket]
RPS <-->|ws/wss| RPC
RPC -->|TCP localhost| LMS --> AMTHTTP
RPC -->|ioctl| HECI --> APF
APF -. tcpip-forward .-> AMTHTTP
-
LMS path: rpc-go opens
localhost:16992(or:16993after TLS port-switch). LMS proxies to AMT. -
LME path: rpc-go talks APF directly to the MEI client, AMT
publishes
tcpip-forwardentries for ports16992 / 16993 / 623 / 664and rpc-go opens a CHANNEL per WSMAN request.
Selecting which one is used is decided in
NewExecutor:
LMS is probed first; if it does not answer, we initialise an
LMEConnection and set isLME = true.
HandleDataFromRPS used to call localManagement.Close() after every
single RPS request whenever TLS-tunnel mode was off. For the LMS path
that is fine (each WSMAN round is a fresh TCP connection). For LME it
tore down:
- the MEI device handle, and
- the four
tcpip-forwardregistrations thatLMEConnection.Initializeset up exactly once at startup.
The next RPS message (typically the Digest-auth retry) entered
prepareLME → Connect → SendMessage and the kernel returned ENODEV
because there was no open handle. The user-visible symptom was
heci device not initialized.
Skip the per-request close when running over LME. End-of-flow teardown
already happens through MakeItSo's deferred Close.
// internal/rps/executor.go (post-response cleanup)
if e.isLME {
e.waitGroup.Wait()
}
if !e.tlsTunnelActive {
if !e.isLME { // <-- effective gate added by 9c80f39
e.localManagement.Close()
e.lmConnected = false
}
}stateDiagram-v2
[*] --> Initialise: rpc-go starts
Initialise --> Idle: HECI open, 4 tcpip-forwards
Idle --> Channel1: RPS msg #1 → CHANNEL_OPEN
Channel1 --> Idle: response sent
state old_buggy {
Idle2 --> ClosedHECI: per-request Close()
ClosedHECI --> ENODEV: RPS msg #2 dial
}
state new_fixed {
Idle3 --> Channel2: RPS msg #2 → CHANNEL_OPEN (reuses HECI)
Channel2 --> Idle3
}
This is the most substantial of the LME commits. It addresses four independent race conditions that all happened to surface on AMT 18.x:
pkg/heci/linux.go now reinitialises the device when the kernel
returns ENODEV, and surfaces a new sentinel ErrDeviceReinitialized
from pkg/heci/types.go so
callers can take action instead of pretending nothing happened.
LMEConnection.Connect catches the sentinel and re-runs Initialize
before retrying CHANNEL_OPEN.
The init handler now waits for all four expected tcpip-forward
ports (16992, 16993, 623, 664) before declaring the device
ready, rather than firing on the first one. This stops the activation
loop from racing the trailing forwards and then choosing the wrong
target port.
After AMT persists the new key, the MEI client for LME briefly
disappears on 18.x, costing ~6 s of re-handshake on the next channel
open. After forwarding the GenerateKeyPair response the executor now
inserts a short pause (default 750 ms) to ride that out.
The legacy code path always dialled 127.0.0.1:16993 on port_switch,
which obviously fails when LMS is not running. In LME mode the executor
now just calls LMEConnection.SetPort(16993), flips tlsTunnelActive,
and sends port_switch_ack over the existing APF channel.
| Variable | Default | Purpose |
|---|---|---|
RPC_PORT_SWITCH_MAX_DELAY |
60s |
upper clamp on RPS-supplied delay |
RPC_PORT_SWITCH_EXTRA_DELAY |
0s |
added on top of server delay |
RPC_LME_POST_KEYGEN_PAUSE_MS |
750 |
pause after GenerateKeyPair
|
sequenceDiagram
autonumber
participant RPS
participant RPC as rpc-go executor
participant LME as LMEConnection
participant HECI as /dev/mei0
participant AMT
RPS->>RPC: wsman request #N (Digest, etc.)
RPC->>LME: prepareLME / Connect
LME->>HECI: CHANNEL_OPEN
HECI->>AMT: APF
AMT-->>HECI: CHANNEL_OPEN_CONFIRMATION
RPC->>AMT: GenerateKeyPair (via channel)
AMT-->>RPC: response
Note over RPC: pause RPC_LME_POST_KEYGEN_PAUSE_MS<br/>(MEI client about to vanish)
RPS->>RPC: port_switch{port:16993, delay}
RPC->>RPC: clamp delay ≤ 60s, sleep
RPC->>LME: SetPort(16993)
RPC-->>RPS: port_switch_ack
Note over RPC,AMT: tlsTunnelActive = true
RPS->>RPC: tls_data
RPC->>LME: CHANNEL_OPEN on port 16993
sequenceDiagram
autonumber
participant RPC
participant HECI
participant LME
RPC->>HECI: SendMessage
HECI-->>RPC: ENODEV
HECI->>HECI: re-open MEI client
HECI-->>RPC: ErrDeviceReinitialized
RPC->>LME: Initialize (re-handshake, wait 4 tcpip-forwards)
LME-->>RPC: ready
RPC->>HECI: SendMessage (retry)
HECI-->>RPC: OK
When the operator runs against an AMT with TLS-on-LMS enforced
(LocalTlsEnforced=true) but the LMS service is not installed on
the host (e.g. headless Linux box), the executor used to return
utils.LMSConnectionFailed and abort. There is no reason to: LME can
reach AMT in-band.
// internal/rps/executor.go (NewExecutor, abbreviated)
err := client.localManagement.Connect()
if err != nil {
log.Tracef("LMS dial failed (%v); falling back to LME (in-band HECI/APF)", err)
lme := lm.NewLMEConnection(lmDataChannel, lmErrorChannel, client.waitGroup)
if config.LocalTlsEnforced {
lme.SetPort(16993) // start on TLS port, no port_switch needed
}
client.localManagement = lme
client.isLME = true
client.localManagement.Initialize()
}When TLS is enforced, the first CHANNEL_OPEN already targets 16993,
so RPS does not need to issue a port_switch mid-flow.
flowchart TD
A[NewExecutor] --> B{LMS dial OK?}
B -- yes --> C[Use LMSConnection]
B -- no --> D{LocalTlsEnforced?}
D -- yes --> E[LMEConnection.SetPort 16993]
D -- no --> F[LMEConnection on 16992]
E --> G[Initialize HECI + APF]
F --> G
G --> H[isLME = true]
This is the largest single diff in the series (~1.2k lines across
internal/lm/apf_conn.go (new),
internal/lm/engine.go,
internal/local/amt/localTransport.go,
internal/rps/executor.go,
pkg/heci/linux.go, and
pkg/utils/constants.go).
It hardens the persistent APF channel that the executor relies on
in tunnel mode (phase 2 of the activation, where RPS just streams
opaque TLS records as tls_data):
- A dedicated
apf_conn.gomodule owns per-channel state (sender id, TX window, handshake confirmation flag) soConnect/Send/Listenstop sharing mutable bookkeeping with the globalapf.Session. -
Connectnow resets per-request session state explicitly (Tempdata,SenderChannel,TXWindow,HandshakeConfirmed) and retriesCHANNEL_OPENup to four times when the device is briefly unavailable. -
Sendswitches toapf.BuildChannelDataBytesand debits the TX window by payload length rather than framed length, eliminating a slow drift that could starve the window on long TLS handshakes. - The executor recognises a TLS
ClientHello(0x16 .. 0x01) on an already-connected channel and closes the channel before forwarding it, so AMT can begin a fresh handshake on a new APF channel rather than receiving a ClientHello mid-session. -
HandleDataFromRPSgained a 90 s response timeout in tunnel mode and graceful handling oflm.ErrLMSReadTimeoutNoData(TLS 1.3 quiet rounds, e.g. after clientFinished), so the persistent channel is not torn down on legitimately silent rounds. - HECI read-timeouts are reclassified as
Warn/Debuginstead ofError, matching their benign nature in the polling loop.
stateDiagram-v2
[*] --> Ready: Initialize (4 tcpip-forwards)
Ready --> ChanOpen: first tls_data from RPS
ChanOpen --> Active: CHANNEL_OPEN_CONFIRMATION
Active --> Active: TLS records both ways
Active --> Quiet: ErrLMSReadTimeoutNoData<br/>(TLS 1.3 quiet round)
Quiet --> Active: next tls_data from RPS
Active --> Reopen: ClientHello on live channel
Reopen --> ChanOpen: close old, CHANNEL_OPEN new
Active --> Closed: RPS terminal msg / EOF
Closed --> [*]
flowchart LR
LM[Data from LM] --> X{tlsTunnelActive?}
X -- no --> R[method = response]
X -- yes --> T[method = tls_data]
R --> WS[WebSocket → RPS]
T --> WS
Implemented in
HandleDataFromLM.
Pure build plumbing, but required for the LME work:
// go.mod
// uncomment when developing locally
// replace github.com/device-management-toolkit/go-wsman-messages/v2 => ../go-wsman-messages
The apf package lives in go-wsman-messages, and several of the LME
fixes above (new apf.BuildChannelDataBytes, apf.ProtocolVersion,
apf.Session.HandshakeConfirmed field, etc.) depend on changes that
must be iterated in lockstep with rpc-go. The replace directive lets
that happen against a sibling checkout without cutting a release of
go-wsman-messages or carrying a vendor tree.
Operationally:
LME_dev/
├── rpc_go_tls_lme/ # this repo
└── go-wsman-messages/ # uncomment replace line and `go build` works
Two independent changes touched the WSMAN surface in the same window.
Before: internal/commands/diagnostics/wsman.go called the generic
WSMAN body-builder for every class and parsed the XML by hand,
duplicating the schema definitions in
go-wsman-messages.
After: each AMT class gets its own typed fetcher that delegates to the
go-wsman-messages typed helpers
(e.g. AMTGeneralSettings, AMTEthernetPortSettings,
CIM_BootSourceSetting, IPS_HostBasedSetupService, ...). 338 lines
of wsman.go (-124 / +214) move from "raw XML wrangling" to "call typed
function, structure response".
flowchart LR
subgraph before[Before a59d38e]
A1[diagnostics.fetchClass<br/>name string] --> A2[wsman.Get raw XML]
A2 --> A3[hand-parse]
end
subgraph after[After a59d38e]
B1[diagnostics.fetchAMTGeneralSettings] --> B2[go-wsman-messages typed call]
B1b[diagnostics.fetchAMTEthernetPortSettings] --> B2
B1c[...] --> B2
B2 --> B3[typed Go struct]
end
Why it matters for LME work: the diagnostics path is what operators run
to confirm an LME-activated device looks healthy. Keeping it on the
typed API avoids drift when the WSMAN schema evolves (the AMT 18.x and
21.x firmware variants exercised in the
remote_logs/ summaries return
slightly different element sets).
Same commit covered in commit 5. Repeating here because it is also a
WSMAN-side change: the apf package and the typed fetchers above all
live in go-wsman-messages, and the replace directive in go.mod
makes the rpc-go LME branch buildable against an in-progress
go-wsman-messages branch.
Dependabot has been bumping
github.com/device-management-toolkit/go-wsman-messages/v2
through the same window:
| Commit | From → To |
|---|---|
894cf18 |
2.46.1 → 2.47.2 |
5755aa8 |
2.46.0 → 2.46.1 |
e773100 |
2.45.x → 2.46.0 |
d8cd4b3 |
earlier |
| (… more in `git log --oneline | grep go-wsman-messages`) |
These are mechanical bumps but they bracket the LME work: any
rebuilding of rpc-go from main against pinned versions should use
≥ v2.47.2 for the typed-fetcher and APF features to be present.
| Commit | Layer | One-line outcome |
|---|---|---|
9c80f39 |
internal/rps/executor.go |
Stops tearing down the LME HECI handle between RPS requests. |
54f0c81 |
internal/lm/engine.go, internal/rps/executor.go, pkg/heci/
|
AMT 18.x port-switch works in LME mode; ENODEV is recoverable; post-keygen pause; tunable via env vars. |
17cbebb |
internal/rps/executor.go |
Falls back to LME (on port 16993) when LMS is absent on TLS-enforced AMT. |
7dbce6b |
internal/lm/, internal/rps/executor.go, pkg/heci/, pkg/utils/constants.go
|
Persistent APF TLS tunnel survives TLS 1.3 quiet rounds, ClientHello renegotiation, and channel-close races. |
79135e6 |
go.mod |
Allows the LME work to track a sibling go-wsman-messages checkout. |
a59d38e |
internal/commands/diagnostics/wsman.go |
Typed AMT-class fetchers replace hand-rolled XML. |
Local LME activation against a TLS-enforced AMT 18.x (no LMS on host):
cd LME_dev/rpc_go_tls_lme
go build -o rpc ./cmd/rpc/main.go
# Tune the AMT-18 specific timing if needed
export RPC_LME_POST_KEYGEN_PAUSE_MS=750
export RPC_PORT_SWITCH_MAX_DELAY=60s
sudo ./rpc activate -u wss://<rps-host>/activate \
--profile <profile-name> --skip-cert-checkExpected log markers (in order):
LMS dial failed (...); falling back to LME (in-band HECI/APF)Initialize: tcpip-forward ready for 16992 / 16993 / 623 / 664-
LME: Opening new APF channel for this request(× N WSMAN rounds) Port switch: successfully switched to port 16993-
port_switch_ackand subsequenttls_datarounds with occasionalNo LMS data before read timeout for this TLS round-trip; continuing without connection_reset - RPS terminal success message.
Failures to watch for:
| Log line | Likely cause |
|---|---|
heci device not initialized after request #2 |
regression of 9c80f39
|
Port switch failed: dial tcp4 127.0.0.1:16993: i/o timeout in LME mode |
regression of 54f0c81 (port_switch dialing LMS instead of calling SetPort) |
Empty response from LMS → connection_reset every round |
LME channel torn down mid-handshake; check 7dbce6b ClientHello / quiet-round handling. |