Skip to content

lme_changes for rpc‐go v3

Nabendu M edited this page Jun 10, 2026 · 1 revision

LME / APF activation changes — rpc-go (recent commits)

Documentation of the LME-path (HECI + APF tunnel) activation fixes that landed on top of the main branch of rpc-go, plus the related WSMAN dependency changes.

Scope: legacy ws/wss remote activation flow driven over LME (the in-band APF tunnel on the HECI MEI device) instead of the user-mode LMS daemon. Files touched live under internal/lm/, internal/rps/executor.go, pkg/heci/, internal/commands/diagnostics/wsman.go, and go.mod.


Commit set covered

In topological order (oldest → newest):

# SHA Subject
1 9c80f39 fix: keep LME HECI handle open between RPS requests
2 54f0c81 fix(lme): make AMT 18.x LME activation reliable through TLS port switch
3 17cbebb fix(lme): fall back to LME on TLS-enforced AMT when LMS is unavailable
4 7dbce6b fix(lme): stabilize persistent APF TLS tunnel and channel-close recovery
5 79135e6 build(deps): replace go-wsman-messages with local sibling clone
6 a59d38e refactor: update wsman.go to use dedicated typed functions for all AMT classes

Together they take rpc-go from "LMS-only with brittle LME fallback" to "LME is a first-class transport that survives AMT 18.x port-switch and TLS-handshake quirks".


High-level transport picture

flowchart LR
    subgraph host[Host running rpc-go]
        RPC[rpc-go executor]
        LMS[LMS daemon<br/>16992/16993]
        HECI[/dev/mei0/]
    end

    subgraph fw[AMT firmware]
        AMTHTTP[AMT WS-MAN<br/>16992/16993]
        APF[APF server<br/>over MEI]
    end

    RPS[RPS<br/>over WebSocket]

    RPS <-->|ws/wss| RPC
    RPC -->|TCP localhost| LMS --> AMTHTTP
    RPC -->|ioctl| HECI --> APF
    APF -. tcpip-forward .-> AMTHTTP
Loading
  • LMS path: rpc-go opens localhost:16992 (or :16993 after TLS port-switch). LMS proxies to AMT.
  • LME path: rpc-go talks APF directly to the MEI client, AMT publishes tcpip-forward entries for ports 16992 / 16993 / 623 / 664 and rpc-go opens a CHANNEL per WSMAN request.

Selecting which one is used is decided in NewExecutor: LMS is probed first; if it does not answer, we initialise an LMEConnection and set isLME = true.


Commit 1 — 9c80f39 keep LME HECI handle open between RPS requests

Problem

HandleDataFromRPS used to call localManagement.Close() after every single RPS request whenever TLS-tunnel mode was off. For the LMS path that is fine (each WSMAN round is a fresh TCP connection). For LME it tore down:

  • the MEI device handle, and
  • the four tcpip-forward registrations that LMEConnection.Initialize set up exactly once at startup.

The next RPS message (typically the Digest-auth retry) entered prepareLME → Connect → SendMessage and the kernel returned ENODEV because there was no open handle. The user-visible symptom was heci device not initialized.

Fix

Skip the per-request close when running over LME. End-of-flow teardown already happens through MakeItSo's deferred Close.

// internal/rps/executor.go (post-response cleanup)
if e.isLME {
    e.waitGroup.Wait()
}
if !e.tlsTunnelActive {
    if !e.isLME {              // <-- effective gate added by 9c80f39
        e.localManagement.Close()
        e.lmConnected = false
    }
}

State machine before/after

stateDiagram-v2
    [*] --> Initialise: rpc-go starts
    Initialise --> Idle: HECI open, 4 tcpip-forwards
    Idle --> Channel1: RPS msg #1 → CHANNEL_OPEN
    Channel1 --> Idle: response sent

    state old_buggy {
        Idle2 --> ClosedHECI: per-request Close()
        ClosedHECI --> ENODEV: RPS msg #2 dial
    }
    state new_fixed {
        Idle3 --> Channel2: RPS msg #2 → CHANNEL_OPEN (reuses HECI)
        Channel2 --> Idle3
    }
Loading

Commit 2 — 54f0c81 reliable AMT 18.x LME activation through TLS port switch

This is the most substantial of the LME commits. It addresses four independent race conditions that all happened to surface on AMT 18.x:

2.1 HECI ENODEV during SendMessage was swallowed

pkg/heci/linux.go now reinitialises the device when the kernel returns ENODEV, and surfaces a new sentinel ErrDeviceReinitialized from pkg/heci/types.go so callers can take action instead of pretending nothing happened. LMEConnection.Connect catches the sentinel and re-runs Initialize before retrying CHANNEL_OPEN.

2.2 Initialize returned too early

The init handler now waits for all four expected tcpip-forward ports (16992, 16993, 623, 664) before declaring the device ready, rather than firing on the first one. This stops the activation loop from racing the trailing forwards and then choosing the wrong target port.

2.3 Post-GenerateKeyPair ENODEV blackout

After AMT persists the new key, the MEI client for LME briefly disappears on 18.x, costing ~6 s of re-handshake on the next channel open. After forwarding the GenerateKeyPair response the executor now inserts a short pause (default 750 ms) to ride that out.

2.4 port_switch was treated as LMS reconnect

The legacy code path always dialled 127.0.0.1:16993 on port_switch, which obviously fails when LMS is not running. In LME mode the executor now just calls LMEConnection.SetPort(16993), flips tlsTunnelActive, and sends port_switch_ack over the existing APF channel.

Tunable knobs (env vars)

Variable Default Purpose
RPC_PORT_SWITCH_MAX_DELAY 60s upper clamp on RPS-supplied delay
RPC_PORT_SWITCH_EXTRA_DELAY 0s added on top of server delay
RPC_LME_POST_KEYGEN_PAUSE_MS 750 pause after GenerateKeyPair

Sequence: AMT 18.x port-switch in LME mode

sequenceDiagram
    autonumber
    participant RPS
    participant RPC as rpc-go executor
    participant LME as LMEConnection
    participant HECI as /dev/mei0
    participant AMT

    RPS->>RPC: wsman request #N (Digest, etc.)
    RPC->>LME: prepareLME / Connect
    LME->>HECI: CHANNEL_OPEN
    HECI->>AMT: APF
    AMT-->>HECI: CHANNEL_OPEN_CONFIRMATION
    RPC->>AMT: GenerateKeyPair (via channel)
    AMT-->>RPC: response
    Note over RPC: pause RPC_LME_POST_KEYGEN_PAUSE_MS<br/>(MEI client about to vanish)
    RPS->>RPC: port_switch{port:16993, delay}
    RPC->>RPC: clamp delay ≤ 60s, sleep
    RPC->>LME: SetPort(16993)
    RPC-->>RPS: port_switch_ack
    Note over RPC,AMT: tlsTunnelActive = true
    RPS->>RPC: tls_data
    RPC->>LME: CHANNEL_OPEN on port 16993
Loading

Recovery on ENODEV during send

sequenceDiagram
    autonumber
    participant RPC
    participant HECI
    participant LME

    RPC->>HECI: SendMessage
    HECI-->>RPC: ENODEV
    HECI->>HECI: re-open MEI client
    HECI-->>RPC: ErrDeviceReinitialized
    RPC->>LME: Initialize (re-handshake, wait 4 tcpip-forwards)
    LME-->>RPC: ready
    RPC->>HECI: SendMessage (retry)
    HECI-->>RPC: OK
Loading

Commit 3 — 17cbebb LME fallback on TLS-enforced AMT when LMS is missing

Problem

When the operator runs against an AMT with TLS-on-LMS enforced (LocalTlsEnforced=true) but the LMS service is not installed on the host (e.g. headless Linux box), the executor used to return utils.LMSConnectionFailed and abort. There is no reason to: LME can reach AMT in-band.

Fix

// internal/rps/executor.go (NewExecutor, abbreviated)
err := client.localManagement.Connect()
if err != nil {
    log.Tracef("LMS dial failed (%v); falling back to LME (in-band HECI/APF)", err)

    lme := lm.NewLMEConnection(lmDataChannel, lmErrorChannel, client.waitGroup)
    if config.LocalTlsEnforced {
        lme.SetPort(16993) // start on TLS port, no port_switch needed
    }
    client.localManagement = lme
    client.isLME = true
    client.localManagement.Initialize()
}

When TLS is enforced, the first CHANNEL_OPEN already targets 16993, so RPS does not need to issue a port_switch mid-flow.

Decision flow

flowchart TD
    A[NewExecutor] --> B{LMS dial OK?}
    B -- yes --> C[Use LMSConnection]
    B -- no --> D{LocalTlsEnforced?}
    D -- yes --> E[LMEConnection.SetPort 16993]
    D -- no  --> F[LMEConnection on 16992]
    E --> G[Initialize HECI + APF]
    F --> G
    G --> H[isLME = true]
Loading

Commit 4 — 7dbce6b stabilise persistent APF TLS tunnel and channel-close recovery

This is the largest single diff in the series (~1.2k lines across internal/lm/apf_conn.go (new), internal/lm/engine.go, internal/local/amt/localTransport.go, internal/rps/executor.go, pkg/heci/linux.go, and pkg/utils/constants.go).

It hardens the persistent APF channel that the executor relies on in tunnel mode (phase 2 of the activation, where RPS just streams opaque TLS records as tls_data):

  • A dedicated apf_conn.go module owns per-channel state (sender id, TX window, handshake confirmation flag) so Connect/Send/Listen stop sharing mutable bookkeeping with the global apf.Session.
  • Connect now resets per-request session state explicitly (Tempdata, SenderChannel, TXWindow, HandshakeConfirmed) and retries CHANNEL_OPEN up to four times when the device is briefly unavailable.
  • Send switches to apf.BuildChannelDataBytes and debits the TX window by payload length rather than framed length, eliminating a slow drift that could starve the window on long TLS handshakes.
  • The executor recognises a TLS ClientHello (0x16 .. 0x01) on an already-connected channel and closes the channel before forwarding it, so AMT can begin a fresh handshake on a new APF channel rather than receiving a ClientHello mid-session.
  • HandleDataFromRPS gained a 90 s response timeout in tunnel mode and graceful handling of lm.ErrLMSReadTimeoutNoData (TLS 1.3 quiet rounds, e.g. after client Finished), so the persistent channel is not torn down on legitimately silent rounds.
  • HECI read-timeouts are reclassified as Warn/Debug instead of Error, matching their benign nature in the polling loop.

Persistent tunnel lifecycle (post-fix)

stateDiagram-v2
    [*] --> Ready: Initialize (4 tcpip-forwards)
    Ready --> ChanOpen: first tls_data from RPS
    ChanOpen --> Active: CHANNEL_OPEN_CONFIRMATION
    Active --> Active: TLS records both ways
    Active --> Quiet: ErrLMSReadTimeoutNoData<br/>(TLS 1.3 quiet round)
    Quiet --> Active: next tls_data from RPS
    Active --> Reopen: ClientHello on live channel
    Reopen --> ChanOpen: close old, CHANNEL_OPEN new
    Active --> Closed: RPS terminal msg / EOF
    Closed --> [*]
Loading

Response framing back to RPS

flowchart LR
    LM[Data from LM] --> X{tlsTunnelActive?}
    X -- no --> R[method = response]
    X -- yes --> T[method = tls_data]
    R --> WS[WebSocket → RPS]
    T --> WS
Loading

Implemented in HandleDataFromLM.


Commit 5 — 79135e6 local clone of go-wsman-messages

Pure build plumbing, but required for the LME work:

// go.mod
// uncomment when developing locally
// replace github.com/device-management-toolkit/go-wsman-messages/v2 => ../go-wsman-messages

The apf package lives in go-wsman-messages, and several of the LME fixes above (new apf.BuildChannelDataBytes, apf.ProtocolVersion, apf.Session.HandshakeConfirmed field, etc.) depend on changes that must be iterated in lockstep with rpc-go. The replace directive lets that happen against a sibling checkout without cutting a release of go-wsman-messages or carrying a vendor tree.

Operationally:

LME_dev/
├── rpc_go_tls_lme/           # this repo
└── go-wsman-messages/        # uncomment replace line and `go build` works

WSMAN changes

Two independent changes touched the WSMAN surface in the same window.

a) a59d38e — typed AMT-class fetchers in diagnostics/wsman.go

Before: internal/commands/diagnostics/wsman.go called the generic WSMAN body-builder for every class and parsed the XML by hand, duplicating the schema definitions in go-wsman-messages.

After: each AMT class gets its own typed fetcher that delegates to the go-wsman-messages typed helpers (e.g. AMTGeneralSettings, AMTEthernetPortSettings, CIM_BootSourceSetting, IPS_HostBasedSetupService, ...). 338 lines of wsman.go (-124 / +214) move from "raw XML wrangling" to "call typed function, structure response".

flowchart LR
    subgraph before[Before a59d38e]
        A1[diagnostics.fetchClass<br/>name string] --> A2[wsman.Get raw XML]
        A2 --> A3[hand-parse]
    end
    subgraph after[After a59d38e]
        B1[diagnostics.fetchAMTGeneralSettings] --> B2[go-wsman-messages typed call]
        B1b[diagnostics.fetchAMTEthernetPortSettings] --> B2
        B1c[...] --> B2
        B2 --> B3[typed Go struct]
    end
Loading

Why it matters for LME work: the diagnostics path is what operators run to confirm an LME-activated device looks healthy. Keeping it on the typed API avoids drift when the WSMAN schema evolves (the AMT 18.x and 21.x firmware variants exercised in the remote_logs/ summaries return slightly different element sets).

b) 79135e6 (above) — local sibling clone

Same commit covered in commit 5. Repeating here because it is also a WSMAN-side change: the apf package and the typed fetchers above all live in go-wsman-messages, and the replace directive in go.mod makes the rpc-go LME branch buildable against an in-progress go-wsman-messages branch.

c) Continuous dependency bumps

Dependabot has been bumping github.com/device-management-toolkit/go-wsman-messages/v2 through the same window:

Commit From → To
894cf18 2.46.1 → 2.47.2
5755aa8 2.46.0 → 2.46.1
e773100 2.45.x → 2.46.0
d8cd4b3 earlier
(… more in `git log --oneline grep go-wsman-messages`)

These are mechanical bumps but they bracket the LME work: any rebuilding of rpc-go from main against pinned versions should use ≥ v2.47.2 for the typed-fetcher and APF features to be present.


Summary table

Commit Layer One-line outcome
9c80f39 internal/rps/executor.go Stops tearing down the LME HECI handle between RPS requests.
54f0c81 internal/lm/engine.go, internal/rps/executor.go, pkg/heci/ AMT 18.x port-switch works in LME mode; ENODEV is recoverable; post-keygen pause; tunable via env vars.
17cbebb internal/rps/executor.go Falls back to LME (on port 16993) when LMS is absent on TLS-enforced AMT.
7dbce6b internal/lm/, internal/rps/executor.go, pkg/heci/, pkg/utils/constants.go Persistent APF TLS tunnel survives TLS 1.3 quiet rounds, ClientHello renegotiation, and channel-close races.
79135e6 go.mod Allows the LME work to track a sibling go-wsman-messages checkout.
a59d38e internal/commands/diagnostics/wsman.go Typed AMT-class fetchers replace hand-rolled XML.

Quick verification recipes

Local LME activation against a TLS-enforced AMT 18.x (no LMS on host):

cd LME_dev/rpc_go_tls_lme
go build -o rpc ./cmd/rpc/main.go

# Tune the AMT-18 specific timing if needed
export RPC_LME_POST_KEYGEN_PAUSE_MS=750
export RPC_PORT_SWITCH_MAX_DELAY=60s

sudo ./rpc activate -u wss://<rps-host>/activate \
    --profile <profile-name> --skip-cert-check

Expected log markers (in order):

  1. LMS dial failed (...); falling back to LME (in-band HECI/APF)
  2. Initialize: tcpip-forward ready for 16992 / 16993 / 623 / 664
  3. LME: Opening new APF channel for this request (× N WSMAN rounds)
  4. Port switch: successfully switched to port 16993
  5. port_switch_ack and subsequent tls_data rounds with occasional No LMS data before read timeout for this TLS round-trip; continuing without connection_reset
  6. RPS terminal success message.

Failures to watch for:

Log line Likely cause
heci device not initialized after request #2 regression of 9c80f39
Port switch failed: dial tcp4 127.0.0.1:16993: i/o timeout in LME mode regression of 54f0c81 (port_switch dialing LMS instead of calling SetPort)
Empty response from LMS → connection_reset every round LME channel torn down mid-handshake; check 7dbce6b ClientHello / quiet-round handling.

Clone this wiki locally