Skip to content

Improve hardware context selection with register validation#18

Merged
hackwa merged 3 commits into
Xilinx:mainfrom
snigdha-gupta:AIESW-32969
May 21, 2026
Merged

Improve hardware context selection with register validation#18
hackwa merged 3 commits into
Xilinx:mainfrom
snigdha-gupta:AIESW-32969

Conversation

@snigdha-gupta
Copy link
Copy Markdown
Contributor

Summary

Refactor check_hw_context() so MLDebugger can pick the right XRT hardware context when multiple hw ctx are listed by xrt-smi.

  • Adds _validate_contexts_with_read() to probe all contexts with a read-only CORE_STATUS register access before attaching.
  • Keeps selection logic in check_hw_context(); validation only returns None or a list of (context_id, pid) pairs that succeeded.

Motivation

On Telluride (and similar setups), xrt-smi often reports several contexts for the same PID. Previously selection could block indefinitely on input() with no timeout. With this PR, MLDebugger can pick the correct/valid context through a register probe: if MLDebugger can read CORE_STATUS on a context, that handle is at least usable for debug attach.

Behavior

Situation Action
Exactly one context from xrt-smi Auto-select (unchanged)
Multiple contexts, one passes probe Auto-select validated context
Multiple contexts, several pass probe Print table of validated contexts only; prompt user (60s timeout)
Multiple contexts, none pass probe Print full xrt-smi table; prompt user (60s timeout)
xrt-smi failure Fall back to manual PID/CTX entry (60s timeout each)

On timeout or invalid context ID, call cleanup_and_exit(args, 1).

Signed-off-by: snigupta <snigupta@amd.com>
Signed-off-by: snigupta <snigupta@amd.com>
Comment thread src/mldebug/input_parser.py Outdated

1. If only one context exists, auto-select it.
2. If multiple exist, validate all (Active and Idle) with register/program-memory read.
3. If no context passes validation, prompt the user (which times out after ``HW_CONTEXT_INPUT_TIMEOUT_S`` seconds and calls ``cleanup_and_exit(args, 1)`` on failure / timeout).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is too long. please limit each line to 120 characters

Comment thread src/mldebug/input_parser.py Outdated
"""
# Load AIE interface if not provided
if aie_iface is None:
aie_iface = loader.load_aie_arch(device)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use args.aie_iface instead of defining a new one.

Comment thread src/mldebug/input_parser.py Outdated
# CORE_STATUS register - safe read-only register
# Device-specific addresses: Telluride=0x38004, PHX/STX=0x32004
if "CORE_STATUS" not in aie_iface.Core_registers:
raise RuntimeError(f"CORE_STATUS register not defined for device {device}")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unnecessary. we always have CORE_STATUS

print(f"[INFO] Context {ctx} validated successfully (CORE_STATUS=0x{reg_value:08x})")
valid_contexts.append((ctx, pid))

except Exception as e:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in future, it would be good to catch the specific exception

Comment thread src/mldebug/input_parser.py Outdated
reg_value = backend.read_register(test_col, test_row, test_reg)

# This context passed validation
print(f"[INFO] Context {ctx} validated successfully (CORE_STATUS=0x{reg_value:08x})")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we comment out these prints. these are good for debug but don't give end user any information

Comment thread src/mldebug/input_parser.py Outdated
ctx = int(list(current_contexts.keys())[0])
pid = int(list(current_contexts.values())[0]["pid"])
else:
print(f"[INFO] Auto-selected single context: {ctx}")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we comment out these prints. these are good for debug but don't give end user any information

Signed-off-by: snigupta <snigupta@amd.com>
@hackwa hackwa merged commit e0b9230 into Xilinx:main May 21, 2026
1 check passed
@snigdha-gupta snigdha-gupta deleted the AIESW-32969 branch May 28, 2026 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants