Fix crash in xdna_aie_array initialization#990
Conversation
Signed-off-by: Bikash Singha <bisingha@xcobisingha50x.amd.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical crash in xdna_aie_array initialization that occurred during multi-process stress tests. The crash was caused by file descriptor leaks from using a debug API, incorrect partition column counts in multi-process scenarios, and improper cleanup order during context destruction.
Changes:
- Replaced debug API
XAie_GetPartitionFdList()with a new kernel IOCTLDRM_AMDXDNA_HWCTX_AIE_PART_FDto obtain a single correct FD without leaks - Added PID-based filtering when retrieving partition info to ensure correct column counts in multi-process tests
- Moved xdna_aie_array destruction to occur before hardware context destruction for proper cleanup order
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/shim_ve2/xdna_hwctx.cpp | Added PID check for partition info retrieval and explicit aie_array destruction before hwctx cleanup |
| src/shim_ve2/xdna_aie_array.h | Added private method get_aie_partition_fd for FD retrieval |
| src/shim_ve2/xdna_aie_array.cpp | Replaced debug API with new helper method using kernel IOCTL to get partition FD |
| src/include/uapi/drm_local/amdxdna_accel.h | Added new parameter DRM_AMDXDNA_HWCTX_AIE_PART_FD for IOCTL interface |
| src/driver/amdxdna/ve2_debug.c | Implemented kernel-side handler for new AIE partition FD retrieval IOCTL |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| static int ve2_get_aie_part_fd(struct amdxdna_client *client, | ||
| struct amdxdna_drm_get_array *args) |
There was a problem hiding this comment.
The function uses args->num_element to pass the hwctx handle, which is semantically unclear. Consider using a more descriptive field name or adding a comment explaining this non-obvious usage of num_element as a handle parameter.
There was a problem hiding this comment.
Making use of the exiting framework.
|
|
||
| if (copy_to_user(u64_to_user_ptr(args->buffer), &aie_fd, sizeof(aie_fd))) { | ||
| XDNA_ERR(xdna, "Failed to copy AIE partition FD to user"); | ||
| close_fd(aie_fd); |
There was a problem hiding this comment.
The file descriptor is closed on copy_to_user failure, but if copy_to_user succeeds, the FD is transferred to userspace without being tracked. This could lead to FD leaks if the userspace process terminates unexpectedly before using or closing the FD. Consider using get_unused_fd_flags with fd_install pattern for proper FD lifecycle management.
There was a problem hiding this comment.
FD is getting closed from the XAie_Finish() which is being called in ~xdna_aie_array()
|
What is partition FD? Is it an ID for a single partition? Why would application care about this ID? How does it map to aie2/aie4 platforms? |
Yes Max. This FD represents single partition. This is required for XDP (Profiling use case) in AIE2PS. This is not required in aie2/aie4 platforms. |
Why we can't use ctx ID for this? Each ctx can only be created on one partition for its entire life cycle, right? So, one ctx maps to one partition, no? |
This FD is created by AIE driver, We get this FD by calling aie_partition_get_fd function |
Problem and fix for https://jira.xilinx.com/browse/AIESW-21965:
While running multi process stress tests severe crash occurred. Earlier while initializing xdna_aie_array, a debug API XAie_GetPartitionFdList() was being used and ve2 userspace used to leak FDs and resources were not getting freed properly which led to crash in stress tests.
Rootcauses:
Fixes:
Tested: