-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed fix for issue 48. #49
Conversation
The primary culprit for the elusive "magic stopped" problem is a boundary condition where the starting adjacent descriptor count written to the dma engine count is wrong and causes the dma engine to attempt to read across a page boundary.
It would be good if xilinx actually provided sample FPGA designs to test their drivers. Did you have to make your own block diagram (logic) to do these tests? |
@hmaarrfk Yes, a fellow team mate did the design, and by luck, one of our simple large transfer tests caught this bug very reliably because the user space address being unaligned caused what you might call a striping pattern in the descriptor bus addresses. This kind of work is really difficult to do alone. |
I totally agree. Xilinx is really bad at leading this collaborative software design though. There are quite a few PRs and issues that have been raised, with little attention from any developers or Product Managers at Xilinx. Thanks for sharing! You know, I think I had some bugs with unaligned memory (using QDMA), but I started to use an aligned allocator since I figured there was likely a bug lurking in there somewhere. I guess I'll circle back to this at a later time thinking about what you found. Thanks for sharing! |
Xilinx has made assumptions about the underlying architecture regarding the adjacent descriptor fetch scheme. So I've added some constants that make the code work (at least on aarch64). Another thing I noticed is that XDMA_TRANSFER_MAX_DESC is too large for all the hardware the driver pretends to support. I've lowered it to 512, because that's what matches my hardware. The danger is FIFO overflow, and I'm not sure what kind of error that would result in.
(0x1000 - ((le32_to_cpu(desc->next_lo)) & 0xFFF)) / 32 - | ||
1; | ||
max_adj_4k = (PAGE_SIZE_X86 - ((le32_to_cpu(desc->next_lo)) & | ||
PAGE_MASK_X86)) / sizeof(struct xdma_desc) - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This macro is hardcore. I guess you all have moved on to aarch ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to make sure that the driver works on x86 and aarch architectures, which is why I made this change. The adjacency accesses make x86 page size assumptions, which is why I propose adding these macros instead of just using "PAGE_SIZE" and "PAGE_MASK", which are standard. Either way it's much better than the magic numbers within the current code.
@woshixiaohuya I found the change necessary on both x86 and aarch architectures, because it's a bug in the boundary condition. The macros are there to help it work on aarch. |
Okay, Thank you! |
@@ -66,7 +66,7 @@ | |||
#define XDMA_OFS_CONFIG (0x3000UL) | |||
|
|||
/* maximum number of desc per transfer request */ | |||
#define XDMA_TRANSFER_MAX_DESC (2048) | |||
#define XDMA_TRANSFER_MAX_DESC (512) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this a necessary change for the default configuration? can you comment on why this is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thinks this change isn't nessary. It didn't solve my problem (see #54).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation from Xilinx (https://www.xilinx.com/support/documentation/ip_documentation/xdma/v4_1/pg195-pcie-dma.pdf) notes that Gen3x16 has a descriptor FIFO depth of 1024 and Gen3x8 (which is what we're using) is 512. I'm having trouble interpretting the language on pages 24 and 25, but it sounds like the engine might be smart about not overflowing this FIFO. What's certain is that the engine can't possibly fetch more than the FIFO allows at once, so I believe the value I've shown is safest for all hardware.
I would be genuinely curious if anyone has benchmarked the performance at different values for this constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a message of a race condition here:
https://forums.aws.amazon.com/thread.jspa?threadID=295643
ref to
timeout_ms
Maybe your review and dev knowledge may get some ideas...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would open a new issue for that. It doesn't seem related to this. You need details like which commit hash you're using and the sample program that caused the error. For what it's worth, it seems odd to me that the engine apparently times out after only one descriptor completed (see completed_desc_count). Also, FYI, error 512 is ERESTARTSYS and is explicitly returned by xdma_xfer_submit on timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would open a new issue for that. It doesn't seem related to this. You need details like which commit hash you're using and the sample program that caused the error. For what it's worth, it seems odd to me that the engine apparently times out after only one descriptor completed (see completed_desc_count). Also, FYI, error 512 is ERESTARTSYS and is explicitly returned by xdma_xfer_submit on timeout.
Hello,
I have opened a issue (#54) for I got the error 512 when I transfer 1MB C2H in stream mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tried the changes by bhathaway. I also experimented with various numbers down to
XDMA_TRANSFER_MAX_DESC (16).
Job works for a few minutes and hangs eventually.
Ubuntu 18.04.02
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be genuinely curious if anyone has benchmarked the performance at different values for this constant.
Quite an old topic but without benchmarking, I can tell that testing using a loopback channel (3x16), I see the following:
-
If I queue 1024 descriptors on the C2H channel, and then queue write descriptors, the transfers are not done. If queuing 1023 read descriptors, it works. I think it means that the choice of this number needs to be done depending on the application. If there is no way reads could lock writes, then no problem, if on the contrary, in some cases, reads cannot happen before a write is done (like loopback mode), this number should not exceed 1023 (on the 3x16 IP).
-
To test this, fetch my adjacent descriptors fix (without that, loopback transfers for more than 1 page mostly don't work or crash the machine because of crossing page boundaries). Use the dma_streaming script and test with sizes up to 1023 pages (4190208 bytes). They work.
-
Now test with more than 4190208, the transfer locks up.
-
Reducing the max number of descriptors to 1023 makes the transfers work in all cases.
I cannot say for sure that the IP works the way I think it does but I'm just putting these here for others to possibly test/confirm or just to help them.
maybe we should just fork, and point people to our own repo. Keeping a few PRs in flight like this is really annoying for downstream users (i.e. ourselves). |
I've hit the same issue locally, again only when copy descriptors exceed a single page in size (i.e. over 128 32 byte descriptors). This fix looks legitimate, but haven't had the chance to test yet. Thanks for taking the time to actually debug this. |
Hi i realized I messed up and didn't install all your changes correctly. I re ran them and now I get the following errors Makefile:10: XVC_FLAGS: . The mmiowb(); was a regression to an earler error that was fixed. he others are new. |
This is a minimal implementation based on [1] that matches the existing code style for handling the page crossing edge case. Previously only one place in the code correctly handled adjacent descriptors crossing a page boundary, but now that code is copied to the other relevant places too. We hit this case when attempting to re-load ELF files after a few attempts, at least with my CheriBSD image. [1] Xilinx/dma_ip_drivers#49
Thanks for finding and fixing this, I just ran into this today. Some comments on your pull request, however:
I made my own minimal commit that follows these principles[2] (but, as it's in a local vendored copy, it's surrounded by noise to clearly indicate that this is a local change). Something like that seems to me like it would have a much greater chance of being reviewed and accepted, but ultimately you're the one who found the bug and worked out what needed fixing so this is not me trying to take credit for that, just trying to push this issue forward and tidy it up into a state where it can actually be fixed. [1] https://www.xilinx.com/support/documentation/ip_documentation/xdma/v4_1/pg195-pcie-dma.pdf |
This is a minimal implementation based on [1] that matches the existing code style for handling the page crossing edge case. Previously only one place in the code correctly handled adjacent descriptors crossing a page boundary, but now that code is copied to the other relevant places too. We hit this case when attempting to re-load ELF files after a few attempts, at least with my CheriBSD image. [1] Xilinx/dma_ip_drivers#49
This commit: Thanks! |
Thanks, that does indeed also appear to fix the issue, at least for my use case. |
Thanks! |
After extensive testing at Zendar, we've found it.
The primary culprit for the elusive "magic stopped" problem is a
boundary condition where the starting adjacent descriptor count
written to the dma engine count is wrong and causes the dma engine
to attempt to read across a page boundary.