Proposed fix for issue 48. #49

bhathaway · 2020-02-18T22:23:18Z

After extensive testing at Zendar, we've found it.

The primary culprit for the elusive "magic stopped" problem is a
boundary condition where the starting adjacent descriptor count
written to the dma engine count is wrong and causes the dma engine
to attempt to read across a page boundary.

The primary culprit for the elusive "magic stopped" problem is a boundary condition where the starting adjacent descriptor count written to the dma engine count is wrong and causes the dma engine to attempt to read across a page boundary.

hmaarrfk · 2020-02-18T23:34:17Z

It would be good if xilinx actually provided sample FPGA designs to test their drivers. Did you have to make your own block diagram (logic) to do these tests?

bhathaway · 2020-02-19T00:03:27Z

@hmaarrfk Yes, a fellow team mate did the design, and by luck, one of our simple large transfer tests caught this bug very reliably because the user space address being unaligned caused what you might call a striping pattern in the descriptor bus addresses. This kind of work is really difficult to do alone.

hmaarrfk · 2020-02-19T00:08:20Z

I totally agree. Xilinx is really bad at leading this collaborative software design though.

There are quite a few PRs and issues that have been raised, with little attention from any developers or Product Managers at Xilinx.

Thanks for sharing!

You know, I think I had some bugs with unaligned memory (using QDMA), but I started to use an aligned allocator since I figured there was likely a bug lurking in there somewhere. I guess I'll circle back to this at a later time thinking about what you found.

Thanks for sharing!

Xilinx has made assumptions about the underlying architecture regarding the adjacent descriptor fetch scheme. So I've added some constants that make the code work (at least on aarch64). Another thing I noticed is that XDMA_TRANSFER_MAX_DESC is too large for all the hardware the driver pretends to support. I've lowered it to 512, because that's what matches my hardware. The danger is FIFO overflow, and I'm not sure what kind of error that would result in.

hmaarrfk · 2020-02-25T02:27:59Z

XDMA/linux-kernel/libxdma/libxdma.c

-			(0x1000 - ((le32_to_cpu(desc->next_lo)) & 0xFFF)) / 32 -
-			1;
+		max_adj_4k = (PAGE_SIZE_X86 - ((le32_to_cpu(desc->next_lo)) &
+			PAGE_MASK_X86)) / sizeof(struct xdma_desc) - 1;


This macro is hardcore. I guess you all have moved on to aarch ;)

I'm trying to make sure that the driver works on x86 and aarch architectures, which is why I made this change. The adjacency accesses make x86 page size assumptions, which is why I propose adding these macros instead of just using "PAGE_SIZE" and "PAGE_MASK", which are standard. Either way it's much better than the magic numbers within the current code.

bhathaway · 2020-03-24T20:53:34Z

@woshixiaohuya I found the change necessary on both x86 and aarch architectures, because it's a bug in the boundary condition. The macros are there to help it work on aarch.

woshixiaohuya · 2020-03-25T01:04:06Z

@woshixiaohuya I found the change necessary on both x86 and aarch architectures, because it's a bug in the boundary condition. The macros are there to help it work on aarch.

Okay, Thank you!

hmaarrfk · 2020-03-28T02:33:38Z

XDMA/linux-kernel/libxdma/libxdma.h

@@ -66,7 +66,7 @@
 #define XDMA_OFS_CONFIG		(0x3000UL)

 /* maximum number of desc per transfer request */
-#define XDMA_TRANSFER_MAX_DESC (2048)
+#define XDMA_TRANSFER_MAX_DESC (512)


was this a necessary change for the default configuration? can you comment on why this is necessary?

I thinks this change isn't nessary. It didn't solve my problem (see #54).

The documentation from Xilinx (https://www.xilinx.com/support/documentation/ip_documentation/xdma/v4_1/pg195-pcie-dma.pdf) notes that Gen3x16 has a descriptor FIFO depth of 1024 and Gen3x8 (which is what we're using) is 512. I'm having trouble interpretting the language on pages 24 and 25, but it sounds like the engine might be smart about not overflowing this FIFO. What's certain is that the engine can't possibly fetch more than the FIFO allows at once, so I believe the value I've shown is safest for all hardware.

I would be genuinely curious if anyone has benchmarked the performance at different values for this constant.

There is a message of a race condition here:
https://forums.aws.amazon.com/thread.jspa?threadID=295643

ref to
timeout_ms

Maybe your review and dev knowledge may get some ideas...

I would open a new issue for that. It doesn't seem related to this. You need details like which commit hash you're using and the sample program that caused the error. For what it's worth, it seems odd to me that the engine apparently times out after only one descriptor completed (see completed_desc_count). Also, FYI, error 512 is ERESTARTSYS and is explicitly returned by xdma_xfer_submit on timeout.

I would open a new issue for that. It doesn't seem related to this. You need details like which commit hash you're using and the sample program that caused the error. For what it's worth, it seems odd to me that the engine apparently times out after only one descriptor completed (see completed_desc_count). Also, FYI, error 512 is ERESTARTSYS and is explicitly returned by xdma_xfer_submit on timeout.

Hello,
I have opened a issue (#54) for I got the error 512 when I transfer 1MB C2H in stream mode.

i tried the changes by bhathaway. I also experimented with various numbers down to

XDMA_TRANSFER_MAX_DESC (16).

Job works for a few minutes and hangs eventually.
Ubuntu 18.04.02

I would be genuinely curious if anyone has benchmarked the performance at different values for this constant.

Quite an old topic but without benchmarking, I can tell that testing using a loopback channel (3x16), I see the following:

If I queue 1024 descriptors on the C2H channel, and then queue write descriptors, the transfers are not done. If queuing 1023 read descriptors, it works. I think it means that the choice of this number needs to be done depending on the application. If there is no way reads could lock writes, then no problem, if on the contrary, in some cases, reads cannot happen before a write is done (like loopback mode), this number should not exceed 1023 (on the 3x16 IP).

To test this, fetch my adjacent descriptors fix (without that, loopback transfers for more than 1 page mostly don't work or crash the machine because of crossing page boundaries). Use the dma_streaming script and test with sizes up to 1023 pages (4190208 bytes). They work.

Now test with more than 4190208, the transfer locks up.

Reducing the max number of descriptors to 1023 makes the transfers work in all cases.

I cannot say for sure that the IP works the way I think it does but I'm just putting these here for others to possibly test/confirm or just to help them.

hmaarrfk · 2020-03-28T02:35:38Z

maybe we should just fork, and point people to our own repo. Keeping a few PRs in flight like this is really annoying for downstream users (i.e. ourselves).

mindbeast · 2020-05-01T22:41:35Z

I've hit the same issue locally, again only when copy descriptors exceed a single page in size (i.e. over 128 32 byte descriptors). This fix looks legitimate, but haven't had the chance to test yet. Thanks for taking the time to actually debug this.

lesjokolat · 2020-06-11T21:38:05Z

Hi i realized I messed up and didn't install all your changes correctly.

I re ran them and now I get the following errors

Makefile:10: XVC_FLAGS: .
make -C /lib/modules/5.3.0-59-generic/build M=/home/paul/Downloads/xdma_driver_linux5/dma_ip_drivers-fixLinux5/XDMA/linux-kernel/xdma modules
make[1]: Entering directory '/usr/src/linux-headers-5.3.0-59-generic'
/home/paul/Downloads/xdma_driver_linux5/dma_ip_drivers-fixLinux5/XDMA/linux-kernel/xdma/Makefile:10: XVC_FLAGS: .
CC [M] /home/paul/Downloads/xdma_driver_linux5/dma_ip_drivers-fixLinux5/XDMA/linux-kernel/xdma/libxdma.o
/home/paul/Downloads/xdma_driver_linux5/dma_ip_drivers-fixLinux5/XDMA/linux-kernel/xdma/libxdma.c: In function ‘engine_start’:
/home/paul/Downloads/xdma_driver_linux5/dma_ip_drivers-fixLinux5/XDMA/linux-kernel/xdma/libxdma.c:708:2: error: implicit declaration of function ‘mmiowb’ [-Werror=implicit-function-declaration]
mmiowb();
^~~~~~
/home/paul/Downloads/xdma_driver_linux5/dma_ip_drivers-fixLinux5/XDMA/linux-kernel/xdma/libxdma.c: In function ‘engine_service_shutdown’:
/home/paul/Downloads/xdma_driver_linux5/dma_ip_drivers-fixLinux5/XDMA/linux-kernel/xdma/libxdma.c:749:2: error: implicit declaration of function ‘swake_up’; did you mean ‘wake_up’? [-Werror=implicit-function-declaration]
swake_up(&engine->shutdown_wq);
^~~~~~~~
wake_up
/home/paul/Downloads/xdma_driver_linux5/dma_ip_drivers-fixLinux5/XDMA/linux-kernel/xdma/libxdma.c: In function ‘xdma_xfer_submit’:
/home/paul/Downloads/xdma_driver_linux5/dma_ip_drivers-fixLinux5/XDMA/linux-kernel/xdma/libxdma.c:3585:4: error: implicit declaration of function ‘swait_event_interruptible_timeout’; did you mean ‘wait_event_interruptible_timeout’? [-Werror=implicit-function-declaration]
swait_event_interruptible_timeout(
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
wait_event_interruptible_timeout
cc1: some warnings being treated as errors
scripts/Makefile.build:288: recipe for target '/home/paul/Downloads/xdma_driver_linux5/dma_ip_drivers-fixLinux5/XDMA/linux-kernel/xdma/libxdma.o' failed
make[2]: *** [/home/paul/Downloads/xdma_driver_linux5/dma_ip_drivers-fixLinux5/XDMA/linux-kernel/xdma/libxdma.o] Error 1
Makefile:1656: recipe for target 'module/home/pablo/Downloads/xdma_driver_linux5/dma_ip_drivers-fixLinux5/XDMA/linux-kernel/xdma' failed
make[1]: *** [module/home/paul/Downloads/xdma_driver_linux5/dma_ip_drivers-fixLinux5/XDMA/linux-kernel/xdma] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-5.3.0-59-generic'
Makefile:27: recipe for target 'all' failed
make: *** [all] Error 2

The mmiowb(); was a regression to an earler error that was fixed. he others are new.

This is a minimal implementation based on [1] that matches the existing code style for handling the page crossing edge case. Previously only one place in the code correctly handled adjacent descriptors crossing a page boundary, but now that code is copied to the other relevant places too. We hit this case when attempting to re-load ELF files after a few attempts, at least with my CheriBSD image. [1] Xilinx/dma_ip_drivers#49

jrtc27 · 2020-06-30T00:12:49Z

Thanks for finding and fixing this, I just ran into this today. Some comments on your pull request, however:

Try and match the style of existing code, regardless of your opinions of it. There is already code in xdma_desc_adjacent to handle this, so the best thing to do is copy that code rather than do it in your own way.
Don't make unrelated changes. You have a bunch of reformatting and renaming changes that, whilst maybe are a good thing, do not belong in a bug fix, but as their own separate pull request (or requests). Especially changing the max number of descriptors belongs in a separate change. This makes it far easier to tell what you're actually trying to fix, making the review quicker and more likely to actually happen.
Your assertions about the page size being X86-related are wrong. The XDMA product guide very clearly states "A block of adjacent descriptors must not cross a 4K address boundary" ([1] p. 24), so whilst also being noise in the commit it's incorrect.
You missed the instance in transfer_desc_init. That's the third place where MAX_EXTRA_ADJ is consulted, so I assume it also should have the same fix applied to it.

I made my own minimal commit that follows these principles[2] (but, as it's in a local vendored copy, it's surrounded by noise to clearly indicate that this is a local change). Something like that seems to me like it would have a much greater chance of being reviewed and accepted, but ultimately you're the one who found the bug and worked out what needed fixing so this is not me trying to take credit for that, just trying to push this issue forward and tidy it up into a state where it can actually be fixed.

[1] https://www.xilinx.com/support/documentation/ip_documentation/xdma/v4_1/pg195-pcie-dma.pdf
[2] jrtc27/connectal@7a43fc1

This is a minimal implementation based on [1] that matches the existing code style for handling the page crossing edge case. Previously only one place in the code correctly handled adjacent descriptors crossing a page boundary, but now that code is copied to the other relevant places too. We hit this case when attempting to re-load ELF files after a few attempts, at least with my CheriBSD image. [1] Xilinx/dma_ip_drivers#49

karenx-xilinx · 2020-06-30T02:31:59Z

This commit:
5faf23e
should fix the issue.

Thanks!

jrtc27 · 2020-06-30T19:04:06Z

This commit:
5faf23e
should fix the issue.

Thanks!

Thanks, that does indeed also appear to fix the issue, at least for my use case.

ynt1988 · 2021-04-09T01:26:18Z

Thanks!

bhathaway mentioned this pull request Feb 18, 2020

XDMA: Some times xdma engine crashes with irq MAGIC_STOPPED #48

Open

hmaarrfk mentioned this pull request Feb 20, 2020

lowlatency kernel cannot use XDMA driver properly #41

Open

bhathaway requested a review from karenx-xilinx February 24, 2020 17:43

hmaarrfk reviewed Feb 25, 2020

View reviewed changes

hmaarrfk reviewed Mar 28, 2020

View reviewed changes

lesjokolat mentioned this pull request Jun 11, 2020

QDMA 2019.1 : Unable to add AXIMM H2C and AXIMM C2H Queues and additional issues as mentioned below #57

Open

jrtc27 mentioned this pull request Jun 30, 2020

libxdma.c: Bug fix Xilinx/dma_ip_drivers#48 cambridgehackers/connectal#174

Merged

jrtc27 mentioned this pull request Jun 30, 2020

Outdated XDMA drivers aws/aws-fpga#491

Closed

karenx-xilinx closed this Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed fix for issue 48. #49

Proposed fix for issue 48. #49

bhathaway commented Feb 18, 2020

hmaarrfk commented Feb 18, 2020

bhathaway commented Feb 19, 2020

hmaarrfk commented Feb 19, 2020

hmaarrfk Feb 25, 2020

bhathaway Feb 27, 2020 •

edited

Loading

bhathaway commented Mar 24, 2020

woshixiaohuya commented Mar 25, 2020

hmaarrfk Mar 28, 2020

woshixiaohuya Mar 28, 2020

bhathaway Apr 3, 2020 •

edited

Loading

lesjokolat Apr 4, 2020

bhathaway Apr 7, 2020

woshixiaohuya Apr 8, 2020

lesjokolat Jun 9, 2020

jberaud Oct 5, 2020

hmaarrfk commented Mar 28, 2020

mindbeast commented May 1, 2020 •

edited

Loading

lesjokolat commented Jun 11, 2020

jrtc27 commented Jun 30, 2020

karenx-xilinx commented Jun 30, 2020

jrtc27 commented Jun 30, 2020

ynt1988 commented Apr 9, 2021

Proposed fix for issue 48. #49

Proposed fix for issue 48. #49

Conversation

bhathaway commented Feb 18, 2020

hmaarrfk commented Feb 18, 2020

bhathaway commented Feb 19, 2020

hmaarrfk commented Feb 19, 2020

Choose a reason for hiding this comment

bhathaway Feb 27, 2020 • edited Loading

Choose a reason for hiding this comment

bhathaway commented Mar 24, 2020

woshixiaohuya commented Mar 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhathaway Apr 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmaarrfk commented Mar 28, 2020

mindbeast commented May 1, 2020 • edited Loading

lesjokolat commented Jun 11, 2020

jrtc27 commented Jun 30, 2020

karenx-xilinx commented Jun 30, 2020

jrtc27 commented Jun 30, 2020

ynt1988 commented Apr 9, 2021

bhathaway Feb 27, 2020 •

edited

Loading

bhathaway Apr 3, 2020 •

edited

Loading

mindbeast commented May 1, 2020 •

edited

Loading