-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
possible issue after having upgraded to v2.6 #1151
Comments
data corruption occurs every 2-5 minutes on a ring managing 1 low-volume TCP connection... So ring buffer wraparound issue is a plausible cause. Also worth mentioning. It happens only on a specific ring. The one that is configured with IORING_SETUP_ATTACH_WQ |
I doubt this might be an issue but a particularity of my setup is that my 6.8.9 kernel has been patched with the NAPI busy polling code... there is a very small probability that if the registering NAPI command value did change between what is in the patch vs the final official value. imho, this would only make the registration fail but maybe not... |
here is an update.... I have downgraded liburing from 2.6 to 2.5 and my inbound TCP stream corruption is gone... My intuition tells me that there is something wrong with IORING_SETUP_NO_SQARRAY feature because I am not seeing what else could cause the issue... I'll try to look into the IORING_SETUP_NO_SQARRAY code to see if I could not find the problem and I will report back here if I find something... I am not reading the io_uring list as much as I did in the past so I missed the discussion around IORING_SETUP_NO_SQARRAY but I guess that there is no reason to not use it and removing this level of indirection is slightly improving io_uring performance, right? I guess this will become clearer once I look into its code... |
I can take a look if we can get a reproducer. If you can't share your code directly, then could you please share what io_uring flags you're using and what you're sending over TCP so I can build one? |
And in lieu of a reproducer, can you simply try and disable IORING_SETUP_NO_SQARRAY just to rule that out? Just removing the
line in src/setup.c:io_uring_queue_init_try_nosqarr() should do it. I don't see how this can be related at all, or how the upgrade in general could break this for you, but let's see if we can collect some data points. Another option is to try and bisect it between 2.5 and 2.6, if it really is the liburing side that broke it. |
I agree with Jens on this one, if you suspect
|
I have followed Jens suggestion to comment out the IORING_SETUP_NO_SQARRAY line in setup.c... With that, my program works flawlessly with liburing 2.6... you seem very confident that the problem cannot come from the IORING_SETUP_NO_SQARRAY kernel code. I'll trust you on that.... I may have another investigation lead... I do use IORING_SETUP_NO_MMAP and I assign remaining huge page memory to ring buffers if it has enough free memory. If not enough, my code fallback on using posix_memalign(). maybe that by modifying the amount of memory used by the io_uring with the IORING_SETUP_NO_SQARRAY option, this somehow make my code mismanage the memory... |
Not sure who you're replying to, but you never know, everything can happen. I'm getting you're not using sqarray, so might be some mishap in the offset calculation, especially in some less common cases like NO_MMAP or so.
If you'd make it always use posix_memalign(), does it still reproduce or is it only happen with huge pages? |
Pavel, I am going to give it a shot... To use posix_memalign() memory to the allocated ring buffers... FYI, I allocate 3 io_urings in a single huge page. I use io_uring_queue_init_mem() return value to update the starting address to allocate the next ring structure... This might be another function to review... I might not be able to continue the investigation today but I will for sure it tomorrow and update you on what I find... Here are the log traces generated by the code setuping the rings... I am not expecting it to make any sense to anyone... I wrote the code many months ago, The code is in front of me and without some reviewing effort I cannot even figure out the traces... for instance, I don't get why the huge page allocation trace appears after creating the first ring...
here is something interesting... the io ring giving trouble is the one sandwiched in the middle. tsi 1 ring... the more we discuss about the issue, the more it looks like some buffer overlap problem... The only remaining thing to determine is the source of the overlap... Update: My program makes 2 mmap() calls... One of 2MB to store the 3 io_uring structures and the allocated buffers of the 2 smallest rings + 1 extra mmap() to allocate huge pages to allocate the 142MB allocated buffers of the ring 0. I am going to improve my traces... I think that printing out the start and end region of each sections in the first huge page should make it obvious if there is an overlap somewhere... |
here are improved logs:
I suspect maybe some aligment rounding requirements are not met anymore with IORING_SETUP_NO_SQARRAY. one addition to the logs is that I have replicated the code in io_uring_alloc_huge() to find out the break out of the memory distributed in the io_uring structures. I was particularly interested in knowing the safety space created by round up mem_used to the next page boundary. it turns out that with IORING_SETUP_NO_SQARRAY, there is 0 buffer. So if somewhere there is a buffer overflow, it was masked by the rounding up that is not present with IORING_SETUP_NO_SQARRAY. maybe tsi 0 cq ring is overflowing into tsi 1 ring memory. I have looked a little bit the cq management code but it has become very complex... It does overflow management, it caches several entries and on and on... it is very hard for a non-initiated to review it... If it can help you, I can tell that with tsi 0, multishot recvmsg are used. multishots appears to have a special cq management... |
with provided buffers memory out of the way and only the rings adjacent to each other... it seems like have a winner... no more ring corruption... My apologize... it seems like the problem is on my side... I'll keep the issue open so that once I find out what is causing the problem, it is documented...
|
but I am still perplexed... my program only reads into the provided buffers... it is io_uring that does the writing... I suspect an underflow in the provided buffers... I was doing an ugly hack to support kTLS by writing into the io_uring_recvmsg_out (16 bytes) and moving the payload buffer 5 bytes ahead but I am not using this anymore since software kTLS is not providing any performance benefit (last time I checked)... This could have been my issue if io_uring_recvmsg_out layout did change...
here is what I will do next. I'll put back the provided buffers into my huge page but I'll put an empty 4KB space between the end of the rings and the provided buffer. with no under/over flow this space should remain null. If there is no non-null data in it. It will be proof that something is writing there and it should not... |
True, but the provided buffers start and length are controlled solely by you, so it'd be very possible that you'd run into problems if you have overlap. Maybe add some red zoning or similar as a debug thing to detect if anything is stepping over spots they should not be. |
This could be either a kernel side or application issue, so if you can debug a bit and figure out where it's coming from, that would certainly be helpful in figuring out who the culprit is. |
io_uring_recvmsg_out layout did not change, it's UABI so cannot change. But that doesn't mean that the kernel side handling couldn't be broken, though that seems unlikely if you are confident this happens with 2.6 and not with 2.5 of liburing (which still seems suspect, and I guess more likely causing a layout change for you that then further makes the bug more likely to trigger). Have you tried running your app under valgrind, for example? |
IIUC 2.6 is the first version switching Sounds like rings might be overlapping with provided buffers, I'll look through the offset calculation. How does the setup addresses calc look like? Similar to below or you do offsets by hand?
|
you are correct about that... idk for sure what is going on or else I would tell... All I have is suspicions on where the problem could be. Add an empty page between the end of a ring and its provided buffers is the next step. Checking that the provided buffers initialization was good, is the first thing that I did check. AFAIK, it seems to be good:
|
Agree, it sounds like overlap, and that could of course be a liburing bug due to the combination of NO_SQARRAY and using NO_MMAP. |
there is something that valgrind does not like about programs using io_uring... I have never been able to run valgrind on my program with great success... that might have changed. I have not tried out recently... What I have been using successfully in the past with very good result, it is -fsanitize-address (or something like that)... if you read my updates since yesterday... I have narrowed down the issue to:
|
this is what I am leading to think. there are logRingMemory traces showing the memory layout with or without NO_SQARRAY. When the option is not set, there is a 3KB safety zone. with NO_SQARRAY option set, everything is flush to the byte and immediately adjacent to the next structure... |
check the logs in 2.6 without the NO_SQARRAY option 2.6 with the NO_SQARRAY, provided buffers away from the hp (allocated with posix_memalign()): I think that I made a good job improving the usefulness of the traces. as far as buffer initialization goes, I have shared the whole function doing it... I am not sure that provided buffers is the right terminology for what I use... maybe user buffer ring would be more accurate... |
I have added the following code:
in the context class destructor:
the resulting traces:
so it seems like the kernel code managing the ring is overflowing it by 1 byte... just enough to cause some issue if something is adjacent... |
Could I get you to turn this into a separate thing that shows the corruption? Doesn't look like it'd be a lot of work given the above simple trace? I'm still not quite sure how you're managing the memory, otherwise I'd just do it myself. |
you mean that you would like to see the memory bytes value? If this can hint you about what that might be yes, of course it is easy to add... as to what is in the memory space
it is a 2.6 io_uring with 256 entries with NO_SQARRAY set (so no safety buffer at its end like when NO_SQARRAY is missing). The trace
is created by mimicking the code found in io_uring_alloc_huge()
normally, I assigned the user buffer ring immediately after the io_uring memory but as suggested, I left a 4KB safety space which should be left zeroed (it is from mmap). the very first byte after the tsi 1 io_uring area is non-zero... I will change my code to print out that value... |
Jens is likely asking for a reproducer we can run ourselves. |
A quick sum up:
|
Can you clarify a little bit about provided buffers and rings. Where do you store the ring itself and where you allocate the buffers you put into the provided buffer ring? I assume there are
I read as it only contains buffers (32 * 17408 == 557056). |
Pavel, you understand 100% correctly. This perfectly sums up what I am experiencing... maybe the first byte of a ring memory is less sensitive than it seems for a user buffer ring... |
yes you read correctly... only 32 buffers... It is a low data stream. from experience I have figured that only 32 buffers were sufficient for that thread/ring. Update: A 2MB Huge page is mmaped. the provided logs clearly show the memory range of each component... I increment the current offset in the huge page where the next component will be placed by using the io_uring_queue_init_mem() return value. concerning the where and how the buffers are allocated, I shared the whole function code at |
here is a full hex dump of the 4KB safety zone. It was a good thing to ask as it is not only the first byte that is corrupted... Hopefully the values will be recognizable as some sort of cqe record...
|
You mentioned you use provided buffer rings, can you show how and where you register them? E.g. how you call Apart from allocating buffers that you put into the ring and would be read/recv'ing data into, there is also memory for the ring itself, I Just wonder where that one is placed, i.e. |
I use io_uring_setup_buf_ring() the whole code is in the comment: the ring itself is stored in the ev loop structure. I fetch it in my initGroupBuffer() function. for the ring creation... nothing special to say about it... It is created by calling io_uring_queue_init_mem()
|
Something else I thought I could do to help is modify my completion handler to check if the returned cage is outside the expected memory range and perhaps log its details if this can help direct the investigation. |
correct me if I am wrong but checking for cq ring overflow is not something that is widely currently tested in the library unit tests. I suspect the issue is not that esoteric but since it is harmless with the most common 1 ring setup and it did never bite anyone so far, it is not something that has been checked... but I could be wrong... this is maybe some cool task for someone such as @spikeh to look at. I can see how valuable it would be to have general teardown test testing invariants such as cq ring respecting the configured boundaries... With minimal parameter tweaking such as making the number of cq entries small enough to ensure at least 1 wraparound during the test + applying a general teardown check... I suspect that with the currently existing unit tests... this issue would show up... if it does not, yes, I will continue to help to nail the down the bug... but otherwise... my very interesting problems investigation budget has been fully spent in the last few days... |
Good idea, check that buffers you pass to One more thing I'm curious about, when you increase the red zone 4KB -> 8KB, where the non-zero pattern start? |
I have not increased the red zone (what I call the safety zone) to 8KB yet... so I guess that what you mean is that you would like me to try that.... If that is the case, yes I can do that... I will report back the result... My prediction is that the corruption will remain in the first 64 bytes... but the best bug fishing catches that I have experienced is when you put aside all your preconceived ideas and assumptions... So I am willing to give it a try... Did you attempt to map the corruption values into a io_uring_cqe record? I am not familiar enough with the structure and common constants to look at the hex dump and be sure that I was looking at io_uring_cqe entries... but I thought that it could have jumped at you immediately it they were io_uring_cqe... |
I have been thinking about what you want to validate by asking me to check the buffer base addresses before reinserting them in the ring... 2 days ago when you did ask me to place them in memory allocated with posix_memalign(), it was before I had code that checks for corruption in the red zone... Would redoing the posix_memalign() buffers allocation and check for corruption past the cq ring region not clear them out? |
I tried writing my own reproducer, and it does seem to indicate that we're writing past the end of the memory region that we said we'd use based on the return value of io_uring_queue_init_mem(). I'll poke a bit and report what I find. |
that is fascinating! this may have been there for a very long time, and no one noticed until now... we have the best job that keeps giving us puzzles to solve! I am always grateful to whatever life is throwing at us... This whole issue did make me review my own code and this has led to few minor improvements... one of these minor improvement that might be of interest to you is this: I have replaced io_uring_for_each_cqe() macro by its body minus io_uring_cqe_index() to eliminate io_uring_cqe_shift() and remove several instructions and memory access at each call since I am not using IORING_SETUP_CQE32. This is the kind of assumption you, of course, cannot make but maybe moving out io_uring_cqe_shift() call from the loop could optimize the generated code... or perhaps provide 2 versions of the io_uring_for_each_cqe() macro to avoid having to make the bit shift at each iteration if it is known at compile time what will be required. this is a minor improvement but considering that it is in the hot path, the improvement might not be non-negligeable... |
Yeah, it looks like bad math in liburing for the size. Seems to account the arrays just fine, but not the indexes. I'll be back with a patch! |
Can you try the below? Very lightly tested...
|
yes. I will do that. count on my feedback later today. but just after reviewing it, I can conclude that it will fix the issue... it is just sad to reserve a full page for only 64 bytes of extra data... |
Thanks! If it's easier for you, I can put it in a liburing branch for you to just pull + checkout rather than need to fiddle with a pasted patch. |
he he... no the patch is a much easier route for me... I am by far NOT a git power user. Instead I just have to modify the source code fetched by my system packaging tools and rebuild... |
It doesn't factor in the indexes themselves, so depending on the ring sizes, we may end up undershooting slightly. This can cause the returned value to be 64 bytes less than what is required. If an application is packing things in tight, then that in turn can cause the assumed ring memory to spill into the next part of that memory, causing corruption of anything stored there. We may overshoot with the current code depending on ring sizes, but at least that's better then undershooting and causing corruption... Link: #1151 Signed-off-by: Jens Axboe <axboe@kernel.dk>
That's fine, I did push an
and that should pick cleanly into the liburing 2.6 branch or the current git branch. test/init-mem.c is the test case. Pretty basic, just setups up a bunch of differently sized rings and checks that nobody stomps on memory outside of what io_uring_queue_init_mem() told us it used. |
the patch provides an highest amount of reserved memory. Therefore placing other elements adjacent to the ring stops being corrupted. the initial TCP inbound data is not observed anymore |
It will overshoot a bit, but that's better than undershooting... I have merged it into master, thanks! |
Since, I have upgraded to v2.6, I experience a periodic data corruption in the inbound TCP stream:
not sure if this could have anything to do with IORING_SETUP_NO_SQARRAY new feature...
I'll try to downgrade to v2.5 to see if the problem go away...
I'm not 100% sure where the problem comes from because I have changed few things all at the same time...
liburing
my own code
Change my instance type from 4 core machine to a 8 cores system...
I'll do some more experimentation, I'll report back if downgrading liburing makes the problem go away...
Important detail. This is with 6.8.9 kernel
The text was updated successfully, but these errors were encountered: