-
-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crash in android in disk_io_thread::try_flush_hashed #1225
Comments
do you know what line that access is on, and which variable it is for? Or, I forget, do you get dumps? |
This is all I get from the google play developer console. It happens with a frequency that I can't ignore or blame the device. |
I see in the code four calls to file::iovec_t* iov = TORRENT_ALLOCA(file::iovec_t, p->blocks_in_piece * cont_pieces);
int* flushing = TORRENT_ALLOCA(int, p->blocks_in_piece * cont_pieces);
// this is the offset into iov and flushing for each piece
int* iovec_offset = TORRENT_ALLOCA(int, cont_pieces + 1);
int iov_len = 0;
// this is the block index each piece starts at
int block_start = 0;
// keep track of the pieces that have had their refcount incremented
// so we know to decrement them later
int* refcount_pieces = TORRENT_ALLOCA(int, cont_pieces); |
you think you may overflow the stack? I would think you'd get a different error for that, but maybe not. does android use segmented stacks? |
gcc calls it split stack. anyway, a short term fix to mitigate this may be to run with a single disk thread. If you try that, and it goes away, that would suggest it's a race condition. |
you know, I can't find any information related to this (segmented stacks) and android. Not clear to me if this is something provided by the compiler alone, or it needs some help from the architecture (assembly instructions) or even the standard library. Anyway, I added the flag
Do you have any idea of how to properly test if it's a stack problem? Do you know if it's possible to add a unit test to detect if there is a race condition problem? |
you could write a test program that triggers a stack overflow and run it on android to see how the error is reported. If it looks different than the report in this ticket we could rule it out. as for triggering races in tests, the only way I can think of would be to write a stress test that puts heavy , sustained load on the disk subsystem. The actual disk I/O would probably have to be disabled too |
(in case you have 5 mins to tell me) how I can force the call to |
of the top of my head, If you're interested in a stress test, you can use |
This crash is happening more and more :( Hi @ssiloti, I notice this commit (aldenml@9c5bb25) and I wonder if it's entirely right. Previously the call to |
It's possible there's something subtle going on there. Since you can reproduce it easily try reverting that commit and seeing what happens. IIRC that commit was a fix for a deadlock when the write cache is disabled so as long as you have that enabled everything should still work. |
Hmm, I just noticed the comment in do_flush_hashed saying that there can be multiple flush jobs active anyways so it seems we need to handle this case. What we really need is a line accurate backtrace so that we can pinpoint which line is causing the crash. |
@aldenml if you still have the binary+debug information for that build (or can rebuild it), you can load it up in gdb and issue: "list *address", where address is one of the addresses in the stack-trace (after "pc"). That will tell you exactly where it is (within the accuracy you get with optimizations enabled) |
@aldenml is there any pattern to the fault address you see in the crash reports? The one in the report you posted is above the 3GB mark which is odd because that puts it in the kernel's address space. That suggests this isn't just a simple dangling pointer access or stack/buffer overflow. |
Here's the value of the PC register value on the last crashes received on
They all seem to be around there. Except a few* that seem to be below |
@ssiloti could this mean that the crash occurs during a system call from |
I meant the fault address, not the program counter, in the report posted the fault was at 0xdaa00000. |
|
The error is not easy to replicate, in fact, it never happened in our internal tests. I have no pattern in the error reporting, only that it seems to happens more in Android 6. (interesting the point about the address) With your comments and suggestions, I will try harder to get the necessary information. Thanks again. |
@arvidn @ssiloti after doing some workout with // count number of blocks that would be flushed
int num_blocks = 0;
for (int i = end-1; i >= 0; --i)
====>>>> num_blocks += (p->blocks[i].dirty && !p->blocks[i].pending);
// we did not satisfy the block_limit requirement
// i.e. too few blocks would be flushed at this point, put it off
if (block_limit > num_blocks) return 0; Any ideas of how |
Or is it the |
Still not sure what's going on here. The fact the the upper 16-bits of the fault address appears to be garbage and the lower 16-bits all zeros suggests something, I'm just not sure what. The obvious suspect would be a torn read, but those shouldn't happen with 32-bit loads on ARM. |
I wish I could say something relevant (reading your PR) |
unfortunately, I can confirm that the PR #1255 does not fix the error (I ported it to master and I got the crash) |
Good news, the crash is gone! We were lucky to have a device in which the crash manifested repeatedly. The PR #1255 was definitively an improvement but only after the last comment I was thinking about the possibility of bad things going on with It's a monumental task to review all the assembly involved to determine what exactly is happening, maybe the ARM clang code generation needs more working. Well, this is all and thanks for all the help. |
it might be worth turning the bitfields into normal members, to see if that fixes it too. I wonder if there's a tearing issue because of it. Perhaps some of those fields must live in their own byte to be thread safe |
I think it is important to understand this. My suspicion is that the code is wrong, it just doesn't manifest itself on x86 |
actually, all of the fields in |
I'm reopening the issue as a remainder that this needs further investigation. |
I've found that some older ARM chips have buggy atomic operations. You may want to try making |
you may be onto something there. the define i found was i tried configuring libtorrent: i also tried to use pthreads instead of atomics: i seem to be at a dead end. |
I'm not sure if boost::shared_ptr has any compiled components in boost thread, but you may want to make sure that boost itself is built with the same configuration. (if you build with boost build, this property would be propagated to dependencies) |
@joycepg did you try to rebuild boost to exclude this failure cause? |
finally got back to trying this. rebuilt boost with
added to ~/user-config.jam rebuilt libtorrent with
then rebuilt my .so with these in Android.mk
the message variable didn't include anything useful. just the libtorrent version number 1.1.1.0-1229491 note that when i link my .so, which uses libtorrent, i only need to link with boost_system, boost_chrono and boost_random.
but i didn't care to solve that because i don't need to link with it anyway. i'm happy to receive suggestions. it's pretty quick for me to recompile and reproduce. |
Thanks a lot to bring this further and nice to hear that at least the compilation failure is easy to reproduce. I hope it’s not too hard to check the atomic error if we found a correct way to compile boost. I did try to find the correct combination of build flags by looking through the code. But I didn’t find much at boost thread https://github.com/boostorg/thread/blob/develop/include/boost/thread/future.hpp |
Thanks for your reply
The compilation of boost.thread is not the issue.
So now I do not even attempt to build boost.thread. I just use
regarding the actual error i am facing
The issue is that libtorrent is crashing at pretty random locations and times (as shown by screenshots). If you have some other compile options you want me to try, it is quick to recompile boost.{system,chrono,random}, libtorrent and my .so, and reproduce the crash. It takes about 5 minutes. I will try to raise an issue on the boost repo that you found. they say there that armv6 is a supported platform with gcc 4.x. our device is armv7 and we are using gcc 4.9, so it should be ok. |
well that didn't take long for the boost atomic people to bounce it :-/ |
@joycepg i didn't bounce it, but pointed you to the right direction: according to the stack trace, is related to the smart pointer, there is no boost.atomic code involved. |
@joycepg I'm not sure what I don't know exactly what boost.shared_ptr used to do, I don't see any specific arm code in there now, presumably it's always tried to use intrinsics on gcc, like If that doesn't help, it's most likely not a bug in boost, but in libtorrent. |
i had a read of the boost code and doco. in the boost asio doco it says
i saw this in the boost code too. so i made sure i passed no change; still crashed. i then added |
It sounds like there might be a bug in libtorrent's disk cache then |
ok. i'm happy to work with you to try to isolate/fix |
Hi @arvidn, the code by now is so different and the error (if any) is not happening in this way anymore. Do you think it's reasonable to close this? |
yeah, probably. |
Please provide the following information
libtorrent version (or branch): master
platform/architecture: android
compiler and compiler version: clang NDK 13
Any hint of where to look? Now I have debug versions but are useless, since this error manifest only in release while running in the wild.
The text was updated successfully, but these errors were encountered: