Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash in android in disk_io_thread::try_flush_hashed #1225

Closed
aldenml opened this issue Oct 18, 2016 · 50 comments
Closed

crash in android in disk_io_thread::try_flush_hashed #1225

aldenml opened this issue Oct 18, 2016 · 50 comments
Assignees

Comments

@aldenml
Copy link
Contributor

aldenml commented Oct 18, 2016

Please provide the following information

libtorrent version (or branch): master

platform/architecture: android

compiler and compiler version: clang NDK 13

pid: 21826, tid: 21936, name: Thread-7015  >>> com.frostwire.android <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xdaa00000
    r0 00000000  r1 00000010  r2 00000010  r3 da9fffe4
    r4 00000010  r5 0000000e  r6 00000080  r7 00000010
    r8 040000d6  r9 00400100  sl 00000010  fp dd8fa7f8
    ip 00000010  sp dd8fa798  lr 00000001  pc dfdeefd8  cpsr 800f0010

backtrace:
    #00 pc 00221fd8  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZN10libtorrent14disk_io_thread16try_flush_hashedEPNS_18cached_piece_entryEiRNS_9tailqueueINS_11disk_io_jobEEERNSt6__ndk111unique_lockINS7_5mutexEEE+340)
    #01 pc 00228a18  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZN10libtorrent14disk_io_thread15do_flush_hashedEPNS_11disk_io_jobERNS_9tailqueueIS1_EE+252)
    #02 pc 00223d6c  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZN10libtorrent14disk_io_thread11perform_jobEPNS_11disk_io_jobERNS_9tailqueueIS1_EE+144)
    #03 pc 0022947c  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZN10libtorrent14disk_io_thread10thread_funENS0_13thread_type_tEN5boost4asio10io_service4workE+436)
    #04 pc 0022a6a8  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZN10libtorrent14disk_io_thread9job_queue10thread_funEN5boost4asio10io_service4workE+56)
    #05 pc 0022f0b8  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZNSt6__ndk114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN10libtorrent21pool_thread_interfaceEFvN5boost4asio10io_service4workEEPS8_SC_EEEEEPvSH_+156)
    #06 pc 0003fa3b  /system/lib/libc.so (_ZL15__pthread_startPv+30)
    #07 pc 0001a085  /system/lib/libc.so (__start_thread+6)

Any hint of where to look? Now I have debug versions but are useless, since this error manifest only in release while running in the wild.

@arvidn
Copy link
Owner

arvidn commented Oct 19, 2016

do you know what line that access is on, and which variable it is for? Or, I forget, do you get dumps?

@aldenml
Copy link
Contributor Author

aldenml commented Oct 19, 2016

This is all I get from the google play developer console. It happens with a frequency that I can't ignore or blame the device.

@aldenml
Copy link
Contributor Author

aldenml commented Oct 19, 2016

I see in the code four calls to alloca, do you think this could be a problem?

file::iovec_t* iov = TORRENT_ALLOCA(file::iovec_t, p->blocks_in_piece * cont_pieces);
int* flushing = TORRENT_ALLOCA(int, p->blocks_in_piece * cont_pieces);
// this is the offset into iov and flushing for each piece
int* iovec_offset = TORRENT_ALLOCA(int, cont_pieces + 1);
int iov_len = 0;
// this is the block index each piece starts at
int block_start = 0;
// keep track of the pieces that have had their refcount incremented
// so we know to decrement them later
int* refcount_pieces = TORRENT_ALLOCA(int, cont_pieces);

@arvidn
Copy link
Owner

arvidn commented Oct 19, 2016

you think you may overflow the stack? I would think you'd get a different error for that, but maybe not. does android use segmented stacks?

@arvidn
Copy link
Owner

arvidn commented Oct 19, 2016

gcc calls it split stack. anyway, a short term fix to mitigate this may be to run with a single disk thread. If you try that, and it goes away, that would suggest it's a race condition.

@aldenml
Copy link
Contributor Author

aldenml commented Oct 20, 2016

you know, I can't find any information related to this (segmented stacks) and android. Not clear to me if this is something provided by the compiler alone, or it needs some help from the architecture (assembly instructions) or even the standard library.

Anyway, I added the flag -fsplit-stack to the compiler only to get

fatal error: error in backend: Segmented stacks do not support vararg functions.
clang38++: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Android clang version 3.8.256229  (based on LLVM 3.8.256229)
Target: armv7a-none-linux-android
Thread model: posix

Do you have any idea of how to properly test if it's a stack problem? Do you know if it's possible to add a unit test to detect if there is a race condition problem?

@arvidn
Copy link
Owner

arvidn commented Oct 20, 2016

you could write a test program that triggers a stack overflow and run it on android to see how the error is reported. If it looks different than the report in this ticket we could rule it out.

as for triggering races in tests, the only way I can think of would be to write a stress test that puts heavy , sustained load on the disk subsystem. The actual disk I/O would probably have to be disabled too

@aldenml
Copy link
Contributor Author

aldenml commented Oct 20, 2016

(in case you have 5 mins to tell me) how I can force the call to disk_io_thread::try_flush_hashed in a transfer? any settings?

@arvidn
Copy link
Owner

arvidn commented Oct 21, 2016

of the top of my head, try_flush_hashed is called for a piece when the cache is under pressure and it flushes blocks in the piece that have already been hashed. i.e. the SHA-1 hash cursor has passed over them and their data has been incorporated in the SHA-1 state (which is held by the hash pointer in the chached_piece_entry object).

If you're interested in a stress test, you can use connection_tester, which I believe is under tools or test. With it you can generate data, generate torrents, connect to a peer using a specific torrent and upload, download fast, or both. connection_tester can make multiple parallel connections, so you may want to turn off the one-peer-per-ip setting in libtorrent under test too.

@aldenml
Copy link
Contributor Author

aldenml commented Oct 22, 2016

This crash is happening more and more :(

Hi @ssiloti, I notice this commit (aldenml@9c5bb25) and I wonder if it's entirely right. Previously the call to pe->outstanding_flush = 0; was after the try_flush_hashed. Now it's right at the beginning, but if pe->num_dirty != 0 and pe->hashing_done, then I fear that multiple "flush" tasks are scheduled beacouse try_flush_hashed unlocks for a moment. Does this makes sense? Thanks

@ssiloti
Copy link
Collaborator

ssiloti commented Oct 22, 2016

It's possible there's something subtle going on there. Since you can reproduce it easily try reverting that commit and seeing what happens. IIRC that commit was a fix for a deadlock when the write cache is disabled so as long as you have that enabled everything should still work.

@ssiloti
Copy link
Collaborator

ssiloti commented Oct 22, 2016

Hmm, I just noticed the comment in do_flush_hashed saying that there can be multiple flush jobs active anyways so it seems we need to handle this case. What we really need is a line accurate backtrace so that we can pinpoint which line is causing the crash.

@arvidn
Copy link
Owner

arvidn commented Oct 22, 2016

@aldenml if you still have the binary+debug information for that build (or can rebuild it), you can load it up in gdb and issue: "list *address", where address is one of the addresses in the stack-trace (after "pc"). That will tell you exactly where it is (within the accuracy you get with optimizations enabled)

@ssiloti
Copy link
Collaborator

ssiloti commented Oct 22, 2016

@aldenml is there any pattern to the fault address you see in the crash reports? The one in the report you posted is above the 3GB mark which is odd because that puts it in the kernel's address space. That suggests this isn't just a simple dangling pointer access or stack/buffer overflow.

@frostwire
Copy link

frostwire commented Oct 23, 2016

Here's the value of the PC register value on the last crashes received on try_flush_cached

pc e006cfd8 -> 3758542808 -> 3.75 GB
pc e01f3fd8 -> 3760144344 -> 3.76 GB
pc dfcfafd8 -> 3754930136 -> 3.75 GB
pc dedd7fd8 -> 3739058136 -> 3.73 GB
pc 9f319fd8 -> 2670829528 -> 2.67 GB*
pc dfeb9fd8 -> 3756761048 -> 3.75 GB
pc e05e5fd8 -> 3764281304 -> 3.76 GB
pc 9e4a7fd8 -> 2655682520 -> 2.65 GB
pc e0452fd8 -> 3762630616 -> 3.76 GB
pc e040ffd8 -> 3762356184 -> 3.76 GB
pc a1146fd8 -> 2702471128 -> 2.7 GB*
pc 9e4a9fd8 -> 2655690712 -> 2.65 GB*

They all seem to be around there. Except a few* that seem to be below 0xc0000000

@frostwire
Copy link

@ssiloti could this mean that the crash occurs during a system call from try_flush_cached? meaning the system call points the PC register at some routine referenced in kernel address space

@ssiloti
Copy link
Collaborator

ssiloti commented Oct 23, 2016

I meant the fault address, not the program counter, in the report posted the fault was at 0xdaa00000.

@frostwire
Copy link

fault addr 0xd88c0000 -> 3633053696
fault addr 0xd3d00000 -> 3553624064
fault addr 0xd2c80000 -> 3536322560
fault addr 0xdd900000 -> 3717201920
fault addr 0x99d40000 -> 2580807680*
fault addr 0xd8b40000 -> 3635675136
fault addr 0xd4f80000 -> 3573022720
fault addr 0x98580000 -> 2555904000*
fault addr 0xd8840000 -> 3632529408
fault addr 0xd5d40000 -> 3587440640
fault addr 0xd8140000 -> 3625189376
fault addr 0x98c00000 -> 2562719744*
fault addr 0x91080000 -> 2433220608*

* address below 0xc0000000, in user space.

@aldenml
Copy link
Contributor Author

aldenml commented Oct 24, 2016

The error is not easy to replicate, in fact, it never happened in our internal tests. I have no pattern in the error reporting, only that it seems to happens more in Android 6. (interesting the point about the address)

With your comments and suggestions, I will try harder to get the necessary information. Thanks again.

@aldenml
Copy link
Contributor Author

aldenml commented Oct 24, 2016

@arvidn @ssiloti after doing some workout with gdb :) I finally narrowed the crash to this line:

        // count number of blocks that would be flushed
        int num_blocks = 0;
        for (int i = end-1; i >= 0; --i)
====>>>>    num_blocks += (p->blocks[i].dirty && !p->blocks[i].pending);

        // we did not satisfy the block_limit requirement
        // i.e. too few blocks would be flushed at this point, put it off
        if (block_limit > num_blocks) return 0;

Any ideas of how cached_piece_entry::blocks can become invalid/corrupt?

@aldenml
Copy link
Contributor Author

aldenml commented Oct 24, 2016

Or is it the end calculation wrong?

@ssiloti
Copy link
Collaborator

ssiloti commented Oct 25, 2016

Still not sure what's going on here. The fact the the upper 16-bits of the fault address appears to be garbage and the lower 16-bits all zeros suggests something, I'm just not sure what. The obvious suspect would be a torn read, but those shouldn't happen with 32-bit loads on ARM.

@aldenml
Copy link
Contributor Author

aldenml commented Oct 25, 2016

I wish I could say something relevant (reading your PR)

aldenml added a commit to aldenml/libtorrent that referenced this issue Oct 26, 2016
aldenml added a commit to aldenml/libtorrent that referenced this issue Oct 26, 2016
@aldenml
Copy link
Contributor Author

aldenml commented Oct 27, 2016

unfortunately, I can confirm that the PR #1255 does not fix the error (I ported it to master and I got the crash)

@aldenml
Copy link
Contributor Author

aldenml commented Oct 31, 2016

Good news, the crash is gone!

We were lucky to have a device in which the crash manifested repeatedly. The PR #1255 was definitively an improvement but only after the last comment I was thinking about the possibility of bad things going on with -O3 and struct's bitfields and mutex-locks. I just compiled with -Os and the crash disappears.

It's a monumental task to review all the assembly involved to determine what exactly is happening, maybe the ARM clang code generation needs more working. Well, this is all and thanks for all the help.

@aldenml aldenml closed this as completed Oct 31, 2016
@arvidn
Copy link
Owner

arvidn commented Oct 31, 2016

it might be worth turning the bitfields into normal members, to see if that fixes it too. I wonder if there's a tearing issue because of it. Perhaps some of those fields must live in their own byte to be thread safe

@arvidn
Copy link
Owner

arvidn commented Oct 31, 2016

I think it is important to understand this. My suspicion is that the code is wrong, it just doesn't manifest itself on x86

@arvidn
Copy link
Owner

arvidn commented Oct 31, 2016

actually, all of the fields in cached_block_entry should only be accessed when holding the mutex, so tearing shouldn't be an issue. as far as I can tell.

@aldenml
Copy link
Contributor Author

aldenml commented Nov 1, 2016

I'm reopening the issue as a remainder that this needs further investigation.

@aldenml aldenml reopened this Nov 1, 2016
@joycepg
Copy link

joycepg commented Nov 24, 2016

we've been getting this error too, but only on certain devices. the nexus 7 has consistently crashed, whereas some samsung 10" tablets don't.
i managed to get our ndk project loaded inside android studio, and finally got a traceback.
android-libtorrent-crash
not really sure what to do next though!
this is using libtorrent 1.1.1 release

@joycepg
Copy link

joycepg commented Nov 24, 2016

i noticed no line numbers in the stacktrace above for libtorrent code. so i re-ran configure with --enable-debug. when i ran it, it did not crash.

then i modified configure to output -O0 instead of -Os when not debugging.

this resulted in a crash in a different spot, but during the same operation (downloading 6 torrents simultaneously)
android-libtorrent-crash-o0

@joycepg
Copy link

joycepg commented Nov 24, 2016

another. same settings, same time (downloading 6 torrents), but different spot. smells like a rogue pointer somewhere, stomping over the heap earlier.
android-libtorrent-crash-o0-again

@arvidn
Copy link
Owner

arvidn commented Nov 24, 2016

I've found that some older ARM chips have buggy atomic operations. You may want to try making boost::shared_ptr use a mutex instead of atomic increment/decrement. There's a define for that.

@joycepg
Copy link

joycepg commented Dec 5, 2016

you may be onto something there.

the define i found was BOOST_SP_NO_ATOMIC_ACCESS

i tried configuring libtorrent:
CPPFLAGS="-DBOOST_ASIO_DISABLE_STD_ATOMIC=1 -DBOOST_SP_NO_ATOMIC_ACCESS=1" ./configure ...
but still experienced crashes.

i also tried to use pthreads instead of atomics:
CPPFLAGS="-DBOOST_AC_USE_PTHREADS=1 -DBOOST_ASIO_DISABLE_STD_ATOMIC=1 -DBOOST_SP_NO_ATOMIC_ACCESS=1" ./configure ...
but the app didn't even run. it hung during initialization

i seem to be at a dead end.

@arvidn
Copy link
Owner

arvidn commented Dec 5, 2016

I'm not sure if boost::shared_ptr has any compiled components in boost thread, but you may want to make sure that boost itself is built with the same configuration. (if you build with boost build, this property would be propagated to dependencies)

@Falcosc
Copy link
Contributor

Falcosc commented Dec 29, 2016

@joycepg did you try to rebuild boost to exclude this failure cause?

@joycepg
Copy link

joycepg commented Jan 17, 2017

finally got back to trying this.

rebuilt boost with

<compileflags>-DBOOST_ASIO_DISABLE_STD_ATOMIC=1
<compileflags>-DBOOST_SP_NO_ATOMIC_ACCESS=1

added to ~/user-config.jam

rebuilt libtorrent with

CPPFLAGS="-DBOOST_ASIO_DISABLE_STD_ATOMIC=1 -DBOOST_SP_NO_ATOMIC_ACCESS=1" ./configure ...

then rebuilt my .so with these in Android.mk

LOCAL_CFLAGS := -DBOOST_ASIO_DISABLE_STD_ATOMIC=1 \
-DBOOST_SP_NO_ATOMIC_ACCESS=1 \
-O0 -g

no luck.
image

the message variable didn't include anything useful. just the libtorrent version number 1.1.1.0-1229491

note that when i link my .so, which uses libtorrent, i only need to link with boost_system, boost_chrono and boost_random.
the boost_thread library does not even compile. The error is

gcc.compile.c++ bin.v2/libs/thread/build/gcc-android4.9/release/link-static/runtime-link-static/target-os-linux/threading-multi/pthread/thread.o
In file included from libs/thread/src/pthread/thread.cpp:19:0:
./boost/thread/future.hpp: In member function 'void boost::promise<void>::lazy_init()':
./boost/thread/future.hpp:2549:37: error: 'atomic_load' was not declared in this scope
             if(!atomic_load(&future_))
                                     ^

but i didn't care to solve that because i don't need to link with it anyway.

i'm happy to receive suggestions. it's pretty quick for me to recompile and reproduce.

@Falcosc
Copy link
Contributor

Falcosc commented Jan 17, 2017

Thanks a lot to bring this further and nice to hear that at least the compilation failure is easy to reproduce. I hope it’s not too hard to check the atomic error if we found a correct way to compile boost.

I did try to find the correct combination of build flags by looking through the code. But I didn’t find much at boost thread https://github.com/boostorg/thread/blob/develop/include/boost/thread/future.hpp
But they use macros to overwrite the atomic_load definition. I guess there is option somewhere which enables ARM compatible atomic_load macros.
Would be very nice if you could ask the atomic module team how to build boost for ARM Chips with non-reliable atomic implementations. I hope they could tell us how to disable atomic for boost or how to enable something which does apply a workaround for non-reliable atomic implementations of some ARM chips. I guess you should create an issue here https://github.com/boostorg/atomic
Would be very nice if you could link it to our ticket, because as a result of it I hope we could find an ARM build suggestion for libtorrent documentation.

@joycepg
Copy link

joycepg commented Jan 18, 2017

Thanks for your reply

Thanks a lot to bring this further and nice to hear that at least the compilation failure is easy to reproduce.

The compilation of boost.thread is not the issue.
boost.thread compiled library is not required in the linking. The libtorrent configure output shows this:

Boost libraries:
  version:              
  CPPFlags:             -I/Users/pjoyce/Developer/latlongo_dependencies/android/armeabi-v7a//include
  LDFlags:              -L/Users/pjoyce/Developer/latlongo_dependencies/android/armeabi-v7a//lib
  boost.system:         -lboost_system
  boost.chrono:         -lboost_chrono
  boost.random:         -lboost_random

So now I do not even attempt to build boost.thread. I just use

./b2 --with-system --with-chrono --with-random toolset=gcc-android4.9 link=static runtime-link=static target-os=linux --stagedir=android

regarding the actual error i am facing

I hope it’s not too hard to check the atomic error if we found a correct way to compile boost.

The issue is that libtorrent is crashing at pretty random locations and times (as shown by screenshots).

If you have some other compile options you want me to try, it is quick to recompile boost.{system,chrono,random}, libtorrent and my .so, and reproduce the crash. It takes about 5 minutes.

I will try to raise an issue on the boost repo that you found. they say there that armv6 is a supported platform with gcc 4.x. our device is armv7 and we are using gcc 4.9, so it should be ok.

@joycepg
Copy link

joycepg commented Jan 18, 2017

well that didn't take long for the boost atomic people to bounce it :-/

@timblechmann
Copy link

@joycepg i didn't bounce it, but pointed you to the right direction: according to the stack trace, is related to the smart pointer, there is no boost.atomic code involved.

@arvidn
Copy link
Owner

arvidn commented Jan 18, 2017

@joycepg I'm not sure what BOOST_SP_NO_ATOMIC_ACCESS does. The issue I've seen in the past (granted, this was almost 10 years ago now) on some arm chips was that the atomic operations did not have correct semantics and would break boost.shared_ptr. The work-around, then, was to enable BOOST_AC_USE_PTHREADS, which uses pthread_mutex instead of atomic add.

I don't know exactly what boost.shared_ptr used to do, I don't see any specific arm code in there now, presumably it's always tried to use intrinsics on gcc, like __sync_add_and_fetch.

If that doesn't help, it's most likely not a bug in boost, but in libtorrent.

@joycepg
Copy link

joycepg commented Jan 31, 2017

i had a read of the boost code and doco. in the boost asio doco it says

Support for the standard atomic integer template is automatically enabled for g++ 4.5 and later, when the -std=c++0x or -std=gnu++0x compiler options are used.

i saw this in the boost code too. so i made sure i passed -std=gnu++0x to the boost and libtorrent builds. this was nice symmetry, since i use that flag to build our app anyway.

no change; still crashed.

i then added -DBOOST_AC_USE_PTHREADS=1 -DBOOST_SP_USE_PTHREADS=1 to the flags, and rebuilt everything. when i ran it several times, it seemed to hang randomly, in the libtorrent/boost code. possibly a deadlock occurring.

@arvidn
Copy link
Owner

arvidn commented Feb 1, 2017

It sounds like there might be a bug in libtorrent's disk cache then

@joycepg
Copy link

joycepg commented Feb 3, 2017

ok. i'm happy to work with you to try to isolate/fix

@arvidn arvidn self-assigned this Feb 4, 2017
@aldenml
Copy link
Contributor Author

aldenml commented Jul 11, 2017

Hi @arvidn, the code by now is so different and the error (if any) is not happening in this way anymore. Do you think it's reasonable to close this?

@arvidn
Copy link
Owner

arvidn commented Jul 11, 2017

yeah, probably.

@arvidn arvidn closed this as completed Jul 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants