crash in android in disk_io_thread::try_flush_hashed #1225

aldenml · 2016-10-18T14:17:40Z

Please provide the following information

libtorrent version (or branch): master

platform/architecture: android

compiler and compiler version: clang NDK 13

pid: 21826, tid: 21936, name: Thread-7015  >>> com.frostwire.android <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xdaa00000
    r0 00000000  r1 00000010  r2 00000010  r3 da9fffe4
    r4 00000010  r5 0000000e  r6 00000080  r7 00000010
    r8 040000d6  r9 00400100  sl 00000010  fp dd8fa7f8
    ip 00000010  sp dd8fa798  lr 00000001  pc dfdeefd8  cpsr 800f0010

backtrace:
    #00 pc 00221fd8  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZN10libtorrent14disk_io_thread16try_flush_hashedEPNS_18cached_piece_entryEiRNS_9tailqueueINS_11disk_io_jobEEERNSt6__ndk111unique_lockINS7_5mutexEEE+340)
    #01 pc 00228a18  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZN10libtorrent14disk_io_thread15do_flush_hashedEPNS_11disk_io_jobERNS_9tailqueueIS1_EE+252)
    #02 pc 00223d6c  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZN10libtorrent14disk_io_thread11perform_jobEPNS_11disk_io_jobERNS_9tailqueueIS1_EE+144)
    #03 pc 0022947c  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZN10libtorrent14disk_io_thread10thread_funENS0_13thread_type_tEN5boost4asio10io_service4workE+436)
    #04 pc 0022a6a8  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZN10libtorrent14disk_io_thread9job_queue10thread_funEN5boost4asio10io_service4workE+56)
    #05 pc 0022f0b8  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZNSt6__ndk114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN10libtorrent21pool_thread_interfaceEFvN5boost4asio10io_service4workEEPS8_SC_EEEEEPvSH_+156)
    #06 pc 0003fa3b  /system/lib/libc.so (_ZL15__pthread_startPv+30)
    #07 pc 0001a085  /system/lib/libc.so (__start_thread+6)

Any hint of where to look? Now I have debug versions but are useless, since this error manifest only in release while running in the wild.

The text was updated successfully, but these errors were encountered:

arvidn · 2016-10-19T05:36:38Z

do you know what line that access is on, and which variable it is for? Or, I forget, do you get dumps?

aldenml · 2016-10-19T12:56:12Z

This is all I get from the google play developer console. It happens with a frequency that I can't ignore or blame the device.

aldenml · 2016-10-19T20:56:14Z

I see in the code four calls to alloca, do you think this could be a problem?

file::iovec_t* iov = TORRENT_ALLOCA(file::iovec_t, p->blocks_in_piece * cont_pieces);
int* flushing = TORRENT_ALLOCA(int, p->blocks_in_piece * cont_pieces);
// this is the offset into iov and flushing for each piece
int* iovec_offset = TORRENT_ALLOCA(int, cont_pieces + 1);
int iov_len = 0;
// this is the block index each piece starts at
int block_start = 0;
// keep track of the pieces that have had their refcount incremented
// so we know to decrement them later
int* refcount_pieces = TORRENT_ALLOCA(int, cont_pieces);

arvidn · 2016-10-19T21:19:56Z

you think you may overflow the stack? I would think you'd get a different error for that, but maybe not. does android use segmented stacks?

arvidn · 2016-10-19T21:24:31Z

gcc calls it split stack. anyway, a short term fix to mitigate this may be to run with a single disk thread. If you try that, and it goes away, that would suggest it's a race condition.

aldenml · 2016-10-20T19:32:28Z

you know, I can't find any information related to this (segmented stacks) and android. Not clear to me if this is something provided by the compiler alone, or it needs some help from the architecture (assembly instructions) or even the standard library.

Anyway, I added the flag -fsplit-stack to the compiler only to get

fatal error: error in backend: Segmented stacks do not support vararg functions.
clang38++: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Android clang version 3.8.256229  (based on LLVM 3.8.256229)
Target: armv7a-none-linux-android
Thread model: posix

Do you have any idea of how to properly test if it's a stack problem? Do you know if it's possible to add a unit test to detect if there is a race condition problem?

arvidn · 2016-10-20T20:23:26Z

you could write a test program that triggers a stack overflow and run it on android to see how the error is reported. If it looks different than the report in this ticket we could rule it out.

as for triggering races in tests, the only way I can think of would be to write a stress test that puts heavy , sustained load on the disk subsystem. The actual disk I/O would probably have to be disabled too

aldenml · 2016-10-20T23:38:12Z

(in case you have 5 mins to tell me) how I can force the call to disk_io_thread::try_flush_hashed in a transfer? any settings?

arvidn · 2016-10-21T01:13:57Z

of the top of my head, try_flush_hashed is called for a piece when the cache is under pressure and it flushes blocks in the piece that have already been hashed. i.e. the SHA-1 hash cursor has passed over them and their data has been incorporated in the SHA-1 state (which is held by the hash pointer in the chached_piece_entry object).

If you're interested in a stress test, you can use connection_tester, which I believe is under tools or test. With it you can generate data, generate torrents, connect to a peer using a specific torrent and upload, download fast, or both. connection_tester can make multiple parallel connections, so you may want to turn off the one-peer-per-ip setting in libtorrent under test too.

aldenml · 2016-10-22T15:21:29Z

This crash is happening more and more :(

Hi @ssiloti, I notice this commit (aldenml@9c5bb25) and I wonder if it's entirely right. Previously the call to pe->outstanding_flush = 0; was after the try_flush_hashed. Now it's right at the beginning, but if pe->num_dirty != 0 and pe->hashing_done, then I fear that multiple "flush" tasks are scheduled beacouse try_flush_hashed unlocks for a moment. Does this makes sense? Thanks

ssiloti · 2016-10-22T19:36:35Z

It's possible there's something subtle going on there. Since you can reproduce it easily try reverting that commit and seeing what happens. IIRC that commit was a fix for a deadlock when the write cache is disabled so as long as you have that enabled everything should still work.

ssiloti · 2016-10-22T20:16:36Z

Hmm, I just noticed the comment in do_flush_hashed saying that there can be multiple flush jobs active anyways so it seems we need to handle this case. What we really need is a line accurate backtrace so that we can pinpoint which line is causing the crash.

arvidn · 2016-10-22T21:12:18Z

@aldenml if you still have the binary+debug information for that build (or can rebuild it), you can load it up in gdb and issue: "list *address", where address is one of the addresses in the stack-trace (after "pc"). That will tell you exactly where it is (within the accuracy you get with optimizations enabled)

ssiloti · 2016-10-22T22:07:57Z

@aldenml is there any pattern to the fault address you see in the crash reports? The one in the report you posted is above the 3GB mark which is odd because that puts it in the kernel's address space. That suggests this isn't just a simple dangling pointer access or stack/buffer overflow.

frostwire · 2016-10-23T22:30:32Z

Here's the value of the PC register value on the last crashes received on try_flush_cached

pc e006cfd8 -> 3758542808 -> 3.75 GB
pc e01f3fd8 -> 3760144344 -> 3.76 GB
pc dfcfafd8 -> 3754930136 -> 3.75 GB
pc dedd7fd8 -> 3739058136 -> 3.73 GB
pc 9f319fd8 -> 2670829528 -> 2.67 GB*
pc dfeb9fd8 -> 3756761048 -> 3.75 GB
pc e05e5fd8 -> 3764281304 -> 3.76 GB
pc 9e4a7fd8 -> 2655682520 -> 2.65 GB
pc e0452fd8 -> 3762630616 -> 3.76 GB
pc e040ffd8 -> 3762356184 -> 3.76 GB
pc a1146fd8 -> 2702471128 -> 2.7 GB*
pc 9e4a9fd8 -> 2655690712 -> 2.65 GB*

They all seem to be around there. Except a few* that seem to be below 0xc0000000

frostwire · 2016-10-23T22:34:55Z

@ssiloti could this mean that the crash occurs during a system call from try_flush_cached? meaning the system call points the PC register at some routine referenced in kernel address space

ssiloti · 2016-10-23T22:43:17Z

I meant the fault address, not the program counter, in the report posted the fault was at 0xdaa00000.

frostwire · 2016-10-23T22:53:33Z

fault addr 0xd88c0000 -> 3633053696
fault addr 0xd3d00000 -> 3553624064
fault addr 0xd2c80000 -> 3536322560
fault addr 0xdd900000 -> 3717201920
fault addr 0x99d40000 -> 2580807680*
fault addr 0xd8b40000 -> 3635675136
fault addr 0xd4f80000 -> 3573022720
fault addr 0x98580000 -> 2555904000*
fault addr 0xd8840000 -> 3632529408
fault addr 0xd5d40000 -> 3587440640
fault addr 0xd8140000 -> 3625189376
fault addr 0x98c00000 -> 2562719744*
fault addr 0x91080000 -> 2433220608*

* address below 0xc0000000, in user space.

aldenml · 2016-10-24T02:22:25Z

The error is not easy to replicate, in fact, it never happened in our internal tests. I have no pattern in the error reporting, only that it seems to happens more in Android 6. (interesting the point about the address)

With your comments and suggestions, I will try harder to get the necessary information. Thanks again.

aldenml · 2016-10-24T03:34:00Z

@arvidn @ssiloti after doing some workout with gdb :) I finally narrowed the crash to this line:

        // count number of blocks that would be flushed
        int num_blocks = 0;
        for (int i = end-1; i >= 0; --i)
====>>>>    num_blocks += (p->blocks[i].dirty && !p->blocks[i].pending);

        // we did not satisfy the block_limit requirement
        // i.e. too few blocks would be flushed at this point, put it off
        if (block_limit > num_blocks) return 0;

Any ideas of how cached_piece_entry::blocks can become invalid/corrupt?

aldenml · 2016-10-24T03:38:23Z

Or is it the end calculation wrong?

ssiloti · 2016-10-25T04:39:01Z

Still not sure what's going on here. The fact the the upper 16-bits of the fault address appears to be garbage and the lower 16-bits all zeros suggests something, I'm just not sure what. The obvious suspect would be a torn read, but those shouldn't happen with 32-bit loads on ARM.

aldenml · 2016-10-25T04:51:44Z

I wish I could say something relevant (reading your PR)

aldenml · 2016-10-27T01:26:37Z

unfortunately, I can confirm that the PR #1255 does not fix the error (I ported it to master and I got the crash)

aldenml · 2016-10-31T14:20:17Z

Good news, the crash is gone!

We were lucky to have a device in which the crash manifested repeatedly. The PR #1255 was definitively an improvement but only after the last comment I was thinking about the possibility of bad things going on with -O3 and struct's bitfields and mutex-locks. I just compiled with -Os and the crash disappears.

It's a monumental task to review all the assembly involved to determine what exactly is happening, maybe the ARM clang code generation needs more working. Well, this is all and thanks for all the help.

arvidn · 2016-10-31T21:00:29Z

it might be worth turning the bitfields into normal members, to see if that fixes it too. I wonder if there's a tearing issue because of it. Perhaps some of those fields must live in their own byte to be thread safe

arvidn · 2016-10-31T21:01:18Z

I think it is important to understand this. My suspicion is that the code is wrong, it just doesn't manifest itself on x86

arvidn · 2016-10-31T21:14:01Z

actually, all of the fields in cached_block_entry should only be accessed when holding the mutex, so tearing shouldn't be an issue. as far as I can tell.

aldenml · 2016-11-01T02:42:07Z

I'm reopening the issue as a remainder that this needs further investigation.

joycepg · 2016-11-24T03:56:46Z

we've been getting this error too, but only on certain devices. the nexus 7 has consistently crashed, whereas some samsung 10" tablets don't.
i managed to get our ndk project loaded inside android studio, and finally got a traceback.

not really sure what to do next though!
this is using libtorrent 1.1.1 release

joycepg · 2016-11-24T05:22:19Z

i noticed no line numbers in the stacktrace above for libtorrent code. so i re-ran configure with --enable-debug. when i ran it, it did not crash.

then i modified configure to output -O0 instead of -Os when not debugging.

this resulted in a crash in a different spot, but during the same operation (downloading 6 torrents simultaneously)

joycepg · 2016-11-24T05:29:41Z

another. same settings, same time (downloading 6 torrents), but different spot. smells like a rogue pointer somewhere, stomping over the heap earlier.

arvidn · 2016-11-24T12:22:11Z

I've found that some older ARM chips have buggy atomic operations. You may want to try making boost::shared_ptr use a mutex instead of atomic increment/decrement. There's a define for that.

joycepg · 2016-12-05T03:54:02Z

you may be onto something there.

the define i found was BOOST_SP_NO_ATOMIC_ACCESS

i tried configuring libtorrent:
CPPFLAGS="-DBOOST_ASIO_DISABLE_STD_ATOMIC=1 -DBOOST_SP_NO_ATOMIC_ACCESS=1" ./configure ...
but still experienced crashes.

i also tried to use pthreads instead of atomics:
CPPFLAGS="-DBOOST_AC_USE_PTHREADS=1 -DBOOST_ASIO_DISABLE_STD_ATOMIC=1 -DBOOST_SP_NO_ATOMIC_ACCESS=1" ./configure ...
but the app didn't even run. it hung during initialization

i seem to be at a dead end.

arvidn · 2016-12-05T04:32:30Z

I'm not sure if boost::shared_ptr has any compiled components in boost thread, but you may want to make sure that boost itself is built with the same configuration. (if you build with boost build, this property would be propagated to dependencies)

Falcosc · 2016-12-29T09:27:25Z

@joycepg did you try to rebuild boost to exclude this failure cause?

joycepg · 2017-01-17T02:26:28Z

finally got back to trying this.

rebuilt boost with

<compileflags>-DBOOST_ASIO_DISABLE_STD_ATOMIC=1
<compileflags>-DBOOST_SP_NO_ATOMIC_ACCESS=1

added to ~/user-config.jam

rebuilt libtorrent with

CPPFLAGS="-DBOOST_ASIO_DISABLE_STD_ATOMIC=1 -DBOOST_SP_NO_ATOMIC_ACCESS=1" ./configure ...

then rebuilt my .so with these in Android.mk

LOCAL_CFLAGS := -DBOOST_ASIO_DISABLE_STD_ATOMIC=1 \
-DBOOST_SP_NO_ATOMIC_ACCESS=1 \
-O0 -g

no luck.

the message variable didn't include anything useful. just the libtorrent version number 1.1.1.0-1229491

note that when i link my .so, which uses libtorrent, i only need to link with boost_system, boost_chrono and boost_random.
the boost_thread library does not even compile. The error is

gcc.compile.c++ bin.v2/libs/thread/build/gcc-android4.9/release/link-static/runtime-link-static/target-os-linux/threading-multi/pthread/thread.o
In file included from libs/thread/src/pthread/thread.cpp:19:0:
./boost/thread/future.hpp: In member function 'void boost::promise<void>::lazy_init()':
./boost/thread/future.hpp:2549:37: error: 'atomic_load' was not declared in this scope
             if(!atomic_load(&future_))
                                     ^

but i didn't care to solve that because i don't need to link with it anyway.

i'm happy to receive suggestions. it's pretty quick for me to recompile and reproduce.

Falcosc · 2017-01-17T07:41:29Z

Thanks a lot to bring this further and nice to hear that at least the compilation failure is easy to reproduce. I hope it’s not too hard to check the atomic error if we found a correct way to compile boost.

I did try to find the correct combination of build flags by looking through the code. But I didn’t find much at boost thread https://github.com/boostorg/thread/blob/develop/include/boost/thread/future.hpp
But they use macros to overwrite the atomic_load definition. I guess there is option somewhere which enables ARM compatible atomic_load macros.
Would be very nice if you could ask the atomic module team how to build boost for ARM Chips with non-reliable atomic implementations. I hope they could tell us how to disable atomic for boost or how to enable something which does apply a workaround for non-reliable atomic implementations of some ARM chips. I guess you should create an issue here https://github.com/boostorg/atomic
Would be very nice if you could link it to our ticket, because as a result of it I hope we could find an ARM build suggestion for libtorrent documentation.

joycepg · 2017-01-18T00:13:34Z

Thanks for your reply

Thanks a lot to bring this further and nice to hear that at least the compilation failure is easy to reproduce.

The compilation of boost.thread is not the issue.
boost.thread compiled library is not required in the linking. The libtorrent configure output shows this:

Boost libraries:
  version:              
  CPPFlags:             -I/Users/pjoyce/Developer/latlongo_dependencies/android/armeabi-v7a//include
  LDFlags:              -L/Users/pjoyce/Developer/latlongo_dependencies/android/armeabi-v7a//lib
  boost.system:         -lboost_system
  boost.chrono:         -lboost_chrono
  boost.random:         -lboost_random

So now I do not even attempt to build boost.thread. I just use

./b2 --with-system --with-chrono --with-random toolset=gcc-android4.9 link=static runtime-link=static target-os=linux --stagedir=android

regarding the actual error i am facing

I hope it’s not too hard to check the atomic error if we found a correct way to compile boost.

The issue is that libtorrent is crashing at pretty random locations and times (as shown by screenshots).

If you have some other compile options you want me to try, it is quick to recompile boost.{system,chrono,random}, libtorrent and my .so, and reproduce the crash. It takes about 5 minutes.

I will try to raise an issue on the boost repo that you found. they say there that armv6 is a supported platform with gcc 4.x. our device is armv7 and we are using gcc 4.9, so it should be ok.

joycepg · 2017-01-18T03:38:54Z

well that didn't take long for the boost atomic people to bounce it :-/

timblechmann · 2017-01-18T06:10:44Z

@joycepg i didn't bounce it, but pointed you to the right direction: according to the stack trace, is related to the smart pointer, there is no boost.atomic code involved.

arvidn · 2017-01-18T07:50:46Z

@joycepg I'm not sure what BOOST_SP_NO_ATOMIC_ACCESS does. The issue I've seen in the past (granted, this was almost 10 years ago now) on some arm chips was that the atomic operations did not have correct semantics and would break boost.shared_ptr. The work-around, then, was to enable BOOST_AC_USE_PTHREADS, which uses pthread_mutex instead of atomic add.

I don't know exactly what boost.shared_ptr used to do, I don't see any specific arm code in there now, presumably it's always tried to use intrinsics on gcc, like __sync_add_and_fetch.

If that doesn't help, it's most likely not a bug in boost, but in libtorrent.

joycepg · 2017-01-31T22:13:42Z

i had a read of the boost code and doco. in the boost asio doco it says

Support for the standard atomic integer template is automatically enabled for g++ 4.5 and later, when the -std=c++0x or -std=gnu++0x compiler options are used.

i saw this in the boost code too. so i made sure i passed -std=gnu++0x to the boost and libtorrent builds. this was nice symmetry, since i use that flag to build our app anyway.

no change; still crashed.

i then added -DBOOST_AC_USE_PTHREADS=1 -DBOOST_SP_USE_PTHREADS=1 to the flags, and rebuilt everything. when i ran it several times, it seemed to hang randomly, in the libtorrent/boost code. possibly a deadlock occurring.

arvidn · 2017-02-01T00:07:39Z

It sounds like there might be a bug in libtorrent's disk cache then

joycepg · 2017-02-03T00:28:50Z

ok. i'm happy to work with you to try to isolate/fix

aldenml · 2017-07-11T16:37:04Z

Hi @arvidn, the code by now is so different and the error (if any) is not happening in this way anymore. Do you think it's reasonable to close this?

arvidn · 2017-07-11T17:50:26Z

yeah, probably.

This was referenced Oct 21, 2016

use of piece_refcount_holder in disk_io_thread::do_flush_piece #1239

Closed

use of piece_refcount_holder in disk_io_thread::do_flush_piece #1240

Closed

aldenml mentioned this issue Oct 23, 2016

[android] crash on disk_io_thread::try_flush_hashed, build 311 frostwire/frostwire#218

Closed

ssiloti mentioned this issue Oct 25, 2016

fishing for #1225 #1254

Closed

aldenml added a commit to aldenml/libtorrent that referenced this issue Oct 26, 2016

port of RC_1_1 fix for issue arvidn#1225

412dc91

aldenml added a commit to aldenml/libtorrent that referenced this issue Oct 26, 2016

port of RC_1_1 fix for issue arvidn#1225

1de97c1

This was referenced Oct 27, 2016

Fatal SIGSEGV on Android frostwire/frostwire-jlibtorrent#116

Closed

using custom equal function in block_cache::cache_t #1264

Closed

aldenml closed this as completed Oct 31, 2016

aldenml reopened this Nov 1, 2016

joycepg mentioned this issue Jan 18, 2017

crashes on android boostorg/atomic#9

Closed

arvidn self-assigned this Feb 4, 2017

arvidn closed this as completed Jul 11, 2017

crash in android in disk_io_thread::try_flush_hashed #1225

crash in android in disk_io_thread::try_flush_hashed #1225

Comments

aldenml commented Oct 18, 2016

arvidn commented Oct 19, 2016

aldenml commented Oct 19, 2016

aldenml commented Oct 19, 2016

arvidn commented Oct 19, 2016

arvidn commented Oct 19, 2016

aldenml commented Oct 20, 2016

arvidn commented Oct 20, 2016

aldenml commented Oct 20, 2016

arvidn commented Oct 21, 2016

aldenml commented Oct 22, 2016 • edited Loading

ssiloti commented Oct 22, 2016

ssiloti commented Oct 22, 2016

arvidn commented Oct 22, 2016 • edited Loading

ssiloti commented Oct 22, 2016

frostwire commented Oct 23, 2016 • edited Loading

frostwire commented Oct 23, 2016

ssiloti commented Oct 23, 2016

frostwire commented Oct 23, 2016

aldenml commented Oct 24, 2016

aldenml commented Oct 24, 2016

aldenml commented Oct 24, 2016

ssiloti commented Oct 25, 2016

aldenml commented Oct 25, 2016

aldenml commented Oct 27, 2016

aldenml commented Oct 31, 2016

arvidn commented Oct 31, 2016

arvidn commented Oct 31, 2016

arvidn commented Oct 31, 2016

aldenml commented Nov 1, 2016

joycepg commented Nov 24, 2016

joycepg commented Nov 24, 2016

joycepg commented Nov 24, 2016

arvidn commented Nov 24, 2016

joycepg commented Dec 5, 2016

arvidn commented Dec 5, 2016

Falcosc commented Dec 29, 2016

joycepg commented Jan 17, 2017

Falcosc commented Jan 17, 2017

joycepg commented Jan 18, 2017 • edited Loading

joycepg commented Jan 18, 2017

timblechmann commented Jan 18, 2017

arvidn commented Jan 18, 2017

joycepg commented Jan 31, 2017

arvidn commented Feb 1, 2017

joycepg commented Feb 3, 2017

aldenml commented Jul 11, 2017

arvidn commented Jul 11, 2017

aldenml commented Oct 22, 2016 •

edited

Loading

arvidn commented Oct 22, 2016 •

edited

Loading

frostwire commented Oct 23, 2016 •

edited

Loading

joycepg commented Jan 18, 2017 •

edited

Loading