Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBWorker crashes in Fleece Encoder (writePointer) #792

Closed
fedal-mas opened this issue May 29, 2019 · 15 comments

Comments

@fedal-mas
Copy link

commented May 29, 2019

Hi,

My app is crashing inside the Couchbase SDK. Here is what is happening:

  • I have an DB on Couchbase Server with around 80k documents.
  • I run a script that adds 30k docs (through SG) in around 500s
  • The app is open while the docs are added to the server. In some point it crashes.
  • If I kill the application and try to open again, same problem
  • Only way to solve it was deleting the app and reinstalling it and all the DB, which is not practical at all.

This is an urgent problem to be solved, since I can't update the app with this new version, because on prod this script runs every night and add or update around 30k docs.

Any idea?

Crash log


  • Version: 2.5.1
  • Client OS: iOS 12.3.1
  • Server: Couchbase Server 6.0 / Sync Gateway 2.5.0
@snej

This comment has been minimized.

Copy link
Member

commented May 29, 2019

Looks like a crash in Fleece, i.e. in LiteCore.

Crash is KERN_INVALID_ADDRESS at 0x0000007902c30a38. Backtrace is:

0 fleece::impl::Encoder::writePointer(long) + 2946456 (Encoder.cc:167)
1 fleece::impl::Encoder::endCollection(fleece::impl::internal::tags) + 2954536 (Encoder.cc:187)
2 fleece::impl::Encoder::endCollection(fleece::impl::internal::tags) + 2954536 (Encoder.cc:187)
3 fleece::impl::Encoder::writeValue(fleece::impl::Value const*, fleece::impl::SharedKeys const*&, 4 fleece::function_ref<bool (fleece::impl::Value const*, fleece::impl::Value const*)> const*) + 2953552 (Encoder.cc:490)
4 FLEncoder_WriteValue + 2849596 (Fleece.cc:533)
5 litecore::repl::DBWorker::_insertRevisionsNow() + 1185712 (DBWorker+Pull.cc:442)

The offsets are huge which makes me suspicious, but the symbol names are all plausible and they do have line numbers, so I'm inclined to trust it.

@snej snej transferred this issue from couchbase/couchbase-lite-ios May 29, 2019

@snej

This comment has been minimized.

Copy link
Member

commented May 29, 2019

Frame 0 is actually in Encoder::placeItem(), which is called (inlined) by writePointer(). The exact line is

            if (_items->tag == kDictTag)

implying _items is a bad pointer. It certainly seems to point way off into the weeds:

Exception Subtype: KERN_INVALID_ADDRESS at 0x0000007902c30a38
VM Region Info: 0x7902c30a38 is not in any region.  Bytes after previous region: 508463090233  
      REGION TYPE                      START - END             [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      MALLOC_NANO            0000000280000000-00000002a0000000 [512.0M] rw-/rwx SM=PRV  
--->  
      UNUSED SPACE AT END
@snej

This comment has been minimized.

Copy link
Member

commented May 29, 2019

Going by the line numbers, and looking at what functions call what, the actual stack seems to be

0 Encoder::placeItem() (Encoder.cc:167)
1 Encoder::writePointer(long) (Encoder.cc:523)
2 Encoder::placeValue(size_t size) (Encoder.cc:186)
3 Encoder::writeValue(.....) (Encoder.cc:452) ???
4 Encoder::writeValue(const Value*, const WriteValueFunc*) (Encoder.cc:502)
5 FLEncoder_WriteValue(FLEncoder,FLValue) (Fleece.cc:533)
6 DBWorker::_insertRevisionsNow() (DBWorker+Pull.cc:441)

although I'm unsure about frame 3, as the crash log says line 490 but only line 502 makes sense.

@snej

This comment has been minimized.

Copy link
Member

commented May 29, 2019

_items points to an item of a smallVector, and there was a bug-fix in smallVector after 2.5 (Fleece commit 20db52bf.) But the fix is in the move constructor, and I don't think that's relevant here.
Update: Yeah, there are no calls to the move constructor at all in 2.5, so the bug is irrelevant.

@snej

This comment has been minimized.

Copy link
Member

commented May 29, 2019

Ah, I think I have it: _items always points to _stack[_stackDepth-1] or to NULL. So it wouldn't just point to nowhere.
...unless _stackDepth were zero. Then the index into _stack is 0xFFFFFFFF.
Multiplying by sizeof(valueArray) (120) gives a byte offset of 0x77ffffff88, which would put the address of _stack at 0x102c30ab0 ... which is a pretty sane looking address. (The crash log says the previous region is from 280000000 to 2a0000000, which is after that address, but that region may not be the app heap. I doubt the app has 512MB allocated.)

@snej

This comment has been minimized.

Copy link
Member

commented May 29, 2019

On the other hand I can't find any way that _stackDepth would drop to zero. I put an assert(_stackDepth > 0) at the end of the pop() method, and it never aborts when running tests. The only places that adjust _stackDepth are push() and pop(), and the latter is only called by endCollection(), which is only called by endArray() and endDict(), and those are always balanced.

@snej

This comment has been minimized.

Copy link
Member

commented May 29, 2019

@fedal-mas — is there anything that might be unusual about the documents you're adding? Like extremely deep nesting?

@snej

This comment has been minimized.

Copy link
Member

commented May 30, 2019

Also: Is anything noteworthy logged right before the crash, like an error/warning?

Since the crash goes away if you delete the database (deleting the app shouldn't be necessary), it may be related to delta compression. At the LiteCore level there is a replicator option to turn off this feature (kC4ReplicatorOptionDisableDeltas) but it doesn't seem to be exposed in the Couchbase Lite API, unfortunately...
Oh, if there's an option on Sync Gateway to disable deltas, you could use that and it would have the same effect. That might work around the bug.

@snej snej changed the title 2.5.1: Crashing with large sync DBWorker crashes in Fleece Encoder (writePointer) May 30, 2019

@fedal-mas

This comment has been minimized.

Copy link
Author

commented May 30, 2019

@snej What would you considerer an "extremely deep nesting"?

What I have that is has the deepest nesting is something like:

{
    "a": {
        "b": {
            "c": {
                "d": {
                    "e": "value",
                    "f": "value",
                    "g": "value"
                },
                "i": [
                    {
                        "e": "value",
                        "f": "value",
                        "g": "value",
                        "h": "value"
                    },
                    {
                        "e": "value",
                        "f": "value",
                        "g": "value",
                        "h": "value"
                    },
                    ...
                ]
            }
        }
    }
}

Also: Is anything noteworthy logged right before the crash, like an error/warning?

Nope, it just says that got an EXC_BAD_ACCESS and CouchbaseLiteSwift was compiled with optimization - stepping may behave oddly; variables may not be available..

Oh, if there's an option on Sync Gateway to disable deltas, you could use that and it would have the same effect. That might work around the bug.

I could do that, but the delta sync was one of the features that made my team and me happy, since it would make the sync in general faster. As I said, around 30k docs (and some of them are big) would be updated daily and synced with all devices, so the delta sync would save a lot of bits.

@fedal-mas

This comment has been minimized.

Copy link
Author

commented May 30, 2019

it may be related to delta compression

@snej I've tested to disable the Delta Sync on SG, it's working now, your guess was correct.

@snej

This comment has been minimized.

Copy link
Member

commented May 30, 2019

Thanks! It helps to know for sure that this is related to delta sync. I'm still struggling to figure out how this could happen, but I feel like we're getting closer.

That nesting isn't terribly deep, but it may be deeper than in our replication tests, so I'll try adding a test with data like that.

@snej

This comment has been minimized.

Copy link
Member

commented May 30, 2019

YES! Found the bug. Applying a delta during c4doc_put uses the shared encoder, and calls finish but then not reset. The reset wouldn't be necessary if the encoder weren't going to be used again immediately, but _insertRevisionsNow is using the encoder. Without the reset call, the encoder's stackDepth gets left at 0, so ending the next encoding will underflow it. I wrote a Fleece unit test for this and it crashes exactly the same way:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x790542b168)
  * frame #0: 0x00000001006f5c4d Test`fleece::impl::Encoder::placeItem(this=0x000000010542b020) at Encoder.cc:167:17
    frame #1: 0x00000001006f9e93 Test`fleece::impl::Encoder::writePointer(this=0x000000010542b020, p=4) at Encoder.cc:523:14
    frame #2: 0x000000010070a6e2 Test`unsigned char* fleece::impl::Encoder::placeValue<false>(this=0x000000010542b020, size=2) at Encoder.cc:186:13
    frame #3: 0x00000001006f6655 Test`unsigned char* fleece::impl::Encoder::placeValue<false>(this=0x000000010542b020, tag=kDictTag, param='\0', size=2) at Encoder.cc:198:21
    frame #4: 0x000000010070314b Test`fleece::impl::Encoder::endCollection(this=0x000000010542b020, tag=kDictTag) at Encoder.cc:712:25
    frame #5: 0x00000001006ff581 Test`fleece::impl::Encoder::endDictionary(this=0x000000010542b020) at Encoder.cc:683:9
    frame #6: 0x0000000100259ba4 Test`fleece::impl::(anonymous namespace)::____C_A_T_C_H____T_E_S_T____787::test(this=0x000000010542b020) at EncoderTests.cc:797:13
...

Technically LiteCore is misusing the Encoder by reusing it without calling reset, but this is a really subtle mistake (I'd forgotten about this requirement) so I'm going to treat it as a bug in Fleece and fix it there. couchbaselabs/fleece#47

@djpongh djpongh added this to the 2.5.2 milestone May 30, 2019

@snej snej closed this in 0a13079 May 30, 2019

@fedal-mas

This comment has been minimized.

Copy link
Author

commented May 30, 2019

@snej thanks for looking into this!

Any idea when this fix will be released?

@snej

This comment has been minimized.

Copy link
Member

commented May 30, 2019

Looks like it'll be in 2.5.2 ... not sure when, though.

@snej

This comment has been minimized.

Copy link
Member

commented May 31, 2019

Your options look like:

  • Disable delta sync on the server and use 2.5.1
  • Wait for 2.5.2
  • Build your own Couchbase Lite from the release/iridium branch of couchbase-lite-ios, but making sure to update the couchbase-lite-core submodule to commit 06c438d (or later)
  • Pay for Couchbase support and have our support engineers send you a build with the fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.