Crash on reset for nextBatchID by assert in FSTLevelDBMutationQueue.start #2237
Describe your environment
Describe the problem
An app on a certain device was crashed by
I'm not sure why data was stored like that. Although I suppose that there might be chances as requirement to reset nextBatchID with
Steps to reproduce:
I couldn't find a way to create same situation.
Stack trace for the crash:
The text was updated successfully, but these errors were encountered:
So unfortunately, the stack trace for this crash isn't informative: if this assertion were to fail it can only fail during startup exactly in the way it has.
Do you still have a device that's failing this way? If so we may be able to discover what's happening by examining the stored persistent data.
The failure here indicates that there's a mismatch between the mutation queue contents and the metadata about the queue. These should always be in sync (they're modified together transactionally) and if they're not in sync it points to a serious error somewhere.
Have you changed Firestore SDK versions recently?
I'm sorry, I already have made the device recovered and I don't know a way to reproduce that.
We fixed version of Firestore SDK to
OK. Looking through the changes around those versions I think I've figured out you could get in this state.
In #2029 (released with Firebase 5.13.0) we changed the way we store writes and documents that have been acknowledged by the server but not yet observed via an active listen. Previously we would hold these in our outbound queue and keep track in memory of these held write acknowledgements. On restart these held writes would be dropped because the on-disk structures were insufficient to reconstruct the in-memory tracking structures. As a result, prior to Firebase 5.13.0, it was possible to observe flicker across restarts, where a value would fleetingly revert to a state prior to a local write if the app happened to stop in this in-between state.
#2029 fixed this flicker (and made a bunch of other internal improvements possible) by storing the SDK's estimate of the resulting document in the cache and removing acknowledged writes immediately. A one-time schema migration was included that performed a final delete of any existing held write acks on restart.
What I think is happening in your case is that while the schema migration deletes acknowledged writes it doesn't reset the highest acknowledged batch id if all writes were deleted. This in turn causes the violation of the invariant you're observing.
The good news is that this in-between state does not exist for very long and to get in this state a user would have to shut down your app just as the last write the queue was acknowledged but not yet observed. That should make this bug rare.
I need to get back with the team to evaluate possible fixes. Specifically I'm not yet convinced that it's safe to just remove the assertion that your device tripped over.