Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firestore.terminate hangs, and never calls callback #12522

Open
sergiocampama opened this issue Mar 12, 2024 · 28 comments
Open

Firestore.terminate hangs, and never calls callback #12522

sergiocampama opened this issue Mar 12, 2024 · 28 comments
Assignees

Comments

@sergiocampama
Copy link
Contributor

Description

There are certain scenarios where I'll call Firestore's terminate function, and it will hang, never calling the callback. I haven't been able to create a reproducer, but can consistently get my product's code to get into this state (not shareable for obvious reasons).

This appears to happen when there's an overload of firestore document changes. I am able to see that the com.google.firebase.firestore queue's target queue is com.apple.root.default-qos.overcommit though not sure what this means. It looks like the queue starts with this value set as the target queue, could be a red herring.

Here's a screenshot of the backtrace of where firestore hangs... It looks like it's unable to close the gRPC watch stream for the RemoteStore, stuck on WaitUntilOffQueue.

Screenshot 2024-03-11 at 8 58 25 PM

I'm basically stuck trying to debug this to see where things go wrong and what I can do to avoid this. I'm very unfamiliar with this codebase, so I don't even know what to look for to find where the code is getting stuck. Any help with pointers on where to look or what to check would be very appreciated.

Reproducing the issue

Can't share reproduction steps, as I can't share the relevant code.

Firebase SDK Version

10.22

Xcode Version

15.3

Installation Method

CocoaPods

Firebase Product(s)

Firestore

Targeted Platforms

iOS

Relevant Log Output

There are no relevant logs.

If using Swift Package Manager, the project's Package.resolved

None

If using CocoaPods, the project's Podfile.lock

Can't share

@google-oss-bot
Copy link

I couldn't figure out how to label this issue, so I've labeled it for a human to triage. Hang tight.

@sergiocampama
Copy link
Contributor Author

Not sure if this is relevant, but I noticed that the tcp connection stops receiving any bytes after this condition is triggered... wondering if there's something here that's triggering some TCP blockage because of the amount of network happening...

@sergiocampama
Copy link
Contributor Author

I think I have an idea of why it starts happening, but can't yet explain it in code... what happens is that after some event, we'll create some changelog documents on a collection like "/changelog/user_id", which might contain 100s or 1000s of documents created in a very short span of time, but outside of any database transaction, so it looks like it gets 100s of serial updates of 1 document each... for some reason, this seems to lock firestore... if I remove this changelog subscription, I can't reproduce the hang any more

@milaGGL milaGGL self-assigned this Mar 12, 2024
@milaGGL
Copy link
Contributor

milaGGL commented Mar 12, 2024

Hi @sergiocampama, thank you for filing this issue. Quick question, what were the steps when you produced this issue? Terminating firestore hangs, how is it related to " there's an overload of firestore document changes."? Are you terminating the firestore right after 100 documents mutation?

@sergiocampama
Copy link
Contributor Author

Through some mechanism, we're creating/modifying 1000s of changelog documents. We have a listener for that collection of changelogs, which is updated 1 at a time, since the changelog documents are created from different cloud functions (they're onModified triggers for another set of documents).

At some point during this listening, it seems firestore locks up and stops receiving any kind of updates. We notice this lack of updates because the UI does not reflect that a document which should have been deleted, is not removed from the screen.

It is at this point that if you call Firestore.terminate, that since it is somehow locked, the terminate's callback won't get called, and it will get stuck on that backtrace shown above.

@milaGGL
Copy link
Contributor

milaGGL commented Mar 12, 2024

My colleague has pointed out it might be because LevelDbRemoteDocumentCache is creating concurrent tasks. It might have created 1000 tasks in your case, which will definitely overflow the default queue as well as the new threads that created after.

Could you please try changing the LevelDbRemoteDocumentCache::LevelDbRemoteDocumentCache function to use CreateSerial instead, and see it resolves the crash

 executor_ = Executor::CreateSerial("com.google.firebase.firestore.query");

@sergiocampama
Copy link
Contributor Author

Made the change locally, but it didn't help, still getting stuck. I also replaced the cacheSettings with MemoryCacheSettings(garbageCollectorSettings: MemoryLRUGCSettings()), which bypasses the LevelDB layer, and it's also getting stuck on the same events.

@sergiocampama
Copy link
Contributor Author

it's not looking like a dispatch issue, more like one of the GrpcStreams is getting blocked

@milaGGL
Copy link
Contributor

milaGGL commented Mar 12, 2024

Ruling out the persistence, the next suspect is the grpc. In the next ios sdk release we are doing a significant upgrade of the grpc dependency, which may fix the issue. Would you like to try again after the new release? Or, you can try applying #12398 locally to upgrade grpc in advance.

Could you please provide debug log as well? It might give us some clue.

@sergiocampama
Copy link
Contributor Author

No dice on updated gRPC.

I've enabled logging, and it's a bit massive and contains some sensitive information for a public setting, and I also don't think it provides much value, since there are no errors, just a massive list of watch/unwatch/response logs... I could share it if you start an email thread with me at sergiocampama at gmail dot com...

but I did found this... when the connection seems to hang because of the massive number of document updates, I can still send requests to the backend to listen to new documents (so it's not a dispatch queue issue), but the backend does not reply with a response for the initial document read nor the registered target_id, so it's starting to look like the WatchStream from Firestore's Datastore is getting hung on the backend side (could be head of line blocking on that h2 stream)... if this is indeed a possibility, it seems harder to debug from your side... not sure if I can give you any identifier for the connection so that someone could look more into it...

it would be great if the firestore grpc channel had some sort of ping mechanism to detect that the connection is hanging and to try to force recover it by reregistering the target_ids, so that we don't lose uptime just on that single connection blocking... if not, force closing the watch_stream and write_stream from remote_store would also help so that we can at least recover by letting the user perform the action that would involve firestore's termination

for now we're mitigating by trying to reduce the number of document reads to not trigger this hang, but given we don't have a root cause yet, we fear it might still be triggered in the future.

@sergiocampama
Copy link
Contributor Author

Actually, looking at the logs for the last response log, I see this just after it:

10.17.0 - [FirebaseFirestore][I-FST000001] GrpcStream('168259010'): operation of type 2 failed
10.17.0 - [FirebaseFirestore][I-FST000001] GrpcStream('168259010'): finishing the underlying call
10.17.0 - [FirebaseFirestore][I-FST000001] GrpcStream('168259010'): fast finishing 1 completion(s)

(we're using 10.22 Firestore but 10.17 FirebaseCore for some reason)

@sergiocampama
Copy link
Contributor Author

setting a breakpoint in datastore.cc's PollGrpcQueue to when ok == false, then tried tracing the WriteStream's write call to see if I could find where we're parsing/getting the error from the C++ code, but it's just so difficult to traverse... so I'm unsure if we're getting the error from the backend or from the network stack, and there are no other error logs...

is there a way to get more information on why the grpc completion queue failed to write to the stream?

@sergiocampama
Copy link
Contributor Author

I'm fairly confident that when we get GrpcStream('168259010'): operation of type 2 failed, the stream effectively dies and firestore gets into a corrupted state where the only recourse is to restart the app

@sergiocampama
Copy link
Contributor Author

just tested that I could continue to use the write stream by creating new firestore documents after getting the operation of type 2 failed, so this is specifically about Datastore's WatchStream getting completely stuck when it fails to write a message, and firestore failing to recover from this state even after attempting to terminate the firestore instance, hopefully this is enough information that you can create a test to simulate this condition (just the failure to write on the watch stream) to reproduce...

ideally, we should be able to find the root cause of the write failure, but sans that, allowing firestore to recover from this state without requiring an app restart should be the higher priority IMO

@sergiocampama
Copy link
Contributor Author

sergiocampama commented Mar 13, 2024

One final update: I was able to create a small repro case that results in the same behavior. This overloads the WatchStream until there's a read error (GrpcStream('6000032595f0'): operation of type 1 failed), and then when you tap on the Run stuff button on the simulator, it attempts to terminate Firestore but it can't and gets stuck

not sure if this is the same failure mode as what I'm seeing above, but it looks similar in that if the watch or write streams are overloaded, firestore gets corrupted with no way of recovering

https://github.com/sergiocampama/ListenerError

@milaGGL
Copy link
Contributor

milaGGL commented Mar 13, 2024

Hi @sergiocampama, thank you for the repro app. I am able to reproduce the "operation of type 2 failed" error.

@sergiocampama
Copy link
Contributor Author

awesome, thanks for confirming... let me know when you get a chance to review with the team, to see if there are any quick workarounds that we can use

@sergiocampama
Copy link
Contributor Author

ok, found a workaround that works for us for now...

if I replace GrpcStream::NewCompletion with the following in grpc_stream.cc:

std::shared_ptr<GrpcCompletion> GrpcStream::NewCompletion(
    Type tag, const OnSuccess& on_success) {
  // Can't move into lambda until C++14.
  GrpcCompletion::Callback decorated =
      [this, on_success](bool ok,
                         const std::shared_ptr<GrpcCompletion>& completion) {
        RemoveCompletion(completion);

        if (ok) {
          if (on_success) {
            on_success(completion);
          }
        } else {
          LOG_DEBUG("GrpcStream('%s'): operation of type %s failed", this, completion->type());
          if (on_success) {
            on_success(completion);
          }
          // Use the same error-handling for all operations; all errors are
          // unrecoverable.
          //  OnOperationFailed();
        }
      };

  // For lifetime details, see `GrpcCompletion` class comment.
  auto completion =
      GrpcCompletion::Create(tag, worker_queue_, std::move(decorated));
  completions_.push_back(completion);
  return completion;
}

I can avoid the stream crapping out.... it won't get the listen updates, but I can pair this with a polling read every n seconds, and if it fails, I can call terminate, which won't lock up any more.

@sergiocampama
Copy link
Contributor Author

In fact, with that code above, which looks terrible since it's calling on_success for the error case, I'm seeing that the connection continues working normally, it never locks up, and it never locks up the terminate call, and it sort of recovers on its own... sure, I'll lose some events, but in those cases we can just call terminate and restart the listeners... I disagree with "all errors are unrecoverable"

@sergiocampama
Copy link
Contributor Author

Screenshot 2024-03-13 at 6 37 35 PM

more information about the "failure" message... it looks like the request itself succeeded? it's unclear why the "ok" boolean is set to false

@sergiocampama
Copy link
Contributor Author

tracing to where the ok success status comes, was able to find this error:

(lldb) p grpc_error_std_string(error)
(std::string) "UNKNOWN:Attempt to send message after stream was closed. {file:\"/Users/kaipi/Developer/cpl/copilot-ios/main/Pods/gRPC-Core/src/core/lib/surface/call.cc\", file_line:1083, created_time:\"2024-03-13T19:01:29.963325-03:00\"}"

after that, only a single bit is propagated to signal the failure of the completion.

@sergiocampama
Copy link
Contributor Author

Figured out how to enable grpc tracing, was able to see that we are indeed getting a RST_STREAM from the backend with code 0

here is the trace logs: https://gist.githubusercontent.com/sergiocampama/c7ebecd21f42ca948358cd45bc1d5a77/raw/0fd3b4e2e8ff8c5639c9c3bca48be1be9a7d8f1e/gistfile1.txt

what I was able to find was that the watch stream does indeed get a RST_STREAM message, and then we restart that stream, as you can see there are multiple SEND_INITIAL_METADATA{:path: /google.firestore.v1.Firestore/Listen events.

what I think is happening is that on overload, the grpc++ batch processor continues sending messages to the backend, after the stream has closed... so it's not detecting that event and canceling the subsequent messages, that generates a discardable error, but an error nonetheless, and that indeed is taken as a (wrong) signal that the stream is broken and that we need to close it, which in turns puts Firestore into a broken state where it can not be recovered from

@sergiocampama
Copy link
Contributor Author

Created a gRPC++ issue here: grpc/grpc#36115

@sergiocampama
Copy link
Contributor Author

Created a fork with this change for now [1]. Hopefully you can all reproduce and provide a more robust fix over time, but this is sufficient for our needs for now.

1: copilotmoney@b4d785d

@milaGGL
Copy link
Contributor

milaGGL commented Mar 14, 2024

This is amazing. Thank you for sharing the debugging process, it is very informative! I will bring this to our team discussion and keep you updated.

@ehsannas
Copy link
Contributor

@sergiocampama you deserve an award for this investigation.

@milaGGL milaGGL assigned tom-andersen and unassigned milaGGL Mar 15, 2024
@yashykt
Copy link

yashykt commented Mar 19, 2024

Hi, I'm from the gRPC team. I don't completely understand the issue here, but from a gRPC perspective - if a write on a stream/RPC is failing, there is no point in writing to that stream again. Whether an RPC is done or not is decided by what's happening on reads and whether a status is available for that RPC. In this case, a RST_STREAM has been received on the associated HTTP2 stream so the stream is indeed done and a status should be available for the RPC.

Please feel free to reach out to me (or schedule a meeting) in case there are any questions as to how gRPC works.

@sergiocampama
Copy link
Contributor Author

Thanks @yashykt, I was basing my batching assumption on the error being generated in the void FilterStackCall::BatchControl::PostCompletion() method in call.cc. I could not find any batching being done in Firestore, and it seems to be using the grpc_completion_queue mechanism, not sure if that's where the issue lies, or a red herring.

I'll let the Firestore experts figure it out then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants