Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC: fix cases where gRPC call could be finished twice #2146

Merged
merged 11 commits into from Dec 5, 2018

Conversation

@var-const
Copy link
Contributor

@var-const var-const commented Dec 4, 2018

gRPC C++ wrappers assert that performing any operations on a call results in an "ok" error code. In certain cases, it was possible for Firestore to call grpc::ClientAsyncReaderWriter::Finish twice, which causes that assertion to fail, crashing the app. I haven't succeeded in reproducing the issue, but I presume the following would be consistent with bug reports:

  • in quick succession, an operation fails, and then Stream stops due to idleness. Normally, a failed operation would cause the stream to stop, and also when an operation fails, Stream doesn't call FinishImmediately on the GrpcStream. However, handling operation failure is split into two async steps: the first one is to finish the underlying gRPC call with a completion, and the second is to invoke the completion which will notify the parent Stream. Because the second step gets added to the back of the worker queue, theoretically the following configuration is possible:
[GrpcStream::OnOperationFailed] [...] [...] [Stream::Stop() /* idleness */] [...] [GprcStream::FinishAndNotify]
                        v -----------------------------------------------------------------------^

Stop due to idleness is considered a graceful stop, and so will call GrpcStream::FinishImmediately. GrpcStream::Shutdown relies on completions_.empty() check to avoid double finish. However, in this case, the check is insufficient, because GrpcStream still has one completion -- the "finish" operation enqueued in OnOperationFailed that won't be removed until FinishAndNotify.

  • if GrpcStream::OnOperationFailed is called twice in quick succession, so that the second OnOperationFailed is executed before FinishAndNotify enqueued by the first OnOperationFailed is. The failure is similar to case 1.

These sound like pretty rare circumstances, which is consistent with how difficult the issue is to reproduce. There might be more cases; the best fix seems to be ensuring that gRPC call is never finished twice.

This PR:

  • adds a flag, is_grpc_call_finished_, to track whether the underlying call has been finished without trying to interpret the current state of GrpcStream to infer that information;
  • adds logging to the shutdown sequence of GrpcStream, in case more related crashes are discovered;
  • adds tests for the two situations described above. Manually verified that none of the tests pass on master.
Copy link
Contributor

@wilhuff wilhuff left a comment

Basically LGTM with a question.

return;

if (!completions_.empty() && !is_grpc_call_finished_) {
// Important: since the stream always has a pending read operation,
Copy link
Contributor

@wilhuff wilhuff Dec 4, 2018

This comment is hard to parse because:

  • it conflates pre- and post-conditions
  • doesn't clearly indicate normal and error operation
  • uses ambiguous phrasing (e.g. "cancellation" could seemingly apply to either the context or the call)

Consider phrasing something like:

Important: during normal operation, the stream always has a pending read operation so Shutdown would hang indefinitely if we didn't cancel the context_. However, if the stream has already failed, avoid canceling the context to avoid overwriting the status captured during the OnOperationFailed.

Copy link
Contributor Author

@var-const var-const Dec 5, 2018

Thanks, I like this wording.

// never been started.
return;

if (!completions_.empty() && !is_grpc_call_finished_) {
Copy link
Contributor

@wilhuff wilhuff Dec 4, 2018

I don't entirely understand why both conditions must hold here. It seems as if is_grpc_call_finished_ should be sufficient, no?

In particular this suggests that it's possible for e.g. completions_ to be empty but is_grpc_call_finished_ to be false and we wouldn't want to call FinishGrpcCall, but that seems wrong.

Copy link
Contributor Author

@var-const var-const Dec 5, 2018

this suggests that it's possible for e.g. completions_ to be empty but is_grpc_call_finished_ to be false and we wouldn't want to call FinishGrpcCall, but that seems wrong.

The stream is in this state before it is started. Finishing a non-started stream would also fail a gRPC assertion. I don't think we should ever run into the case when Stream tries to Finish a GrpcStream before it is started. I'm mainly doing this because it's an easy check to make.

If you think protecting against this case isn't worthwhile, I'll simplify the check. Otherwise, I'll add a comment.

Copy link
Contributor

@wilhuff wilhuff Dec 5, 2018

OK. Looking at the prior state more closely there was a comment there that explained this and it made sense. It's possible you may want to merge something of that comment with what you have below.

@var-const var-const assigned wilhuff and unassigned wilhuff Dec 5, 2018
wilhuff
wilhuff approved these changes Dec 5, 2018
Copy link
Contributor

@wilhuff wilhuff left a comment

LGTM

// never been started.
return;

if (!completions_.empty() && !is_grpc_call_finished_) {
Copy link
Contributor

@wilhuff wilhuff Dec 5, 2018

OK. Looking at the prior state more closely there was a comment there that explained this and it made sense. It's possible you may want to merge something of that comment with what you have below.

@var-const var-const merged commit 48b57a3 into master Dec 5, 2018
2 checks passed
bstpierr added a commit that referenced this issue Dec 6, 2018
* master: (26 commits)
  Functions Interop (#2113)
  Add a travis cron job for CocoaPod symbol collision testing (#2154)
  Save schema version on downgrade, add test to verify (#2153)
  Silence Storage Unit Test `nil` warning. (#2150)
  Update versions for Release 5.14.0 (#2145)
  gRPC: fix cases where gRPC call could be finished twice (#2146)
  Fix Swizzler test warnings (#2144)
  Update Auth CHANGELOG.md (#2128)
  Make fuzz tests optional until they pass (#2143)
  Add support of Game Center sign in (#2127)
  Add test for deprecated FDLURLComponents init API. (#2133)
  fix a typo in integration test (#2131)
  Make fuzzing less verbose to avoid exceeding Travis log limit (#2126)
  Move to `domainURIPrefix` for FIRDynamicLinkComponents (#2119)
  Carthage instructions for new gRPCCertificates.bundle location (#2132)
  Fix pod lib lint GoogleUtilities.podspec --use-libraries regression (#2130)
  Avoid using default FIROptions directly. (#2124)
  Changelog entry for LRU GC (#2122)
  Revert "Add Firebase Source to Header Search Path" (#2123)
  Custom fdl domain (#2121)
  ...
@firebase firebase locked and limited conversation to collaborators Oct 24, 2019
@paulb777 paulb777 deleted the varconst/fix-double-finish branch Jul 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants