Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc.StartGRPCServer takes 5 seconds to start #14429

Closed
Wondertan opened this issue Dec 27, 2022 · 9 comments · Fixed by #15041
Closed

grpc.StartGRPCServer takes 5 seconds to start #14429

Wondertan opened this issue Dec 27, 2022 · 9 comments · Fixed by #15041
Assignees
Labels
C: gRPC Issues and PRs related to the gRPC service and HTTP gateway.

Comments

@Wondertan
Copy link
Contributor

Wondertan commented Dec 27, 2022

Summary of Bug

grpc.StartGRPCServer takes 5 seconds to start. It's not really clear why this approach was taken, and git blame didn't give me any definitive answer. Looking at the Serve implementation, I don't see why one would wait 5 seconds to conclude the server started correctly.

Why is this a problem? Besides just solving a problem that does not exist, this creates additional problems to end users. E.g., we in celestia-node run dozens of integration tests locally and on CI. Each of the tests running an App has to wait additional 5 seconds, just because. Also, some dev from Secret shared that they have to similarly engineer a solution around the problem.

Running this in a goroutine does not help and is racy, as the func sets GRPCClient to the context. Additionally, the func does not respect native Go's context... Context police 🚓 is discontented.

I don't see any other workaround besides copy code and waiting until it's fixed downstream into our fork,

Possible solutions

  • At least respect the Go's context in the channel waiting for timeout; then there would be a workaround just to set a context deadline to exit earlier.
  • But really, there should not be any timeouts, and it's totally fine to run Serve in the goroutine without handling an error for 5 seconds. Additionally, you could pass a logger(through Context?) to log errors, if any. Also, the native Go's context should still be respected, and the server routine should be stopped once it is canceled. Otherwise, the server is started forever and goroutine is similarly leaked/

Version

Latest master

Steps to Reproduce

Just run grpc.StartGRPCServer

@Wondertan
Copy link
Contributor Author

Happy to submit a patch once we settle on the solution.

@alexanderbez
Copy link
Contributor

IIRC, Serve is a blocking process. So how can you call StartGRPCServer return the server object without blocking the caller's process?

@tac0turtle tac0turtle added the C: gRPC Issues and PRs related to the gRPC service and HTTP gateway. label Dec 28, 2022
@tac0turtle
Copy link
Member

does it have to be blocking @alexanderbez ?

@Wondertan
Copy link
Contributor Author

So how can you call StartGRPCServer return the server object without blocking the caller's process?

@alexanderbez, same as it is right now. Start the goroutine internally. If there is a need to handle the error by the user(unlikely), she should call Serve herself. Handling errors only for 5 secs is pointless.

@alexanderbez
Copy link
Contributor

alexanderbez commented Dec 28, 2022

does it have to be blocking @alexanderbez ?

I'm referring to Serve on the gRPC server (which is defined by the open-source gRPC project), here: https://github.com/cosmos/cosmos-sdk/blob/6dfa0c98062d5d8b38d85ca1d2807937f47da4a3/server/grpc/server.go#LL70C19-L70C19

So yes, it's starting a server, of course it has to block.


@Wondertan I'm not really following your proposal, but if you can open a draft PR I would be happy to take a look at it 😄

Wondertan added a commit to celestiaorg/celestia-node that referenced this issue Jan 5, 2023
…#1551)

This PR is one of the cleaning PRs that I am doing. It removes the usage of KVStore. Besides simplifying and unifying tests, it also fixes all the existing and potential "bind already in used" errors from the app and tendermint tandem. I tested this by running all the tests with `-parallel 12`, which runs in parallel per pkg. With this flag, we can speed up the time it takes to run all the tests on Ci and locally. Going further, we can also consider running all the tests in parallel with only one instance of the app running rather than an instance per test.

Would love this diff to be reddish, but unfortunately, I had to copy a bunch of code from Cosmos SDK. The rationale is in the comments. Also, see cosmos/cosmos-sdk#14429
tzdybal pushed a commit to celestiaorg/go-header that referenced this issue Jan 19, 2023
… (#1551)

This PR is one of the cleaning PRs that I am doing. It removes the usage of KVStore. Besides simplifying and unifying tests, it also fixes all the existing and potential "bind already in used" errors from the app and tendermint tandem. I tested this by running all the tests with `-parallel 12`, which runs in parallel per pkg. With this flag, we can speed up the time it takes to run all the tests on Ci and locally. Going further, we can also consider running all the tests in parallel with only one instance of the app running rather than an instance per test.

Would love this diff to be reddish, but unfortunately, I had to copy a bunch of code from Cosmos SDK. The rationale is in the comments. Also, see cosmos/cosmos-sdk#14429
@ValarDragon
Copy link
Contributor

ValarDragon commented Feb 15, 2023

cc @Wondertan did this get fixed in celestia / anything that can be upstreamed?

Would appreciate this 5 second delay being eliminated as well. Its not clear to me as well why this is a blocking 5 second wait. As I read the godoc, it seems as its explicitly not intended to be blocking: https://pkg.go.dev/google.golang.org/grpc#Server.Serve

. Serve returns when lis.Accept fails with fatal errors. lis will be closed when this method returns. Serve will return a non-nil error unless Stop or GracefulStop is called.

So the current implemented semantic seems incorrect - it assumes no errors / successful continual execution after 5 seconds. Whereas what I think you should want is this error to be handled with that same goroutine, with {some error recovery handling} for main application to deal with. (And have no blocking behavior)

Should define how we want grpc errors to be dealt with, to infer the error handling strategy. If we want GRPC to be isolated from the main application, this error recovery plan could just be a function thats ran after grpcServer.Serve(), in that goroutine. And otherwise, the function immediately returns with the server.

If we wanted grpc failures within 5 seconds to terminate the entire application, we can have the main app pass in a channel to get the error communicated on, and then the error handling logic is ran from within the app. (But entirely non-blocking)

@yihuang
Copy link
Collaborator

yihuang commented Feb 15, 2023

I think we can share the err channel between all services, and only wait once at the end.

@alexanderbez alexanderbez self-assigned this Feb 15, 2023
@alexanderbez
Copy link
Contributor

I think I can drastically clean this stuff up. Let me take a stab at it...

@Wondertan
Copy link
Contributor Author

cc @Wondertan did this get fixed in celestia / anything that can be upstreamed?

@ValarDragon, we have a quick solution for the test runner that just avoids the timeout. We don't have any clean and "correct" solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: gRPC Issues and PRs related to the gRPC service and HTTP gateway.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants