-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restart a HTTP server on schema updates #5
Conversation
that pull request introduced a perf regression, that should now be fixed with 02a51d3, and more than fixed, because performance actually improves compared to the main branch. I have been testing locally with the rust based subservices, and limiting the router to 2 threads so I can saturate it more easily: -#[tokio::main]
-async fn main() -> Result<()> {
+//#[tokio::main]
+fn main() -> Result<()> {
+ let runtime = tokio::runtime::Builder::new_multi_thread()
+ .enable_all()
+ .worker_threads(2)
+ .build()
+ .unwrap();
+ runtime.block_on(rt_main())
+}
+
+async fn rt_main() -> Result<()> { I was benchmarking using hey:
benchmark results: 200000 requests, 200 concurrent clientsmain
PR
benchmark results: 200000 requests, 400 concurrent clientsmain
PR
benchmark results: 200000 requests, 800 concurrent clientsmain
PR
Those results are interesting: with 200 clients, the PR has slightly higher latency, but much higher throughput, while with 400 clients, the PR has lower throughput but a lower latency distribution. This is good, because for a high scalabillity server, we want to optimize for the case of a high number of clients doing few queries rather than a low number of clients doing a lot of queries. (those are 20s long benchmarks, so not enough data, but from what I've seen, results do not change when running the benchmark longer, and anyway a benchmark on localhost on a laptop is not enough anyway) next: running that in the gatling benchmarks |
Current benchmarks with 50 fields, breadth 1, depth 2, backend response time 40ms: this PR at 02a51d3, the breaking point is at 10659 rps Both of them keep a very stable p99 latency until the breaking point. With this I declare that this PR can be merged 🥳 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #235 In a previous PR, the schema that was shared between the state machine and the HTTP server behind a lock was removed, so schema update was broken. We want to fix it without reintroducing a lock in the middle of the hot path for queries. Our solution here is to launch a new HTTP server with the new schema configuration on schema updates, as launching a server is very cheap. We need to replace warp's HTTP server with our own loop though, to get the ability to reuse the TCP listener socket from one server to the next and avoid losing TCP connections
Since we do not have access to the private hyper structs and traits used to implement it (Graceful, Watch, Exec, ConnStreamExec...), it is challenging to make a struct wrapper for a Connection, especially if we want to satisfy the bounds of https://docs.rs/hyper/0.14.13/hyper/server/conn/struct.Connection.html#method.graceful_shutdown What we can do, though, is to select over the shutdown watcher and the connection: - if the connection finishes first, exit there - if the shutdown watcher exits first, call graceful_shutdown() on the connection then await on the connection
this will help in isolating the TcpListener dance
this matches more the initial API, with only a oneshot::Sender<()> in HttpServerHandle. Everything else is handled internally in the implementation of WarpHttpServerFactory
move the socket unwrap to the spawned session task that way, if that accept() call failed, it only affects the current session and not future ones
Co-authored-by: Cecile Tonglet <cecile.tonglet@cecton.com>
Example case: the new configuration sets up override addresses for backend services, so the HttpServiceRegistry used by the graph fetcher must be recreated
maybe those are causing a perf regression
it is much faster than watch
02a51d3
to
612bfa9
Compare
In a previous version of the code, the schema that was shared between the state machine
and the HTTP server behind a lock was removed, so schema update was
broken.
We want to fix it without reintroducing a lock in the middle of the hot
path for queries. Our solution here is to launch a new HTTP server with
the new schema configuration on schema updates, as launching a server is
very cheap.
We need to replace warp's HTTP server with our own loop though, to get
the ability to reuse the TCP listener socket from one server to the next
and avoid losing TCP connections
it is still missing the graceful shutdown feature and some options on the HTTP session
Checklist for merge