Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a reactor_resolver method that doesn't use reactor_pool/pthread #7

Closed
talawahtech opened this issue May 30, 2020 · 8 comments
Closed

Comments

@talawahtech
Copy link
Contributor

talawahtech commented May 30, 2020

In my performance tests I have observed that after a pthread gets created, it uses a small, but noticeable amount of overhead, even when it is not doing any work.

Profiling shows __pthread_enable_asynccancel and __pthread_disable_asynccancel as the functions being called. On the techempower JSON test, performance increased by 3-4% when this overhead was removed by using a custom version of reactor_net_bind that doesn't use reactor_pool.

I assume the thread pool is used by reactor_resolver_request because getaddrinfo is a blocking call. My suggestion is to add something like reactor_resolver_request_ip that takes a struct in_addr * builds an addrinfo struct without calling getaddrinfo.

This could then be used by reactor_net_bind which in most cases is being bound to an IP address. We can call inet_pton to see if the address is a valid IP and get the in_addr struct. If it isn't, we simply fall back to using the existing reactor_resolver_request function.

Let me know if I should go ahead and create a PR to demonstrate this approach.

@fredrikwidlund
Copy link
Owner

Interesting as well. How you do measure performance increase, exactly? Yes, resolving without blocking is possible for example with AI_NUMERICHOST, we can do this transparently (and this has been done in other revisions). However that the mere presence of pthread has such a clear impact is interesting. I would like to try reproduce this.

Also you should be aware that libreactor and libdynamic are being refactored. The code as it is is likely to undergo fairly large changes, but regardless these potential overheads are definitely things I want to avoid.

@fredrikwidlund
Copy link
Owner

As a side note, I do appreciate the contribution in the TechEmpower benchmark. However I do actually consider the use of preamble and such in the handler itself not to be a realistic approach. Of course this is up to the benchmark, and libreactor is just a framework, it is up to you how to use it. If the gain was clear enough it could maybe be done transparently as a "fast path" in the library, but the gain would need to be substantial.

@talawahtech
Copy link
Contributor Author

Yes, resolving without blocking is possible for example with AI_NUMERICHOST, we can do this transparently (and this has been done in other revisions).

Great, that's an even better idea!

Interesting as well. How you do measure performance increase, exactly?

I am running the code from the techempower json test on AWS in a cluster group using an m5.large for the server and a c5.2xlarge for the client.

I measure the performance using the same wrk command used by the techempower. Verifiying improvements can be tricky because you can definitely see performance variances on EC2 instances without changing anything. So in addition to doing multiple comparative runs (I build a docker image for each approach), I also generate flamegraphs and look for changes there as confirmation.

sudo perf record -F 99 --call-graph dwarf --pid $(pgrep -d ',' libreactor)

sudo perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl --cp > FlameGraphOutput/result.svg

Below is an example of a graph that shows the __pthread functions in the recv section of the graph. They only show up on the graph if pthread_create gets called. The reactor_pool_construct can still be called by default without issue, the impact is only seen if reactor_pool_dispatch gets called. Note that I had to build using -static to get those __pthread functions to show in the graph

image

However that the mere presence of pthread has such a clear impact is interesting. I would like to try reproduce this.

Well it wouldn't have such a clear impact for most use cases, this test is definitely pushing the boundaries :). I just figure if we can get a little extra performance for free without impacting other use cases negatively then it is a win.

Also you should be aware that libreactor and libdynamic are being refactored. The code as it is is likely to undergo fairly large changes, but regardless these potential overheads are definitely things I want to avoid.

Great, sounds like the perfect time for feedback. I would be more than happy to test out some of the new code before the final release.

@talawahtech
Copy link
Contributor Author

As a side note, I do appreciate the contribution in the TechEmpower benchmark. However I do actually consider the use of preamble and such in the handler itself not to be a realistic approach. Of course this is up to the benchmark, and libreactor is just a framework, it is up to you how to use it. If the gain was clear enough it could maybe be done transparently as a "fast path" in the library, but the gain would need to be substantial.

Yea I deliberately marked the default approach as a "platform" implementation and the reactor-server one as a "framework" implementation. I am not super happy with "preamble" either but I was in a bit of a hurry to squeeze out extra performance before the round 19 deadline and I saw that aspnetcore was using that approach. They even go a step further and precompute the response size, which I felt was probably too far.

I have it on my to-do list to revisit and measure the performance benefits more closely. From what I recall most of the benefit came from using a timer to only update the date once per second, but my implementation is definitely way too ugly. I will definitely let you know the results.

@talawahtech
Copy link
Contributor Author

Yes, resolving without blocking is possible for example with AI_NUMERICHOST, we can do this transparently (and this has been done in other revisions).

Oh, I should also mention I came across two resolver related issues while playing with the code. Let me know if you want me to create separate tickets for them.

  1. I don't think the != 0 checks for the ai pointer in reactor_net.c and test/reactor_resolver.c work (or at the very least, they don't always work). The ubuntu container that I used for the tests doesn't have an /etc/services file so when I tried to specify the service as "http" instead of "80" the getaddrinfo lookup failed, but the ai pointer still had a non-zero value. Instead I think you need to check the return value:
    e = getaddrinfo(node, service, &hints, &ai);
    if (e){
      err(1, "unable to resolve %s:%s %s\n", node, service, gai_strerror(e));
    }
  1. There are a couple reactor_assert_int_not_equal calls in reactor_net.c e.g. after calling socket and listen. The program immediately aborts if the check fails without printing any kind of error message to indicate what went wrong where. This made it a little more difficult to troubleshoot the first issue.

@fredrikwidlund
Copy link
Owner

Yea I deliberately marked the default approach as a "platform" implementation and the reactor-server one as a "framework" implementation. I am not super happy with "preamble" either but I was in a bit of a hurry to squeeze out extra performance before the round 19 deadline and I saw that aspnetcore was using that approach. They even go a step further and precompute the response size, which I felt was probably too far.

Sigh, yes, I don't agree very well with this benchmark. Looking at the json serialization in aspnetcore I'm not sure if it's actually done dynamically for each request, and so forth. A decent way would perhaps be to have different candidates including a "low level" one that makes use of these kinds of optimizations.

@fredrikwidlund
Copy link
Owner

  1. I don't think the != 0 checks for the ai pointer in reactor_net.c and test/reactor_resolver.c work (or at the very least, they don't always work). The ubuntu container that I used for the tests doesn't have an /etc/services file so when I tried to specify the service as "http" instead of "80" the getaddrinfo lookup failed, but the ai pointer still had a non-zero value. Instead I think you need to check the return value:
    e = getaddrinfo(node, service, &hints, &ai);
    if (e){
      err(1, "unable to resolve %s:%s %s\n", node, service, gai_strerror(e));
    }
  1. There are a couple reactor_assert_int_not_equal calls in reactor_net.c e.g. after calling socket and listen. The program immediately aborts if the check fails without printing any kind of error message to indicate what went wrong where. This made it a little more difficult to troubleshoot the first issue.

I'll try to find some time to refactor this. Unfortunately time is something I have little of currently.

@fredrikwidlund
Copy link
Owner

Removing pthread is very specific optimization but I will consider removing this dependency in v2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants