Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Keep-Alive HTTP header to all CouchDB requests #11233

Closed
wants to merge 1 commit into from

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Jul 28, 2022

Fixes #11231

Status

in development

Description

Since ReqMgr2 may request large chunk of data from CouchDB we should inform upstream server to keep alive HTTP connection, otherwise it may be closed by (I think) CherryPy server, see details in #11231 Therefore, I propose to setup Keep-Alive HTTP header for all CouchDB requests to some large meaningful interval. I do not know though how it may affect overall performance, but it should be fine if all connections are closed. If we have a leak of connections, then this may be a problem since we may end-up with open connection which may be accumulated under the high load. This should be verified during integration tests and by watching ReqMgr2 connection plots in ReqMgr2 monitoring dashboard.

Is it backward compatible (if not, which system it affects?)

MAYBE

Related PRs

External dependencies / deployment changes

@vkuznet
Copy link
Contributor Author

vkuznet commented Jul 28, 2022

Alan, I do not know which labels should be assigned to this PR, and to move forward we should test this change in testbed to see if it will help to solve the problem and if so, how it may affect ReqMgr2 services. Since I do not know how connections are handled by WMCore+CherryPy chain I cannot comment either if it will have impact or not. But, if connections are always properly closed (which is an open question for me) everything should be fine. My suggestion that you deploy this to any VM or k8s cluster where ReqMgr2 can be deployed and point to production CouchDB and try out with HTTP request asking for large data. I am not aware that you have any local setup where such change can be testbed since such test requires: (1) FE setup, (2) CouchDB setup, (3) put large data into CouchDB, (4) perform API calls to that FE/CouchDB.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 4 changes in unstable tests
  • Python3 Pylint check: failed
    • 18 warnings and errors that must be fixed
    • 4 warnings
    • 171 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 35 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13465/artifact/artifacts/PullRequestReport.html

@vkuznet vkuznet self-assigned this Aug 17, 2022
@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 20, 2022

@amaltaro , almost two months are gone, do you have any input on this PR? Did you ever try it?

@amaltaro
Copy link
Contributor

@vkuznet thanks for the reminder. I've been planning to give it a go, but I am not sure we can have meaningful test with the different architectures we have for dev/test/production.

It isn't clear in this documentation:
https://cms-http-group.docs.cern.ch/k8s_cluster/architecture/

but last time I saw an schematic for the production cluster, I remember seeing 2 layers of nginx service, one for the frontends and one for the backends.

What I am considering to do is, to point a ReqMgr2 read-only instance to production CouchDB, this would have to be done either in:

  • a private VM setup
  • or in one of the testX clusters

I am still not convinced it would be a valid check. Would you have any suggestions?

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 21, 2022

@amaltaro , the current setup indeed use one FE cluster (which has apache daemonsets, i.e. it does not have nginx, and redirect requests to BE cluster which has nginx.

Said that, I think the meaningful test we can do is to use httpgo service which is deployed on production BE cluster. The request to it comes from FE cluster. As such this service only dumps HTTP header. We always use this service to test stuff. Since request to it will got through FE (apache) -> BE (nginx) -> BE (service) we can easily generate load for this service to see if the issue is indeed in nginx or not. Since we have separate FE clusters for cmsweb-prod and cmsweb we can use cmsweb FE to perform the test, i.e. the test will not affect requests coming to cmsweb-prod. If test will show that load is fine we may declare that it is not an issue with FE since httpgo is Go-based HTTP server (totally different from ReqMgr2/CherryPy) and in this case we can rule out or not if the issue is in nginx. In other words if test is ok, then nginx does not play any role in such load and the actual issue is indeed with BE server. If you agree I think @muhammadimranfarooqi can easily organize this test (we need 30K distributed requests send to http://cmsweb.cern.ch/httpgo). If you want to complement this test, you may write simple Python CherryPy HTTP server which will only do "Hello world" and package it into docker image. Then we may deploy it to production BE and adjust FE rules to redirect to it. This will be similar setup to httpgo service but will use CherryPy. In this way you can measure the performance among the two using the same test and we can use keep-alive HTTP header in such tests.

On a contrary what you suggest has nothing to do with current setup since private VM does not have nginx, and testX cluster should be accessed from production FE (i.e. we need 2 cluster setup).

@amaltaro
Copy link
Contributor

@vkuznet Valentin, I fail to see why we should involve httpgo in this test setup. Isn't httpgo service supposed to be a replacement for the Apache frontend, such that it runs together with the backend service doing the authentication step?

Can you please also clarify where you the 30k requests come from?

The workflow that we need to test is:
client makes a request to ReqMgr2
--> client request gets authenticated in the cmsweb FEs
--> post-authentication, client request is passed to Nginx
--> client request reaches one of the ReqMgr2 backends
--> ReqMgr2 server adds the Keep-Alive header and makes a reqmgr2 request to CouchDB
--> reqmgr2 request is authenticated in the cmsweb FEs
--> post-authentication, reqmgr2 request is passed to the CouchDB backend
--> CouchDB data is served back to ReqMgr2 server through cmsweb FEs and Nginx
--> once ReqMgr2 has the data - potentially from multiple CouchDB requests - it starts serving the end client with this data through Nginx and cmsweb frontends

If we want to test only whether ReqMgr2 and CouchDB will behave properly and not leave TCP connections alive/stale, then we can either make a script to test it or point any ReqMgr2 instance (dev/test/private) to CouchDB (could be done even for testbed CouchDB). Still, it will be only the first test.

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 21, 2022

To make your life easier I did the following:

You may adjust server.py as you wish but basically it is simple Hello from Cherrypy HTTP server and now it is available on testbed. Therefore, this is a bare minimum HTTP CherryPy server we need to do all the tests with it.

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 21, 2022

Alan, you don't need ReqMgr2 per-se to test FE/BE interaction, you need any HTTP server. In order to test affect nginx and FE you don't need ReqMgr2 and couchdb. Therefore, I suggested to use httpgo and now I deployed httppy basic HTTP server. As such we can generate distributed load to this server which will test all HTTP request flow you outlined. So this is the flow of the test:

  • client send request to FE
  • FE authenticate request and send it to BE cluster
  • on BE cluster request goes through nginx to specific server, e.g. httppy or httpgo

That's all you need for the test. The basic HTTP server is sufficient for this test and will either prove or not effect of nginx.

@amaltaro
Copy link
Contributor

@vkuznet I think we are not really converging on what exactly needs to be tested and how. Let's discuss it over Zoom at some point during the day please.

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 21, 2022

Alan, I'll be happy to discuss but I think you are missing entire point. In real setup we have:

client -> FE -> BE (nginx) -> service

Now, in case of client->ReqMgr->CouchDB interaction we have double loop shown above, i.e. client needs to get data from ReqMgr2, and ReqMgr2 going through the same loop to get data from CouchDB. As such we have two loops (requests) going through our FE. If we want to know if Keep-Alive has affect on nginx performance or not, we do not need per-se full setup, we only need one request flow. As such it can be tested with simple (any implementation) HTTP server. And, this is what I propose, test the single loop using either httpgo or httppy. If nginx has any effect on request flow then we'll see it using simple setup.

@amaltaro
Copy link
Contributor

Valentin, as we discussed over slack, we can start testing this scenario that you mentioned above.

Thus, deploying an already existent application in the CMSWEB production k8s cluster, that will accept client requests specifying how big the payload to be served has to be (ideally a json object). Sizes like 100, 200, 300, 400, 500MB is likely all that we need.

If we don't manage to pinpoint this issue, then we need to setup a scenario closer to reality, meaning going through Nginx twice for any given client request (couchdb to reqmgr2, reqmgr2 to client).

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 22, 2022

@amaltaro , I deployed new version of httpgo into production cluster which now provides new API payload which takes several arguments like size=XXXKB, latency=N (in seconds) and format=json or ndjson. Therefore, we are ready for the test. Here is how someone can call the API to create payload of 10KB with json data (here scurl is my alias to curl with x509 certs):

scurl -s "https://cmsweb.cern.ch/httpgo/payload?size=10KB&format=json"

and if you want ndjson:

scurl -s "https://cmsweb.cern.ch/httpgo/payload?size=10KB&format=ndjson"

and if you want to introduce latency of 5 seconds you can do it as following:

scurl -s "https://cmsweb.cern.ch/httpgo/payload?size=10KB&format=json&latency=5"

The supported data size metrics should be KB, MB or GB, e.g. 10KB, or 10MB or 10GB. I provided json records as dict with two keys: {"id":N, "data":xxx}, where N is current index and data is always 1KB in size of string.

I'll be busy today with lots of meetings and may not perform tests, but if you want to give it a try feel free to do that. Once again, you should call it from cmsweb.cern.ch, e.g. https://cmsweb.cern.ch/httpgo/payload?size=10KB&format=json

@vkuznet
Copy link
Contributor Author

vkuznet commented Oct 28, 2022

@amaltaro , do we need this PR? If you consider that we no longer need to have Keep-Alive header since gzip will be enforeced may be we should close this PR then.

@amaltaro
Copy link
Contributor

amaltaro commented Nov 3, 2022

@vkuznet I was never too confident with this change, and given that the payload is substantially smaller now, I would suggest to leave it as is. If we see a need for it in the future, we can always reference back to this. Thank you for proposing this though.

@amaltaro amaltaro closed this Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ReqMgr2 unable to serve too many (30k?) requests to the client - Nginx limitation(?)
3 participants