-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Keep-Alive HTTP header to all CouchDB requests #11233
Conversation
Alan, I do not know which labels should be assigned to this PR, and to move forward we should test this change in testbed to see if it will help to solve the problem and if so, how it may affect ReqMgr2 services. Since I do not know how connections are handled by WMCore+CherryPy chain I cannot comment either if it will have impact or not. But, if connections are always properly closed (which is an open question for me) everything should be fine. My suggestion that you deploy this to any VM or k8s cluster where ReqMgr2 can be deployed and point to production CouchDB and try out with HTTP request asking for large data. I am not aware that you have any local setup where such change can be testbed since such test requires: (1) FE setup, (2) CouchDB setup, (3) put large data into CouchDB, (4) perform API calls to that FE/CouchDB. |
Jenkins results:
|
@amaltaro , almost two months are gone, do you have any input on this PR? Did you ever try it? |
@vkuznet thanks for the reminder. I've been planning to give it a go, but I am not sure we can have meaningful test with the different architectures we have for dev/test/production. It isn't clear in this documentation: but last time I saw an schematic for the production cluster, I remember seeing 2 layers of nginx service, one for the frontends and one for the backends. What I am considering to do is, to point a ReqMgr2 read-only instance to production CouchDB, this would have to be done either in:
I am still not convinced it would be a valid check. Would you have any suggestions? |
@amaltaro , the current setup indeed use one FE cluster (which has apache daemonsets, i.e. it does not have nginx, and redirect requests to BE cluster which has nginx. Said that, I think the meaningful test we can do is to use On a contrary what you suggest has nothing to do with current setup since private VM does not have nginx, and testX cluster should be accessed from production FE (i.e. we need 2 cluster setup). |
@vkuznet Valentin, I fail to see why we should involve Can you please also clarify where you the 30k requests come from? The workflow that we need to test is: If we want to test only whether ReqMgr2 and CouchDB will behave properly and not leave TCP connections alive/stale, then we can either make a script to test it or point any ReqMgr2 instance (dev/test/private) to CouchDB (could be done even for testbed CouchDB). Still, it will be only the first test. |
To make your life easier I did the following:
You may adjust |
Alan, you don't need ReqMgr2 per-se to test FE/BE interaction, you need any HTTP server. In order to test affect nginx and FE you don't need ReqMgr2 and couchdb. Therefore, I suggested to use
That's all you need for the test. The basic HTTP server is sufficient for this test and will either prove or not effect of nginx. |
@vkuznet I think we are not really converging on what exactly needs to be tested and how. Let's discuss it over Zoom at some point during the day please. |
Alan, I'll be happy to discuss but I think you are missing entire point. In real setup we have:
Now, in case of |
Valentin, as we discussed over slack, we can start testing this scenario that you mentioned above. Thus, deploying an already existent application in the CMSWEB production k8s cluster, that will accept client requests specifying how big the payload to be served has to be (ideally a json object). Sizes like 100, 200, 300, 400, 500MB is likely all that we need. If we don't manage to pinpoint this issue, then we need to setup a scenario closer to reality, meaning going through Nginx twice for any given client request (couchdb to reqmgr2, reqmgr2 to client). |
@amaltaro , I deployed new version of
and if you want ndjson:
and if you want to introduce latency of 5 seconds you can do it as following:
The supported data size metrics should be KB, MB or GB, e.g. 10KB, or 10MB or 10GB. I provided json records as dict with two keys: I'll be busy today with lots of meetings and may not perform tests, but if you want to give it a try feel free to do that. Once again, you should call it from cmsweb.cern.ch, e.g. |
@amaltaro , do we need this PR? If you consider that we no longer need to have Keep-Alive header since gzip will be enforeced may be we should close this PR then. |
@vkuznet I was never too confident with this change, and given that the payload is substantially smaller now, I would suggest to leave it as is. If we see a need for it in the future, we can always reference back to this. Thank you for proposing this though. |
Fixes #11231
Status
in development
Description
Since ReqMgr2 may request large chunk of data from CouchDB we should inform upstream server to keep alive HTTP connection, otherwise it may be closed by (I think) CherryPy server, see details in #11231 Therefore, I propose to setup
Keep-Alive
HTTP header for all CouchDB requests to some large meaningful interval. I do not know though how it may affect overall performance, but it should be fine if all connections are closed. If we have a leak of connections, then this may be a problem since we may end-up with open connection which may be accumulated under the high load. This should be verified during integration tests and by watching ReqMgr2 connection plots in ReqMgr2 monitoring dashboard.Is it backward compatible (if not, which system it affects?)
MAYBE
Related PRs
External dependencies / deployment changes