Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReqMgr2 unable to serve too many (30k?) requests to the client - Nginx limitation(?) #11231

Closed
amaltaro opened this issue Jul 24, 2022 · 41 comments

Comments

@amaltaro
Copy link
Contributor

Impact of the bug
ReqMgr2 (but likely affecting any CMSWEB service)

Describe the bug
Dima reported this issue a couple of weeks ago, which is the inability to retrieve active requests from ReqMgr2 through this API:

https://cmsweb.cern.ch/reqmgr2/data/request?status=ACTIVE

This problem is tightly coupled to the overload of the system, with around 30k active requests at the moment.
Investigation with the curl command suggests it to be a problem/limitation in the CMSWEB infrastructure [1], failing requests with a 502 Bad Gateway.

The acquired status is the currrent worst offender, matching around 19k requests. Which takes more than a minute to be served to the client and transferring a total of 270MB [2]

How to reproduce it
Query the API above

Expected behavior
These requests and amount of data will become more common and we need to ensure that our applications and infrastructure is capable of dealing with it.
We should start discussing this with the HTTP experts and understand what are the limitations of NGinx, AFAIK there is a limit of 80MB or so. We should try to fix and/or increase this limits in the CMSWEB cluster such that they meet the CMS user use cases.
For a longer future, we need to start serving compressed data to the users, especially for these heavy data APIs.

FYI: @drkovalskyi

Additional context and error message
[1]

$ time curl --cert $X509_USER_CERT --key $X509_USER_KEY "https://cmsweb.cern.ch/reqmgr2/data/request?status=ACTIVE" > active_requests.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   150    0   150    0     0      0      0 --:--:--  0:02:48 --:--:--    33

real	2m48.514s
user	0m0.086s
sys	0m0.111s
$ cat active_requests.json 
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>

[2]

$ time curl --cert $X509_USER_CERT --key $X509_USER_KEY "https://cmsweb.cern.ch/reqmgr2/data/request?status=acquired" > acquired_requests.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  270M    0  270M    0     0  3693k      0 --:--:--  0:01:15 --:--:-- 37.5M

real	1m15.023s
user	0m1.230s
sys	0m1.614s
@amaltaro
Copy link
Contributor Author

@muhammadimranfarooqi Imran, could you please clarify what those limits are at the Nginx layer and try to check any logs that could confirm this suspicious?

@vkuznet
Copy link
Contributor

vkuznet commented Jul 24, 2022

Alan, I doubt that your conclusion is correct. The 502 gateway errors in fact says that the backend rather then FE is down, and therefore the FE can't reach it. I'm confident that nginx is capable to much higher load and since nginx stays on par with apache it can sustain in our setup around 500 req/sec. I also do not understand what MB in your example represent, Is it single response size or is it total size across all requests. Finally, your curl example does not represent and says anything about the BE performance. As such I suggest the following:

  • measure throughput of backend service, i.e. reqmgr2, without any FE. This will give you clear idea how much throughput python server can deliver, but I bet that what you will find will really disappoint you. I performed this studies many years ago, see my talk and numbers here
  • measure throughput of apache/APS/nginx performance using simplest backend, e.g. httpgo. This will give you clear idea of performance of FE. We already did that for apache vs APS, and you can find results over here.

Bottom line, I doubt anything wrong with our FE throughput and rather the actual issue here is poor performance of python back-end servers. I saw only two ways to improve it:

  • change python web services to Go
  • or, put wgsi server gateway in front of python servers, see here. For example, I performed DBS improvements studies using uWSGI server and shown that the performance can be improved by 30% but not more.

@amaltaro
Copy link
Contributor Author

Valentin, I don't think the problem here is throughput, but rather the ability to serve large data to the end users (it does not matter if it's 1 or 100 requests!).
I remember discussing similar issues with Imran in the past and he mentioning certain limitations with the Nginx service. I have also had similar issues with eoscmsweb in the past, and the problem was with Nginx as well (502 Bad Gateway).

Anyways, I didn't get to any conclusions yet and I am actually trying to identify where the problem is. Let's see what Imran can contribute here.

@vkuznet
Copy link
Contributor

vkuznet commented Jul 24, 2022

My suggestions remains, please clearly define what is "large" data, and measure performance of BE first. And, as a first action please start using gzipped content as whatever you refer as large at the end can be compressed by many folds, and therefore may not require any action on infrastructure and/or FEs.

@muhammadimranfarooqi
Copy link

@amaltaro
Copy link
Contributor Author

@muhammadimranfarooqi thanks for pointing us to that dashboard. Unfortunately I fail to see how to map any of that information to this issue to be investigated here.

@vkuznet since you have already joined this discussion, can you please follow this up then? Thanks!

@amaltaro amaltaro moved this from To do to In progress in Planned for Q3 - 2022 Jul 24, 2022
@vkuznet
Copy link
Contributor

vkuznet commented Jul 25, 2022

Alan, I have little knowledge of the system and in order even to understand it I need more input from you. For instance:

  • what is a "large" data you are referring to? What is a range, is it O(100MB)? What we should project and support?
  • where data comes from?
    • I have suspicious it comes from CouchDB, but does this API fetches it and process it, i.e. doing internal work or not, in other words does any latency existing between source of the data and reqmgr API?
  • does this reqmgr2 API support gzipped format?
  • who are the clients and how they retrieve the data? Can clients consume gzipped format?
  • what is a pattern of requests to this API? What is its rate?

Said that, I also mentioned that I have little time in Q3 due to my schedule and I may not have time to look into this issue until late Aug.

@amaltaro
Copy link
Contributor Author

Those are all valid questions, but I think you do not need to know any of these in order to evaluate whether CMSWEB frontends/Nginx is the limiting factor here (or if the problem is with the backend, as you previously suggested).

I agree that we need to look into this issue taking into consideration longer term design and implementation, but the main goal with this one issue is to identify why we get the 502 Bad Gateway and see if this can be fixed in O(days) instead of weeks/months.

@muhammadimranfarooqi
Copy link

Hi @vkuznet @amaltaro

In nginx setting, we have put these restrictions.

config:
    client-body-timeout: "300"
    client-header-timeout: "300"
    proxy-body-size: 200m
    proxy-connect-timeout: "300"
    proxy-read-timeout: "300"
    proxy-send-timeout: "300"
    upstream-keepalive-timeout: "0"

That corresponds to timeout for one of the plot of t0_reqmon in the dashboard
https://monit-grafana.cern.ch/d/k8s-be-prod-nginx/cmsweb-k8s-be-production-nginx-ingress?orgId=11&from=now-24h&to=now&viewPanel=92

Should we increase this 200MB upstream limit?

@amaltaro
Copy link
Contributor Author

proxy-body-size: 200m

thank you for providing these details, @muhammadimranfarooqi . I'd suggest increasing it to 400MB then. Once it's deployed, we can test this specific call once again.

@vkuznet
Copy link
Contributor

vkuznet commented Jul 25, 2022

@muhammadimranfarooqi please do not change yet this setting on production cluster, I'll provide details in a bit

@muhammadimranfarooqi
Copy link

Yes, @vkuznet , I was waiting for your comment. I did not change anything yet

@vkuznet
Copy link
Contributor

vkuznet commented Jul 25, 2022

@amaltaro , I'm sorry but I completely disagree with your request. The FE are designed for high throughput for ALL CMSWEB service and we simply can't increase resources because BE misbehave. In this case, the increase of body size implies that the FE will be stuck with request and will not serve other clients. As I wrote, since you did not provide any description of API patterns I against any changes on FE. Let me clarify:

  • sending O(100MB+) data takes time, i.e. it means that specific FE will be stuck with this request
  • at this time it can't serve other clients, e.g. DBS
  • if this API will be used often, or in parallel by different clients, FE will allocate more and more resources to that client and all HTTP calls (due to network transfer) will be stuck on that client's API call. It will liimit FE throughput for other clients
  • etc.

The solution is simple, and I already pointed out, BE and clients should start using gzip'ed content. Your API which ships 270MB can be easily fit in FE requirements if it will use gzipped content which will reduce this size by factor of 10-100, and no changes to FE will be required, and not blockage can be done on FE and BE. Please do not address a problem by simply requiring more RAM or resources. We battle it before with DBS and I'm against changing FE settings w/o clear understanding how resources will be used. Moreover, by doubling the size we simply post-pone the same issue on FE again. Do you confident that ReqMgr2 will not ship 400MB body size? I doubt it, since it is matter of time and load on our production systems. As such, any threshold increase should be accomplished with clear description how this happen, what are patterns, what are the clients, what is the rate, etc.

Said that, I also understand that we are talking about production service. As such, we may accept a change on FE as a temporary solution to allow current service to provide data to client. But the increase we will do is only change 200MB to 300MB (temporary) which will allow ReqMgr2 API to behave now and force us to change ReqMgr2 behavior asap to use gzipped content with highest priority. Does it sound as an acceptable compromise?

@vkuznet
Copy link
Contributor

vkuznet commented Jul 25, 2022

@muhammadimranfarooqi , we had discussion during today's WMCore meeting, and priority for this issue has been raised. Please proceed and increase threshold on nginx service to what Alan has suggested, i.e. 400MB. This will be a temporary fix until gzipped content will be put in place.

@amaltaro , please review memory usage of ReqMgr2 as it has spikes according to this monitoring dashboard. The spikes goes to up to 3GB in RAM (I don't know if they are related to this specific API but it may be the case). From previous experience we know that Python memory management is not ideal, and quite often RAM usage of server can increase significantly when serving large chunks of data. For example, the DBS spikes went up to 10GB and we were unable to run it reliably on k8s infrastructure.

@muhammadimranfarooqi
Copy link

HI @vkuznet @amaltaro

I have increased size from 200MB to 400MB.

@amaltaro
Copy link
Contributor Author

@muhammadimranfarooqi @vkuznet thanks for updating it.

I tried the same query once again and it was successful:

$ time curl -v --cert $X509_USER_CERT --key $X509_USER_KEY "https://cmsweb.cern.ch/reqmgr2/data/request?status=acquired" > acquired_requests.json
...
< HTTP/1.1 200 OK
< Date: Mon, 25 Jul 2022 16:11:11 GMT
< Server: Apache
< Content-Type: text/plain;charset=utf-8
< Vary: Accept,Accept-Encoding
< Cache-Control: max-age=3600
< X-Rest-Status: 100
< X-Rest-Time: 35687958.241 us
< CMS-Server-Time: D=35697288 t=1658765435763543
< Transfer-Encoding: chunked
< 
  0     0    0     0    0     0      0      0 --:--:--  0:00:35 --:--:--     0{ [data not shown]
100  271M    0  271M    0     0  3609k      0 --:--:--  0:01:16 --:--:-- 24.8M
* Connection #0 to host cmsweb.cern.ch left intact

real	1m17.004s
user	0m1.274s
sys	0m1.742s

which confirms my suspicioun of the frontend/Nginx limitations.

@drkovalskyi if you want to re-enable the original implementation you have on your monitoring setup, could you please report here how it goes?

@vkuznet
Copy link
Contributor

vkuznet commented Jul 25, 2022

And, for the record, the existing json size is 272MB while gzip compression provides 17 times reduction and reduce it to 16MB. As such the transfer time will be smaller too.

@drkovalskyi
Copy link

Thanks guys for looking into it. Unfortunately it doesn't work for me. I get:

    curl.perform()
error: (18, 'transfer closed with outstanding read data remaining')

I use https://cmsweb.cern.ch/reqmgr2/data/request?status=ACTIVE query, which is more than you expect with acquired alone.

@vkuznet
Copy link
Contributor

vkuznet commented Jul 25, 2022

@amaltaro , once again could you please clearly specify where the data comes from and what is a data size we should expect? It seems to me that doubling payload size only account for acquired workflows, but Dima's query specify a different conditions. Since we do not know the pattern of the query, neither have any idea about amount of data we should expect I hope you can provide this numbers that we can understand the scope and requirements of this API behavior. Moreover, we need to know (without approximate guess) what is the range we should expect from this API, i.e. do we need 400MB, 1GB or what is the upper limit here?

@amaltaro
Copy link
Contributor Author

Sorry, I mixed up acquired and ACTIVE in my query.

Let me try to expand a bit on number and what to expect:

  • NGinx was configured to 200MB and we were able to fetch workflows sitting in acquired status (18.5k ATM).
  • If we try to fetch all of the ACTIVE requests, that should be ~29k ATM. So the acquired workflows account for 2/3 of that load, which should be more than enough to accommodate in the new limit of 400MB
  • each workflow description is different and it can take less or more volume (really depends on the workflow construction). This [*] one seems to be well representative and its data volume in the filesystem corresponds to ~17kB.
    • So the total amount of ACTIVE workflows should be between 450 and 500MB
    • While the total amount of acquired workflows gives me around 300MB (how did it work with the previous 200MB limit?)
  • according to CMSWEB timber k8s dashboard, we have been having around 800k HTTP requests per day over the last week (but I've seen it up to 3 or 4 millions in the past)

Regarding the query Dima runs, it's one of the heaviest ReqMgr2 queries because it's basically fetching all the active workflows in the system and their full documents from CouchDB. I could be wrong, but I think we do not have any other system executing such call. Dima can clarify as well, but this call shouldn't be executed more than a handful of times per day.

I notice that we no longer get the 502 Bad Gateway, instead the server responds with HTTP/1.1 200 OK, but then the connection gets closed before it's over. HTH.

[*] https://cmsweb.cern.ch:8443/reqmgr2/data/request?name=cmsunified_task_BPH-RunIISummer20UL16GEN-00127__v1_T_220316_105637_8586

@vkuznet
Copy link
Contributor

vkuznet commented Jul 25, 2022

Alan, thanks for this info. Now we need to translate it into the load. If you say we have 19K acquired workflows which translates into roughly 300MB, and according to you previous post it takes 1m17sec, then 30K workflows will give almost 500-600MB but I have no idea how much time service will take to produce this data, and how much time it will take to serialize and transfer this data. If service completes requests within FE interval (5min), then client will get 200 OK, but client still need to fetch full data which accounts for network time (which will reflect in time from BE to FE and nginx). Since I do not know how data are produced and shipped to the client I can't answer this question. Is it possible that time to acquire this data to time to deliver this data to the client will take more than 5min? I do not know. All of this questions depends on how service is implemented, it depends on latency of database, on time spent on serialization of data, etc.

In order to answer this question you should go to your service and perform this call w/o any FE involvement, e.g. login to a pod with curl command, and fetch data from your service using curl local network call, e.g. curl http://reqmgr2.dmwm:port/api This call will go through k8s network (i.e. bypass FE) and you can measure how much time it will take. Once we know this number we can understand where the issue and where it should be addresses.

@drkovalskyi
Copy link

BTW, I cannot get https://cmsweb.cern.ch/reqmgr2/data/request?status=acquired with pycurl, while it works if use curl directly. I can switch to curl, but it's somewhat ugly.

@vkuznet
Copy link
Contributor

vkuznet commented Jul 26, 2022

@amaltaro I think we have a complex problem here (sorry for the long post but it outlines all technical details to understand the issue).

I finally was able to figure out how to get proper set of CMS HTTP headers, and set proper curl call to localhost of ReqMgr2 host, i.e. bypass FE entirely. And, I am unable to query ReqMgr2 service request API with ACTIVE value. Below you can see full series of steps which you (or anyone in with access to dmwm namesapce) can repeat. For security reason I strip off port number and I do not provide location of hmac file (but you should know this information).

For transparency here are the steps I did:

  • login to lxplus8, in order to have access to k8s infrastructure
  • I setup access to cmsweb k8s clusters by setting proper KUBECONFIG
  • I looked up reqmgr2 pods in dmwm namespace
  • I logged into one of the reqmgr2 pod
  • I created rcurl shell script using CMS headers:
    • for that I borrowed from ServerMonitor the code to create CMS Headers, see my code below, and run it to get CMS headers
    • then I put all of them to rcurl script
  • I run rcurl with http://localhost:<port>/reqmgr2/data/info URL to check that I can access ReqMgr2 data and it was successfull
  • Finally, I run the following command
time ./rcurl http://localhost:<port>/reqmgr2/data/request?status=ACTIVE > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    13    0    13    0     0      0      0 --:--:--  0:01:06 --:--:--     3
curl: (18) transfer closed with outstanding read data remaining

real    1m6.421s
user    0m0.011s
sys     0m0.023s

But I can run sucessfully the following comamnd:

time ./rcurl "http://localhost:8246/reqmgr2/data/request?status=acquired" > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  271M    0  271M    0     0  7055k      0 --:--:--  0:00:39 --:--:-- 62.7M

real    0m39.390s
user    0m0.065s
sys     0m0.123s

It tells me that the following:

  • there are two calls to FE, one from a client (e.g. DIma's script) to reqmgr2 API, and then another call from reqmgr2 server to CouchDB
  • therefore, we have 2 network transfers, one from CouchDB to ReqMgr2 API, and then from ReqMgr2 APi back to the client
  • the network time is not negligible for 271MB payload, only ReqMgr2 server requires 40sec for acquired data. This means that it will almost double when this data will ship back to the client
  • since I still do not know the size of payload for ACTIVE data and it fails from ReqMgr2 server to CouchDB I can't say much neither about size of the data or the time required to fetch this data. To do that someone should go to CouchDB directly and execute exact query which ReqMgr2 is doing. I do not know what is the query.
  • the amount of data we transfer is doubled (since there is no streaming in ReqMgr2 API), i.e. first ReqMgr2 service fetches the data from CouchDB (this is one data transfer request), then it will pass it to the client (this is second data transfer call). As such we have two transfer calls going through our FEs. One from ReqMgr2 server to CouchDB and another from ReqMgr2 to the client.

From the above observations, I still conclude that the best solution is to enable gzip encoding. But, since it is not yet in ReqMgr2 and in order to move forward we need the following info:

  • you should look-up CouchDB query and execute it directly on the Couch w/o access to FE in order to figure out payload size of ACTIVE workflows.
  • one we know that, we may adjust FE nginx payload size to allow pass of such data through it
  • after that, we may measure time spent per single call to transfer such data and estimate if such data will fit into FE HTTP window (5min).
  • we also need to check with monitoring dashboard for ReqMgr2 RAM behavior when fetching this set of data (as I mentioned above we already have spikes with 3GB and I fear that they will come higher once we'll get all data from CouchDB)
  • only after that we may discuss if we can change FE settings and make them semi-permanent

For the record, here is the code to get CMS HTTP headers using hmac file:

#!/usr/bin/env python

import hmac, hashlib

def authz_headers(hmac_key):
  """Create fake authentication and authorisation headers compatible
  with the CMSWEB front-ends. Assumes you have the HMAC signing key
  the back-end will use to validate the headers.

  :arg str hmac_key: binary key data for signing headers.
  :returns: list of header name, value tuples to add to a HTTP request."""
  headers = { "cms-auth-status": "OK",
              "cms-authn-method": "ServerMonitor",
              "cms-authn-login": "server-monitor",
              "cms-authn-name": "Server Monitor" }
  prefix = suffix = ""
  hkeys = headers.keys()
  for hk in sorted(hkeys):
    if hk != "cms-auth-status":
      prefix += "h%xv%x" % (len(hk), len(headers[hk]))
      suffix += "%s%s" % (hk, headers[hk])

  cksum = hmac.new(hmac_key, prefix + "#" + suffix, hashlib.sha1).hexdigest()
  headers["cms-authn-hmac"] = cksum
  return headers.items()

fname= # NOTE: SHOULD BE SET using location of our hmac file
authz = authz_headers(open(fname).read())
print(authz)

@vkuznet
Copy link
Contributor

vkuznet commented Jul 26, 2022

@drkovalskyi I think intermediate solution here to get data in chunks, i.e. you can make individual calls to ReqMgr2 service using different states, e.g. status=new, status=acquired, etc. Since individual states data are smaller I would expect that you can get all info by doing multiple calls with different statuses. Once we'll figure out the size of data for ACTIVE status we can do something about it on k8s infrastructure using recipe I outlined in my previous comment.

@amaltaro
Copy link
Contributor Author

Thank you for checking this in the reqmgr2 backends, Valentin.
Your comment about a double transfer is valid and indeed a concern.

I had a look at the WMCore code, and it looks like ReqMgr2 is actually triggering the following POST request to CouchDB (converted to curl):

curl -H 'Content-Type: application/json' --cert $X509_USER_CERT --key $X509_USER_KEY "https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache/_design/ReqMgr/_view/bystatus?include_docs=true" -d '{"keys": ["new","assignment-approved","assigned","staging","staged","acquired","running-open","running-closed","failed","force-complete","completed","closed-out","announced","aborted","aborted-completed","rejected"]}' > include_docs.json

where it queries a CouchDB view and filters requests against a large list of keys (in addition to asking for the whole document to be transferred). I just executed this call a few times and it takes in average 25secs to finish, the json size - in the file system - is around 315MB.

I logged in to vocms0742 - which hosts this database - and executed the same query against localhost. Time to get it server is very similar to the one going through the frontends (25secs as well).

I googled for the possibility of requesting compressed data from CouchDB view, but it doesn't seem to be supported.

I also googled for that "transfer closed ..." error, and stack overflow suggests a possible tweak/configuration that can be evaluated for Nginx:
https://stackoverflow.com/questions/33522615/curl-transfer-closed-with-outstanding-read-data-remaining

Lastly, concerning the spikes of memory footprint for ReqMgr2, I think this query could easily give ReqMgr2 an extra GB of memory footprint, since it has to load all this CouchDB data in memory, and then an extra data structure is created after applying some filters on the output of CouchDB. See this method formatting CouchDB output, before it's transferred to the user:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/RequestDB/RequestDBReader.py#L62

@drkovalskyi
Copy link

@drkovalskyi I think intermediate solution here to get data in chunks, i.e. you can make individual calls to ReqMgr2 service using different states, e.g. status=new, status=acquired, etc. Since individual states data are smaller I would expect that you can get all info by doing multiple calls with different statuses. Once we'll figure out the size of data for ACTIVE status we can do something about it on k8s infrastructure using recipe I outlined in my previous comment.

I already do that. If I exclude acquired I can get everything else in one call. I cannot get acquired though. It fails with pycurl.

@drkovalskyi
Copy link

Actually I now can get acquired request finish successfully pycurl. Haven't changed anything on my side. So operationally the pressure is reduced, since we have a working solution even if it's not complete reliable.

@vkuznet
Copy link
Contributor

vkuznet commented Jul 27, 2022

@amaltaro , thanks for giving more feedback about actual CouchDB query, transfer time and additional links. I think the issue with large payload is related with closed connection within ReqMgr2 API itself. From your stack overflow link I see two solutions:

  • either disable proxy buffering within nginx
  • or, in fact, change your client (in this case ReqMgr2 server) to use keep-alive HTTP header, see docs here

I rather prefer to adjust HTTP header of the client to re-use HTTP connection instead of changing nginx settings. The nginx provides access to all services and any change to it should be carefully evaluated to not allow breaks in other sub-systems or open a door for various attacks from clients which may happen unintentionally and without knowledge of the client.

I'm not sure where exactly in WMCore codebase the HTTP headers are set for queries to CouchDB, but it would be useful if you check this first and see if keep-alive HTTP header is set within ReqMgr2 calls which it makes to CouchDB, e.g. Keep-Alive: timeout=60, max=1000 (here the timeout should specify timeout time range which should cover the time required to fetch CouchDB data, and max should specify maximum time allowing to keep connection).

@vkuznet
Copy link
Contributor

vkuznet commented Jul 27, 2022

@drkovalskyi I also suggest that you try to setup Keep-Alive: timeout=200, max=1000 option with your pycurl client, even though I'm not sure if it will help in this case.

The problem here is double connection to our FEs:

  • one comes from your client to reqmgr2 and this connection should not be closed
  • another comes within ReqMgr2 to CouchDB and should be kept within the server until all data is fetched

I do not remember if our FE (and nginx in this case too) propagates client's options through entire HTTP request, and it most likely depends if ReqMgr2 (and WMCore acknowledge) client's HTTP headers. If so, this option should be applied across entire requests, so it worth to try it out.

@vkuznet
Copy link
Contributor

vkuznet commented Aug 13, 2022

@amaltaro, @drkovalskyi where do we stand on this issue? Did you try to use Keep-Alive HTTP header in a client?

@vkuznet
Copy link
Contributor

vkuznet commented Aug 23, 2022

@amaltaro , @drkovalskyi did you tried the Keep-Alive patch I provided? If so, does it affect anything?

@amaltaro
Copy link
Contributor Author

@vkuznet not yet Valentin. I have a few other things in the pipeline before I can get to it. I should be able to look into it by the end of this week (this test isn't that simple btw).

@vkuznet
Copy link
Contributor

vkuznet commented Sep 22, 2022

@amaltaro , I got the numbers. All tests passed with the following timing:

  • 100MB, 3.1 sec
  • 200MB, 7.5 sec
  • 300MB, 10.4 sec
  • 400MB, 13.3 sec
  • 500MB, 16.7 sec
    For the record, here is last query I passed:
lxplus# time scurl -s "https://cmsweb.cern.ch/httpgo/payload?size=500MB&format=json" > /tmp/data.json

real    0m16.655s
user    0m3.042s
sys     0m4.472s

lxplus#  ls -alh /tmp/data.json
-rw-r--r--. 1 valya zh 554M Sep 22 16:50 /tmp/data.json

Therefore, from this single test I would say that FE+nginx+BE data flow works just fine. If you want we can now perform a test with some number of concurrent clients, just specify how many clients should we generate. Please note, here there is no KeepAlive header as it is not set by default in curl.

@amaltaro
Copy link
Contributor Author

Thanks Valentin. I think we can test it with up to 15 concurrent requests (steps of 5).

@vkuznet
Copy link
Contributor

vkuznet commented Sep 22, 2022

I created this ticket https://its.cern.ch/jira/browse/CMSKUBERNETES-185 to perform the tests and asked @muhammadimranfarooqi to provide break down.

@vkuznet
Copy link
Contributor

vkuznet commented Oct 5, 2022

According to results published in https://its.cern.ch/jira/browse/CMSKUBERNETES-185 you can see the following:
for 10 concurrent clients with payload 500MB we have 0 failures with keep-alive, while we have 6 failures without keep-alive. This is exactly what I was expecting. When we start hitting with more concurrent clients then results (with or without keep alive) are the same. It means that this PR may help to fix the problem for low concurrent load on our infrastructure.

Said that, the actual fix is to reduce payload size by simply using gzip encoding.

@amaltaro amaltaro added this to To do in Planned for Q4 - 2022 via automation Oct 10, 2022
@amaltaro amaltaro moved this from To do to In progress in Planned for Q4 - 2022 Oct 10, 2022
@vkuznet
Copy link
Contributor

vkuznet commented Oct 28, 2022

@amaltaro , do we need keep this open? My understanding that the actual issue is related to large payload which we addressed with gzip encoding. Please make a decision if we need to keep this issue open.

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 3, 2022

We have been running beyond 30k requests for the past 2 months, and things are looking much better. Next week we deploy support to gzip response content as well, so I would suggest to close it. Thank you for your contribution on this, Valentin.

@amaltaro amaltaro closed this as completed Nov 3, 2022
Planned for Q3 - 2022 automation moved this from In progress to Done Nov 3, 2022
Planned for Q4 - 2022 automation moved this from In progress to Done Nov 3, 2022
@amaltaro
Copy link
Contributor Author

amaltaro commented Jan 10, 2023

Just an FYI.
Given that I needed to use the instructions provided in this comment:
#11231 (comment)

to test t0_reqmon, and that it required further changes to the script generating the http headers. I decided to consolidate it in this wiki page:
https://github.com/dmwm/WMCore/wiki/REST-localhost-tests

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Jan 11, 2023

Thanks @amaltaro and @vkuznet for putting those instructions in one place. These are indeed useful. Alan we may actually link those also from the troubleshooting wiki page as well.

@amaltaro
Copy link
Contributor Author

@todor-ivanov good idea. Can you take care of it? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

5 participants