Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New DBSConcurrency module for concurrent execution of HTTP queries to DBS via pycurl manager #11913

Merged
merged 2 commits into from
Feb 27, 2024

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Feb 26, 2024

Fixes #11899

Status

ready

Description

New DBSConcurrency module to hold helper functions to DBS via concurrent execution of HTTP calls.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

#11884

External dependencies / deployment changes

It requires pycurl module

@vkuznet vkuznet self-assigned this Feb 26, 2024
@vkuznet vkuznet changed the title Fix issue 11899 New DBSConcurrency module for concurrent execution of HTTP queries to DBS via pycurl manager Feb 26, 2024
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 3 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14915/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 3 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14916/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Feb 26, 2024

Alan, this is a first draft, feel free to provide your feedback. If you satisfied with this code I can import this module in #11884 and use it there after the merge of the latter.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, I have two general comments for this PR.

  1. Ideally, we should make this pycurl-based API as generic as possible - hence accepting a kwargs as input parameter, matching the actual files DBS API. The problem with that is that we have to keep all the data returned by DBS in memory, pass it upstream and let the upstream code to parse and use the data as it needs.
    Another approach would be to create a specific function for the pileup use case, similar to what is written here. But then we better clarify it in the code not to mislead people.

  2. For the unit test, we have two options. a) Either we mark it as integration, not to run it in Jenkins all the time; or b) we use a different dataset that has only a handful of blocks. That Neutrino dataset is huge and we should not add relevant load into the production system by running unit tests. In addition, let us please use a dataset from cmsweb-testbed instead, not to make any requests to the production system.

@vkuznet
Copy link
Contributor Author

vkuznet commented Feb 27, 2024

Alan, the second issue is addressed now. Regarding the first one. I think you mix two different concepts here:

  1. API calls where you want generalization
  2. optimization (e.g. memory reduction) of the function

They have nothing to do with each other. For the first one, I can put kwargs into function parameters but do we need it here? First, we must have details since only with this parameter we get file list and nevents. Without it DBS will only return list of files. Second, we need to use validFileOnly because I doubt we ever need invalid files in WMA. Therefore, my question is: is it worth to delegate to upstream code this? In other words upstream code must always specify these parameters.

For the second, (memory) optimization part. The code generalization can be done via converting it to generator which will yield individual records, like {block: {'FilesList':..., 'NumberOfEvents':...}} but it will not solve memory concern since upstream code uses entire dictionary in its operations. Therefore, without re-factoring upstream code (which we just finished) I doubt any optimization here will make any good. The entire problem is dictionary usage in many parts of WMCore, instead of list of records. The dictionary forces entire structure to be loaded into memory and causing large memory footprint. Until the usage of these dict structures will be changed literally little you can do in context of memory optimization.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 2 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14922/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 tests added
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 1 warnings
    • 7 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14923/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 1 warnings
    • 7 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14924/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Feb 27, 2024

Alan, I added memory measurement into my gist, and measured memory allocation before, after multi_getdata and after iteration over generator results. Here are results:

https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader/blocks?dataset=/Neutrino_E-10_gun/RunIISummer20ULPrePremix-UL16_106X_mcRun2_asymptotic_v13-v1/PREMIX
before: usertime=0.055698 systime=0.018106999999999998 mem=24.84375 (MB)
after: usertime=0.055705 systime=0.018109 mem=24.84375 (MB)
for /Neutrino_E-10_gun/RunIISummer20ULPrePremix-UL16_106X_mcRun2_asymptotic_v13-v1/PREMIX get 463
after gen iteration: usertime=0.071628 systime=0.022612999999999998 mem=30.25 (MB)
before getBlockInfo: usertime=0.071657 systime=0.022626 mem=30.25 (MB)
before getBlockInfo:multi_getdata: usertime=0.07818399999999999 systime=0.022737 mem=30.375 (MB)
before getBlockInfo:multi_getdata: usertime=0.078222 systime=0.022764 mem=30.375 (MB)
after getBlockInfo: usertime=2.442248 systime=1.257647 mem=97.234375 (MB)
Elapsed time 12.484441995620728

Here are few observations:

  • usertime, systeim and memory are the same before and after multi_getdata generator call
  • for multi_getdata calls within getBlockInfo function I see the same pattern
  • but once we iterate over generator we see jump from 24MB to 97MB (for 463 blocks info we get from DBS)

This confirms my observation that multi_getdata (generator) function by itself does not lead to any memory increase, and only once we start iterating we get appropriate bump based on number of results we get from generator function. Therefore, usage of multi_getdata ONLY lead to speed up of HTTP calls and total time is equal to average single HTTP request.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 1 warnings
    • 7 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14925/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 1 warnings
    • 7 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14926/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Feb 27, 2024

Now, it is ready for another round of review.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for confirming the behavior of multi_getdata() Valentin.
These changes look good to me and I think this closes the loop of partial pileup support in the agent.

@amaltaro amaltaro merged commit 9a703d6 into dmwm:master Feb 27, 2024
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Efficiently obtain file list and number of events information for a given list of block names
3 participants