Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ad-hoc fix for failed workflow while creating WQEs #11810

Merged
merged 2 commits into from
Dec 11, 2023

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Dec 5, 2023

Fixes #11784

Status

ready

Description

This is ad-hoc patch rather an actual fix for issue #11784. So far it only disables exception before create of WQE when there is no Input data placement present in a given spec (which may be the case when data is on tape). Instead we would like to have logging warnings to monitor and further understand the situation.

Is it backward compatible (if not, which system it affects?)

MAYBE

Related PRs

External dependencies / deployment changes

@vkuznet vkuznet self-assigned this Dec 5, 2023
@vkuznet vkuznet changed the title Provide ad-hoc fix for issue 11784 ad-hoc fix for failed workflow while creating WQEs Dec 5, 2023
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 31 new failures
    • 1 tests no longer failing
    • 10 changes in unstable tests
  • Python3 Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 9 warnings
    • 101 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14691/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Dec 5, 2023

@amaltaro , @todor-ivanov , @khurtado please review this ad-hoc patch. I added additional call to Rucio via self.getDatasetLocation API which should give us an answer if we do have available replicas for dataset in question at run-time. I also added additional logger warning messages to trace this use-cases. If you agree, we can merge this PR, create a patch release and deploy it to cmsweb.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 31 new failures
    • 1 tests no longer failing
    • 10 changes in unstable tests
  • Python3 Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 9 warnings
    • 101 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14692/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Dec 5, 2023

apparently I broke unit tests with missing import, will fix it in a bit

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 9 warnings
    • 101 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14693/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @vkuznet It looks good

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet please find my comments along the code.

src/python/WMCore/WorkQueue/WorkQueue.py Outdated Show resolved Hide resolved
@amaltaro
Copy link
Contributor

amaltaro commented Dec 5, 2023

When the time to merge/backport/deploy it comes, the instructions are very similar to what was mentioned in this comment:
#11781 (review)

Tagging and how to tag is described in this wiki: https://github.com/dmwm/WMCore/wiki/TaggingAndReleasing#new-tagging-convention

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 250 new failures
    • 40 tests deleted
    • 1 tests added
    • 12 changes in unstable tests
  • Python3 Pylint check: failed
    • 11 warnings and errors that must be fixed
    • 9 warnings
    • 101 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 5 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14695/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests added
  • Python3 Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 9 warnings
    • 101 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 5 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14696/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 9 warnings
    • 101 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 5 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14700/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet Valentin, as mentioned along the code, the getDatasetLocations() function is not properly used in this PR. It looks like we only have 1 src/ and 1 test/ location calling that function and I'd be totally in favor of refactoring it to get a flat list of datasets (removing the dbs dependency, given that it is not used AT ALL during the location resolution).

src/python/WMCore/WorkQueue/WorkQueue.py Outdated Show resolved Hide resolved
@vkuznet
Copy link
Contributor Author

vkuznet commented Dec 11, 2023

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 53 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14707/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 53 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14708/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 53 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14709/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Dec 11, 2023

Alan, please review again, the unit test which fails (WMCore_t.Services_t.Rucio_t.RucioUtils_t.RucioUtilsTest:testWeightedChoice ) is not related to this PR.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, these changes look good to me. Can you please squash those commits accordingly? Thanks

@vkuznet
Copy link
Contributor Author

vkuznet commented Dec 11, 2023

Alan, I squashed commits.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
  • Python3 Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 53 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14714/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Dec 11, 2023

Alan, please let me know if you'll merge this PR or want me to proceed, so far every PR I made was merged by you and I would like to avoid waiting deadlock if you assume otherwise for this specific PR.

@amaltaro amaltaro merged commit e560431 into dmwm:master Dec 11, 2023
3 of 4 checks passed
@amaltaro
Copy link
Contributor

@vkuznet Valentin, please go ahead and backport this fix, tag and upgrade global workqueue in cmsweb-testbed. See this comment for further details and/or let me know if you have any questions.

For now, I am going to upgrade testbed with the latest 2.2.6rc5 (being built at this very moment), but once you have a dedicated hot-fix for global workqueue available, just upgrade global workqueue again in testbed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed Workflows due to Failure while creating WQE's
4 participants