Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap sendAlert() in a try/except block. #10467

Conversation

todor-ivanov
Copy link
Contributor

@todor-ivanov todor-ivanov commented Apr 22, 2021

Fixes #10464

Status

Ready

Description

While using the old mechanism for sending alerts from MSTransferor we were not trying to handle exceptions at place, when it comes to sending alerts. But this was breaking the creation of the Rucio rules for a given data block. Or more precisely it was calling Rucio successfully and creating all the necessary rules, and only when it was about some large blocks(which were exceeding the so configured threshold), it was trying to send alerts which from times to times may break, due to a plethora of reasons [1]. In result the function call self.makeTransferRucio() was never able to succeed and the call for the given DID was retried every MSTransferor polling cycle with one and the same RSE list, while the workflow was sitting in assigned. In general Rucio is able to protect us from situation like this because it does not recreate already existing rules. A side effect of such behavior, though, is that in case any of the RSEs which are to be used for these rule submissions gets out of quota (i.e. for a reason not related to our broken workflow) MSTransferor drops it from the list of RSEs for this DID and in result the new rule to be created is not considered as duplicate and then Rucio creates a brand new rule, which leads to the over replication.

The current PR aims to enclose the the call to the alertManager in order to avoid situation as the one explained above.

[1]

self.notifyLargeData(aboveWarningThreshold, transferId, wflow.getName(), dataSize, dataIn)

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/83
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/84
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/85
and configuration for MSOutput
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/86
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/87
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/88

External dependencies / deployment changes

In order to fix the actual issue we also need to provide the proper PR in the gitlab K8 configuration repository for adding the correct AlertManagerUrl to the MSTransfferor configuration

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing this fix, Todor. It looks good to me.
However, I'm only approving it once we also have the relevant configuration changes linked in this PR :-D (can you also please update the initial description? There is one field with the placeholder content).

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Apr 22, 2021

Hi @amaltaro I am getting there (to the PR with the proper configuration changes) really soon ... The problem is I cannot manually generate an alert with the AlertManagerAPI and the correct url (which is now known). That's why I've put status "Need more testing". I suspect some more changes to the code may be needed here and there.

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: failed
    • 1 new failures
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 5 warnings
    • 31 comments to review
  • Pylint py3k check: succeeded
    • 0 errors and warnings that should be fixed
    • 0 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 19 comments to review
  • Python3 compatibility checks: succeeded
    • there are suggested fixes for newer python3 idioms

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11666/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor Author

Hi @amaltaro,
Now we have the full picture. I just made the three PRs in the K8 service-config repository and please take a look at my second commit to the current PR which reflects a change that we need to make in the current alertManagerAPI class in order to use the proper api version, as it is explained in the prometheus documentation [1]. I decided to split the Url from the Uri and keep the former in the service config, while the later to be in the code, but if you or @goughes think otherwise, just let me know.

[1]
https://www.prometheus.io/docs/alerting/latest/clients/

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: failed
    • 4 new failures
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 5 warnings
    • 34 comments to review
  • Pylint py3k check: succeeded
    • 0 errors and warnings that should be fixed
    • 0 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 19 comments to review
  • Python3 compatibility checks: succeeded
    • there are suggested fixes for newer python3 idioms

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11672/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

@todor-ivanov these changes look good to me.

I have two questions though:
*) did you manage to run a basic test with these changes? If so, then I think we can run a final test in the production POD, and if things go through, then we make a new release and ask Imran to push that in as soon as images are available.
*) MSOutput also relies on AlertManager. Perhaps you could update those 3 merge requests in gitlab for ms-output as well? Creating new MR is fine as well.

@todor-ivanov todor-ivanov force-pushed the bug_MSTransferor_FailsToSendAlerts_fix-10464 branch from 01ffb2a to 8720f89 Compare April 23, 2021 08:18
@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 1 tests no longer failing
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 5 warnings
    • 31 comments to review
  • Pylint py3k check: succeeded
    • 0 errors and warnings that should be fixed
    • 0 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 19 comments to review
  • Python3 compatibility checks: succeeded
    • there are suggested fixes for newer python3 idioms

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11674/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Todor, these changes look good to me.
I added the MSOutput configuration merge requests to the PR initial description.

Are you done testing these changes? Please also update it in the PR description.

@todor-ivanov
Copy link
Contributor Author

Thanks @amaltaro ! Yes I am done with all the tests needed. Sorry for not reflecting this in the PR description.

@amaltaro
Copy link
Contributor

Thanks Todor.

@amaltaro amaltaro merged commit ad4e143 into dmwm:master Apr 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MSTransferor fails to send alerts through Alertmanager
3 participants