Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to use DBSError reason/srvCode instead of if/else exception block #11375

Merged
merged 1 commit into from
Jan 5, 2023

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Nov 28, 2022

Fixes #11365

Status

not-tested

Description

Replace if/else exception block with DBSError reason and server code

Is it backward compatible (if not, which system it affects?)

MAYBE

Related PRs

it is continuation of effort started in #11173 and #10962

External dependencies / deployment changes

yes, it requires new dbs3-client, version 4.0.12

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 3 changes in unstable tests
  • Python3 Pylint check: succeeded
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13757/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet Valentin, I have a general comment on this implementation.
We should be replacing all those checks of string errors by the actual server error code.

In other words, we will still have the if/elif/else logic, but it will be replaced by server code errors. Example:

srvCode = dbsError.getServerCode()
if srvCode == xxx:  # block already exists
    mark call as success
elif srvCode == yyy:  # missing parent
    mark call as failure
else:
   mark call as failure, unexpected error

@vkuznet
Copy link
Contributor Author

vkuznet commented Dec 1, 2022

@amaltaro , ok, how about now? I added all existing codes to the code and made appropriate message. Please review.

@vkuznet vkuznet requested a review from amaltaro December 1, 2022 13:10
@amaltaro
Copy link
Contributor

amaltaro commented Dec 1, 2022

I think we should check only for the expected error codes, which AFAICT they are:

  1. block already exists
  2. dataset_id not found (missing parent data)
  3. (and it's likely a good idea) datatier does not exist

we should ditch all the others. From what I understand, the transaction on the server side will never yield many of those errors that you implemented. If we see that we are getting hit by another one of those, then we keep adding add. But I would start it with a very small set and stick to that if possible.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

@vkuznet
Copy link
Contributor Author

vkuznet commented Dec 2, 2022

@amaltaro , I'm sorry but I disagree with your assessment. The codes I added are possible to hit in DBSUploadPoller, e.g. in order to insert block/dataset an acquisition or processing era should be registered, etc. In other words it is possible due to existing data racing HTTP conditions #11106 that two HTTP requests can overlap and one of the errors I mentioned and implemented may occur in request. In my view it is better to have full coverage now rather than later add yet another code to cover rare case. Please re-consider.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, I see two better alternatives then, such that we can remove all this DBS error handling here and rely more on the DBSError class that you recently provided. They are:
a) either rely on the DBSError.getMessage() to get a very short and specific error message (hopefully it would be similar to what you defined here). Of course, we would have to enrich the error message with something like "Temporary failure for block {block}. Error: {dbsError.getMessage()}"...
b) we define a new method in DBSError that would map the "getServerCode" output to such very short error messages.

Of course, we still need to check if the error is meant to be considered a success (block already exists) or error (other failures). What do you think?

@vkuznet
Copy link
Contributor Author

vkuznet commented Dec 15, 2022

@amaltaro , from good design principle the DBSError class only parses errors and provides details. It has no logic about DMWM workflows or data injection, therefore we should not enrich it with such logic. If block exists in DBS and someone tries to insert it again, it is an error from DBS side, period. But from DMWM point of view it is not since the block exists and we can proceed safely. Therefore, both of your proposals do not appeal to me, and I still think the proper logic is correctly defined in DBSUploadPoller which knows details about data injection logic.

@amaltaro
Copy link
Contributor

I have to argue that that long if/elif statement is doing exactly what you mentioned, parsing the error details provided by the DBS server, hence a good design practice would be to encapsulate that same logic into its own class (DBSError).

If you don't want to rewrite the error message, as currently provided in this PR, then I think we can use the output of getMessage method. Code becomes much smaller and mainteinable. This would be my option a) above by the way.

@vkuznet
Copy link
Contributor Author

vkuznet commented Dec 15, 2022

Alan, you should not be upset with if/else because it walks through all possible codes. If you want we can put this into separate function for code clarify. What I'm saying that this function or if/else block does not belong to DBSError generic class because they represent logic of data processing and not code themselves. It is WMCore code which needs this logic and not another way around. The purpose of DBSError is to extract error codes, and get message, but it is not about placing calls to DBS. If we'll put this if/else block into DBSError then we'll need to parse URLs of the calls, while here we know the name (the parameter passed to DBS API) because it belongs here. Think how you'll construct such messages if you'll need to parse all possible DBS URLs along with their parameters. In other words, DBSError class does not hold input parameters of DBS URLs, i.e. in this specific case name of the block.

@vkuznet
Copy link
Contributor Author

vkuznet commented Jan 4, 2023

@amaltaro , I do not want to put this PR aside, please re-read my reply and let me know how would you like to proceed. If you still think that if/else code should be put in DBSError class I'll move it there, but it will not provide details of input API parameters (as I explained in my reply this will require parsing URLs rather taking input parameters from the code as it is right now).

@amaltaro
Copy link
Contributor

amaltaro commented Jan 4, 2023

The purpose of DBSError is to extract error codes, and get message, but it is not about placing calls to DBS. If we'll put this if/else block into DBSError then we'll need to parse URLs of the calls...

Valentin, it looks like you misunderstood my previous message. I am not saying that the DBSError class should be in charge of making calls to DBS and/or parsing any URLs.
My point is that the DBSError class is meant to be an object containing a generic DBS error. As such, this is the correct place to define any generic error messages based on the server exit code.

Can you please remind me if it's the message or the reason attribute that will have a short version of the error message (instead of the nested error chain)? My goal here is to replace all those error messages that you defined in the msg variable, according to the server exit code, by something like:

errMsg = f"Failed to insert block: {blockName}, with server error code: {srvCode} and error message: {srvMsg}"

In my opinion, the current proposal should be refactored to something like (very short version...):

if srvCode == 128:
                # block already exist
                logging.warning("Block %s already exists. Marking it as uploaded.", name)
                results.put({'name': name, 'success': "uploaded"})
else:
                errMsg = f"Failed to insert block: {blockName}, with server error code: {srvCode} and error message: {srvMsg}"
                logging.error(errMsg)
                results.put({'name': name, 'success': "error", 'error': errMsg})

In other words, if the DBS server is already providing a clear and meaningful error message - as those that you extensively defined in the if/elif/else statement - then we should simply use that instead of redefining them in this component.

@vkuznet
Copy link
Contributor Author

vkuznet commented Jan 4, 2023

ok, here is what DBS server returns, I only show you relevant parts:

scurl -s "https://cmsweb.cern.ch/dbs/prod/global/DBSReader/datatiers?foo=1" | jq
[
  {
    "error": {
      "reason": "invalid parameter(s)",
      "message": "parameter 'foo' is not accepted by 'datatiers' API",
      "code": 118,
    }.
    "message": "DBSError Code:118 Description:DBS invalid parameter for the DBS API Function:dbs.parameters.CheckQueryParameters Message:parameter 'foo' is not accepted by 'datatiers' API Error: invalid parameter(s)"

Therefore

  • getMessage method of DBSError class will extract error.message, i.e. "parameter 'foo' is not accepted by 'datatiers' API"
  • getReason method of DBSError class will extract error.reason, i.e. "invalid parameter(s)"
  • none of methods of DBSError class will use nested message of DBS error

In case of failed block the getMessage will return Data already exist in DBS, see here and getReason will return fmt.Sprintf("Block %s already exists", bName) see here. In other words, the reason returns actual issue with failure, the message returns generic message. The current PR uses getReason and lists all possible error codes which DBS server can produces during block injection. If you want to replace actual msg lines then we should use bothgetReason and getMessage DBSError methods and server code, since in most cases getMessage is what I added to the code and getReason is actual error returned from DBS database.

@amaltaro
Copy link
Contributor

amaltaro commented Jan 4, 2023

If you want to replace actual msg lines then we should use both getReason and getMessage DBSError methods and server code, since in most cases getMessage is what I added to the code and getReason is actual error returned from DBS database.

Thanks for providing this extra information. Yes, then I agree with your statement above.

@vkuznet
Copy link
Contributor Author

vkuznet commented Jan 5, 2023

@amaltaro I made necessary changes to address your suggestion, could you please review my last commit bbdb136 and you'll be satisfied I'll squash the changes and we can proceed with this PR.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 2 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13861/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's much better now, thanks!
I took the opportunity to make a few extra suggestions along the code. Please have a look.

src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py Outdated Show resolved Hide resolved
src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py Outdated Show resolved Hide resolved
src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py Outdated Show resolved Hide resolved
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 3 changes in unstable tests
  • Python3 Pylint check: succeeded
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13862/artifact/artifacts/PullRequestReport.html

@vkuznet vkuznet requested a review from amaltaro January 5, 2023 14:47
@vkuznet
Copy link
Contributor Author

vkuznet commented Jan 5, 2023

@amaltaro I made necessary changes, please review. I also saw one unit test failure but I doubt it is related to these changes and I think it is rather unstable one. Please check and advise further.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 3 new failures
    • 3 changes in unstable tests
  • Python3 Pylint check: succeeded
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13863/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it looks good to me.

@amaltaro
Copy link
Contributor

amaltaro commented Jan 5, 2023

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 4 changes in unstable tests
  • Python3 Pylint check: succeeded
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13864/artifact/artifacts/PullRequestReport.html

@amaltaro amaltaro merged commit 908707a into dmwm:master Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Start using DBSError codes in WMCore
3 participants