Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBS3Upload component keeps running during the injection of not valid files coming from a race condition on DBS server side #11107

Closed
jhonatanamado opened this issue Apr 21, 2022 · 9 comments

Comments

@jhonatanamado
Copy link
Contributor

Impact of the bug
T0 Agent, WMAgent, Agents/Services using DBS3Upload

Describe the bug
Testing with a replay dbsgo server we found that the DBS3Upload component from Tier0 agent keeps running when the following error appears in the component log.

2022-04-15 23:07:41,192:140096476325632:ERROR:DBSUploadPoller:Error trying to process block /ExpressCosmics/Tier0_REPLAY_2022-Express-v37/FEVT#7555befd-7860-4940-b23b-a1ad71d30eac through DBS. Error: [{"error":{"reason":"DBSError Code:110 Description:DBS DB insert record error Function:dbs.GetRecID Message: Error: nested DBSError Code:110 Description:DBS DB insert record error Function:dbs.processeddatasets.Insert Message: Error: ORA-00001: unique constraint (CMS_DBS3_K8S_GLOBAL_OWNER.TUC_PSDS_PROCESSED_DS_NAME) violated\n","message":"2bb8ee3b62a4bef5db1fcf52a593f924ec05db159720e2a24195f29f57292dc6 unable to find processed_ds_id for procDS='Tier0_REPLAY_2022-Express-v37'","function":"dbs.bulkblocks.getProcessedDSName","code":109},"http":{"method":"POST","code":400,"timestamp":"2022-04-15 21:07:41.187425862 +0000 UTC m=+114594.770691755","path":"/dbs/int/global/DBSWriter/bulkblocks","user_agent":"DBSClient/Unknown/","x_forwarded_host":"cmsweb-testbed.cern.ch:8443","x_forwarded_for":"137.138.31.57","remote_addr":"10.100.6.64:21420"},"exception":400,"type":"HTTPError","message":"DBSError Code:109 Description:DBS DB ID error for provided entity, e.g. there is no record in DB for provided value Function:dbs.bulkblocks.getProcessedDSName Message:2bb8ee3b62a4bef5db1fcf52a593f924ec05db159720e2a24195f29f57292dc6 unable to find processed_ds_id for procDS='Tier0_REPLAY_2022-Express-v37' Error: nested DBSError Code:110 Description:DBS DB insert record error Function:dbs.GetRecID Message: Error: nested DBSError Code:110 Description:DBS DB insert record error Function:dbs.processeddatasets.Insert Message: Error: ORA-00001: unique constraint (CMS_DBS3_K8S_GLOBAL_OWNER.TUC_PSDS_PROCESSED_DS_NAME) violated\n"}]
Traceback (most recent call last):
  File "/data/tier0/srv/wmagent/3.0.4/sw.jamadova/slc7_amd64_gcc630/cms/t0/3.0.4/lib/python3.8/site-packages/WMComponent/DBS3Buffer/DBSUploadPoller.py", line 92, in uploadWorker
    dbsApi.insertBulkBlock(blockDump=block)
  File "/data/tier0/srv/wmagent/3.0.4/sw.jamadova/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.7/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 606, in insertBulkBlock
    result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )
  File "/data/tier0/srv/wmagent/3.0.4/sw.jamadova/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.7/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 448, in __callServer
    self.__parseForException(http_error)
  File "/data/tier0/srv/wmagent/3.0.4/sw.jamadova/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.7/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 492, in __parseForException
    raise http_error
  File "/data/tier0/srv/wmagent/3.0.4/sw.jamadova/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.7/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 445, in __callServer
    self.http_response = method_func(self.url, method, params, data, request_headers)
  File "/data/tier0/srv/wmagent/3.0.4/sw.jamadova/slc7_amd64_gcc630/cms/py3-dbs3-pycurl/3.17.7-comp/lib/python3.8/site-packages/RestClient/RestApi.py", line 42, in post
    return http_request(self._curl)
  File "/data/tier0/srv/wmagent/3.0.4/sw.jamadova/slc7_amd64_gcc630/cms/py3-dbs3-pycurl/3.17.7-comp/lib/python3.8/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 62, in __call__
    raise HTTPError(effective_url, http_code, http_response.msg, http_response.raw_header, http_response.body)
RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: Bad Request
2022-04-15 23:07:41,258:140096476325632:ERROR:DBSUploadPoller:Error found in multiprocess during process of block /ExpressCosmics/Tier0_REPLAY_2022-Express-v37/FEVT#7555befd-7860-4940-b23b-a1ad71d30eac
2022-04-15 23:07:41,258:140096476325632:ERROR:DBSUploadPoller:Error trying to process block /ExpressCosmics/Tier0_REPLAY_2022-Express-v37/FEVT#7555befd-7860-4940-b23b-a1ad71d30eac through DBS. Error: [{"error":{"reason":"DBSError Code:110 Description:DBS DB insert record error Function:dbs.GetRecID Message: Error: nested DBSError Code:110 Description:DBS DB insert record error Function:dbs.processeddatasets.Insert Message: Error: ORA-00001: unique constraint (CMS_DBS3_K8S_GLOBAL_OWNER.TUC_PSDS_PROCESSED_DS_NAME) violated\n","message":"2bb8ee3b62a4bef5db1fcf52a593f924ec05db159720e2a24195f29f57292dc6 unable to find processed_ds_id for procDS='Tier0_REPLAY_2022-Express-v37'","function":"dbs.bulkblocks.getProcessedDSName","code":109},"http":{"method":"POST","code":400,"timestamp":"2022-04-15 21:07:41.187425862 +0000 UTC m=+114594.770691755","path":"/dbs/int/global/DBSWriter/bulkblocks","user_agent":"DBSClient/Unknown/","x_forwarded_host":"cmsweb-testbed.cern.ch:8443","x_forwarded_for":"137.138.31.57","remote_addr":"10.100.6.64:21420"},"exception":400,"type":"HTTPError","message":"DBSError Code:109 Description:DBS DB ID error for provided entity, e.g. there is no record in DB for provided value Function:dbs.bulkblocks.getProcessedDSName Message:2bb8ee3b62a4bef5db1fcf52a593f924ec05db159720e2a24195f29f57292dc6 unable to find processed_ds_id for procDS='Tier0_REPLAY_2022-Express-v37' Error: nested DBSError Code:110 Description:DBS DB insert record error Function:dbs.GetRecID Message: Error: nested DBSError Code:110 Description:DBS DB insert record error Function:dbs.processeddatasets.Insert Message: Error: ORA-00001: unique constraint (CMS_DBS3_K8S_GLOBAL_OWNER.TUC_PSDS_PROCESSED_DS_NAME) violated\n"}]
2022-04-15 23:07:41,284:140096476325632:INFO:BaseWorkerThread:DBSUploadPoller took 0.825 secs to execute
2022-04-15 23:09:21,420:140096476325632:INFO:DBSUploadPoller:Starting the DBSUpload Polling Cycle
2022-04-15 23:09:21,426:140096476325632:INFO:DBSUploadPoller:Found 53 open blocks.
2022-04-15 23:09:21,426:140096476325632:INFO:DBSUploadPoller:Loaded 0 blocks.
2022-04-15 23:09:22,135:140096476325632:INFO:DBSUploadPoller:Queueing block for insertion: /ExpressCosmics/Tier0_REPLAY_2022-Express-v37/FEVT#7555befd-7860-4940-b23b-a1ad71d30eac
2022-04-15 23:09:22,137:140096476325632:INFO:DBSUploadPoller:Queueing block for insertion: /StreamExpressCosmics/Tier0_REPLAY_2022-PromptCalibProdSiStrip-Express-v37/ALCAPROMPT#4ed37e20-5374-44a9-9013-b6bc440f1855
2022-04-15 23:09:22,138:140096476325632:INFO:DBSUploadPoller:Queueing block for insertion: /StreamExpressCosmics/Tier0_REPLAY_2022-SiPixelCalZeroBias-Express-v37/ALCARECO#1b72d7f6-cd10-40f7-a4f3-531e2787efd1
2022-04-15 23:09:22,138:140096476325632:INFO:DBSUploadPoller:About to call insert block for: /ExpressCosmics/Tier0_REPLAY_2022-Express-v37/FEVT#7555befd-7860-4940-b23b-a1ad71d30eac
2022-04-15 23:09:22,139:140096476325632:INFO:DBSUploadPoller:Queueing block for insertion: /ExpressPhysics/Tier0_REPLAY_2022-Express-v37/FEVT#22c306c6-e47d-4786-9652-00e13d7ab34f
2022-04-15 23:09:22,139:140096476325632:INFO:DBSUploadPoller:About to call insert block for: /StreamExpressCosmics/Tier0_REPLAY_2022-PromptCalibProdSiStrip-Express-v37/ALCAPROMPT#4ed37e20-5374-44a9-9013-b6bc440f1855

Later all other calls for injecting data to dbs succeed and the replay finished without issues. (All wfs marked as archived) But for the block /ExpressCosmics/Tier0_REPLAY_2022-Express-v37/FEVT#7555befd-7860-4940-b23b-a1ad71d30eac the files in dasgoclient appear as not valid.

"https://cmsweb-testbed.cern.ch:8443/dbs/int/global/DBSReader/files/?block_name=/ExpressCosmics/Tier0_REPLAY_2022-Express-v37/FEVT%237555befd-7860-4940-b23b-a1ad71d30eac&detail=True&validFileOnly=0"
[
{"adler32":"c04c503b","auto_cross_section":0,"block_id":23911802,"block_name":"/ExpressCosmics/Tier0_REPLAY_2022-Express-v37/FEVT#7555befd-7860-4940-b23b-a1ad71d30eac","branch_hash_id":null,"check_sum":"4288972905","create_by":"WMAgent","creation_date":1650056962,"dataset":"/ExpressCosmics/Tier0_REPLAY_2022-Express-v37/FEVT","dataset_id":14352048,"event_count":1964,"file_id":657086557,"file_size":460031022,"file_type":"EDM","file_type_id":1,"is_file_valid":0,"last_modification_date":1650056962,"last_modified_by":"WMAgent","logical_file_name":"/store/backfill/1/express/Tier0_REPLAY_2022/ExpressCosmics/FEVT/Express-v37/000/345/595/00000/28908492-bb93-42f1-b371-15f841e740dd.root","md5":null}
,{"adler32":"71838cce","auto_cross_section":0,"block_id":23911802,"block_name":"/ExpressCosmics/Tier0_REPLAY_2022-Express-v37/FEVT#7555befd-7860-4940-b23b-a1ad71d30eac","branch_hash_id":null,"check_sum":"2177888195","create_by":"WMAgent","creation_date":1650056962,"dataset":"/ExpressCosmics/Tier0_REPLAY_2022-Express-v37/FEVT","dataset_id":14352048,"event_count":200,"file_id":657086597,"file_size":48096552,"file_type":"EDM","file_type_id":1,"is_file_valid":0,"last_modification_date":1650056962,"last_modified_by":"WMAgent","logical_file_name":"/store/backfill/1/express/Tier0_REPLAY_2022/ExpressCosmics/FEVT/Express-v37/000/345/595/00000/2de1561e-fc99-46d4-a8f0-789bce4c1d4f.root","md5":null}
]

More information about this race condition is described in #11106
How to reproduce it
This error appears only when a new replay version is set. For dataset/blocks already created in dbs the injection of new files succeed.

Expected behavior
The component should raise an error exception if the injection is not possible.

Additional context and error message
Full component log can be found in /afs/cern.ch/user/c/cmst0/public/WMCoreIssues/DBS2GO

@vkuznet
Copy link
Contributor

vkuznet commented Apr 21, 2022

@jhonatanamado you outlined three different issues here:

  • the issue that WMCore should raise exception if such error occurs and stop injection process
  • the issue with invalid files, which is fixed in new DBS server by this commit dmwm/dbs2go@d40973d
  • the issue of ORA-00001 error which (I hope) is fixed in this commit dmwm/dbs2go@a0dd13f

Therefore, the latter two issues should disappear in new workflows submissions as I put them in into new DBS server version which is already deployed on testbed, but the original issue that WMCore should raise exception is what this ticket should address.

@amaltaro
Copy link
Contributor

@jhonatanamado Jhonatan, the database constraint error is likely something that shouldn't even be reported to the end user. It's likely not the error that we should be looking at as well (there are a few errors nested in the error message, as far as I could follow).

Anyhow, I see that Valentin has made some changes to the DBSWriter code and we have a new dbs2go release in testbed. Before doing any further investigation, could you please trigger a new replay (and ensure output datasets will be unique, which I think it's already done in your workflow)? Please let us know about the outcome. Thanks

@jhonatanamado
Copy link
Contributor Author

As requested, I deployed a second replay ensuring that the output of dataset will be unique in testbed with some runs that will produce a considerable pressure on the server. Checking the Componentlog I saw some Oracle Errors including Error: ORA-00001: unique constraint but in this particular case x-checking the output and the information on dbs it shows that was properly uploaded.
The Componentlog can be found in /afs/cern.ch/user/c/cmst0/public/WMCoreIssues/DBS2GO/ComponentLogv2
So this leads with the following situation.

  1. When the component should crash ?
  2. The errors are not in the list of isPassiveError function, so why the component was still running?

@amaltaro
Copy link
Contributor

@jhonatanamado thanks for running another test.

From the log you posted, this error is strange to me (even though not critical):

[{"error":{"reason":"ORA-02396: exceeded maximum idle time, please connect again\n"

Regarding this error message:

2022-04-26 07:12:33,418:140557585323776:ERROR:DBSUploadPoller:Error trying to process block /ZeroBias1/Tier0_REPLAY_2022-v42/RAW#4afc9904-31e9-493d-92a5-8da583f9c470 through DBS. Error: [{"error":{"reason":"DBSError Code:110 Description:DBS DB insert record error Function:dbs.outputconfigs.InsertOutputConfigs Message: Error: nested DBSError Code:110 Description:DBS DB insert record error Function:dbs.releaseversions.Insert Message: Error: ORA-00001: unique constraint (CMS_DBS3_K8S_GLOBAL_OWNER.TUC_RV_RELEASE_VERSION) violated\n","message":"9a61d11e7d976d3ecde76517df4d1eb264ccf148536bcebca835887a285eb72f","function":"dbs.bulkblocks.insertDatasetConfigurations","code":110},"http":{"method":"POST","code":400,"timestamp":"2022-04-26 05:12:33.414400699 +0000 UTC m=+299481.922399975","path":"/dbs/int/global/DBSWriter/bulkblocks","user_agent":"DBSClient/Unknown/","x_forwarded_host":"cmsweb-testbed.cern.ch:8443","x_forwarded_for":"137.138.31.57","remote_addr":"10.100.121.64:55736"},"exception":400,"type":"HTTPError","message":"DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertDatasetConfigurations Message:9a61d11e7d976d3ecde76517df4d1eb264ccf148536bcebca835887a285eb72f Error: nested DBSError Code:110 Description:DBS DB insert record error Function:dbs.outputconfigs.InsertOutputConfigs Message: Error: nested DBSError Code:110 Description:DBS DB insert record error Function:dbs.releaseversions.Insert Message: Error: ORA-00001: unique constraint (CMS_DBS3_K8S_GLOBAL_OWNER.TUC_RV_RELEASE_VERSION) violated\n"}]

I have two comments to make:

  1. there is some duplication in the error reported back to the client (coming from the reason and message keys). Is it coming from dbs3-client?
  2. it looks to me like DBS Server runs a set of insert calls in parallel - or disregarding the outcome of the previous insert - otherwise we should see a single failure reported (and stop the server transaction as soon as an error is hit). @vkuznet should clarify this.

Last but not least, the isPassiveError is meant to list WMAgent errors, not errors raised from external services/calls (like the DBS ones). In short, I'd say we are in good shape here!

@vkuznet
Copy link
Contributor

vkuznet commented Apr 27, 2022

Regarding oracle errors:

  • ORA-02396 seems to me related to fact that single transaction time out for whatever reason. The log shows nothing specific about it, therefore we can only guess here
  • ORA-00001 once again is related to issue I outlined here Eliminate racing conditions in DBS data injection procedure #11106 and to understand it better I need to know input details and how many concurrent threads were used to inject different block data. Can you provide this info? Meanwhile, I changed logic of insertion of dataset output configs (which your JSON supply) to avoid using single transaction and use separate transactions for each config data. The new code is deployed on testbed and I suggest that you retry your tests.

@amaltaro
Copy link
Contributor

Meanwhile, I changed logic of insertion of dataset output configs (which your JSON supply) to avoid using single transaction and use separate transactions for each config data.

I'm not sure you answered my questions raised above. But to add to this, given that we are inserting one unit - which is the block and its related data/metadata information - I think it should all be done within a single transaction. If you separate it in multiple transactions, how can you rollback a transaction that have been completed and committed?

In addition to that, I do not think the database layer information should be returned to the end user (end user does not need to know anything about database, tables, columns, constraints and so on. What we care is that a block, or a file, or a processing_string failed or succeeded, or is duplicated, so on.

If I understand correctly what is done on the server side, you:

  1. get a block dump to be inserted
  2. break this dump down into many entities/binds
  3. pass these binds to many goroutines, i.e., all inserts running in parallel
  4. add all the possible errors in the message to be sent back to the end user

is that correct? If so, then it means that the client could get 10 error messages back because the very first insertion (e.g. dataset id) failed to be inserted into the database, cascading this error to all the other inserts running concurrently.
This makes things a bit harder to debug, since we do not know what the root source of the problem is, but identifying it on the server side might be too much and I think it's okay like this.

@vkuznet
Copy link
Contributor

vkuznet commented Apr 27, 2022

We had a chat with Alan and agreed that dataset output configurations will be injected within single transaction. I rolled back my changes on testbed. I also provided a temporary solution to solve racing conditions, please see my comments here: #11106 (comment)

@amaltaro
Copy link
Contributor

Regarding the reason and message messages, as discussed with Valentin, the message one is supposed to be shorter and more user friendly. While the reason keeps piling up errors such that we can see where everything started and be able to spot what causes the nested error.

@jhonatanamado as there is no real problem here, besides the future code refactoring to be addressed in #11106, shall we close this ticket?

@jhonatanamado
Copy link
Contributor Author

Yes @amaltaro Im closing this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants