Eliminate racing conditions in DBS data injection procedure #11106

vkuznet · 2022-04-20T13:17:53Z

Impact of the bug
The current bulkblocks API for DBS data injection contains internal racing conditions independent of DBS server implementation (Python or Go-based code). We should revisit and change how we insert the data into DBS and replace bulkblocks API with new way for data injection.

Describe the bug
The data input (JSON) forbulkblocks DBS API contains common and independent parts. For instance, if we want to inject two different blocks via parallel HTTP calls, the provided JSON forbulkblocks API contains the static data which will be the same for both JSON files, e.g. physics groups, data tiers, dataset access types, etc. And, the data which will be different and independent from each other the blocks/files parts. Since each HTTP call will have internal database transactions this transactions can compete to insert or obtain ID for the static data which will lead to the racing condition.

How to reproduce it
Send multiple requests to bulkblocks API of Python or Go server with blocks/files from the same datasets. At certain rate, we will experience ORA-00001 error in DBS server code which represents transaction error of identical data injection, e.g. we need either to inject or obtain ID of common fields, such as data tiers, physics group, processed datasets, etc. The probability of the error increases with size of the internal ORACLE table where we need to insert or check for existence of common attribute. For instance, the probability of this error is low for low populated tables such as physics groups, data tiers, but much higher for dense tables such as processed datasets.

Currently, both Python and Go-server has a way to re-iterate over ORA-00001 ORACLE violate constrain errors.
For instance, Python codebase has the following code, see lines 467-562

...
                    #Another job inserted it just 1/100000 second earlier than
                    #you!!  YG 11/17/2010
                    if str(ex).find("ORA-00001") != -1 or str(ex).lower().find("duplicate") !=-1:
                        if str(ex).find("TUC_OMC_1") != -1:
                            #the config is already in db, get the ID later
                            pass
                        else:
                            #reinsert it if one or two or three of the three attributes (vresion, hash and app) are inserted
                            #just 1/100000 second eailer.
                            try:
                                self.otptModCfgin.execute(conn, configObj, tran)
                                tran.commit()
                                tran = None
                            except exceptions.IntegrityError as ex:
                                if (str(ex).find("ORA-00001") != -1 and str(ex).find("TUC_OMC_1"))\
                                        or str(ex).lower().find("duplicate") != -1:
                                    pass

For some calls, this error is ignored since the code assumes that data is already in DB, for others (as shown above), there is a repetitive attempts to reinsert the data, and for others even there are comments about racing conditions (wording differently), e.g.

            if str(ei).find("ORA-00001") != -1 or str(ei).lower().find("duplicate")!=-1:
                # For now, we assume most cases are the same dataset was instered by different thread. If not,
                # one has to call the insert dataset again. But we think this is a rare case and let the second
                # DBSBlockInsert call fix it if it happens.

In Go code, I added the following construct https://github.com/dmwm/dbs2go/blob/master/dbs/dbs.go#L610-617 to handle the same issue.

Both code-bases are vulnerable to ORA-00001 issue which happen due to design of bulkblocks API. In order to resolve it we either need to repeat the data insertion in a server code (as now it is done in Python and Go based server code) or on a client side by re-injecting the data again. But fundamentally it is a flaw in the design of this API.

In short, the design of bulkblocks API internally contains racing conditions which can't be 100% resolved on a server side and even though we have some protection the code it is vulnerable to this issue which may occur again
if we move to more concurrent nature of data injection and larger data volume in the future.

Finally, please note that issue occurs when there is no entry in DBS database for common data used by concurrent blocks/files injection. For instance, when we create a new campaign, or create new processed dataset and try to inject all its blocks/files via bulkblocks API these independent HTTP calls (used by upstream workflows, e.g. WMCore, T0, CRAB) can lead to the racing conditions among them.

Expected behavior
The bulkblocks API used by WMCore/T0/CRAB teams should be decomposed into several pieces which fall into two main categories:

insert common data which remain the same for all blocks/files, such as data tiers, physics groups, etc. This injection should be done separately from blocks/files data injection since latter it requires this common data
insert blocks/files separately from common data, this can be done via parallel injection. Since these data will have independent pieces their injections can be safely injected through parallel HTTP calls but it requires that common data should be already injected in DBS.

This will require major changes to WMCore data injection procedure and we should carefully design it. For instnace, the change of data structures used by bulkblocks API and decomposition of data injection calls, e.g.

the JSON for primary dataset followed by injection call for it
the JSON for physics group followed by injection call for it
....
the JSON for processed dataset followed by injection call fo it
the JSON for blocks/files followed by injection call for it

It may requires to revisit data-structures of injected data as well as WMCore/T0/CRAB workflows. As such, the further discussion will be required among all stakeholders.

Additional context and error message
@amaltaro , @todor-ivanov , @klannon , @dciangot , @belforte , @mapellidario , @d-ylee
This issue requires careful design and implementation for WMCore/T0/CRAB workflows and further discussion are required. I suggest to revisit this issue once we'll discuss new data injection policies, changes to DBS databases, etc.

The text was updated successfully, but these errors were encountered:

vkuznet · 2022-04-27T16:51:07Z

While talking with Alan and explaining issue with #11107 I made a proposal how to temporary solve this problem. The proposal is simple, to avoid racing conditions between independent HTTP requests when we inject multiple blocks, we can make two step procedure:

inject single block (based on bulkblocks JSON)
inject all other blocks in parallel

The first step will insure that all common parts associated with new block are either injected or present in DB. Therefore, after this step we can inject as many blocks in parallel as we wish since all of their common meta-data will be present in DB. This will be easier to implement in WMCore/WMAgent and can be used as temporary solution to the problem.

amaltaro · 2022-04-27T17:27:19Z

Thanks for this summary, Valentin.
I just noticed that it won't protect against multiple agents injecting the same data (diff block for the same dataset) at the same time, but even the redesign that you initially suggested wouldn't be protected from that. I think this alternative that you mention above is a pretty good commitment.

vkuznet added Further Discussion Enhancement DBS scalability labels Apr 20, 2022

vkuznet self-assigned this Apr 20, 2022

vkuznet mentioned this issue Apr 20, 2022

Handle racing conditions for common meta-data dmwm/dbs2go#31

Open

jhonatanamado mentioned this issue Apr 21, 2022

DBS3Upload component keeps running during the injection of not valid files coming from a race condition on DBS server side #11107

Closed

todor-ivanov mentioned this issue Apr 21, 2022

WorkQueue fails to create subscription due to ORACLE unique constraint violation #11108

Closed

amaltaro mentioned this issue May 27, 2022

Better handle for DBS3Upload errors due to "Error: concurrency error" #11167

Closed

vkuznet mentioned this issue Nov 15, 2022

Successful (no job failure) workflows w/ missing output on DBS #11358

Closed

vkuznet mentioned this issue Dec 2, 2022

Switch to use DBSError reason/srvCode instead of if/else exception block #11375

Merged

vkuznet mentioned this issue Nov 7, 2023

Blocks failing to be inserted with Error: concurrency error dmwm/dbs2go#98

Closed

todor-ivanov mentioned this issue Apr 5, 2024

Allow numbers in the data_tier_name regular expression dmwm/dbs2go#108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate racing conditions in DBS data injection procedure #11106

Eliminate racing conditions in DBS data injection procedure #11106

vkuznet commented Apr 20, 2022

vkuznet commented Apr 27, 2022 •

edited

Loading

amaltaro commented Apr 27, 2022

Eliminate racing conditions in DBS data injection procedure #11106

Eliminate racing conditions in DBS data injection procedure #11106

Comments

vkuznet commented Apr 20, 2022

vkuznet commented Apr 27, 2022 • edited Loading

amaltaro commented Apr 27, 2022

vkuznet commented Apr 27, 2022 •

edited

Loading