Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate racing conditions in DBS data injection procedure #11106

Open
vkuznet opened this issue Apr 20, 2022 · 2 comments
Open

Eliminate racing conditions in DBS data injection procedure #11106

vkuznet opened this issue Apr 20, 2022 · 2 comments

Comments

@vkuznet
Copy link
Contributor

vkuznet commented Apr 20, 2022

Impact of the bug
The current bulkblocks API for DBS data injection contains internal racing conditions independent of DBS server implementation (Python or Go-based code). We should revisit and change how we insert the data into DBS and replace bulkblocks API with new way for data injection.

Describe the bug
The data input (JSON) forbulkblocks DBS API contains common and independent parts. For instance, if we want to inject two different blocks via parallel HTTP calls, the provided JSON forbulkblocks API contains the static data which will be the same for both JSON files, e.g. physics groups, data tiers, dataset access types, etc. And, the data which will be different and independent from each other the blocks/files parts. Since each HTTP call will have internal database transactions this transactions can compete to insert or obtain ID for the static data which will lead to the racing condition.

How to reproduce it
Send multiple requests to bulkblocks API of Python or Go server with blocks/files from the same datasets. At certain rate, we will experience ORA-00001 error in DBS server code which represents transaction error of identical data injection, e.g. we need either to inject or obtain ID of common fields, such as data tiers, physics group, processed datasets, etc. The probability of the error increases with size of the internal ORACLE table where we need to insert or check for existence of common attribute. For instance, the probability of this error is low for low populated tables such as physics groups, data tiers, but much higher for dense tables such as processed datasets.

Currently, both Python and Go-server has a way to re-iterate over ORA-00001 ORACLE violate constrain errors.
For instance, Python codebase has the following code, see lines 467-562

...
                    #Another job inserted it just 1/100000 second earlier than
                    #you!!  YG 11/17/2010
                    if str(ex).find("ORA-00001") != -1 or str(ex).lower().find("duplicate") !=-1:
                        if str(ex).find("TUC_OMC_1") != -1:
                            #the config is already in db, get the ID later
                            pass
                        else:
                            #reinsert it if one or two or three of the three attributes (vresion, hash and app) are inserted
                            #just 1/100000 second eailer.
                            try:
                                self.otptModCfgin.execute(conn, configObj, tran)
                                tran.commit()
                                tran = None
                            except exceptions.IntegrityError as ex:
                                if (str(ex).find("ORA-00001") != -1 and str(ex).find("TUC_OMC_1"))\
                                        or str(ex).lower().find("duplicate") != -1:
                                    pass

And, the same pattern (with handling ORA-00001 error) repeats on the following lines:
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L139,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L316,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L382,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L409,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L510,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L522,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L616,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L683,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L736,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L812,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L816,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L918,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L964,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L988.
In total, there are 14 occurrences which handle in different way racing conditions.

For some calls, this error is ignored since the code assumes that data is already in DB, for others (as shown above), there is a repetitive attempts to reinsert the data, and for others even there are comments about racing conditions (wording differently), e.g.

            if str(ei).find("ORA-00001") != -1 or str(ei).lower().find("duplicate")!=-1:
                # For now, we assume most cases are the same dataset was instered by different thread. If not,
                # one has to call the insert dataset again. But we think this is a rare case and let the second
                # DBSBlockInsert call fix it if it happens.

In Go code, I added the following construct https://github.com/dmwm/dbs2go/blob/master/dbs/dbs.go#L610-617 to handle the same issue.

Both code-bases are vulnerable to ORA-00001 issue which happen due to design of bulkblocks API. In order to resolve it we either need to repeat the data insertion in a server code (as now it is done in Python and Go based server code) or on a client side by re-injecting the data again. But fundamentally it is a flaw in the design of this API.

In short, the design of bulkblocks API internally contains racing conditions which can't be 100% resolved on a server side and even though we have some protection the code it is vulnerable to this issue which may occur again
if we move to more concurrent nature of data injection and larger data volume in the future.

Finally, please note that issue occurs when there is no entry in DBS database for common data used by concurrent blocks/files injection. For instance, when we create a new campaign, or create new processed dataset and try to inject all its blocks/files via bulkblocks API these independent HTTP calls (used by upstream workflows, e.g. WMCore, T0, CRAB) can lead to the racing conditions among them.

Expected behavior
The bulkblocks API used by WMCore/T0/CRAB teams should be decomposed into several pieces which fall into two main categories:

  • insert common data which remain the same for all blocks/files, such as data tiers, physics groups, etc. This injection should be done separately from blocks/files data injection since latter it requires this common data
  • insert blocks/files separately from common data, this can be done via parallel injection. Since these data will have independent pieces their injections can be safely injected through parallel HTTP calls but it requires that common data should be already injected in DBS.

This will require major changes to WMCore data injection procedure and we should carefully design it. For instnace, the change of data structures used by bulkblocks API and decomposition of data injection calls, e.g.

  • the JSON for primary dataset followed by injection call for it
  • the JSON for physics group followed by injection call for it
  • ....
  • the JSON for processed dataset followed by injection call fo it
  • the JSON for blocks/files followed by injection call for it

It may requires to revisit data-structures of injected data as well as WMCore/T0/CRAB workflows. As such, the further discussion will be required among all stakeholders.

Additional context and error message
@amaltaro , @todor-ivanov , @klannon , @dciangot , @belforte , @mapellidario , @d-ylee
This issue requires careful design and implementation for WMCore/T0/CRAB workflows and further discussion are required. I suggest to revisit this issue once we'll discuss new data injection policies, changes to DBS databases, etc.

@vkuznet
Copy link
Contributor Author

vkuznet commented Apr 27, 2022

While talking with Alan and explaining issue with #11107 I made a proposal how to temporary solve this problem. The proposal is simple, to avoid racing conditions between independent HTTP requests when we inject multiple blocks, we can make two step procedure:

  • inject single block (based on bulkblocks JSON)
  • inject all other blocks in parallel

The first step will insure that all common parts associated with new block are either injected or present in DB. Therefore, after this step we can inject as many blocks in parallel as we wish since all of their common meta-data will be present in DB. This will be easier to implement in WMCore/WMAgent and can be used as temporary solution to the problem.

@amaltaro
Copy link
Contributor

Thanks for this summary, Valentin.
I just noticed that it won't protect against multiple agents injecting the same data (diff block for the same dataset) at the same time, but even the redesign that you initially suggested wouldn't be protected from that. I think this alternative that you mention above is a pretty good commitment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants