FilePump "child record exists" errors / BlockDelete bug #651

ericvaandering · 2013-10-23T18:54:58Z

Original Savannah ticket 65239 reported by magini on Thu Apr 1 09:10:26 2010.

Deletions and transfers are not supposed to interfere with each other.

Deletions are not to be created while transfer tasks exist
Transfer tasks are not to be created while a file is being deleted

The first point is not true in the case of T1s having (Buffer, MSS) node. The file-level deletion should not be created for the MSS node when the Buffer node has incoming or outgoing transfers.

The result of this bug is that deletions triggered on actively transferring data at a T1 cause the FilePump agent to error when trying to remove the deleted replica. The location of the exception causes task statistics to get stuck, which could also impact the FileRouter.

The only workaround is to either wait for the outgoing tasks to complete / expire or to intervene in the DB, removing the tasks.

ericvaandering · 2013-10-23T18:54:58Z

Comment by magini on Thu Aug 19 09:52:05 2010

Another less serious occurrence of the same bug is documented here:

https://savannah.cern.ch/support/?116138
https://savannah.cern.ch/bugs/?56012

If the file replicas are not yet on the MSS node, triggering a deletion of the replicas while they are actively transferring from Buffer will not result in the FilePump agent getting stuck. But the deletion will be marked as completed without removing the Buffer replicas, leading to an inconsistency between the storage and the DB.

N.

ericvaandering · 2013-10-23T18:54:58Z

Comment by magini on Mon Mar 14 05:34:15 2011

A fix for this at MSS nodes was added in 4.0; however in Testbed this issue was also observed for deletions at a T2. No deletions at T1s were in progress when these alerts appeared. The alerts disappeared automatically (due to xfer tasks expiring probably), so it's difficult to reconstruct what happened a-posteriori:

+verbatim+
2011-03-09 22:59:26: FilePump[3402]: alert: database error: DBD::Oracle::st execute failed: ORA-02292: integrity constraint (CMS_TRANSFERMGMT_TESTBED.FK_XFER_TASK_REPLICA) violated - child record found (DBD ERROR: OCIStmtExecute) [for Statement "delete from t_xfer_replica where (node, fileid) in (select xr.node, xr.fileid from t_xfer_delete xd join t_xfer_replica xr on xr.node = xd.node and xr.fileid = xd.fileid where xd.time_complete is not null and xd.time_complete >= xr.time_create)"] at /data/phedex/magini/PHEDEX/perl_lib/PHEDEX/Core/DB.pm line 322.
-verbatim-

ericvaandering · 2013-10-23T18:54:58Z

Closed by magini on Thu Apr 7 12:54:18 2011

ericvaandering · 2013-10-23T18:54:59Z

Comment by magini on Thu Apr 7 12:54:18 2011

Hi,

not seen since the final PHEDEX_4_0_0 upgrade, closing ticket

N.

ericvaandering · 2013-10-23T18:54:59Z

Open by magini on Mon Oct 10 10:45:57 2011

ericvaandering · 2013-10-23T18:54:59Z

Comment by magini on Mon Oct 10 10:45:57 2011

Hi,

reopening, since this happened again on 2011-10-08:

+verbatim+
2011-10-08 14:38:10: FilePump[8968]: alert: database error: DBD::Oracle::st execute failed: ORA-02292: integrity constraint (CMS_TRANSFERMGMT.FK_XFER_TASK_REPLICA) violated - child record found (DBD ERROR: OCIStmtExecute) [for Statement "delete from t_xfer_replica where (node, fileid) in (select xr.node, xr.fileid from t_xfer_delete xd join t_adm_link l on l.from_node = xd.node join t_adm_node fn on fn.id = l.from_node join t_adm_node tn on tn.id = l.to_node join t_xfer_replica xr on xr.node = tn.id and xr.fileid = xd.fileid where l.is_local = 'y' and fn.kind = 'MSS' and tn.kind = 'Buffer' and xd.time_complete is not null and xd.time_complete >= xr.time_create)"] at /data/ProdNodes/PHEDEX/perl_lib/PHEDEX/Core/DB.pm line 322.
-verbatim-

Probably related to the deletion of these blocks from T1_US_FNAL_MSS:

https://cmsweb.cern.ch/phedex/prod/Request::View?request=330226

which were in transfer to T1_FR_CCIN2P3_MSS at the same time.

Cheers
N.

ericvaandering · 2013-10-23T18:54:59Z

Comment by magini on Tue Oct 11 11:20:08 2011

Hi,

reproduced in Testbed. This race condition can also occur in the following case:

Deletion is started on T1_SOURCE_MSS
Transfer is requested to T1_DESTINATION_MSS
FileIssue issues the tasks on the WAN link T1_SOURCE_Buffer-->T1_DESTINATION_Buffer
The deletion tasks at T1_SOURCE_MSS are completed
FilePump tries to remove the replicas from T1_SOURCE_MSS and the connected T1_SOURCE_Buffer node, resulting in the integrity constraint alert.

+verbatim+
2011-10-11 16:19:57: FilePump[18683]: alert: database error: DBD::Oracle::st execute failed: ORA-02292: integrity constraint (CMS_TRANSFERMGMT_TESTBED.FK_XFER_TASK_REPLICA) violated - child record found (DBD ERROR: OCIStmtExecute) [for Statement "delete from t_xfer_replica where (node, fileid) in (select xr.node, xr.fileid from t_xfer_delete xd join t_adm_link l on l.from_node = xd.node join t_adm_node fn on fn.id = l.from_node join t_adm_node tn on tn.id = l.to_node join t_xfer_replica xr on xr.node = tn.id and xr.fileid = xd.fileid where l.is_local = 'y' and fn.kind = 'MSS' and tn.kind = 'Buffer' and xd.time_complete is not null and xd.time_complete >= xr.time_create)"] at /data/phedex/magini/410dev2/PHEDEX/perl_lib/PHEDEX/Core/DB.pm line 322.
-verbatim-

The problem is in point 3: FileIssue should not issue transfer tasks to/from Buffer nodes when there are pending deletion tasks on the connected MSS nodes.

The fix included in BlockDelete in PHEDEX_4_0_0 only addressed the opposite case (starting a deletion when the transfer was already in progress).

N.

ericvaandering · 2013-10-23T18:55:00Z

Closed by magini on Wed Nov 9 08:33:21 2011

ericvaandering · 2013-10-23T18:55:00Z

Comment by magini on Wed Nov 9 08:33:21 2011

Hi,

additional fix for this included in PHEDEX_4_0_1, validated in Dev and deployed in Production on Wed 9th - closing ticket.

Cheers
Nicolo'

ghost assigned ericvaandering Oct 23, 2013

ericvaandering closed this as completed Oct 23, 2013

nikmagini mentioned this issue Nov 14, 2016

FilePump: alert from replica deletion #1067

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilePump "child record exists" errors / BlockDelete bug #651

FilePump "child record exists" errors / BlockDelete bug #651

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

FilePump &quot;child record exists&quot; errors / BlockDelete bug #651

FilePump &quot;child record exists&quot; errors / BlockDelete bug #651

Comments

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

ericvaandering commented Oct 23, 2013

FilePump "child record exists" errors / BlockDelete bug #651

FilePump "child record exists" errors / BlockDelete bug #651