Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FilePump "child record exists" errors / BlockDelete bug #651

Closed
ericvaandering opened this issue Oct 23, 2013 · 9 comments
Closed

Comments

@ericvaandering
Copy link
Member

Original Savannah ticket 65239 reported by magini on Thu Apr 1 09:10:26 2010.

Deletions and transfers are not supposed to interfere with each other.

  • Deletions are not to be created while transfer tasks exist
  • Transfer tasks are not to be created while a file is being deleted

The first point is not true in the case of T1s having (Buffer, MSS) node. The file-level deletion should not be created for the MSS node when the Buffer node has incoming or outgoing transfers.

The result of this bug is that deletions triggered on actively transferring data at a T1 cause the FilePump agent to error when trying to remove the deleted replica. The location of the exception causes task statistics to get stuck, which could also impact the FileRouter.

The only workaround is to either wait for the outgoing tasks to complete / expire or to intervene in the DB, removing the tasks.

@ericvaandering
Copy link
Member Author

Comment by magini on Thu Aug 19 09:52:05 2010

Another less serious occurrence of the same bug is documented here:

https://savannah.cern.ch/support/?116138
https://savannah.cern.ch/bugs/?56012

If the file replicas are not yet on the MSS node, triggering a deletion of the replicas while they are actively transferring from Buffer will not result in the FilePump agent getting stuck. But the deletion will be marked as completed without removing the Buffer replicas, leading to an inconsistency between the storage and the DB.

N.

@ericvaandering
Copy link
Member Author

Comment by magini on Mon Mar 14 05:34:15 2011

A fix for this at MSS nodes was added in 4.0; however in Testbed this issue was also observed for deletions at a T2. No deletions at T1s were in progress when these alerts appeared. The alerts disappeared automatically (due to xfer tasks expiring probably), so it's difficult to reconstruct what happened a-posteriori:

+verbatim+
2011-03-09 22:59:26: FilePump[3402]: alert: database error: DBD::Oracle::st execute failed: ORA-02292: integrity constraint (CMS_TRANSFERMGMT_TESTBED.FK_XFER_TASK_REPLICA) violated - child record found (DBD ERROR: OCIStmtExecute) [for Statement "delete from t_xfer_replica where (node, fileid) in (select xr.node, xr.fileid from t_xfer_delete xd join t_xfer_replica xr on xr.node = xd.node and xr.fileid = xd.fileid where xd.time_complete is not null and xd.time_complete >= xr.time_create)"] at /data/phedex/magini/PHEDEX/perl_lib/PHEDEX/Core/DB.pm line 322.
-verbatim-

@ericvaandering
Copy link
Member Author

Closed by magini on Thu Apr 7 12:54:18 2011

@ghost ghost assigned ericvaandering Oct 23, 2013
@ericvaandering
Copy link
Member Author

Comment by magini on Thu Apr 7 12:54:18 2011

Hi,

not seen since the final PHEDEX_4_0_0 upgrade, closing ticket

N.

@ericvaandering
Copy link
Member Author

Open by magini on Mon Oct 10 10:45:57 2011

@ericvaandering
Copy link
Member Author

Comment by magini on Mon Oct 10 10:45:57 2011

Hi,

reopening, since this happened again on 2011-10-08:

+verbatim+
2011-10-08 14:38:10: FilePump[8968]: alert: database error: DBD::Oracle::st execute failed: ORA-02292: integrity constraint (CMS_TRANSFERMGMT.FK_XFER_TASK_REPLICA) violated - child record found (DBD ERROR: OCIStmtExecute) [for Statement "delete from t_xfer_replica where (node, fileid) in (select xr.node, xr.fileid from t_xfer_delete xd join t_adm_link l on l.from_node = xd.node join t_adm_node fn on fn.id = l.from_node join t_adm_node tn on tn.id = l.to_node join t_xfer_replica xr on xr.node = tn.id and xr.fileid = xd.fileid where l.is_local = 'y' and fn.kind = 'MSS' and tn.kind = 'Buffer' and xd.time_complete is not null and xd.time_complete >= xr.time_create)"] at /data/ProdNodes/PHEDEX/perl_lib/PHEDEX/Core/DB.pm line 322.
-verbatim-

Probably related to the deletion of these blocks from T1_US_FNAL_MSS:

https://cmsweb.cern.ch/phedex/prod/Request::View?request=330226

which were in transfer to T1_FR_CCIN2P3_MSS at the same time.

Cheers
N.

@ericvaandering
Copy link
Member Author

Comment by magini on Tue Oct 11 11:20:08 2011

Hi,

reproduced in Testbed. This race condition can also occur in the following case:

  1. Deletion is started on T1_SOURCE_MSS
  2. Transfer is requested to T1_DESTINATION_MSS
  3. FileIssue issues the tasks on the WAN link T1_SOURCE_Buffer-->T1_DESTINATION_Buffer
  4. The deletion tasks at T1_SOURCE_MSS are completed
  5. FilePump tries to remove the replicas from T1_SOURCE_MSS and the connected T1_SOURCE_Buffer node, resulting in the integrity constraint alert.

+verbatim+
2011-10-11 16:19:57: FilePump[18683]: alert: database error: DBD::Oracle::st execute failed: ORA-02292: integrity constraint (CMS_TRANSFERMGMT_TESTBED.FK_XFER_TASK_REPLICA) violated - child record found (DBD ERROR: OCIStmtExecute) [for Statement "delete from t_xfer_replica where (node, fileid) in (select xr.node, xr.fileid from t_xfer_delete xd join t_adm_link l on l.from_node = xd.node join t_adm_node fn on fn.id = l.from_node join t_adm_node tn on tn.id = l.to_node join t_xfer_replica xr on xr.node = tn.id and xr.fileid = xd.fileid where l.is_local = 'y' and fn.kind = 'MSS' and tn.kind = 'Buffer' and xd.time_complete is not null and xd.time_complete >= xr.time_create)"] at /data/phedex/magini/410dev2/PHEDEX/perl_lib/PHEDEX/Core/DB.pm line 322.
-verbatim-

The problem is in point 3: FileIssue should not issue transfer tasks to/from Buffer nodes when there are pending deletion tasks on the connected MSS nodes.

The fix included in BlockDelete in PHEDEX_4_0_0 only addressed the opposite case (starting a deletion when the transfer was already in progress).

N.

@ericvaandering
Copy link
Member Author

Closed by magini on Wed Nov 9 08:33:21 2011

@ericvaandering
Copy link
Member Author

Comment by magini on Wed Nov 9 08:33:21 2011

Hi,

additional fix for this included in PHEDEX_4_0_1, validated in Dev and deployed in Production on Wed 9th - closing ticket.

Cheers
Nicolo'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant