-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FilePump "child record exists" errors / BlockDelete bug #651
Comments
Comment by magini on Thu Aug 19 09:52:05 2010 Another less serious occurrence of the same bug is documented here: https://savannah.cern.ch/support/?116138 If the file replicas are not yet on the MSS node, triggering a deletion of the replicas while they are actively transferring from Buffer will not result in the FilePump agent getting stuck. But the deletion will be marked as completed without removing the Buffer replicas, leading to an inconsistency between the storage and the DB. N. |
Comment by magini on Mon Mar 14 05:34:15 2011 A fix for this at MSS nodes was added in 4.0; however in Testbed this issue was also observed for deletions at a T2. No deletions at T1s were in progress when these alerts appeared. The alerts disappeared automatically (due to xfer tasks expiring probably), so it's difficult to reconstruct what happened a-posteriori: +verbatim+ |
Closed by magini on Thu Apr 7 12:54:18 2011 |
Comment by magini on Thu Apr 7 12:54:18 2011 Hi, not seen since the final PHEDEX_4_0_0 upgrade, closing ticket N. |
Open by magini on Mon Oct 10 10:45:57 2011 |
Comment by magini on Mon Oct 10 10:45:57 2011 Hi, reopening, since this happened again on 2011-10-08: +verbatim+ Probably related to the deletion of these blocks from T1_US_FNAL_MSS: https://cmsweb.cern.ch/phedex/prod/Request::View?request=330226 which were in transfer to T1_FR_CCIN2P3_MSS at the same time. Cheers |
Comment by magini on Tue Oct 11 11:20:08 2011 Hi, reproduced in Testbed. This race condition can also occur in the following case:
+verbatim+ The problem is in point 3: FileIssue should not issue transfer tasks to/from Buffer nodes when there are pending deletion tasks on the connected MSS nodes. The fix included in BlockDelete in PHEDEX_4_0_0 only addressed the opposite case (starting a deletion when the transfer was already in progress). N. |
Closed by magini on Wed Nov 9 08:33:21 2011 |
Comment by magini on Wed Nov 9 08:33:21 2011 Hi, additional fix for this included in PHEDEX_4_0_1, validated in Dev and deployed in Production on Wed 9th - closing ticket. Cheers |
Original Savannah ticket 65239 reported by magini on Thu Apr 1 09:10:26 2010.
Deletions and transfers are not supposed to interfere with each other.
The first point is not true in the case of T1s having (Buffer, MSS) node. The file-level deletion should not be created for the MSS node when the Buffer node has incoming or outgoing transfers.
The result of this bug is that deletions triggered on actively transferring data at a T1 cause the FilePump agent to error when trying to remove the deleted replica. The location of the exception causes task statistics to get stuck, which could also impact the FileRouter.
The only workaround is to either wait for the outgoing tasks to complete / expire or to intervene in the DB, removing the tasks.
The text was updated successfully, but these errors were encountered: