-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transfers failing because the FTS backend cannot parse submit command output #915
Comments
Comment by wildish on Fri Nov 2 03:37:35 2012 Hi, I can think of two possible sources of this:
|
Comment by magini on Fri Nov 2 09:59:45 2012 Hi Tony, maybe it could be related to this ancient bug report? https://savannah.cern.ch/bugs/?func=detailitem&item_id=45798 The symptoms today are similar: a bunch of tasks failing during submission with "Could not submit to FTS" because the agent cannot parse the glite-transfer-submit output, and at almost the same time we see alerts about jobs with "No "DESTINATION" key!", which means that the agent cannot parse the output of glite-transfer-status. This strengthens the hypothesis that PoCo::Child is mixing the output of two different child commands. Three years ago you put a protection in the agent to stop it from crashing, but it seems that the underlying bug that was causing the mixup was never fixed. Let's wait until we can reproduce this issue at CERN. In this case the issue is not critical, because the transfers will simply be resubmitted later, but I suppose that in other cases it could cause more serious issues (for example, if the agent mixes up the output of two different FileDownloadVerify commands, it could mark as 'good' the result of a 'bad' transfer). N. |
Comment by magini on Mon Nov 5 12:55:06 2012 Hi, another transfer failure like this: +verbatim+ POE debugging information for this command, and for a previous command with the same PID: +verbatim+ > POE::Component::Child - run(): "glite-transfer-status -l --verbose -s https://fts22-t1-import.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer 7338cb18-26a4-11e2-bf5c-9605fd599dc3", wheel=97282, pid=44942012-11-04 18:41:35: FileDownload[11366]: debug: start copy job job.1351790993.1594 N. |
Comment by magini on Wed Nov 7 09:45:19 2012 Hi, and another one. This time I was able to identify exactly which two processes got mixed up: glite-transfer-submit process: +verbatim+ +verbatim+ glite-transfer-status process: +verbatim+ The processes have different wheel IDs and PIDs, but they were running simultaneously. |
Comment by magini on Tue Dec 11 11:10:20 2012 Hi, another case caught live, this time printing out +verbatim+ The interesting line is this one: +verbatim+ From this, we see that the output of the two commands (glite-transfer-submit and glite-transfer-status) is already mixed up when POE::Component::Child passes it to PHEDEX::Core:JobManager::_child_stdout N. |
Original Savannah ticket 98594 reported by None on Thu Nov 1 12:43:26 2012.
Hello,
a rare transfer failure mode seen with the FileDownload agent with the FTS backend:
the FTS backend expects an FTS job ID as output of the glite-transfer-submit command, but sometimes the output is not what is expected, so the transfer fails with "Could not submit to FTS" and an undefined job ID.
An example Transfer Log is reported below:
+verbatim+
---------- JOB-LOG ----------
1351642390 created...
backend: PHEDEX::Transfer::FTS
glite-transfer-submit -s https://fts22-t1-import.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer -f /data/ProdNodes/Prod_T2_CH_CERN/state/download-t1/work/job.1347966267.13180/copyjob
JOBID=undefined, cannot monitor this job: [...]
---------- RAWOUTPUT ----------
Source: srm://srm-cms.gridpp.rl.ac.uk:8443/srmc2ba17d2-22ef-11e2-bf5c-9605fd599dc3
-verbatim-
Looking at the RAWOUTPUT, it seems that the output of glite-transfer-submit (c2ba17d2-22ef-11e2-bf5c-9605fd599dc3) is mixed with part of the output of a glite-transfer-status command executed for a different transfer! (Source: srm://srm-cms.gridpp.rl.ac.uk:8443/srm)
Switched on POCO_DEBUG on CERN agents to try to figure out what's going on.
Cheers
Nicolo'
The text was updated successfully, but these errors were encountered: