Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfers failing because the FTS backend cannot parse submit command output #915

Open
ericvaandering opened this issue Oct 23, 2013 · 5 comments

Comments

@ericvaandering
Copy link
Member

Original Savannah ticket 98594 reported by None on Thu Nov 1 12:43:26 2012.

Hello,

a rare transfer failure mode seen with the FileDownload agent with the FTS backend:

the FTS backend expects an FTS job ID as output of the glite-transfer-submit command, but sometimes the output is not what is expected, so the transfer fails with "Could not submit to FTS" and an undefined job ID.

An example Transfer Log is reported below:

+verbatim+
---------- JOB-LOG ----------
1351642390 created...
backend: PHEDEX::Transfer::FTS
glite-transfer-submit -s https://fts22-t1-import.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer -f /data/ProdNodes/Prod_T2_CH_CERN/state/download-t1/work/job.1347966267.13180/copyjob
JOBID=undefined, cannot monitor this job: [...]
---------- RAWOUTPUT ----------
Source: srm://srm-cms.gridpp.rl.ac.uk:8443/srmc2ba17d2-22ef-11e2-bf5c-9605fd599dc3
-verbatim-

Looking at the RAWOUTPUT, it seems that the output of glite-transfer-submit (c2ba17d2-22ef-11e2-bf5c-9605fd599dc3) is mixed with part of the output of a glite-transfer-status command executed for a different transfer! (Source: srm://srm-cms.gridpp.rl.ac.uk:8443/srm)

Switched on POCO_DEBUG on CERN agents to try to figure out what's going on.

Cheers
Nicolo'

@ericvaandering
Copy link
Member Author

Comment by wildish on Fri Nov 2 03:37:35 2012

Hi,

I can think of two possible sources of this:

  1. the agent fails to parse the jobID correctly, but doesn't then abandon the task correctly, thereby mixing things up somehow. That part of the code is perhaps less well tested than the rest.

  2. I recall there was a bug in PoCo::Child that Ricky and I took ages to track down about 3-4 years ago. IIRC the wheelID was an unsigned int that was not protected against wrap-around, so when it did it managed to mix things up. I thought we had swatted that one, but maybe it has come back somehow. I'll see if I can find more details about that bug.

@ericvaandering
Copy link
Member Author

Comment by magini on Fri Nov 2 09:59:45 2012

Hi Tony,

maybe it could be related to this ancient bug report?

https://savannah.cern.ch/bugs/?func=detailitem&item_id=45798

The symptoms today are similar: a bunch of tasks failing during submission with "Could not submit to FTS" because the agent cannot parse the glite-transfer-submit output, and at almost the same time we see alerts about jobs with "No "DESTINATION" key!", which means that the agent cannot parse the output of glite-transfer-status.

This strengthens the hypothesis that PoCo::Child is mixing the output of two different child commands.

Three years ago you put a protection in the agent to stop it from crashing, but it seems that the underlying bug that was causing the mixup was never fixed.

Let's wait until we can reproduce this issue at CERN.

In this case the issue is not critical, because the transfers will simply be resubmitted later, but I suppose that in other cases it could cause more serious issues (for example, if the agent mixes up the output of two different FileDownloadVerify commands, it could mark as 'good' the result of a 'bad' transfer).

N.

@ghost ghost assigned ericvaandering Oct 23, 2013
@ericvaandering
Copy link
Member Author

Comment by magini on Mon Nov 5 12:55:06 2012

Hi,

another transfer failure like this:

+verbatim+
---------- JOB-LOG ----------
1352054495 created...
backend: PHEDEX::Transfer::FTS
glite-transfer-submit -s https://fts22-t1-import.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer -f /data/ProdNodes/Prod_T2_CH_CERN/state/download-t1/work/job.1351790993.1594/copyjob
JOBID=undefined, cannot monitor this job: job = [...]
---------- RAWOUTPUT ----------
Source: 435e45f9-26af-11e2-b29d-ffcd381ede4c
-verbatim-

POE debugging information for this command, and for a previous command with the same PID:

+verbatim+
2012-11-04 17:24:15: QMon[11366]: debug: dequeue JOBID=7338cb18-26a4-11e2-bf5c-9605fd599dc3

> POE::Component::Child - run(): "glite-transfer-status -l --verbose -s https://fts22-t1-import.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer 7338cb18-26a4-11e2-bf5c-9605fd599dc3", wheel=97282, pid=4494

2012-11-04 18:41:35: FileDownload[11366]: debug: start copy job job.1351790993.1594
> POE::Component::Child - run(): "glite-transfer-submit -s https://fts22-t1-import.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer -f /data/ProdNodes/Prod_T2_CH_CERN/state/download-t1/work/job.1351790993.1594/copyjob", wheel=99314, pid=4494
-verbatim-

N.

@ericvaandering
Copy link
Member Author

Comment by magini on Wed Nov 7 09:45:19 2012

Hi,

and another one. This time I was able to identify exactly which two processes got mixed up:

glite-transfer-submit process:

+verbatim+
glite-transfer-submit -s https://fts-t2-service.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer -t CMS_DEFAULT -f /data/DebugNodes/Debug_T1_CH_CERN_Buffer/state/download-t2/work/job.1351790899.4341/copyjob
JOBID=undefined, cannot monitor this job: job = [...]
---------- RAWOUTPUT ----------
Destination: srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/PhEDEx_LoadTest07/LoadTest07_Debug_Estonia/CERN/411/LoadTest07_Estdab2abb5-28da-11e2-a03a-96f506f1f878
-verbatim-

+verbatim+
2012-11-07 12:58:40: FileDownload[22322]: debug: copy job job.1351790899.4341 status: pend=0 ready=1 done=0 lost=0
2012-11-07 12:58:40: FileDownload[22322]: debug: start copy job job.1351790899.4341
[...]
> POE::Component::Child - run(): "glite-transfer-submit -s https://fts-t2-service.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer -t CMS_DEFAULT -f /data/DebugNodes/Debug_T1_CH_CERN_Buffer/state/download-t2/work/job.1351790899.4341/copyjob", wheel=103163, pid=24705
-verbatim-

glite-transfer-status process:

+verbatim+
2012-11-07 12:58:40: QMon[22322]: debug: dequeue JOBID=d133a3e8-28da-11e2-a03a-96f506f1f878
> POE::Component::Child - run(): "glite-transfer-status -l --verbose -s https://fts-t2-service.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer d133a3e8-28da-11e2-a03a-96f506f1f878", wheel=103165, pid=24709
[...]
2012-11-07 12:58:41: QMon[22322]: alert: ListJob for d133a3e8-28da-11e2-a03a-96f506f1f878 returned error: No "DESTINATION" key! : DURATION=0 , REASON=(null) , RETRIES=0 , SOURCE=srm://ganymede.hep.kbfi.ee:8888/srm/v2/server?SFN=/hdfs/cms/LoadTest07/LoadTest07_Estonia_3F onia_3F_lDaaTKuxVLWPJ2qo_411 , STATE=Active
2012-11-07 12:58:41: QMon[22322]: debug: WorkStats: class=JOBS key=d133a3e8-28da-11e2-a03a-96f506f1f878 value=undefined
-verbatim-

The processes have different wheel IDs and PIDs, but they were running simultaneously.

@ericvaandering
Copy link
Member Author

Comment by magini on Tue Dec 11 11:10:20 2012

Hi,

another case caught live, this time printing out
+verbatim+
$args->{out}
-verbatim-
as it is in PHEDEX::Core:JobManager::_child_stdout

+verbatim+
> POE::Component::Child - run(): "glite-transfer-submit -s https://fts22-t1-import.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer -t CMS_DEFAULT -f /data/DebugNodes/Debug_T1_CH_CERN_Buffer/state/download-t1/workdir/job.1355159361.372/copyjob", wheel=14492, pid=21542
2012-12-11 07:59:00: QMon[9713]: debug: dequeue JOBID=96289d8e-4368-11e2-ae15-cf1c1ceec317
> POE::Component::Child - run(): "glite-transfer-status -l --verbose -s https://fts22-t1-import.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer 96289d8e-4368-11e2-ae15-cf1c1ceec3
17", wheel=14493, pid=21544
[...]
Request ID: 96289d8e-4368-11e2-ae15-cf1c1ceec317
Status: Active
Channel: PIC-CERN
Client DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cassel/CN=696497/CN=Rapolas Kaselis
Reason: <None>
Submit time: 2012-12-11 07:58:45.492
Files: 5
Priority: 3
VOName: cms
Done: 0
Active: 5
Pending: 0
Ready: 0
Canceled: 0
Failed: 0
Finishing: 0
Finished: 0
Submitted: 0
Hold: 0
Waiting: 0
Source: srm://srmcms.pic.es:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/store/PhEDEx_LoadTest07/LoadTest07_Prod_PIC/T2_Spain/LoadTest07_PIC_8D
Destination: srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/PhEDEx_LoadTest07/LoadTest07_Debug_PIC/CERN/1719/LoadTest07_PIC_8D_XCx2xKk0SMU3cXKO_1719
State: Active
Retries: 0
Reason: (null)
Duration: 0
Source: srm://srmcms.pic.es:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/store/PhEDEx_LoadTest07/LoadTest07_Prod_PIC/T2_Spain/LoadTest07_PIC_09
Destination: srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/PhEDEx_LoadTest07/LoadTest07_Debug_PIC/CERN/1719/LoadTest07_PIC_09_znpj99N5wMoO4AE2_1719
State: Active
Retries: 0
Reason: (null)
Duration: 0
Source: srm://srmcms.pic.es:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/store/PhEDEx_LoadTest07/LoadTest07_Prod_PIC/T2_Spain/LoadTest07_PIC_0D
Destination: srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/PhEDEx_LoadTest07/LoadTest07_Debug_PIC/CERN/1719/LoadTest07_PIC_0D_FFnKKpGtUppaDBbc_1719
State: Active
Retries: 0
Reason: (null)
Duration: 0
Source: srm://srmcms.pic.es:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/store/PhEDEx_LoadTest07/LoadTest07_Prod_PIC/T2_Spain/LoadTest07_PIC_8D
Destination: srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/PhEDEx_LoadTest07/LoadTest07_Debug_PIC/CERN/1719/LoadTest07_PIC_8D_4Lx3YgczCBewyaSu_1719
State: Active
Retries: 0
Reason: (null)
Duration: 0
Source: srm://srmcms.pic.e9f2b7071-4368-11e2-ae15-cf1c1ceec317
s:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/store/PhEDEx_LoadTest07/LoadTest07_Prod_PIC/T2_Spain/LoadTest07_PIC_85
Destination: srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/PhEDEx_LoadTest07/LoadTest07_Debug_PIC/CERN/1719/LoadTest07_PIC_85_1pOJamjZ6tYFrcWd_1719
State: Active
Retries: 0
Reason: (null)
Duration: 0
> POE::Component::Child - close()
> POE::Component::Child - close()
-verbatim-

The interesting line is this one:

+verbatim+
Source: srm://srmcms.pic.e9f2b7071-4368-11e2-ae15-cf1c1ceec317
s:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/store/PhEDEx_LoadTest07/LoadTest07_Prod_PIC/T2_Spain/LoadTest07_PIC_85
-verbatim-

From this, we see that the output of the two commands (glite-transfer-submit and glite-transfer-status) is already mixed up when POE::Component::Child passes it to PHEDEX::Core:JobManager::_child_stdout

N.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants