Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GridFTP incompatibilities with Globus Online #3545

Open
ahaupt opened this issue Sep 25, 2017 · 14 comments
Open

GridFTP incompatibilities with Globus Online #3545

ahaupt opened this issue Sep 25, 2017 · 14 comments
Assignees

Comments

@ahaupt
Copy link

ahaupt commented Sep 25, 2017

Hi,

dCache version: 2.16.47

We are suffering from a rather long-standing incompatibility with Globus Online's GridFTP implementation. In our case Icecube is suffers from this problem. GO transfers files in parallel but as soon as one file is transferred successfully, it cancels all other still ongoing transfers.

Here an example, finished transfer:

09.25 15:14:52 [door:GFTP-plum15-AAVaAuwI7mA@gridftp-plum15Domain:request] ["/DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=jade/jade-lta.icecube.wisc.edu":16892:248:184.73.189.163] [00009B888CD56F6447778A58209BCDFF1237,176851709905] [/pnfs/ifh.de/acs/icecube/archive/data/exp/IceCube/2015/unbiased/PFDST/0318/9e20f17c-b429-44d3-bcba-1f8e8c0480dd.zip] icecube:pfdst@osm 1810144 0 {0:""}

And here one transfer that gets cancelled just in the same moment:

09.25 15:14:52 [door:GFTP-plum15-AAVaAuwI6ng@gridftp-plum15Domain:request] ["/DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=jade/jade-lta.icecube.wisc.edu":16892:248:184.73.189.163] [000013047092D1644B3CB9C32F368DC4CCDD,0] [/pnfs/ifh.de/acs/icecube/archive/data/exp/IceCube/2015/unbiased/PFDST/1211/feaab171-9c23-46ee-8a6b-f1a53bf173f3.zip] icecube:pfdst@osm 1810481 0 {451:"Aborting transfer due to session termination"}

I guess, GO uses a GridFTP feature dCache doesn't support (pipelining?). Any idea how to smoothly interoperate with GO?

@paulmillar
Copy link
Member

Could you find the FTP client (GO) operations for the successful transfer? The access log file should contain all operations and dCache's response. A simple grep using the session (door:GFTP-plum15-AAVaAuwI7mA@gridftp-plum15Domain) should provide the commands.

@ahaupt
Copy link
Author

ahaupt commented Sep 25, 2017

Hi Paul,

I'm not so familiar with GO ... I do not find the failed transfer I mentioned here, but this one looks identical and failed at the same time:

Error (transfer)

Endpoint: IceCube Gridftp Scratch (aec5c658-f77d-11e6-ba7f-22000b9a448b) Server: gridftp-scratch.icecube.wisc.edu:2811 File: /mnt/tank/jade/bundles1/data/exp/IceCube/2015/unbiased/PFDST/0404/29288523-aad1-40a6-9737-e22824041da5.zip Command: RETR /mnt/tank/jade/bundles1/data/exp/IceCube/2015/unbiased/PFDST/0404/29288523-aad1-40a6-9737-e22824041da5.zip Message: Fatal FTP response

Details: 500-Command failed. : callback failed.\r\n500-globus_xio: System error in writev: Broken pipe\r\n500-globus_xio: A system call failed: Broken pipe\r\n500 End.\r\n

The successful attempt only mentions this:

{ "files_succeeded": 1 }

Here the "grepped" session log from our gridftp door:

gridftp-plum15Domain.access.txt

@paulmillar
Copy link
Member

Thanks for the information.

The message Broken pipe message is from the Globus FTP server, not from dCache. My guess is that the FTP client (GO) disconnected from dCache, which aborted the transfer. This resulted in dCache tearing down the data connections, triggering the error message you see.

Unfortunately, the access log you found is almost certainly not the connection that experience the problem. Certainly, there is no indication of a problem in that access log file.

Instead, it shows the FTP client (GO) disconnecting shortly after starting a new transfer, apparently unprovoked.

I have seen this behaviour before. It comes from the recovery produce GO uses, where it disconnects all FTP connections when there is a problem with any connection; therefore, it is quite likely that the problem was with some other FTP connection: either the same GridFTP door or another GridFTP door.

You could try restricting GO to making a single transfer at any time and try to recreate the problem there. This should make it easier to discover why GO is aborting.

@paulmillar
Copy link
Member

Would it be possible to try the latest dCache version (3.2) -- well, it's not yet released, but we're just putting together the release notes ?

This has a couple of features that GO requires (dynamic checksum calculation; command pipelining).

Perhaps you could set up a small test system just to demonstrate whether GO works better with this version of dCache.

@ahaupt
Copy link
Author

ahaupt commented Sep 26, 2017

Hi Paul,

Upgrading our test system once version 3.2 is released should not be much of a problem. But getting a firewall exception for that system is one ... Any idea how to simulate the GO client with e.g. globus-url-copy? GO looks like a "black-box client" to me ...

Is prometheus.desy.de public so that we could use it for compatibility tests?

Thanks!
Andreas

@paulmillar
Copy link
Member

If you like, you can take one of the latest 3.2 pre-release builds and try that:

https://ci.dcache.org/view/dCache%203.2/job/dCache-v3.2/

Unfortunately, I'm not sure how to emulate GO with globus-url-copy. In my experiments, I created a virtual machine and ran the GO packaged server there. However, mostly it was a case of observing what GO does when interacting with dCache, instrumenting error cases, and the occasional inspired detective work to understand what was going wrong and get it to work with dCache.

Yes, you can certainly use prometheus for testing -- that's one of its major reasons for existing. Various VOs are already authorised, but I can also create an account specifically for you (tied to your DN). Just drop me an email if that would be useful.

@gonzalomerino
Copy link

Hello, this is Gonzalo from IceCube @ UW-Madison.

We have a data archive service here that issues Globus-online transfers to archive data from endpoint A to endpoint B. If you think it might be of useful for your testing, we could quite easily direct some arbitrary transfer load to a test endpoint that you would point us to.

Gonzalo

@paulmillar
Copy link
Member

Hi Gonzalo,

Sorry for the delay in getting back in touch -- what I propose is giving you an account on our test system called 'prometheus'. This would allow a much faster turn-around for getting to the bottom of any problems with GO.

Could you send me the output of

htpasswd -n -m <username> | sed 's/\$/\\\$/g'

(replacing <username> with your preferred username on the system) along with the DN of your X.509 certificate -- preferably via email.

Cheers,

Paul.

@gonzalomerino
Copy link

Hello Paul,

Are there any updates on the debugging of this issue?

thanks!
Gonzalo

@paulmillar
Copy link
Member

My apologies for the delay in replying.

There was an unresolved issue with the update that prevented me from updating dCache so it supports the Globus transfer-service. It turns out the problem was not with the patch, but with the existing dCache code. That problem is now fixed, so I've deployed the patch.

Currently the patch is in our 'master' branch. This means it appears in prometheus test system right now, so you should be able to verify that it works there.

We will back-port the fix to our stable branches, going back to dCache v3.2. It's too late to do that for this release cycle (due out tomorrow), but it should be available as part of the next release cycle (due next Tuesday: 2018-03-06).

@gonzalomerino
Copy link

gonzalomerino commented Mar 9, 2018 via email

@gonzalomerino
Copy link

gonzalomerino commented Mar 15, 2018 via email

@paulmillar
Copy link
Member

Thanks for doing this testing, Gonzalo.

Every day, at 06:00 CET/CEST, prometheus is wiped clean and reinstalled from scratch. This isn't normal dCache behaviour -- it's something special to prometheus, as it always has the latest dCache version.

I believe that (currently) this time corresponds to midnight in Central US time. Looking at the logs, I see the Globus transfer service connections (acting on your behalf), starting 2018-03-15T05:13:00.669+0100, with the last one connecting 2018-03-15T08:07:40.427+0100.

So, I believe this explains the "few errors" you described.

Could you retry the transfers, starting them somewhat earlier, to try and avoid midnight?

@paulmillar paulmillar self-assigned this Nov 17, 2020
@paulmillar
Copy link
Member

There doesn't seem to have been much progress on this ticket.

To be clear, I believe this problem is solved. I am able to transfer many files between two dCache instances using Globus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants