Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFTP: Aborted uploads leave 0 length files behind #6176

Open
DmitryLitvintsev opened this issue Oct 26, 2021 · 25 comments
Open

GFTP: Aborted uploads leave 0 length files behind #6176

DmitryLitvintsev opened this issue Oct 26, 2021 · 25 comments
Labels
bug FTP Problems related to any of the FTP family of doors (e.g., GridFTP/gsiftp)

Comments

@DmitryLitvintsev
Copy link
Member

DmitryLitvintsev commented Oct 26, 2021

dCache 7.2

Aborted GFTP uploads leave 0 length files behind. Additionally these files are "P"recious on pools and perpetually tried to
be written to tape:

billing=# select datestamp, connectiontime, errorcode, errormessage from  doorinfo where pnfsid='0000FC5BEB62A1854D33B987CD236ED17189';
         datestamp          | connectiontime | errorcode |                 errormessage                 
----------------------------+----------------+-----------+----------------------------------------------
 2021-10-21 14:57:04.854-05 |         176078 |       451 | Aborting transfer due to session termination
(1 row)

176 seconds, data not moved:

billing=# select datestamp, transfersize, fullsize, connectiontime, errorcode, errormessage from  billinginfo where pnfsid='0000FC5BEB62A1854D33B987CD236ED17189';
         datestamp          | transfersize | fullsize | connectiontime | errorcode |                                                    errormessa
ge                                                     
----------------------------+--------------+----------+----------------+-----------+--------------------------------------------------------------
-------------------------------------------------------
 2021-10-21 14:57:04.892-05 |            0 |        0 |              0 |       666 | Transfer forcefully killed: Active transfer cancelled: killed
 by door: Aborting transfer due to session termination
(1 row)
@paulmillar
Copy link
Member

There's a fair bit of information missing in order to understand the problem, reproduce and fix it.

Ideally, can you provide command-line options for globus-url-copy that reproduces the issue?

How is data reaching the pool: is the door proxying? is the client using MODE S or MODE E?

Could you provide the access log file (for the FTP session containing the failed transfer described above)?

I'm assuming the client is aborting the transfer because it's taking too long. Could you describe at what point the client is getting stuck? (is the mover started? does the client connect to the pool?)

@paulmillar paulmillar added bug FTP Problems related to any of the FTP family of doors (e.g., GridFTP/gsiftp) labels Oct 26, 2021
@paulmillar
Copy link
Member

You mention the aborted transfer left a zero-length file that is precious. Would a successful uploaded file in this directory also be precious?

In other words, is the file being precious a symptom of a problem, or just contextual information?

@DmitryLitvintsev
Copy link
Member Author

I do not know exact options passed to globus-url-copy in the quoted case. It is likely mode E, data is not proxied by door.
Coincidentally I am using TPC to transfer data from one dCache instances to another. Before upgrade to 7.2 things worked as expected, no 0 length file left behind. Now I observe a significant amount of 0 length files left behind. They seem to all fall into the same category - door connection made, but client never connects to the pool. I assume mover started.
IOW all fails are 451 Aborting transfer due to session termination.

The command I am using:

 globus-url-copy -dbg -nodcau -p 4 -fast -cd -vb 

@DmitryLitvintsev
Copy link
Member Author

DmitryLitvintsev commented Oct 26, 2021

I attach here door access log and billing log for one of the tranfer,.

Client logic is :

rc = 1 
tries = 0 
while rc != 0 and tries < 3:
   rc = system("globus-url-copy ...")
   tries++ 

billing:

[root@fndca5b 10]# grep 'CSskim-NP-20101123-010613-6113518_p21.20.00' billing-2021.10.21
2021.10.21 18:22:48:857 [door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [0000A44CDD5E64D44A4F8E0C109F3774431A,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] d0.prd@enstore 900034 0 {451:"Aborting transfer due to session termination"}
2021.10.21 18:22:48:901 [pool:rw-stkendca1904-2@rw-stkendca1904-2Domain:transfer] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507] [0000A44CDD5E64D44A4F8E0C109F3774431A,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] d0.prd@enstore 0 0 true {GFtp-2.0 131.225.69.61 41419} [door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain:1634857668824000] [p2p=false] {666:"Transfer forcefully killed: Active transfer cancelled: killed by door: Aborting transfer due to session termination"}
2021.10.21 18:37:49:223 [door:GFTP-stkendca58a-AAXO5SxBXLg@gridftp-stkendca58aDomain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [0000F5BB7A0EC033452BAC8C911C066014B2,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] d0.prd@enstore 900076 0 {451:"Aborting transfer due to session termination"}
2021.10.21 18:37:49:230 [pool:rw-stkendca1905-2@rw-stkendca1905-2Domain:transfer] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507] [0000F5BB7A0EC033452BAC8C911C066014B2,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] d0.prd@enstore 0 0 true {GFtp-2.0 131.225.69.61 46067} [door:GFTP-stkendca58a-AAXO5SxBXLg@gridftp-stkendca58aDomain:1634858569147000] [p2p=false] {666:"Transfer forcefully killed: Active transfer cancelled: killed by door: Aborting transfer due to session termination"}
2021.10.21 18:42:35:904 [door:GFTP-stkendca54a-AAXO5XL6X4g@gridftp-stkendca54aDomain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] <Unknown> 12 0 {550:"File exists"}
2021.10.21 18:42:36:218 [door:GFTP-stkendca54a-AAXO5XMCgZA@gridftp-stkendca54aDomain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] <Unknown> 11 0 {550:"File exists"}
2021.10.21 18:42:36:698 [door:GFTP-stkendca1804-AAXO5XMHzQg@gridftp-stkendca1804Domain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] <Unknown> 12 0 {550:"File exists"}
2021.10.21 18:42:37:516 [door:GFTP-stkendca1804-AAXO5XMOmzg@gridftp-stkendca1804Domain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] <Unknown> 14 0 {550:"File exists"}
2021.10.21 18:42:38:069 [door:GFTP-stkendca56a-AAXO5XMbwmg@gridftp-stkendca56aDomain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] <Unknown> 210 0 {550:"File exists"}
2021.10.21 18:42:38:618 [door:GFTP-stkendca56a-AAXO5XMjjoA@gridftp-stkendca56aDomain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] <Unknown> 14 0 {550:"File exists"}
2021.10.21 18:42:38:983 [door:GFTP-stkendca58a-AAXO5XMsd8A@gridftp-stkendca58aDomain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] <Unknown> 12 0 {550:"File exists"}
2021.10.21 18:42:39:291 [door:GFTP-stkendca58a-AAXO5XMxfOg@gridftp-stkendca58aDomain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] <Unknown> 12 0 {550:"File exists"}
2021.10.21 18:42:39:613 [door:GFTP-stkendca1804-AAXO5XM2oVA@gridftp-stkendca1804Domain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] <Unknown> 12 0 {550:"File exists"}
2021.10.21 18:42:40:150 [door:GFTP-stkendca1804-AAXO5XM7Gdg@gridftp-stkendca1804Domain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] <Unknown> 12 0 {550:"File exists"}
2021.10.21 18:42:40:556 [door:GFTP-stkendca63a-AAXO5XNDtPg@gridftp-stkendca63aDomain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] <Unknown> 13 0 {550:"File exists"}
2021.10.21 18:42:40:878 [door:GFTP-stkendca63a-AAXO5XNJfXA@gridftp-stkendca63aDomain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] <Unknown> 11 0 {550:"File exists"}
2021.10.21 18:42:41:285 [door:GFTP-stkendca65a-AAXO5XNPJqg@gridftp-stkendca65aDomain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] <Unknown> 18 0 {550:"File exists"}
2021.10.21 18:42:42:031 [door:GFTP-stkendca65a-AAXO5XNUpOg@gridftp-stkendca65aDomain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] <Unknown> 208 0 {550:"File exists"}
2021.10.21 18:42:43:328 [door:GFTP1-fndca4b-AAXO5XNgzjA@gridftp1-fndca4bDomain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] <Unknown> 14 0 {550:"File exists"}
2021.10.21 18:42:43:696 [door:GFTP1-fndca4b-AAXO5XN0C_g@gridftp1-fndca4bDomain:request] ["/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov":dzero::1841:1507:131.225.13.45] [,0] [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00] <Unknown> 13 0 {550:"File exists"}
[root@fndca5b 10]#

door access:

[root@stkendca58a dcache-log]# grep 'GFTP-stkendca58a-AAXO5PaVQ-A' gridftp-stkendca58aDomain.log
21 Oct 2021 18:22:48 (GFTP-stkendca58a-AAXO5PaVQ-A) [door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain] Transfer error: 451 Aborting transfer due to session termination
[root@stkendca58a dcache-log]# grep 'GFTP-stkendca58a-AAXO5PaVQ-A' gridftp-stkendca58aDomain.access
level=INFO ts=2021-10-21T18:07:48.398-0500 event=org.dcache.ftp.hello session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain socket.remote=131.225.13.45:53284 socket.local=131.225.69.61:2811 reply="220 GSI FTP door ready"
level=INFO ts=2021-10-21T18:07:48.399-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="AUTH GSSAPI" reply="334 ADAT must follow"
level=INFO ts=2021-10-21T18:07:48.407-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ADAT ..." reply="335 ADAT=..."
level=INFO ts=2021-10-21T18:07:48.424-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ADAT ..." reply="335 ADAT=..."
level=INFO ts=2021-10-21T18:07:48.431-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ADAT ..." reply="335 ADAT=..."
level=INFO ts=2021-10-21T18:07:48.443-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ADAT ..." reply="235 OK"
level=INFO ts=2021-10-21T18:07:48.460-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{USER :globus-mapping:}" user.dn="/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=dcatest01.fnal.gov" user.mapped=1841:1507 reply="ENC{200 User :globus-mapping: logged in}"
level=INFO ts=2021-10-21T18:07:48.461-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PASS}" reply="ENC{200 PASS command successful}"
level=INFO ts=2021-10-21T18:07:48.462-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{SITE HELP}" reply="ENC{214-The following site-specific commands are supported:\r\nSITE <SP> HELP - Information about SITE commands\r\nSITE <SP> BUFSIZE <SP> <size> - Set network buffer to <size>\r\nSITE <SP> CHKSUM <SP> <value> - Fail upload if ADLER32 checksum isn't <value>\r\nSITE <SP> CHGRP <SP> <group> <SP> <path> - Change group-owner of <path> to group <group>\r\nSITE <SP> CHMOD <SP> <perm> <SP> <path> - Change permission of <path> to octal value <perm>\r\nSITE <SP> CLIENTINFO <SP> <id> - Provide server with information about the client\r\nSITE <SP> SYMLINK <SP> <dest> <SP> <path> - Register symlink at <path>, pointing to <dest>\r\nSITE <SP> SYMLINKFROM <SP> <path> - Register symlink location; SYMLINKTO must follow\r\nSITE <SP> SYMLINKTO <SP> <path> - Create symlink to <path>; SYMLINKFROM must be earlier command\r\nSITE <SP> TASKID <SP> <id> - Provide server with an identifier\r\nSITE <SP> USAGE <SP> [TOKEN <SP> <token> <SP> ] <path>\r\n214 SITE <SP> WHOAMI - Provides the username or uid of the user}"
level=INFO ts=2021-10-21T18:07:48.462-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{FEAT}" reply="ENC{211-OK\r\n EOF\r\n PARALLEL\r\n MODE-E-PERF\r\n SIZE\r\n SBUF\r\n ERET\r\n ESTO\r\n GETPUT\r\n MDTM\r\n CKSUM ADLER32,MD5,MD4,SHA-1,SHA-256,SHA-512\r\n MODEX\r\n TVFS\r\n MFMT\r\n MFCT\r\n MFF Modify;Create;UNIX.mode;\r\n PASV AllowDelayed\r\n MLSC\r\n MLST Size*;Modify*;Create;Type*;Unique*;Perm*;UNIX.owner*;UNIX.group*;UNIX.uid*;UNIX.gid*;UNIX.mode*;UNIX.ctime;UNIX.atime;\r\n211 End}"
level=INFO ts=2021-10-21T18:07:48.463-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{SITE CLIENTINFO scheme=gsiftp;appname=\"globus-url-copy\";appver=\"10.10 (gcc64, 1629915172-0) [Grid Community Toolkit 6.2]\";}" reply="ENC{250 OK}"
level=INFO ts=2021-10-21T18:07:48.464-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{TYPE I}" reply="ENC{200 Type set to I}"
level=INFO ts=2021-10-21T18:07:48.465-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{MODE E}" reply="ENC{200 Will use Extended Block mode}"
level=INFO ts=2021-10-21T18:07:48.479-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{MKD //pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/}" reply="ENC{550 \"/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000\" directory already exists}"
level=INFO ts=2021-10-21T18:07:48.501-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{MKD //}" reply="ENC{550 Permission denied}"
level=INFO ts=2021-10-21T18:07:48.511-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{MKD //pnfs/}" reply="ENC{550 Permission denied}"
level=INFO ts=2021-10-21T18:07:48.520-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{MKD //pnfs/fnal.gov/}" reply="ENC{550 Permission denied}"
level=INFO ts=2021-10-21T18:07:48.531-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{MKD //pnfs/fnal.gov/usr/}" reply="ENC{550 Permission denied}"
level=INFO ts=2021-10-21T18:07:48.543-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{MKD //pnfs/fnal.gov/usr/d0/}" reply="ENC{550 Permission denied}"
level=INFO ts=2021-10-21T18:07:48.558-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{MKD //pnfs/fnal.gov/usr/d0/data/}" reply="ENC{550 \"/pnfs/fnal.gov/usr/d0/data\" directory already exists}"
level=INFO ts=2021-10-21T18:07:48.575-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{MKD //pnfs/fnal.gov/usr/d0/data/db5/}" reply="ENC{550 \"/pnfs/fnal.gov/usr/d0/data/db5\" directory already exists}"
level=INFO ts=2021-10-21T18:07:48.586-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{MKD //pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/}" reply="ENC{550 \"/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector\" directory already exists}"
level=INFO ts=2021-10-21T18:07:48.598-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{MKD //pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/}" reply="ENC{550 \"/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b\" directory already exists}"
level=INFO ts=2021-10-21T18:07:48.611-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{MKD //pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/}" reply="ENC{550 \"/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero\" directory already exists}"
level=INFO ts=2021-10-21T18:07:48.626-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{MKD //pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/}" reply="ENC{550 \"/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail\" directory already exists}"
level=INFO ts=2021-10-21T18:07:48.641-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{MKD //pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/}" reply="ENC{550 \"/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep\" directory already exists}"
level=INFO ts=2021-10-21T18:07:48.655-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{MKD //pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/}" reply="ENC{550 \"/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000\" directory already exists}"
level=INFO ts=2021-10-21T18:07:48.823-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{ALLO 936499002}" reply="ENC{200 ALLO command successful}"
level=INFO ts=2021-10-21T18:07:48.871-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PUT path=//pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00;pasv;}" reply="ENC{127 PORT (131,225,69,106,78,157)}"
level=INFO ts=2021-10-21T18:07:48.871-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PUT path=//pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00;pasv;}" reply="ENC{150 Opening BINARY data connection for /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00}"
level=INFO ts=2021-10-21T18:07:48.872-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PUT path=//pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00;pasv;}" reply="ENC{112-Perf Marker\r\n Timestamp:  1634857668.8\r\n Stripe Index: 0\r\n Stripe Bytes Transferred: 0\r\n Total Stripe Count: 1\r\n112 End.}"
level=INFO ts=2021-10-21T18:08:58.874-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PUT path=//pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00;pasv;}" reply="ENC{112-Perf Marker\r\n Timestamp:  1634857668.8\r\n Stripe Index: 0\r\n Stripe Bytes Transferred: 0\r\n Total Stripe Count: 1\r\n112 End.}"
level=INFO ts=2021-10-21T18:10:08.874-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PUT path=//pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00;pasv;}" reply="ENC{112-Perf Marker\r\n Timestamp:  1634857668.8\r\n Stripe Index: 0\r\n Stripe Bytes Transferred: 0\r\n Total Stripe Count: 1\r\n112 End.}"
level=INFO ts=2021-10-21T18:11:18.874-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PUT path=//pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00;pasv;}" reply="ENC{112-Perf Marker\r\n Timestamp:  1634857668.8\r\n Stripe Index: 0\r\n Stripe Bytes Transferred: 0\r\n Total Stripe Count: 1\r\n112 End.}"
level=INFO ts=2021-10-21T18:12:28.875-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PUT path=//pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00;pasv;}" reply="ENC{112-Perf Marker\r\n Timestamp:  1634857668.8\r\n Stripe Index: 0\r\n Stripe Bytes Transferred: 0\r\n Total Stripe Count: 1\r\n112 End.}"
level=INFO ts=2021-10-21T18:13:38.875-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PUT path=//pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00;pasv;}" reply="ENC{112-Perf Marker\r\n Timestamp:  1634857668.8\r\n Stripe Index: 0\r\n Stripe Bytes Transferred: 0\r\n Total Stripe Count: 1\r\n112 End.}"
level=INFO ts=2021-10-21T18:14:48.876-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PUT path=//pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00;pasv;}" reply="ENC{112-Perf Marker\r\n Timestamp:  1634857668.8\r\n Stripe Index: 0\r\n Stripe Bytes Transferred: 0\r\n Total Stripe Count: 1\r\n112 End.}"
level=INFO ts=2021-10-21T18:15:58.876-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PUT path=//pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00;pasv;}" reply="ENC{112-Perf Marker\r\n Timestamp:  1634857668.8\r\n Stripe Index: 0\r\n Stripe Bytes Transferred: 0\r\n Total Stripe Count: 1\r\n112 End.}"
level=INFO ts=2021-10-21T18:17:08.884-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PUT path=//pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00;pasv;}" reply="ENC{112-Perf Marker\r\n Timestamp:  1634857668.8\r\n Stripe Index: 0\r\n Stripe Bytes Transferred: 0\r\n Total Stripe Count: 1\r\n112 End.}"
level=INFO ts=2021-10-21T18:18:18.879-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PUT path=//pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00;pasv;}" reply="ENC{112-Perf Marker\r\n Timestamp:  1634857668.8\r\n Stripe Index: 0\r\n Stripe Bytes Transferred: 0\r\n Total Stripe Count: 1\r\n112 End.}"
level=INFO ts=2021-10-21T18:19:28.879-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PUT path=//pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00;pasv;}" reply="ENC{112-Perf Marker\r\n Timestamp:  1634857668.8\r\n Stripe Index: 0\r\n Stripe Bytes Transferred: 0\r\n Total Stripe Count: 1\r\n112 End.}"
level=INFO ts=2021-10-21T18:20:38.879-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PUT path=//pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00;pasv;}" reply="ENC{112-Perf Marker\r\n Timestamp:  1634857668.8\r\n Stripe Index: 0\r\n Stripe Bytes Transferred: 0\r\n Total Stripe Count: 1\r\n112 End.}"
level=INFO ts=2021-10-21T18:21:48.879-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PUT path=//pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00;pasv;}" reply="ENC{112-Perf Marker\r\n Timestamp:  1634857668.8\r\n Stripe Index: 0\r\n Stripe Bytes Transferred: 0\r\n Total Stripe Count: 1\r\n112 End.}"
level=INFO ts=2021-10-21T18:22:48.858-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain command="ENC{PUT path=//pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00;pasv;}" reply="ENC{451 Aborting transfer due to session termination}"
level=INFO ts=2021-10-21T18:22:48.858-0500 event=org.dcache.ftp.disconnect host.remote=131.225.13.45:53284 session=door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain

@DmitryLitvintsev
Copy link
Member Author

The source file has normal size:

[root@dcatest01 ~]# ls -alt /pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00 
-rw-r--r-- 1 1841 1507 936499002 Nov 23  2010 /pnfs/fnal.gov/usr/d0/data/db5/derivedDetector/csg-p21.20.00-p20.18.02b/dzero/thumbnail/lep/0000/CSskim-NP-20101123-010613-6113518_p21.20.00
[root@dcatest01 ~]# 

@DmitryLitvintsev
Copy link
Member Author

Pool side log excerpt for the above transfer:

21 Oct 2021 18:22:48 (rw-stkendca1904-2) [door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain GFTP-stkendca58a-AAXO5PaVQ-A PoolAcceptFile 0000A44CDD5E64D44A4F8E0C109F3774431A] Transfer forcefully killed: Active transfer cancelled: killed by door: Aborting transfer due to session termination
21 Oct 2021 18:22:48 (rw-stkendca1904-2) [door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain GFTP-stkendca58a-AAXO5PaVQ-A PoolAcceptFile 0000A44CDD5E64D44A4F8E0C109F3774431A] Failed to read file size: java.nio.file.NoSuchFileException: /storage/data2/rw-pool-2/data/0000A44CDD5E64D44A4F8E0C109F3774431A

@DmitryLitvintsev
Copy link
Member Author

(I think in this case actually the file did not end up with 0 size replica on pools, It existed in some "ether", cacheinfoof that file showed empty)

@DmitryLitvintsev
Copy link
Member Author

The extend of the problem. Overnight, looking got HSM flush errors:

[root@fndca2b ~]# for id in $ids; do  ssh -p 24223 enstore@fndca3b "\sl $id rep ls $id" | grep "> 0 "; done
    0000055B005B2A5E4ABF9056C59099B96EBD <-P---------L(0)[1]> 0 si={minerva.mc_minos}
    0000139B8DE1FBA545CAA1F9C5C4DA586EB4 <-P---------L(0)[1]> 0 si={uboone.data_raw}
    0000156CF3F4C56349AD802FFDEC317AD08E <-P---------L(0)[0]> 0 si={e1039.raw_coda}
    00001EFF0D9341C34A5BA4655CE78A4F27B8 <-P---------L(0)[0]> 0 si={mu2e.usr-sim}
    000026B4D53F3AB44849A64D4308A52BD53C <-P---------L(0)[0]> 0 si={nova.nova_users}
    000026B4D53F3AB44849A64D4308A52BD53C <-P---------L(0)[0]> 0 si={nova.nova_users}
    00002B4C5882AB184D22A31EB780B19AB113 <-P---------L(0)[0]> 0 si={nova.nova_production}
    00003910C6A5BD90487FABE5AB3D57697066 <-P---------L(0)[0]> 0 si={GM2.GM2}
    000073429EAB666A4EFCB738100013F64E32 <-P---------L(0)[0]> 0 si={egp.egp}
    000074D0A861DEE243FC94E9DAE97B907396 <-P---------L(0)[0]> 0 si={nova.nova_production}
    00008803C13EEF38402AB24B39874351D59F <-P---------L(0)[0]> 0 si={dune.dune}
    00009D35E5ABADAE4B57A4446728C8CC2FEF <-P---------L(0)[0]> 0 si={minerva.mc_minos}
    0000A692391CDD314652893316D00163D438 <-P---------L(0)[0]> 0 si={uboone.overlay_reco}
    0000AFF01EEDF50447C99D3E6FFFE5906D15 <-P---------L(0)[1]> 0 si={uboone.data_raw}
    0000BC9DCB0231C849EB9B1524CAA78630A8 <-P---------L(0)[0]> 0 si={nova.nova_users}
    0000D9E1091D27294285B4E6FC0037BF7406 <-P---------L(0)[1]> 0 si={minerva.mc_minos}
    0000E37309DFB46D4D76B0680FF86AE8069B <-P---------L(0)[0]> 0 si={nova.nova_users}
    0000F5CA239F2C5C44F88F99A5CA7D238063 <-P---------L(0)[0]> 0 si={minerva.mc_generation}

Precious 0 size file that fail to go to tape (due to different problem).

@paulmillar
Copy link
Member

Does the door's log file (the log file of the domain hosting the door) contain any entries regarding these failing transfers?

@DmitryLitvintsev
Copy link
Member Author

single line:

21 Oct 2021 18:22:48 (GFTP-stkendca58a-AAXO5PaVQ-A) [door:GFTP-stkendca58a-AAXO5PaVQ-A@gridftp-stkendca58aDomain] Transfer error: 451 Aborting transfer due to session termination

(this corresponds to the very first aborted transfer that I started this Issue with)

@paulmillar
Copy link
Member

Two more questions:

Do you see any SRM upload activity (in particular that failed) on this dCache instance?

Do you see any xrootd uploads with the "Persist On Successful Close" (POSC) flag set (again, in particular that failed) ?

@DmitryLitvintsev
Copy link
Member Author

SRM activity is rather minimal. By looking in SRM access file I see:

   41018 request.method=srmStatusOfBringOnlineRequest
  29930 request.method=srmLs
  20698 request.method=srmReleaseFiles
  14746 request.method=srmPing
  14527 request.method=srmPrepareToGet
    696 request.method=srmBringOnline
    660 request.method=srmAbortFiles
    248 request.method=srmPutDone
    248 request.method=srmPrepareToPut
    233 request.method=srmRm
    113 request.method=srmStatusOfGetRequest
      3 request.method=srmStatusOfLsRequest

Actually I do not seem to see failing gets or puts. Fails I see have to do with SrmRm or SrmLs fails due to SRM_INVALID_PATH.

How do I know if POSC was set? ("grep -i 'pocs' xroot.access file did not bring up anything)

@paulmillar
Copy link
Member

Do any of the srmAbortFiles requests match the files with the problematic FTP transfers ... or do they happen at about the same time as the transfers?

@paulmillar
Copy link
Member

Detecting xroot POSC isn't completely trivial. You're looking for a xroot kXR_open request in the access log file with the options flag 0x1000 set.

Here's a sed one-liner that pulls out the options flag from the access log file:

sed -n 's/.*xrootd.request.*request=open.*options=\([^ ]*\).*/\1/p' my-door.access

Here's some sample output from running the sed script:

0x468
0x1468

The first entry (0x468) is the options for a kXR_open request without POSC set.
The second entry (0x1468) is the options for a kXR_open request with POSC set.

@DmitryLitvintsev
Copy link
Member Author

SRM is rarely used. Mostly globus-url-copy.

Today I scanned all 0 lenth files accumulated so far:

psql -t -F ' ' -A -U enstore billing -c "select pnfsid from billinginfo where fullsize=0 and errorcode=666 and errormessage like 'Transfer forcefully killed: Active transfer cancelled: killed%' and datestamp> current_date" -o zero_size_files.txt 
[root@fndca5b ~]# wc -l zero_size_files.txt 
1168 zero_size_files.txt

out of 1168 only 5 actually exist in namespace and on pools. Luckily all on scratch (aka volatile pools).

Taking just one file ....

door access:

level=INFO ts=2021-10-26T03:17:09.690-0500 event=org.dcache.ftp.hello session=door:GFTP-stkendca2043-AAXPPRqYMpA@gridftp-stkendca2043Domain socket.remote=131.225.208.106:58664 socket.local=131.225.69.172:2811 reply="220 GSI FTP door ready"
level=INFO ts=2021-10-26T03:17:09.848-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca2043-AAXPPRqYMpA@gridftp-stkendca2043Domain command="AUTH GSSAPI" reply="334 ADAT must follow"
level=INFO ts=2021-10-26T03:17:10.316-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca2043-AAXPPRqYMpA@gridftp-stkendca2043Domain command="ADAT ..." reply="335 ADAT=..."
level=INFO ts=2021-10-26T03:17:12.161-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca2043-AAXPPRqYMpA@gridftp-stkendca2043Domain command="ADAT ..." reply="335 ADAT=..."
level=INFO ts=2021-10-26T03:17:12.509-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca2043-AAXPPRqYMpA@gridftp-stkendca2043Domain command="ADAT ..." reply="335 ADAT=..."
level=INFO ts=2021-10-26T03:17:13.616-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca2043-AAXPPRqYMpA@gridftp-stkendca2043Domain command="ADAT ..." reply="235 OK"
level=INFO ts=2021-10-26T03:17:31.553-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca2043-AAXPPRqYMpA@gridftp-stkendca2043Domain command="ENC{USER :globus-mapping:}" user.dn="/DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Jeremy Hewes/CN=UID:jhewes15" user.mapped=48604:9553,8674,8678,9767,9960,8844,8623,9553,9937,9010,8828 reply="ENC{200 User :globus-mapping: logged in}"
level=INFO ts=2021-10-26T03:17:31.864-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca2043-AAXPPRqYMpA@gridftp-stkendca2043Domain command="ENC{PASS}" reply="ENC{200 PASS command successful}"
level=INFO ts=2021-10-26T03:17:32.036-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca2043-AAXPPRqYMpA@gridftp-stkendca2043Domain command="ENC{SITE HELP}" reply="ENC{214-The following site-specific commands are supported:\r\nSITE <SP> HELP - Information about SITE commands\r\nSITE <SP> BUFSIZE <SP> <size> - Set network buffer to <size>\r\nSITE <SP> CHKSUM <SP> <value> - Fail upload if ADLER32 checksum isn't <value>\r\nSITE <SP> CHGRP <SP> <group> <SP> <path> - Change group-owner of <path> to group <group>\r\nSITE <SP> CHMOD <SP> <perm> <SP> <path> - Change permission of <path> to octal value <perm>\r\nSITE <SP> CLIENTINFO <SP> <id> - Provide server with information about the client\r\nSITE <SP> SYMLINK <SP> <dest> <SP> <path> - Register symlink at <path>, pointing to <dest>\r\nSITE <SP> SYMLINKFROM <SP> <path> - Register symlink location; SYMLINKTO must follow\r\nSITE <SP> SYMLINKTO <SP> <path> - Create symlink to <path>; SYMLINKFROM must be earlier command\r\nSITE <SP> TASKID <SP> <id> - Provide server with an identifier\r\nSITE <SP> USAGE <SP> [TOKEN <SP> <token> <SP> ] <path>\r\n214 SITE <SP> WHOAMI - Provides the username or uid of the user}"
level=INFO ts=2021-10-26T03:17:32.038-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca2043-AAXPPRqYMpA@gridftp-stkendca2043Domain command="ENC{FEAT}" reply="ENC{211-OK\r\n EOF\r\n PARALLEL\r\n MODE-E-PERF\r\n SIZE\r\n SBUF\r\n ERET\r\n ESTO\r\n GETPUT\r\n MDTM\r\n CKSUM ADLER32,MD5,MD4,SHA-1,SHA-256,SHA-512\r\n MODEX\r\n TVFS\r\n MFMT\r\n MFCT\r\n MFF Modify;Create;UNIX.mode;\r\n PASV AllowDelayed\r\n MLSC\r\n MLST Size*;Modify*;Create;Type*;Unique*;Perm*;UNIX.owner*;UNIX.group*;UNIX.uid*;UNIX.gid*;UNIX.mode*;UNIX.ctime;UNIX.atime;\r\n211 End}"
level=INFO ts=2021-10-26T03:17:32.038-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca2043-AAXPPRqYMpA@gridftp-stkendca2043Domain command="ENC{SITE CLIENTINFO scheme=gsiftp;appname=\"globus-url-copy\";appver=\"10.10 (gcc64, 1629915172-0) [Grid Community Toolkit 6.2]\";}" reply="ENC{250 OK}"
level=INFO ts=2021-10-26T03:17:32.038-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca2043-AAXPPRqYMpA@gridftp-stkendca2043Domain command="ENC{TYPE I}" reply="ENC{200 Type set to I}"
level=INFO ts=2021-10-26T03:17:32.426-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca2043-AAXPPRqYMpA@gridftp-stkendca2043Domain command="ENC{ALLO 1757712}" reply="ENC{200 ALLO command successful}"

door log:

can't find anything for that string "

0

but the door crashed about the same time:

26 Oct 2021 03:18:13 (GFTP-stkendca2043-AAXPPRw20xg) [door:GFTP-stkendca2043-AAXPPRw20xg@gridftp-stkendca2043Domain] Login failed for CN=1569213987,CN=1121042410,CN=1773309327,CN=3869595295,CN=UID:bzli,CN=Bingzhi Li,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org: Request to [>gPlazma@local] timed out.
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /home/enstore/dcache-log/gridftp-stkendca2043Domain-oom.hprof ...
Unable to create /home/enstore/dcache-log/gridftp-stkendca2043Domain-oom.hprof: File exists
Terminating due to java.lang.OutOfMemoryError: Java heap space
2021-10-26 03:18:22 Process terminated unexpectedly
2021-10-26 03:18:22 Sleeping 10 seconds

2021-10-26 03:18:32 Restarting /usr/bin/java -server -Xmx1024m -XX:MaxDirectMemorySize=1024m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.ftp.log-aborted-transfers=true -Dorg.dcache.net.tcp.portrange=25001:35000 -Djava.security.krb5.realm=FNAL.GOV -Djava.security.krb5.kdc=krb-fnal-enstore.fnal.gov:krb-fnal-2.fnal.gov:krb-fnal-3.fnal.gov:krb-fnal-4.fnal.gov:krb-fnal-5.fnal.gov -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/jgss_host.conf -Dcontent.types.user.table=/etc/dcache/content-types.properties -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/enstore/dcache-log/gridftp-stkendca2043Domain-oom.hprof -XX:+ExitOnOutOfMemoryError -XX:+StartAttachListener -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.9.2.jar -Djava.awt.headless=true -DwantLog4jSetup=n -Djava.security.properties=/etc/dcache/max.java.security -Djdk.tls.ephemeralDHKeySize=2048 -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start gridftp-stkendca2043Domain
WARN  - dcache.conf:21: Property dcache.broker.host: please remove this assignment; See dcache.zookeeper.connection

The pool : no records.

GFTP crashing with OOM is something new. Happening after upgrade to 7.2 often

[gridftp-${host.name}Domain]
dcache.java.memory.heap=1024m
dcache.java.memory.direct=1024m
[gridftp-${host.name}Domain/ftp]
ftp.authn.protocol=gsi

1GB not enough?

@paulmillar
Copy link
Member

Could look at the pool hosting one of these zero-length precious files. Does the pool's log file contain anything significant?

@paulmillar
Copy link
Member

Could you pull out the billing log entries for the failed upload for the file with PNFS-ID 000026B4D53F3AB44849A64D4308A52BD53C ?

Also, could you confirm which pool(s) have a zero-length replica of this file?

@DmitryLitvintsev
Copy link
Member Author

what makes this file special?

@paulmillar
Copy link
Member

In a previous comment, you ran rep ls on all pools with problematic PNFS-IDs. This particular file appears to be stored on two pools.

This gives me a hint on what might be going on here.

I haven't figured out all the details, but I'd like to (try to) double-check my ideas, by looking at the billing entries for this PNFS-ID and which pools have a (zero-length) replica of this file.

@DmitryLitvintsev
Copy link
Member Author

DmitryLitvintsev commented Oct 27, 2021

Ah, I see. But I should not have. I scanned the pool log foe "HSM flush" errors. Some of these files are not GFTP files and some files are old. This particular file is actually NFS.

I will be more careful and filter on protocol and creation date.

I have a theory that these 0 length file may correlate with GFTP dooes running out memory.

I will keep posting here once I found out more.

In the migration program that invoked globus-url-copy I added:

if rc != 0:
if os.path.exists(destination) : os.unlink(destination)

I will remove that to see if I start accumulating 0 file size again on fails.

@paulmillar
Copy link
Member

As a suggestion: perhaps we should focus on one specific transfer that resulted in a precious zero-length file and try to understand what went wrong.

Currently, it seems like this issue is drifting and attracting many problems in the process, which might or might not be related.

Therefore, in order to make any progress, I think we should focus on one specific transfer. Any other issues (e.g., running out of memory) should be documented in separate issues.

@DmitryLitvintsev
Copy link
Member Author

DmitryLitvintsev commented Oct 27, 2021

I started with very specific GFTP transfer. You can ignore comment starting with "The extend of the problem. Overnight, looking got HSM flush errors:". I think that one caught unrelated cases.

@DmitryLitvintsev
Copy link
Member Author

psql -t -F ' ' -A -U enstore billing -c "select pnfsid from billinginfo where fullsize=0 and protocol in ('GFtp-2.0', 'GFtp-1.0') and errorcode=666 and isnew is true errormessage like 'Transfer forcefully killed: Active transfer cancelled: killed%' and datestamp> current_date" -o zero_size_files.txt

wc -l zero_size_files.txt 
2843 zero_size_files.txt

out of these 2843 25 exist in namespace and all 25 have 0 size replica. none of these files on tape-backed pools. Lets look at one

00001B7C7918FA2848B89A684E4377310C65
 p-ldrd-stkendca70a-21

billing


billing=# select * from doorinfo where pnfsid = '00001B7C7918FA2848B89A684E4377310C65';
 client | mappedgid | mappeduid | owner | path | queuedtime | connectiontime | action | cellname | datestamp | errorcode | errormessage | pnfsid | transaction | fqan 
--------+-----------+-----------+-------+------+------------+----------------+--------+----------+-----------+-----------+--------------+--------+-------------+------
(0 rows)



billing=# select datestamp, errorcode, errormessage, connectiontime, initiator   from billinginfo where pnfsid = '00001B7C7918FA2848B89A684E4377310C65';
         datestamp          | errorcode |                             errormessage                             | connectiontime |                                  initiator                                   
----------------------------+-----------+----------------------------------------------------------------------+----------------+------------------------------------------------------------------------------
 2021-10-27 01:08:27.526-05 |       666 | Transfer forcefully killed: Active transfer cancelled: killed by JTM |              0 | door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain:1635311284924000
(1 row)

door access:

[root@stkendca71a ~]# grep 'GFTP-stkendca71a-AAXPTpQyz_A' ~enstore/dcache-log/gridftp-stkendca71aDomain.access
level=INFO ts=2021-10-27T00:08:04.310-0500 event=org.dcache.ftp.hello session=door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain socket.remote=128.230.11.9:62143 socket.local=131.225.69.79:2811 reply="220 GSI FTP door ready"
level=INFO ts=2021-10-27T00:08:04.338-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain command="AUTH GSSAPI" reply="334 ADAT must follow"
level=INFO ts=2021-10-27T00:08:04.373-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain command="ADAT ..." reply="335 ADAT=..."
level=INFO ts=2021-10-27T00:08:04.421-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain command="ADAT ..." reply="335 ADAT=..."
level=INFO ts=2021-10-27T00:08:04.452-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain command="ADAT ..." reply="335 ADAT=..."
level=INFO ts=2021-10-27T00:08:04.487-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain command="ADAT ..." reply="235 OK"
level=INFO ts=2021-10-27T00:08:04.537-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain command="ENC{USER :globus-mapping:}" user.dn="/DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Jacob Zettlemoyer/CN=UID:jzettle" user.mapped=56911:8644,9314,8644,9767,8844,8623,9429,5468,8636,9470 reply="ENC{200 User :globus-mapping: logged in}"
level=INFO ts=2021-10-27T00:08:04.566-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain command="ENC{PASS}" reply="ENC{200 PASS command successful}"
level=INFO ts=2021-10-27T00:08:04.595-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain command="ENC{SITE HELP}" reply="ENC{214-The following site-specific commands are supported:\r\nSITE <SP> HELP - Information about SITE commands\r\nSITE <SP> BUFSIZE <SP> <size> - Set network buffer to <size>\r\nSITE <SP> CHKSUM <SP> <value> - Fail upload if ADLER32 checksum isn't <value>\r\nSITE <SP> CHGRP <SP> <group> <SP> <path> - Change group-owner of <path> to group <group>\r\nSITE <SP> CHMOD <SP> <perm> <SP> <path> - Change permission of <path> to octal value <perm>\r\nSITE <SP> CLIENTINFO <SP> <id> - Provide server with information about the client\r\nSITE <SP> SYMLINK <SP> <dest> <SP> <path> - Register symlink at <path>, pointing to <dest>\r\nSITE <SP> SYMLINKFROM <SP> <path> - Register symlink location; SYMLINKTO must follow\r\nSITE <SP> SYMLINKTO <SP> <path> - Create symlink to <path>; SYMLINKFROM must be earlier command\r\nSITE <SP> TASKID <SP> <id> - Provide server with an identifier\r\nSITE <SP> USAGE <SP> [TOKEN <SP> <token> <SP> ] <path>\r\n214 SITE <SP> WHOAMI - Provides the username or uid of the user}"
level=INFO ts=2021-10-27T00:08:04.625-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain command="ENC{FEAT}" reply="ENC{211-OK\r\n EOF\r\n PARALLEL\r\n MODE-E-PERF\r\n SIZE\r\n SBUF\r\n ERET\r\n ESTO\r\n GETPUT\r\n MDTM\r\n CKSUM ADLER32,MD5,MD4,SHA-1,SHA-256,SHA-512\r\n MODEX\r\n TVFS\r\n MFMT\r\n MFCT\r\n MFF Modify;Create;UNIX.mode;\r\n PASV AllowDelayed\r\n MLSC\r\n MLST Size*;Modify*;Create;Type*;Unique*;Perm*;UNIX.owner*;UNIX.group*;UNIX.uid*;UNIX.gid*;UNIX.mode*;UNIX.ctime;UNIX.atime;\r\n211 End}"
level=INFO ts=2021-10-27T00:08:04.654-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain command="ENC{SITE CLIENTINFO scheme=gsiftp;appname=\"globus-url-copy\";appver=\"10.10 (gcc64, 1629915172-0) [Grid Community Toolkit 6.2]\";}" reply="ENC{250 OK}"
level=INFO ts=2021-10-27T00:08:04.866-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain command="ENC{TYPE I}" reply="ENC{200 Type set to I}"
level=INFO ts=2021-10-27T00:08:04.894-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain command="ENC{ALLO 2540543}" reply="ENC{200 ALLO command successful}"
level=INFO ts=2021-10-27T00:08:04.968-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain command="ENC{PUT path=/pnfs/fnal.gov/usr/ldrd/darksectorldrd/persistent/users/jzettle/darksectorsim/DM_815//darksectorsim_run_124.root;pasv;}" reply="ENC{127 PORT (131,225,69,78,85,215)}"
level=INFO ts=2021-10-27T00:08:04.968-0500 event=org.dcache.ftp.response session=door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain command="ENC{PUT path=/pnfs/fnal.gov/usr/ldrd/darksectorldrd/persistent/users/jzettle/darksectorsim/DM_815//darksectorsim_run_124.root;pasv;}" reply="ENC{150 Opening BINARY data connection for /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/ldrd/darksectorldrd/persistent/users/jzettle/darksectorsim/DM_815/darksectorsim_run_124.root}"

door log:

Nothing....

BUT... door failed 40 minutes later:

27 Oct 2021 00:49:51 (GFTP-stkendca71a-AAXPTydRWVA) [door:GFTP-stkendca71a-AAXPTydRWVA@gridftp-stkendca71aDomain] Login failed for CN=1441949455,CN=404089824,CN=131724431,CN=3450858237,CN=UID:yinrui,CN=Yinrui Liu,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org: Request to [>gPlazma@local] timed out.
27 Oct 2021 00:49:52 (GFTP-stkendca71a-AAXPTydW06g) [door:GFTP-stkendca71a-AAXPTydW06g@gridftp-stkendca71aDomain] Login failed for CN=665513507,CN=1834188273,CN=1539352787,CN=943063367,CN=uboonepro-uboonegpvm01.fnal.gov,OU=Fermilab,O=Fermi Research Alliance,L=Batavia,ST=Illinois,C=US,DC=incommon,DC=org: Request to [>gPlazma@local] timed out.
27 Oct 2021 00:49:53 (GFTP-stkendca71a-AAXPTydhpTA) [door:GFTP-stkendca71a-AAXPTydhpTA@gridftp-stkendca71aDomain] Login failed for CN=1047975828,CN=1783680616,CN=1296741990,CN=1499981063,CN=UID:reitzner,CN=Diane Reitzner,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org: Request to [>gPlazma@local] timed out.
27 Oct 2021 00:49:57 (GFTP-stkendca71a-AAXPTyfi_ng) [door:GFTP-stkendca71a-AAXPTyfi_ng@gridftp-stkendca71aDomain] Login failed for CN=2020185082,CN=404089824,CN=131724431,CN=3450858237,CN=UID:yinrui,CN=Yinrui Liu,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org: Request to [>gPlazma@local] timed out.
27 Oct 2021 00:49:55 (GFTP-stkendca71a-AAXPTyfGGAg) [door:GFTP-stkendca71a-AAXPTyfGGAg@gridftp-stkendca71aDomain] Login failed for CN=114956643,CN=404089824,CN=131724431,CN=3450858237,CN=UID:yinrui,CN=Yinrui Liu,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org: Request to [>gPlazma@local] timed out.
27 Oct 2021 00:49:55 (GFTP-stkendca71a-AAXPTyfGGAk) [door:GFTP-stkendca71a-AAXPTyfGGAk@gridftp-stkendca71aDomain] Login failed for CN=599196195,CN=404089824,CN=131724431,CN=3450858237,CN=UID:yinrui,CN=Yinrui Liu,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org: Request to [>gPlazma@local] timed out.
27 Oct 2021 00:49:55 (GFTP-stkendca71a-AAXPTyfGGAo) [door:GFTP-stkendca71a-AAXPTyfGGAo@gridftp-stkendca71aDomain] Login failed for CN=422559779,CN=2032225097,CN=849638223,CN=1499981063,CN=UID:reitzner,CN=Diane Reitzner,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org: Request to [>gPlazma@local] timed out.
27 Oct 2021 00:49:59 (GFTP-stkendca71a-AAXPTyfXD8g) [door:GFTP-stkendca71a-AAXPTyfXD8g@gridftp-stkendca71aDomain] Login failed for CN=590566002,CN=404089824,CN=131724431,CN=3450858237,CN=UID:yinrui,CN=Yinrui Liu,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org: Request to [>gPlazma@local] timed out.
27 Oct 2021 00:50:04 (GFTP-stkendca71a-AAXPTygozZA) [door:GFTP-stkendca71a-AAXPTygozZA@gridftp-stkendca71aDomain] Login failed for CN=1170340006,CN=1546171595,CN=1359815235,CN=770238401,CN=UID:jzettle,CN=Jacob Zettlemoyer,OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org: Request to [>gPlazma@local] timed out.
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /home/enstore/dcache-log/gridftp-stkendca71aDomain-oom.hprof ...
Heap dump file created [1402871611 bytes in 15.171 secs]
Terminating due to java.lang.OutOfMemoryError: Java heap space
2021-10-27 00:50:31 Process terminated unexpectedly
2021-10-27 00:50:31 Sleeping 10 seconds

pool log :

 27 Oct 2021 01:08:27 (p-ldrd-stkendca70a-21) [door:GFTP-stkendca71a-AAXPTpQyz_A@gridftp-stkendca71aDomain GFTP-stkendca71a-AAXPTpQyz_A PoolAcceptFile 00001B7C7918FA2848B89A684E4377310C65] Transfer forcefully killed: Active transfer cancelled: killed by JTM

so, I will try to gather some stat. But does it look like this is correlated to whatever events lead up to
door running out of memory?

@DmitryLitvintsev
Copy link
Member Author

DmitryLitvintsev commented Oct 29, 2021

before dying GFTP door spat this:


29 Oct 2021 13:27:32 (GFTP-stkendca58a-AAXPgLDNbpA) [door:GFTP-stkendca58a-AAXPgLDNbpA@gridftp-stkendca58aDomain] Transfer error: 451 Failed to deliver PnfsGetFileAttributes message <-527409
367631836793:-7271213560554978864> to [>PnfsManager@local]: c-dCacheDomain-AAXPfxnKo2A-AAXPfxnKswA@gridftp-stkendca58aDomain is busy (its estimated response time of 310654 ms is longer than 
the message TTL of 300000 ms).
29 Oct 2021 13:29:06 (GFTP-stkendca58a-AAXPgbYq_jg) [door:GFTP-stkendca58a-AAXPgbYq_jg@gridftp-stkendca58aDomain] Failed to delete file after failed upload: /pnfs/fnal.gov/usr/pnfs/fnal.gov/
usr/GM2/scratch/users/gm2pro/preprod/QDatabase/5302P/Run4U-v9_70/2021-10-29-12-05-45/json/gm2offline_ana_49742599_43996.00021.json (0000D590F73856A34C3493CD52C488729C86): Request to [>PnfsMa
nager@local] timed out.
29 Oct 2021 13:29:53 (GFTP-stkendca58a-AAXPgbYq_jg) [door:GFTP-stkendca58a-AAXPgbYq_jg@gridftp-stkendca58aDomain] Transfer error: 451 Request to [>v-stkendca2012-2@v-stkendca2012-2Domain] ti
med out.
29 Oct 2021 13:30:31 (GFTP-stkendca58a-AAXPf-ptGYA) [door:GFTP-stkendca58a-AAXPf-ptGYA@gridftp-stkendca58aDomain v-stkendca2011-2 DoorTransferFinished 00002E266C62F2144E34A9631F80FBF72A06] F
ailed to delete file after failed upload: /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/GM2/scratch/daq/2021-10-29-10-06-13/data/gm2preproduction_full_25360075_44088.00302.root (00002E266C62F2144E34A
9631F80FBF72A06): Request to [>PnfsManager@local] timed out.

Guess this is how the zero size file get left behind.

Ha, the file above not 0 size:

[root@stkendca58a ~]# ls -al /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/GM2/scratch/daq/2021-10-29-10-06-13/data/gm2preproduction_full_25360075_44088.00302.root
-rw-rw-r-- 1 45651 9950 23779020 Oct 29 13:11 /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/GM2/scratch/daq/2021-10-29-10-06-13/data/gm2preproduction_full_25360075_44088.00302.root
[root@stkendca58a ~]# cat  /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/GM2/scratch/daq/2021-10-29-10-06-13/data/".(id)(gm2preproduction_full_25360075_44088.00302.root)"
00002E266C62F2144E34A9631F80FBF72A06

Billing:

billing=# select datestamp, protocol, isnew, errorcode, errormessage from billinginfo where pnfsid='00002E266C62F2144E34A9631F80FBF72A06';
         datestamp          | protocol | isnew | errorcode |                                     errormessage                                      
----------------------------+----------+-------+-----------+---------------------------------------------------------------------------------------
 2021-10-29 13:25:18.929-05 | NFS4-4.1 | f     |         0 | 
 2021-10-29 13:11:56.059-05 | GFtp-2.0 | t     |       666 | General problem: Problem while connected to 137.99.174.15:41558: Connection timed out
(2 rows)

We may be having incomplete, and thus corrupted files left behind!

@DmitryLitvintsev
Copy link
Member Author

This issue could be the effect of doors crashing with OOM. Issue #6195
Will need to verify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug FTP Problems related to any of the FTP family of doors (e.g., GridFTP/gsiftp)
Projects
None yet
Development

No branches or pull requests

2 participants