-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GFTP: Aborted uploads leave 0 length files behind #6176
Comments
There's a fair bit of information missing in order to understand the problem, reproduce and fix it. Ideally, can you provide command-line options for How is data reaching the pool: is the door proxying? is the client using MODE S or MODE E? Could you provide the access log file (for the FTP session containing the failed transfer described above)? I'm assuming the client is aborting the transfer because it's taking too long. Could you describe at what point the client is getting stuck? (is the mover started? does the client connect to the pool?) |
You mention the aborted transfer left a zero-length file that is precious. Would a successful uploaded file in this directory also be precious? In other words, is the file being precious a symptom of a problem, or just contextual information? |
I do not know exact options passed to globus-url-copy in the quoted case. It is likely mode E, data is not proxied by door. The command I am using:
|
I attach here door access log and billing log for one of the tranfer,. Client logic is :
billing:
door access:
|
The source file has normal size:
|
Pool side log excerpt for the above transfer:
|
(I think in this case actually the file did not end up with 0 size replica on pools, It existed in some "ether", cacheinfoof that file showed empty) |
The extend of the problem. Overnight, looking got HSM flush errors:
Precious 0 size file that fail to go to tape (due to different problem). |
Does the door's log file (the log file of the domain hosting the door) contain any entries regarding these failing transfers? |
single line:
(this corresponds to the very first aborted transfer that I started this Issue with) |
Two more questions: Do you see any SRM upload activity (in particular that failed) on this dCache instance? Do you see any xrootd uploads with the "Persist On Successful Close" (POSC) flag set (again, in particular that failed) ? |
SRM activity is rather minimal. By looking in SRM access file I see:
Actually I do not seem to see failing gets or puts. Fails I see have to do with SrmRm or SrmLs fails due to SRM_INVALID_PATH. How do I know if POSC was set? ("grep -i 'pocs' xroot.access file did not bring up anything) |
Do any of the |
Detecting xroot POSC isn't completely trivial. You're looking for a xroot Here's a sed one-liner that pulls out the options flag from the access log file:
Here's some sample output from running the sed script:
The first entry ( |
SRM is rarely used. Mostly globus-url-copy. Today I scanned all 0 lenth files accumulated so far:
out of 1168 only 5 actually exist in namespace and on pools. Luckily all on scratch (aka volatile pools). Taking just one file .... door access:
door log: can't find anything for that string "
but the door crashed about the same time:
The pool : no records. GFTP crashing with OOM is something new. Happening after upgrade to 7.2 often [gridftp-${host.name}Domain] 1GB not enough? |
Could look at the pool hosting one of these zero-length precious files. Does the pool's log file contain anything significant? |
Could you pull out the billing log entries for the failed upload for the file with PNFS-ID Also, could you confirm which pool(s) have a zero-length replica of this file? |
what makes this file special? |
In a previous comment, you ran This gives me a hint on what might be going on here. I haven't figured out all the details, but I'd like to (try to) double-check my ideas, by looking at the billing entries for this PNFS-ID and which pools have a (zero-length) replica of this file. |
Ah, I see. But I should not have. I scanned the pool log foe "HSM flush" errors. Some of these files are not GFTP files and some files are old. This particular file is actually NFS. I will be more careful and filter on protocol and creation date. I have a theory that these 0 length file may correlate with GFTP dooes running out memory. I will keep posting here once I found out more. In the migration program that invoked globus-url-copy I added: if rc != 0: I will remove that to see if I start accumulating 0 file size again on fails. |
As a suggestion: perhaps we should focus on one specific transfer that resulted in a precious zero-length file and try to understand what went wrong. Currently, it seems like this issue is drifting and attracting many problems in the process, which might or might not be related. Therefore, in order to make any progress, I think we should focus on one specific transfer. Any other issues (e.g., running out of memory) should be documented in separate issues. |
I started with very specific GFTP transfer. You can ignore comment starting with "The extend of the problem. Overnight, looking got HSM flush errors:". I think that one caught unrelated cases. |
out of these 2843 25 exist in namespace and all 25 have 0 size replica. none of these files on tape-backed pools. Lets look at one
billing
door access:
door log: Nothing.... BUT... door failed 40 minutes later:
pool log :
so, I will try to gather some stat. But does it look like this is correlated to whatever events lead up to |
before dying GFTP door spat this:
Guess this is how the zero size file get left behind. Ha, the file above not 0 size:
Billing:
We may be having incomplete, and thus corrupted files left behind! |
This issue could be the effect of doors crashing with OOM. Issue #6195 |
dCache 7.2
Aborted GFTP uploads leave 0 length files behind. Additionally these files are "P"recious on pools and perpetually tried to
be written to tape:
176 seconds, data not moved:
The text was updated successfully, but these errors were encountered: