-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transfers failing with rejected PUT: 500 Server Error #6557
Comments
Hi @cfgamboa , So, at a high level, what's happening here is ATLAS wants to copy a file from T0D1 (i.e., disk) storage to T1D0 (i.e., tape) storage. To do this, they are trying to use HTTP-TPC, with both endpoints being at BNL. There are two attempts. The first attempt is a pull-request, so the COPY request is to the T1D0 url, with a The second attempt is a push request, so the COPY request targets the T0D1 url, with a In both cases, we see a I would imagine the next step would be to understand why the migration module task is being cancelled. |
Hello @paulmillar Thank you for your feedback. I have updated the FTS transfer URL with the one related to this transfer. More contextual information is that the day before we had a hardware failure affecting the SRM component. I understand that the DAV doors were restarted as well later on. However, there were many transfers failing like this one afterwards. I had to restart the Dav doors and apparently this helped on alleviating this type of problem. What is the expected behavior of the RemoteTransferManager per failure occurrence in SRM? |
Hi @cfgamboa , A couple of quick questions: Which version of dCache is this? Is the problem ongoing, or are transfers succeeding again? SRM and HTTP-TPC are unrelated; however, both the SRM and WebDAV doors use the The WebDAV door should recover from I've created a patch that should update the message That said, I think the migration task being cancelled is likely a result of some underlying problem, rather than the cause of the transfers failing. HTH, |
Hello @paulmillar Which version of dCache is this? Is the problem ongoing, or are transfers succeeding again? Indeed, the transfermanagers is part cell resource in the SRM domain. Thank you for your feedback and work for in the patch. All the best, |
Motivation: We have had reports (see #6557) where a migration job was cancelled; however, the reason the job was cancelled is not clear. Currently, the pool logs only `Task was cancelled`. Modification: Update PoolMigrationCancelMessage to include a reason (as a simple String), explaining the motivation behind cancelling the migration job. The controlling job is updated to populate this explanation. Note that this information is already available if the task is explicitly cancelled (i.e., outside of the FSM). The target pool is updated to log the explanation from the PoolMigrationCancelMessage if one is provided. Result: The pool now provides more information when a migration job was cancelled. Target: master Request: 8.0 Request: 7.2 Requires-notes: yes Requires-book: no Patch: https://rb.dcache.org/r/13501/ Acked-by: Lea Morschel
Motivation: We have had reports (see dCache#6557) where a migration job was cancelled; however, the reason the job was cancelled is not clear. Currently, the pool logs only `Task was cancelled`. Modification: Update PoolMigrationCancelMessage to include a reason (as a simple String), explaining the motivation behind cancelling the migration job. The controlling job is updated to populate this explanation. Note that this information is already available if the task is explicitly cancelled (i.e., outside of the FSM). The target pool is updated to log the explanation from the PoolMigrationCancelMessage if one is provided. Result: The pool now provides more information when a migration job was cancelled. Target: master Request: 8.0 Request: 7.2 Requires-notes: yes Requires-book: no Patch: https://rb.dcache.org/r/13501/ Acked-by: Lea Morschel
Motivation: We have had reports (see dCache#6557) where a migration job was cancelled; however, the reason the job was cancelled is not clear. Currently, the pool logs only `Task was cancelled`. Modification: Update PoolMigrationCancelMessage to include a reason (as a simple String), explaining the motivation behind cancelling the migration job. The controlling job is updated to populate this explanation. Note that this information is already available if the task is explicitly cancelled (i.e., outside of the FSM). The target pool is updated to log the explanation from the PoolMigrationCancelMessage if one is provided. Result: The pool now provides more information when a migration job was cancelled. Target: master Request: 8.0 Request: 7.2 Requires-notes: yes Requires-book: no Patch: https://rb.dcache.org/r/13501/ Acked-by: Lea Morschel
Hello all,
we have seen many transfers failing with
rejected PUT: 500 Server Error
https://fts304.usatlas.bnl.gov:8449/var/log/fts3/transfers/2022-03-21/dcsrm.usatlas.bnl.gov__dcgftp.usatlas.bnl.gov/2022-03-21-1708__dcsrm.usatlas.bnl.gov__dcgftp.usatlas.bnl.gov__1282623060__d7b80cb2-a905-11ec-8f8e-b49691292ed8
At the billing log the file appears to be available for copy, here the source is PNSFID 0000EEC369BA76464902A124C6842E7DE7CA, file name:
Looking at the Billinglogs for that transfer
Similar request appears to succeed with
If we take a look at part 1. There is an error logged as {10011:"Task was cancelled."}
Looking at the time when the file is made available in. the pool
The PNFSID of the file tha failed to be copy in 1. is
0000D1ACF85F8E9148A7875B57F87F66344B
Looking at the billing logs for this pnfsid I see this:
I looked at the logs in the, pool: dcdoor11_1
And the ones in the pool dc209_11
It seems that since the file failed to be copied to the pool dcdoor11_1 the P2P copy failed.
We are rather puzzled about therefore we would like to know if you could please take a look at this ticket and provide advice as possible. Is there any other information from components that could be use I we are missing?
Below please also find the log for the door involve in the first failure attempt.
All the best,
Carlos
The log for the door involved in the failing transfer is
The text was updated successfully, but these errors were encountered: