Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transfers failing with rejected PUT: 500 Server Error #6557

Closed
cfgamboa opened this issue Mar 21, 2022 · 4 comments
Closed

transfers failing with rejected PUT: 500 Server Error #6557

cfgamboa opened this issue Mar 21, 2022 · 4 comments

Comments

@cfgamboa
Copy link

cfgamboa commented Mar 21, 2022

Hello all,

we have seen many transfers failing with rejected PUT: 500 Server Error
https://fts304.usatlas.bnl.gov:8449/var/log/fts3/transfers/2022-03-21/dcsrm.usatlas.bnl.gov__dcgftp.usatlas.bnl.gov/2022-03-21-1708__dcsrm.usatlas.bnl.gov__dcgftp.usatlas.bnl.gov__1282623060__d7b80cb2-a905-11ec-8f8e-b49691292ed8

At the billing log the file appears to be available for copy, here the source is PNSFID 0000EEC369BA76464902A124C6842E7DE7CA, file name:

/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data

Looking at the Billinglogs for that transfer

  1. Not working transfer
grep data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data billing-2022.03.21

03.21 17:08:42 [door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:request] ["usatlas1":6435:31152:192.12.15.232] [0000EEC369BA76464902A124C6842E7DE7CA,2621829888] [/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt1d0:BNLT1D0@osm 5 0 {10011:"Task was cancelled."}

03.21 17:08:42 [pool:dcdoor11_1:transfer] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 0 0 true {RemoteHttpsDataTransfer-1.1:https://dcdoor12.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data} [door:RemoteTransferManager@srm-dcsrm03Domain:1647896922644-162976] {10027:"rejected GET: 500 Server Error"}

03.21 17:08:42 [door:RemoteTransferManager@srm-dcsrm03Domain:request] ["usatlas1":6435:31152:unknown] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 82 40 {10027:"rejected GET: 500 Server Error"}

03.21 17:08:55 [door:WebDAV2-dcdoor08-2@webdav2-dcdoor08_httpsDomain:request] ["usatlas1":6435:31152:10.42.38.52] [0000F1646266F48845FE8BBB730D97F75E69,2621829888] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 15 0 {10011:"Task was cancelled."}

03.21 17:08:55 [pool:dcdoor08_1:transfer] [0000EEC369BA76464902A124C6842E7DE7CA,2621829888] [/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt1d0:BNLT1D0@osm 0 0 false {RemoteHttpsDataTransfer-1.1:https://dcgftp.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data} [door:RemoteTransferManager@srm-dcsrm03Domain:1647896922851-162979] {10027:"rejected PUT: 500 Server Error"}

03.21 17:08:55 [pool:dcdoor08_1:transfer] [0000EEC369BA76464902A124C6842E7DE7CA,2621829888] [/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt1d0:BNLT1D0@osm 0 0 false {RemoteHttpsDataTransfer-1.1:https://dcgftp.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data} [door:RemoteTransferManager@srm-dcsrm03Domain:1647896922851-162979] {10027:"rejected PUT: 500 Server Error"}

03.21 17:08:42 [door:RemoteTransferManager@srm-dcsrm03Domain:request] ["usatlas1":6435:31152:unknown] [0000EEC369BA76464902A124C6842E7DE7CA,0] [/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt1d0:BNLT1D0@osm 12275 12216 {10027:"rejected PUT: 500 Server Error"}
  1. working transfer
    Similar request appears to succeed with
03.21 18:00:34 [door:WebDAV-dcdoor18-2@webdav-dcdoor18_httpsDomain:request] ["usatlas1":6435:31152:192.12.15.232] [0000EEC369BA76464902A124C6842E7DE7CA,2621829888] [/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt1d0:BNLT1D0@osm 17846 0 {0:""}
03.21 18:00:36 [pool:dcdoor11_1:transfer] [0000B8E98835C401469584C9A75D2D38F865,2621829888] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 2621829888 17902 true {RemoteHttpsDataTransfer-1.1:https://dcdoor18.usatlas.bnl.gov:2881/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data} [door:RemoteTransferManager@srm-dcsrm03Domain:1647900017018-168986] {0:""}
03.21 18:00:17 [door:RemoteTransferManager@srm-dcsrm03Domain:request] ["usatlas1":6435:31152:unknown] [0000B8E98835C401469584C9A75D2D38F865,0] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 20522 38 {0:""}

If we take a look at part 1. There is an error logged as {10011:"Task was cancelled."}
Looking at the time when the file is made available in. the pool

[root@dc209 dcache]# ls -l /data3/dcache_pool_38/pool/data/0000EEC369BA76464902A124C6842E7DE7CA
-rw-r--r-- 1 root root 2621829888 Mar 21 17:05 /data3/dcache_pool_38/pool/data/0000EEC369BA76464902A124C6842E7DE7CA

The PNFSID of the file tha failed to be copy in 1. is 0000D1ACF85F8E9148A7875B57F87F66344B

Looking at the billing logs for this pnfsid I see this:

[root@dcmaint02 03]# grep 0000D1ACF85F8E9148A7875B57F87F66344B billing-2022.03.21
03.21 17:08:42 [pool:dcdoor11_1:transfer] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 0 0 true {RemoteHttpsDataTransfer-1.1:https://dcdoor12.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data} [door:RemoteTransferManager@srm-dcsrm03Domain:1647896922644-162976] {10027:"rejected GET: 500 Server Error"}
03.21 17:08:42 [door:RemoteTransferManager@srm-dcsrm03Domain:request] ["usatlas1":6435:31152:unknown] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 82 40 {10027:"rejected GET: 500 Server Error"}
03.21 17:08:42 [pool:dcdoor11_1:transfer] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [Unknown] bnlt0d1:BNLT0D1@osm 0 166 false {Http-1.1:10.42.38.13:0:dc209_11:dc209elevenDomain:/0000D1ACF85F8E9148A7875B57F87F66344B} [pool:dc209_11@dc209elevenDomain] {0:""}
03.21 17:08:42 [pool:dc209_11@dc209elevenDomain:remove] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [Unknown] bnlt0d1:BNLT0D1@osm {0:"Transfer failed and replica is FROM_POOL"}
03.21 17:08:43 [pool:dcdoor11_1@dcdoor11oneDomain:remove] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [Unknown] bnlt0d1:BNLT0D1@osm {0:"cleaner@cleanerDomain [PoolRemoveFiles]"}

I looked at the logs in the, pool: dcdoor11_1

root@dcdoor11 dcache]# grep 0000D1ACF85F8E9148A7875B57F87F66344B dcdoor11oneDomain.log
21 Mar 2022 17:08:42 (dcdoor11_1) [door:WebDAV2-dcdoor02-2@webdav2-dcdoor02_httpsDomain:AAXawOazNxA RemoteTransferManager PoolAcceptFile 0000D1ACF85F8E9148A7875B57F87F66344B] Transfer failed: rejected GET: 500 Server Error

And the ones in the pool dc209_11

[root@dc209 dcache]# grep 0000D1ACF85F8E9148A7875B57F87F66344B *
grep: check_pool_logs: Is a directory
dc209elevenDomain.log:21 Mar 2022 17:08:43 (dc209_11) [door:WebDAV2-dcdoor02-2@webdav2-dcdoor02_httpsDomain:AAXawOazNxA dcdoor11_1 HttpDoorUrlInfo 0000D1ACF85F8E9148A7875B57F87F66344B] P2P for 0000D1ACF85F8E9148A7875B57F87F66344B failed: CacheException(rc=10001;msg=No such file or directory: 0000D1ACF85F8E9148A7875B57F87F66344B)

It seems that since the file failed to be copied to the pool dcdoor11_1 the P2P copy failed.

We are rather puzzled about therefore we would like to know if you could please take a look at this ticket and provide advice as possible. Is there any other information from components that could be use I we are missing?

Below please also find the log for the door involve in the first failure attempt.

All the best,
Carlos

The log for the door involved in the failing transfer is

[root@dcdoor12 dcache]# zgrep data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data webdav2-dcdoor12_httpsDomain.*
webdav2-dcdoor12_httpsDomain.access:level=INFO ts=2022-03-21T17:08:42.617-0400 event=org.dcache.webdav.request request.method=POST request.url=https://dcdoor12.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data response.code=200 socket.remote=[2620:0:210:1::1c]:54658 user-agent="fts_url_copy/3.11.1 gfal2/2.20.2 neon/0.0.29" user.dn="CN=1647725402,CN=1546877136,CN=1399668574,CN=Robot: ATLAS Data Management,CN=531497,CN=ddmadmin,OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=6435:31152,69907 request.macaroon-request="{\"caveats\": [\"activity:LIST,DOWNLOAD\"], \"validity\": \"PT196M\"}" response.macaroon-id=61xQF3bx:yJhMgeIQ duration=0
webdav2-dcdoor12_httpsDomain.access:level=ERROR ts=2022-03-21T17:08:42.714-0400 event=org.dcache.webdav.request request.method=GET request.url=https://dcdoor12.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data response.code=500 socket.remote=192.12.15.232:38288 user-agent=dCache/7.2.7 user.mapped=6435:31152,69907[61xQF3bx:yJhMgeIQ] transaction=door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:AAXawOa0RJg:1647896922709000 duration=11
webdav2-dcdoor12_httpsDomain.access:level=WARN ts=2022-03-21T17:08:55.129-0400 event=org.dcache.webdav.request request.method=COPY request.url=https://dcdoor12.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data response.code=202 socket.remote=[2620:0:210:1::1c]:54668 user-agent="fts_url_copy/3.11.1 gfal2/2.20.2 neon/0.0.29" user.dn="CN=1647725402,CN=1546877136,CN=1399668574,CN=Robot: ATLAS Data Management,CN=531497,CN=ddmadmin,OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=6435:31152,69907[61xQF3bx:yJhMgeIQ] tpc.credential=none tpc.error="rejected PUT: 500 Server Error" tpc.require-checksum=false tpc.destination=https://dcgftp.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data duration=12287
@paulmillar
Copy link
Member

Hi @cfgamboa ,

So, at a high level, what's happening here is ATLAS wants to copy a file from T0D1 (i.e., disk) storage to T1D0 (i.e., tape) storage. To do this, they are trying to use HTTP-TPC, with both endpoints being at BNL.

There are two attempts.

The first attempt is a pull-request, so the COPY request is to the T1D0 url, with a Source header that points to the T0D1 url. This HTTP-TPC request fails with the message rejected GET: 500 Server Error.

The second attempt is a push request, so the COPY request targets the T0D1 url, with a Destination header that points to the T1D0 url. This HTTP-TPC request fails with the message rejected PUT: 500 Server Error.

In both cases, we see a Task was cancelled message, but from different pools. This message comes from the migration module on a pool, but unfortunately we don't log why this migration is being cancelled.

I would imagine the next step would be to understand why the migration module task is being cancelled.

@cfgamboa
Copy link
Author

Hello @paulmillar

Thank you for your feedback. I have updated the FTS transfer URL with the one related to this transfer.

More contextual information is that the day before we had a hardware failure affecting the SRM component. I understand that the DAV doors were restarted as well later on. However, there were many transfers failing like this one afterwards. I had to restart the Dav doors and apparently this helped on alleviating this type of problem.

What is the expected behavior of the RemoteTransferManager per failure occurrence in SRM?

@paulmillar
Copy link
Member

Hi @cfgamboa ,

A couple of quick questions:

Which version of dCache is this?

Is the problem ongoing, or are transfers succeeding again?

SRM and HTTP-TPC are unrelated; however, both the SRM and WebDAV doors use the transfermanagers service to orchestrate third-party transfers (SRM third-party transfers are no longer used by anyone). I believe sites often run the transfermanagers service close to the srmmanager service (e.g., in the same domain), so a hardware failure may have affected both.

The WebDAV door should recover from transfermanagers service being restarted, but this has been problematic in the past.

I've created a patch that should update the message Task was cancelled to include the reason why the task was cancelled. Once this patch passes code review and is part of a bug-fix release, you would need to update the pools to benefit from the change.

That said, I think the migration task being cancelled is likely a result of some underlying problem, rather than the cause of the transfers failing.

HTH,
Paul.

@cfgamboa
Copy link
Author

Hello @paulmillar

Which version of dCache is this?
dcache-7.2.7 for doors and SRM, other components are 7.2.3

Is the problem ongoing, or are transfers succeeding again?
Yes after restarting the doors the transfers succeeding again.

Indeed, the transfermanagers is part cell resource in the SRM domain.

Thank you for your feedback and work for in the patch.

All the best,
Carlos

paulmillar added a commit that referenced this issue Mar 31, 2022
Motivation:

We have had reports (see #6557) where a migration job was cancelled;
however, the reason the job was cancelled is not clear.  Currently, the
pool logs only `Task was cancelled`.

Modification:

Update PoolMigrationCancelMessage to include a reason (as a simple
String), explaining the motivation behind cancelling the migration job.

The controlling job is updated to populate this explanation.  Note that
this information is already available if the task is explicitly
cancelled (i.e., outside of the FSM).

The target pool is updated to log the explanation from the
PoolMigrationCancelMessage if one is provided.

Result:

The pool now provides more information when a migration job was
cancelled.

Target: master
Request: 8.0
Request: 7.2
Requires-notes: yes
Requires-book: no
Patch: https://rb.dcache.org/r/13501/
Acked-by: Lea Morschel
paulmillar added a commit to paulmillar/dcache that referenced this issue Mar 31, 2022
Motivation:

We have had reports (see dCache#6557) where a migration job was cancelled;
however, the reason the job was cancelled is not clear.  Currently, the
pool logs only `Task was cancelled`.

Modification:

Update PoolMigrationCancelMessage to include a reason (as a simple
String), explaining the motivation behind cancelling the migration job.

The controlling job is updated to populate this explanation.  Note that
this information is already available if the task is explicitly
cancelled (i.e., outside of the FSM).

The target pool is updated to log the explanation from the
PoolMigrationCancelMessage if one is provided.

Result:

The pool now provides more information when a migration job was
cancelled.

Target: master
Request: 8.0
Request: 7.2
Requires-notes: yes
Requires-book: no
Patch: https://rb.dcache.org/r/13501/
Acked-by: Lea Morschel
paulmillar added a commit to paulmillar/dcache that referenced this issue Mar 31, 2022
Motivation:

We have had reports (see dCache#6557) where a migration job was cancelled;
however, the reason the job was cancelled is not clear.  Currently, the
pool logs only `Task was cancelled`.

Modification:

Update PoolMigrationCancelMessage to include a reason (as a simple
String), explaining the motivation behind cancelling the migration job.

The controlling job is updated to populate this explanation.  Note that
this information is already available if the task is explicitly
cancelled (i.e., outside of the FSM).

The target pool is updated to log the explanation from the
PoolMigrationCancelMessage if one is provided.

Result:

The pool now provides more information when a migration job was
cancelled.

Target: master
Request: 8.0
Request: 7.2
Requires-notes: yes
Requires-book: no
Patch: https://rb.dcache.org/r/13501/
Acked-by: Lea Morschel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants