Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gfal-copy failing but returning status code 0 - leaving a bunch of file mismatch in the system #11556

Closed
vkuznet opened this issue Apr 17, 2023 · 70 comments · Fixed by #11636
Closed

Comments

@vkuznet
Copy link
Contributor

vkuznet commented Apr 17, 2023

Impact of the bug
Identify issue with stuck PnR workflows and allow them to progress.

Describe the bug
As was described in PnR#41 we have certain level of workflows which have file mismatch between DBS and Rucio and they never get announced for months. The filemismatch usually happens when there are output files in DBS but not recorded by rucio, here is one example:

file: /store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/025d55a6-8e48-4695-9087-80bf9681229b.root

for WF: pdmvserv_Run2022D_Muon_10Dec2022_221209_230856_3545. A recent list of workflows having such issue can be found here.

How to reproduce it
Most likely we need to perform debugging of agent logs and its local databases to identify the issue with stuck workflows.

Expected behavior
We should have a file match between DBS and Rucio to allow WF to progress.

Additional context and error message

@vkuznet
Copy link
Contributor Author

vkuznet commented Apr 17, 2023

FYI: @amaltaro , @haozturk

@amaltaro
Copy link
Contributor

@vkuznet thanks for creating this issue. @haozturk FYI

Here is my suggestion on how to debug it:

  1. pick a workflow reported by Unified (if possible, zoom in in the relevant block/file missing)
  2. using wmstats (or wmstatsserver), check which agents worked on that workflow
  3. connect to that/those agents
  4. grep for the file name in the relevant component. If file is missing in DBS, check the DBS3Upload; if it's missing in Rucio, check the RucioInjector component log.
  5. Note that we do not record the filename in RucioInjector, only the block and how many files are/were attached to it.

Another option is to use the relational database, using your preferred SQL client (sqlplus, SQLDeveloper, etc). If you use sqlplus, you can type this command in the agent:

$manage db-prompt wmagent

and I think the table that we want to look at is dbsbuffer_file.

@vkuznet
Copy link
Contributor Author

vkuznet commented Apr 18, 2023

Here is very simple procedure to check DBS vs rucio files:

  1. obtain (CMS) block in question (please note I escape # with %23 in block name), e.g.
blk=/Muon/Run2022D-ZMu-10Dec2022-v1/RAW-RECO%23a63aafe6-480d-45d8-9d83-668e6172c271
  1. run DBS files API to obtain list of files, here I use jq tool to parse JSON output and extract LFN nmes:
scurl "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/files?block_name=$blk" | jq '.[].logical_file_name'
  1. run rucio client to obtain list of files, here is my rucio_client script:
#!/bin/bash

url=$1
account=das
cert=/etc/proxy/proxy
ckey=/etc/proxy/proxy
if [ ! -f $ckey ]; then
    ckey=$HOME/.globus/userkey.pem
    cert=$HOME/.globus/usercert.pem
fi
aurl=https://cms-rucio-auth.cern.ch/auth/x509
if [ -n "`echo $url | grep rucio-int`" ]; then
    aurl=https://cms-rucio-auth-int.cern.ch/auth/x509
fi

opt="-s -L -k --key $HOME/.globus/userkey.pem --cert $HOME/.globus/usercert.pem"
token=`curl $opt -v -H "X-Rucio-Account: $account" $aurl 2>&1 | grep "X-Rucio-Auth-Token:" | sed -e "s,< X-Rucio-Auth-Token: ,,g"`
curl $opt -H "X-Rucio-Auth-Token: $token" -H "X-Rucio-Account: $account" "$url"

and here I run it

rucio_curl "http://cms-rucio.cern.ch/replicas/cms/Muon/Run2022D-ZMu-10Dec2022-v1/RAW-RECO%23a63aafe6-480d-45d8-9d83-668e6172c271" | jq '.name'

So, if you run both and count number of files you will see the difference, e.g.

# create dbs output
scurl "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/files?block_name=$blk" | jq '.[].logical_file_name' | sort -u > /tmp/dbs.blocks

# create rucio output
rucio_curl "http://cms-rucio.cern.ch/replicas/cms/Muon/Run2022D-ZMu-10Dec2022-v1/RAW-RECO%23a63aafe6-480d-45d8-9d83-668e6172c271" | jq '.name' | sort -u > /tmp/rucio.blocks

# perform comparison:
diff -u /tmp/dbs.blocks /tmp/rucio.blocks
--- /tmp/dbs.blocks	2023-04-18 09:18:34.000000000 -0400
+++ /tmp/rucio.blocks	2023-04-18 09:18:26.000000000 -0400
@@ -1,4 +1,3 @@
-"/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/025d55a6-8e48-4695-9087-80bf9681229b.root"
 "/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/0866682d-44b9-4f3a-8b5b-4b4baab4c137.root"
 "/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/0c1860fd-a94a-4ec7-843b-453c3a4bec3e.root"
 "/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/10f22a56-6424-4559-95d0-c505ddefc7b0.root"

The missing LFN with hash 025d55a6-8e48-4695-9087-80bf9681229b does not exist in Rucio.

@vkuznet
Copy link
Contributor Author

vkuznet commented Apr 18, 2023

I communicated with @ericvaandering who gave me the clue. According to Rucio the file which is not present in Rucio does not have associated block (i.e. it is not attached to the block).

# here is a LFN which is attached to the CMS block, rucio dataset
rucio list-parent-dids
   cms:/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/0866682d-44b9-4f3a-8b5b-4b4baab4c137.root

   +-----------------------------------------------------------------------------------+--------------+
   | SCOPE:NAME               | [DID TYPE]   |

   |-----------------------------------------------------------------------------------+--------------|

   |
   cms:/Muon/Run2022D-ZMu-10Dec2022-v1/RAW-RECO#a63aafe6-480d-45d8-9d83-668e6172c271 | DATASET      |

# while here is LFN which has no attachment
rucio list-parent-dids
    cms:/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/025d55a6-8e48-4695-9087-80bf9681229b.root

Then, I checked LFNs status in DBS and found the same confirmation, i.e. the LFN /store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/0866682d-44b9-4f3a-8b5b-4b4baab4c137.root has valid status while LFN /store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/025d55a6-8e48-4695-9087-80bf9681229b.root is not valid.

My conclusion is that invalid files are not appeared in Rucio.

@vkuznet vkuznet self-assigned this Apr 18, 2023
@vkuznet
Copy link
Contributor Author

vkuznet commented Apr 18, 2023

I further investigated Unified code base and found particular flaw in their function usage which depends on internal cache. Please see full details over here: https://gitlab.cern.ch/CMSToolsIntegration/operations/-/issues/41#note_6636304. In summary, it seems to me that Unified stores DBS results into cache used in multiple places and when finally compare results of DBS vs Rucio files some files may have invalidated by underlying workflow or data-ops leading to file mismatch.

I'll be happy to check another block when PnR will provide that but at the moment I do not see any issues with WMCore/agent/RucioInjector.

@amaltaro
Copy link
Contributor

@vkuznet Valentin, regardless of a stale cache or not in Unified, didn't you confirm that actually there is a file in Rucio which has not been attached to its parent block?
If this is true, then there must be a flaw as well either in:

  • WMCore (RucioInjector component)
  • Rucio Server itself.

BTW, here is our Rucio implementation for adding a replica to Rucio server plus attaching it to a block:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Rucio/Rucio.py#L381

We might have to chase this file/block in the agent logs to see if any error/exception was raised when dealing with it in RucioInjector.

@ericvaandering
Copy link
Member

Unless "invalidation" (who/where does that) is not deleting the Rucio DID but only detaching it.

@vkuznet
Copy link
Contributor Author

vkuznet commented Apr 18, 2023

Alan, how I can find which agent was used for archived workflow? I looked up the one provided in description and found it here: https://cmsweb.cern.ch/reqmgr2/fetch?rid=request-pdmvserv_Run2022D_Muon_10Dec2022_221209_230856_3545 but I can't find it in wmstats using this WF name.

@vkuznet
Copy link
Contributor Author

vkuznet commented Apr 18, 2023

Meanwhile, according to WMCore RucioInjector code base https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Rucio/Rucio.py#L404-L414 and using logs from one of the CERN agents (vocms0255) I do see few errors like this:

2023-02-28 17:28:25,353:140522088707840:ERROR:Rucio:Failed to add replicas for: [{'name': '/store/mc/RunIISummer20UL16RECO/WplusH_HTo2LongLivedTo2G2Q_MH-125_MFF-7_ctau-20cm_TuneCP5_fixedLifetime_13TeV-powheg-pythia8/AODSIM/106X_mcRun2_asymptotic_v13-v2/2550000/A9995941-D344-2447-AEC2-A4B0FF2F1810.root', 'scope': 'cms', 'bytes': 1744683405, 'state': 'A', 'adler32': '37f4804d'}] and block: /WplusH_HTo2LongLivedTo2G2Q_MH-125_MFF-7_ctau-20cm_TuneCP5_fixedLifetime_13TeV-powheg-pythia8/RunIISummer20UL16RECO-106X_mcRun2_asymptotic_v13-v2/AODSIM#05edac39-5ba2-4d35-b29f-00ccd0422a1a. Error: An unknown exception occurred

which suggests that RucioInjector can fail and the reason is unknown (or not propagated well from Rucio client). Said that, we may perform inverse look-up and check if PnR has filemismatch in this block /WplusH_HTo2LongLivedTo2G2Q_MH-125_MFF-7_ctau-20cm_TuneCP5_fixedLifetime_13TeV-powheg-pythia8/RunIISummer20UL16RECO-106X_mcRun2_asymptotic_v13-v2/AODSIM#05edac39-5ba2-4d35-b29f-00ccd0422a1a or any other block we can extract from agent log.

@vkuznet
Copy link
Contributor Author

vkuznet commented Apr 18, 2023

@amaltaro , another suggestion, by inspecting reqmgr2 logDB I found that for a given WF we do have pretty extensive JSON meta-data but it lacks of one particular important element the agent name which processed this WF. Should we open issue for that to make debugging easier?

@amaltaro
Copy link
Contributor

The only way to know which agents worked on a given request is through wmstats, at the same time, wmstats only contains data for active workflow. The workflow you are looking at has already been archived (normal-archived), so we cannot easily know which agents processed it. I think Hasan mentioned many workflows in the same situation, so I'd suggest to pick one or two that have ran in the past weeks (and that are not yet archived).

This message that you pasted "Error: An unknown exception occurred" suggested that no error details were returned from Rucio server. Still, WMAgent is stateful and it will keep retrying to inject that replica+attach in the coming component cycles.

Regarding reqmgr2 LogDB, I feel like the way it's used is sub-optimal and from time to time the database gets slower and queries might time out. I believe it's also meant to be a temporary database only for active workflows. We might have to revisit its usage and how data is organized in it before expanding with further use-cases. But I agree that the agent information is very much needed!

@vkuznet
Copy link
Contributor Author

vkuznet commented Apr 18, 2023

@amaltaro , while awaiting from PnR about active workflows, I inspected the list of ones they listed in their ticket and I found none available in wmstats, either it does not exist or it is completed. Said that, I also inspected WMCore Rucio.py code base and would like to provide general feedback about this issue:

  • the debugging process seems quite complicated since we do not record agent in WF JSON files, therefore for completed or archived workflows we have no reference which agent was used
  • the function createReplicas in Rucio.py does not return concrete status, only success or failure, and only raise exception. Therefore, it is not feasible to trace down what is wrong with rucio injection. I rather suggest to modify this function to return a JSON record instead of traceback to upstream code such that such error record can be propagated to WF JSON (FWJR) and represent the actual error which will ease debugging of such issue. A concrete example is DBS server which return full error along with its error code. The error includes function, the database error, and auxilary information. Here we need a similar concept. But to make it workable it will require
    • adjusting createReplicas to return such error, including Rucio error info, actual traceback, input parameters
    • adjusting code which use this API, e.g. MSPileup or MSTransferor
    • capture such error records and add them to FWJR such that we can have full picture without going into debug process.
      These are suggestions regardless of this issue and you may consider put them as GH feature issue for future development.

@amaltaro
Copy link
Contributor

Thanks Valentin.
Then we need to wait for P&R to provide a workflow that has this issue and which is still active.

Regarding the Rucio.py implementation, we need to check what is actually returned from the server - through Rucio client - to see which information we can pass upstream.

Another option would be to actually separate the "rucio replica injection" from the "rucio replica attachment". Making it in 2 different methods such that we can properly see what the outcome of each is.

Note that createReplicas is only used by RucioInjector, given that it is the only component registering DID in Rucio.

@z4027163
Copy link

z4027163 commented May 2, 2023

Since in P&R ticket there is restriction, I am pasting what we have from other ticket.

Valentin checked this workflow pdmvserv_task_BTV-Run3Summer22EEGS-00009__v1_T_221223_125301_7115 as completed in FNAL submit8 agent.

An example of a file that is in DBS but not in rucio: /store/mc/Run3Summer22EEMiniAODv3/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/MINIAODSIM/124X_mcRun3_2022_realistic_postEE_v1-v1/80000/aede32b8-d95a-44f0-aa06-2f21dbda371b.root

The corresponding block is: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf

@z4027163
Copy link

z4027163 commented May 2, 2023

Alan did check the workflow which produced the MINIAODSIM above, which was: pdmvserv_task_BTV-Run3Summer22EEGS-00009__v1_T_221223_125301_7115

Looking into the output MINIAODSIM dataset, here is what one of the WMAgent utilitarian tools says:

cmst1@vocms0255:/data/srv/wmagent/current $ python3 apps/wmagentpy3/bin/adhoc-scripts/checkDsetFileCount.py /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM
...
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: *** Dataset: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: Rucio file count : 1974
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: DBS file count   : 2016
2023-04-25 21:11:27,899:INFO:checkDsetFileCount:  - valid files   : 2006
2023-04-25 21:11:27,899:INFO:checkDsetFileCount:  - invalid files : 10
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: Blocks in Rucio but not in DBS: set()
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: Blocks in DBS but not in Rucio: set()
2023-04-25 21:11:27,899:WARNING:checkDsetFileCount: Block with file mismatch: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf
2023-04-25 21:11:27,899:WARNING:checkDsetFileCount: 	Rucio: 0		DBS: 41
2023-04-25 21:11:27,899:WARNING:checkDsetFileCount: Block with file mismatch: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#956bfed7-3d2b-401d-b8f7-f13a86176ec5
2023-04-25 21:11:27,899:WARNING:checkDsetFileCount: 	Rucio: 7		DBS: 8

so one of the blocks is said not to have any files in Rucio, while DBS has 41 files, confirmed with: https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filesummaries?block_name=/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM%232ac6771a-ae1e-4666-9c3d-f2e3d8daccbf

IF I grep the submit8 RucioInjector logs, I can find that this block was reported as being successfully injected in Rucio back in January (AND it reports the very same 41 files):

[cmsdataops@cmsgwms-submit8 current]$ grep 'MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf' install/wmagentpy3/DBS3Upload/ComponentLog 
2023-01-08 07:10:07,090:140090107418368:INFO:DBSUploadPoller:Queueing block for insertion: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf
2023-01-08 07:10:07,110:140090107418368:INFO:DBSUploadPoller:About to call insert block for: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf
[cmsdataops@cmsgwms-submit8 current]$ 
[cmsdataops@cmsgwms-submit8 current]$ grep 'MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf' install/wmagentpy3/RucioInjector/ComponentLog 
2023-01-06 07:11:17,939:140090104010496:INFO:RucioInjectorPoller:Block /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf inserted into Rucio
2023-01-06 07:11:40,684:140090104010496:INFO:RucioInjectorPoller:Successfully inserted 41 files on block /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf
2023-01-06 07:12:17,958:140090104010496:INFO:RucioInjectorPoller:Block rule created for block: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf, at: T2_US_Florida, with rule id: d6326e6f122540b9b8f92e07b5912b03
2023-01-08 07:11:53,813:140090104010496:INFO:RucioInjectorPoller:Closing block: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf

Alan still need to look into Rucio server to see if it actually knowns any of the DIDs (files) in this block. To be done next.

@z4027163
Copy link

z4027163 commented May 2, 2023

From Alan:

FYI. I asked for feedback in the #DM Ops Mattermost channel and Jhonatan seemed to have found all those files in Rucio: https://mattermost.web.cern.ch/cms-o-and-c/pl/4mg45jykbbn8jdbqxydn64f58y

I still have follow up questions and we still need to understand what that means and why that happened. Anyhow, feel free to follow this up over there as well.

@amaltaro
Copy link
Contributor

amaltaro commented May 2, 2023

@z4027163 thank you for updating this ticket! Sorry that I had to miss the DM meeting, is there anything that we WM are expected to do next? Or what is the next step to properly understanding what happened here?

@z4027163
Copy link

z4027163 commented May 2, 2023

@amaltaro They are aware of the issue, we couldn't figure out the exact issue in the meeting. Eric/Diago will check the ticket for more details. We are waiting for their feedback, maybe there are follow-ups to do from wmcore side after that.

@ericvaandering
Copy link
Member

I think we said we'd monitor the ticket, not that there was anything I could look into. Jhonathan's posts in MM show that these file DIDs are getting created in Rucio, right? And they are being attached to the block DIDs. So the mystery is how they are being unattached.

@z4027163
Copy link

z4027163 commented May 2, 2023

@ericvaandering Ah yes, I misunderstood. I thought you guys would be the right people to check it. Who should we ask to look into it? I had the impression it was outside wmcore's reach.

@ericvaandering
Copy link
Member

Well, I did just take a look at the Rucio code to see WHEN file are detached from blocks. It looks like it can happen if a file is declared bad AND it is the last copy of that file: https://github.com/rucio/rucio/blob/e595093ef688422a9a3d8823570d7826369ab22e/lib/rucio/core/rule.py#L1932

@ivmfnal would it be possible to search the consistency checking logs for some of these files (is there a sample list somewhere?) to see if this is what's happening?

@vkuznet
Copy link
Contributor Author

vkuznet commented May 2, 2023

@ericvaandering , I certainly do not know details of rucio but looking at the code I find suspicious the following block:

def update_rules_for_lost_replica(scope, name, rse_id, nowait=False, session=None, logger=logging.log):
...
    for dts in datasets:
        logger(logging.INFO, 'File %s:%s bad at site %s is completely lost from dataset %s:%s. Will be marked as LOST and detached', scope, name, rse, dts['scope'], dts['name'])
        rucio.core.did.detach_dids(scope=dts['scope'], name=dts['name'], dids=[{'scope': scope, 'name': name}], session=session)

Here we have two parameters, name which is passed to the API and I think it is LFN name, anddst['name'] which I would assume is name of dataset/block. So, is it possible that LFN belongs to different datasets/blocks and can be discarded from a wrong dataset/block? Or, is it possible that we pass LFN from a CMS dataset but it can be present in multiple CMS blocks, and therefore if it is discarded from single CMS block it will be LOST for entire dataset?

@vkuznet
Copy link
Contributor Author

vkuznet commented May 24, 2023

Alan, I confirm that I do see too such errors:

gfal-copy error: 5 (Input/output error) - DESTINATION OVERWRITE   Result HTTP 500 : Unexpected server error: 500  after 1 attempts
/srv/startup_environment.sh: line 9: BASHOPTS: readonly variable
/srv/startup_environment.sh: line 16: BASH_VERSINFO: readonly variable
/srv/startup_environment.sh: line 35: EUID: readonly variable
/srv/startup_environment.sh: line 156: PPID: readonly variable
/srv/startup_environment.sh: line 163: SHELLOPTS: readonly variable
/srv/startup_environment.sh: line 181: UID: readonly variable
gfal-rm error: 5 (Input/output error) - Result HTTP 500 : Unexpected server error: 500  after 1 attempts

in agents for cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_294 workflow. It is very common error which happens in many logs in all agents. For instance on vocms0283:/data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_294/Job_294596.tar.bz2

@vkuznet
Copy link
Contributor Author

vkuznet commented May 24, 2023

I scanned few agents, and I do see HTTP 500 errors in all of them. I copied few log files to this location /afs/cern.ch/user/v/valya/public/WMCore if anyone would like to look, but I observed many stage out failures (not only related to HTTP 500 error) and they are quite frequent. The errors related to 500 message come from gfal calls. Based on this observation may be it is worth to re-think how to handle such kind of errors for easy access to them. For instance, we can introduce the code which will send error messages to MONIT via AMQ brokers, such that later someone can view them in ES/OpenSearch. Such error message may contain name of blocks, lfn, agent, its log, error message, etc. The, the look-up of errors can be simplified by looking into ES/OpenSearch, and moreover we can provide corresponding dashboard for easy navigation.

@amaltaro
Copy link
Contributor

amaltaro commented May 24, 2023

@vkuznet Valentin, having stage out errors is normal and expected. What is not expected is to have multiple status code out of the stage out command. From the example I provided above, this:

gfal-copy exit status: 0
gfal-copy error: 256

is bad and needs to be understood.

I would suggest you to look into the command constructed for the gfal-based stage out and try to understand how it can happen. We might have to even pull the relevant singularity image and try to reproduce it, but I hope we don't need to go that far.

@vkuznet
Copy link
Contributor Author

vkuznet commented May 24, 2023

Well, according to https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/Backends/GFAL2Impl.py#L122 the gfal-copy exit status is added to shell script which executes gfal-copy command on non zero exit. Therefore, two messages come from:

  • gfal-copy exit status is generated by shell script which executes gfal-copy but the exit code is likely not properly captured by shell script, and
  • gfal-copy error comes from gfal tool itself.

In other words, both come from single shells script but, in my view, the first exit code is not properly captured by the shell script and it is simple echo command output, while the second one is a real failure code of gfal-copy.

@vkuznet
Copy link
Contributor Author

vkuznet commented May 30, 2023

I reconstructed the shell script using gfal command and it looks like this one:

#!/bin/bash
env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh; date gfal-copy -t 2400 -T 2400 -p -K adler32  sourcePFN targetPFN'
EXIT_STATUS=$?
echo "gfal-copy exit status: $EXIT_STATUS"
if [[ $EXIT_STATUS != 0 ]]; then
   echo "ERROR: gfal-copy exited with $EXIT_STATUS"
   echo "Cleaning up failed file:"
   env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh; date; gfal-rm -t 600 targetPFN '
fi
exit $EXIT_STATUS

where sourcePFN and targetPFN represents actual PFN files, like file:///..... If you look closely at the script logic the EXICT_STATUS captures status of gfal-copy command.` We may try to test this shell script with actual PFNs at one of the failed sites and see which output we will get.

@vkuznet
Copy link
Contributor Author

vkuznet commented May 31, 2023

Does anyone know how we can emulate the file transfer on a site using the script I posted earlier:

  • I think we need to identify and use one site, e.g. T2_IT_Rome
  • we need to contact local site admins to try out the gfal script
    • I do not know how to find out proper people at a given site
    • should we try to transfer one of the real PFN files or any file can work to test gfal script?

Finally, if we can't reproduce the issue with aforementioned script I suggest to add to its structure a call to verify the transfer, e.g. gfal-ls (if it exists) or similar to get file meta-data (size, checksum, etc.)

@vkuznet
Copy link
Contributor Author

vkuznet commented May 31, 2023

In this PR #11601 I provided additional check for checksums of source and target PFN via gfal-sum tool. I think it is worth to have it in our codebase to simplify debugging of such use-cases.

@amaltaro
Copy link
Contributor

Valentin, from this gfal-copy manpage:
https://www.mankier.com/1/gfal-copy

I understand that the option -K adler32 would already calculate the adler32 checksum of the local and remote file and compare them at the end of the data transfer, confirming a successful data transfer or not. This can be disabled if an extra option is provided --just-copy, but T2_IT_Rome SITECONF does not pass this argument in.

I am also starting to consider that it might be helpful to make a bug report in their (CERN) repository:
https://gitlab.cern.ch/dmc/gfal2-util

On how we could test and/or try to reproduce it, I think we would need to:

  1. use the same singularity image used by the job itself (apparently this: GWMS_SINGULARITY_IMAGE_HUMAN=/cvmfs/singularity.opensciencegrid.org/cmssw/cms:rhel8)
  2. using the bash script above, test writing a dummy file to the storage
  3. ideally this should be executed in Rome, but I guess the only way to do that would be reaching out to the site admin.

@todor-ivanov @khurtado perhaps you have another suggestion on how we can test it?

@amaltaro
Copy link
Contributor

BTW, we should definitely add a gfal-copy --version to the beginning of the stage out bash script.

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 1, 2023

I created GGUS ticket https://ggus.eu/?mode=ticket_info&ticket_id=162181&come_from=submit and request to perform manual file transfer using our script on T2_IT_Rome site.

@dynamic-entropy
Copy link

Hi @vkuznet @amaltaro
Can you guys let me know if you are going ahead with creating a service for DBS invalidation of lost files in Rucio?

@amaltaro
Copy link
Contributor

amaltaro commented Jun 9, 2023

@dynamic-entropy Hi Rahul, Valentin is on vacation until mid next week, so let me reply it to the best of my knowledge.

We are still trying to understand why gfal-copy command returns an exit code 0 while internally raising an error. I believe this is the main root cause of these file mismatch and ideally we should resolve it there.

Nonetheless, I still think A service that keeps data invalidation in sync between Rucio and DBS will be important, but right now its unclear what scope that belongs to.

@ericvaandering
Copy link
Member

Right. I believe Kevin's statement was more or less "We have limited development effort, so the first focus should be on fixing the problem if we understand it"

@amaltaro
Copy link
Contributor

Just for the record, the current list of T1/T2 sites using gfal2 grid utils for stage out are:

['T1_DE_KIT', 'T1_ES_PIC', 'T1_FR_CCIN2P3', 'T1_IT_CNAF', 'T1_RU_JINR', 'T1_UK_RAL', 'T2_AT_Vienna', 'T2_BE_IIHE', 'T2_BE_UCL', 'T2_BR_SPRACE', 'T2_BR_UERJ', 'T2_CH_CSCS', 'T2_CN_Beijing', 'T2_DE_DESY', 'T2_DE_RWTH', 'T2_EE_Estonia', 'T2_ES_CIEMAT', 'T2_ES_IFCA', 'T2_FI_HIP', 'T2_FR_GRIF', 'T2_FR_GRIF_IRFU', 'T2_FR_IPHC', 'T2_HU_Budapest', 'T2_IN_TIFR', 'T2_IT_Bari', 'T2_IT_Legnaro', 'T2_IT_Pisa', 'T2_IT_Rome', 'T2_KR_KISTI', 'T2_PK_NCP', 'T2_PL_Cyfronet', 'T2_PL_Swierk', 'T2_PT_NCG_Lisbon', 'T2_RU_IHEP', 'T2_RU_INR', 'T2_RU_ITEP', 'T2_RU_JINR', 'T2_TR_METU', 'T2_TW_NCHC', 'T2_UA_KIPT', 'T2_UK_London_Brunel', 'T2_UK_London_IC', 'T2_UK_SGrid_Bristol', 'T2_UK_SGrid_RALPP', 'T2_US_Caltech', 'T2_US_Florida', 'T2_US_MIT', 'T2_US_Nebraska', 'T2_US_Purdue', 'T2_US_UCSD', 'T2_US_Vanderbilt', 'T2_US_Wisconsin']

@amaltaro
Copy link
Contributor

I also took the opportunity and contacted the CERN gfal-util experts with this GitLab issue:
https://gitlab.cern.ch/dmc/gfal2-util/-/issues/2

@belforte
Copy link
Member

belforte commented Nov 30, 2023

@amaltaro we just got a report also from a CRAB user of
"gfal-copy returned exit code 0, but file was not copied and stderr says

gfal-copy error: 256 (Unknown error 256) - TRANSFER ERROR: Copy failed (streamed). Last attempt: HTTP 500

i.e. what you discussed above in #11556 (comment) and below.

The issue in gfal repo https://gitlab.cern.ch/dmc/gfal2-util/-/issues/2 is dormant (no reply from Mihai in 4 months).

Do you still see this error ? Is the code catching it or doing something about it ? I looked at the PR's mentioned above but they do not look a solution.
We share WMCore's stageout plugins in CRAB.

I'd like to add Stephan Lammel as FYI, but do not find his GH handle.

@stlammel
Copy link

Alan pointed me to the gfal issue a while ago and i then investigated/found the issue.
"stlammel" subscribed.

  • Stephan

@amaltaro
Copy link
Contributor

@belforte Hi Stefano, I haven't heard any reports from the P&R team on this for at least a couple of months. Nonetheless, it does not mean that it no longer happens in production, it could be that they are simply going for data invalidation without reporting it. @haozturk (and Hassan, I don't remember his GH user) are you still seeing Rucio vs DBS inconsistencies? Is there any automation to "overcome" those on the Unified side?

@stlammel I had to re-read the GFAL2 ticket and my understanding is that that change/suspicious is on the gfal2-util side. If so, do you think you (or Stefano?) could get a hold of Mihai?

@stlammel
Copy link

Hallo Alan,
i can talk with Mihai and push this/make sure a fix is included in the next version.

  • Stephan

@belforte
Copy link
Member

this error is difficult to catch, Unless you grep for it in the logs. Indeed P&R could spot it as "rules to tape stay stuck/suspended" but if they look and find the file missing on disk they have no way to know why. And tracking back to WMA logs may not be easy.

Even if Mihai fixes the code, I propose to stay on the safe side and add a gfal-ls -l to make sure that file is there with correct size. Computing checksum takes time and may be an overkill

@amaltaro
Copy link
Contributor

@belforte checksum calculation has been enabled by default:

but it looks like enabling it in the gfal-copy command is not enough to catch these "under the hood" failures unreported.
Perhaps a gfal-ls after the gfal-copy command is successful would be enough to catch such cases and fail stage out in place.

@belforte
Copy link
Member

I guess that gfal-copy still fails with HTTP 500, return code 256 and we are back in same situation.
If we can't trust gfal-ls exit code, we need a check after it completed. I'd rather not parse its stdout/stderr, would still make feel uneasy.

@amaltaro amaltaro changed the title File mismatch investigation gfal-copy failing but returning status code 0 - leaving a bunch of file mismatch in the system Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment