Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow numbers in the data_tier_name regular expression #108

Open
todor-ivanov opened this issue Mar 20, 2024 · 5 comments · May be fixed by #109
Open

Allow numbers in the data_tier_name regular expression #108

todor-ivanov opened this issue Mar 20, 2024 · 5 comments · May be fixed by #109
Assignees

Comments

@todor-ivanov
Copy link

This issue is a followup of dmwm/WMCore#11931 tracking the change on the DBS side.

@todor-ivanov todor-ivanov changed the title Change the data_tier_name regular expression in the DBS lexicon Allow numbers in the data_tier_name regular expression Mar 20, 2024
@todor-ivanov todor-ivanov self-assigned this Apr 5, 2024
@todor-ivanov
Copy link
Author

todor-ivanov commented Apr 5, 2024

Ok now,

Since we have accumulated some steps already working in iterations with @germanfgv on that issue in order to provide this functionality for the T0 replays, I'd like to put some activity records here:

  • The change from the PR Change the data_tier_name relgexp in the lexicon to allow numbers. #109 is not going to fix the issue in the server itself, it will work only for testing the regular expressions provided for the local tests of the server
  • Usually the dbs servers in K8 do not start with the static lexicon configuration, but they rather take it from services_config - here is the startup log for one of them:
$ kubectl  -n dbs logs dbs2go-global-w-79fd47d745-jdfrm
Defaulted container "dbs2go-global-w" out of: dbs2go-global-w, dbs2go-global-w-filebeat
TNS_ADMIN=/etc/tnsnames.ora
start with /etc/secrets/dbsconfig.json
$ grep lexicon /etc/secrets/dbsconfig.json
    "lexicon_file": "/etc/secrets/lexicon.json"
  • I have reconfigured all the dbs instances in preprod changing the respective regexp for the data_tier fields: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/246/diffs#8cf13424fa3ae3173e9252fa46377dbc7bd6cd91
  • Once this was done @germanfgv injected a replay campaign in one T0 agent And we've stumbled on one concurrency issue which was previously reported by @vkuznet here: Eliminate racing conditions in DBS data injection procedure WMCore#11106
  • The origin if this is the fact that we can distinguish the record fields at two different levels - those which are common to all records under the same block - like data_tier, PD, etc.... (as Valentin explains in the WMCore issue) and those which are specific on the file level. And when an insertion in bulk is happening this leads to breaking the Oracle database unique constraint and throwing the ORA000001 error.
  • Due to the concurrency nature of the issue on the client side, there have been created a separate error handler on the dbs side which throws a Concurrency error to the clients, which is exactly what the clients see at the end... but at the bottom still sits a Unique Constraint broken error from Oracle.
  • The errors that T0 team was reporting were:
2024-04-04 13:05:58,020:140283221800704:INFO:DBSUploadPoller:About to call insert block for: /L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac
2024-04-04 13:05:58,094:140283221800704:ERROR:DBSUploadPoller:Error trying to process block /L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac through DBS. Details: DBSError code: 110, message: 6d0f69d1208bcc4d2f65cdc5abafe8c5196e64d1e1d695a534
8eb382403935a2 unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.inse
rtFilesViaChunks Message: Error: concurrency error
  • The logs on the dbs side for a similar record were showing the ORA00001 error:
[2024-04-03 00:01:17.850781226 +0000 UTC m=+456728.830418024] files.go:288: Insert Files file_id=3084044237 lfn=/store/backfill/1/data/Tier0_REPLAY_2024/MinimumBias/DQMIO/PromptReco-v2818455
7/000/377/880/00000/477FFB68-727F-45BF-81F2-1D60C8B3CC54.root
[2024-04-03 00:01:17.851609624 +0000 UTC m=+456728.831246425] blocks.go:181: Insert Blocks
INSERT INTO cms_dbs3_k8s_global_owner.BLOCKS
    (block_id,block_name,dataset_id,
     open_for_writing,origin_site_name,
     block_size,file_count,
     creation_date,create_by,
     last_modification_date,last_modified_by)
    VALUES
    (:block_id,:block_name,:dataset_id,
     :open_for_writing,:origin_site_name,
     :block_size,:file_count,
     :creation_date,:create_by,
     :last_modification_date,:last_modified_by)

&{BLOCK_ID:32917742 BLOCK_NAME:/HLTPhysics/Tier0_REPLAY_2024-PromptReco-v28184557/AOD#aaecec0b-d16d-4c6e-9027-5920685dc53e DATASET_ID:14839867 OPEN_FOR_WRITING:0 ORIGIN_SITE_NAME:T0_CH_CERN_
Disk BLOCK_SIZE:2079405 FILE_COUNT:1 CREATION_DATE:1711650349 CREATE_BY:T0Replay LAST_MODIFICATION_DATE:1711650349 LAST_MODIFIED_BY:T0Replay}
[2024-04-03 00:01:17.853299409 +0000 UTC m=+456728.832936205] files.go:312: unable to insert files, error ORA-00001: unique constraint (CMS_DBS3_K8S_GLOBAL_OWNER.TUC_FL_LOGICAL_FILE_NAME) vi
olated
  • Strangely enough, after 24 hours all dataset records were injected, modulo one, and the only error we were left with, was related to one particular block: /L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac. Log records from the agent side were:
2024-04-04 12:30:50,017:140283221800704:INFO:DBSUploadPoller:Executing retrieveBlocks method...
2024-04-04 12:30:50,018:140283221800704:INFO:DBSUploadPoller:About to call insert block for: /Cosmics/Tier0_REPLAY_2024-PromptReco-v3182029/DQMIO#38d97107-0bd6-4a61-881c-9bd08afb101b
2024-04-04 12:30:50,018:140283221800704:INFO:DBSUploadPoller:About to call insert block for: /Cosmics/Tier0_REPLAY_2024-TkAlCosmics0T-PromptReco-v3182029/ALCARECO#1bbe16a4-bcad-4f92-928b-a4fdad226bfa
2024-04-04 12:30:50,020:140283221800704:INFO:DBSUploadPoller:About to call insert block for: /MinimumBias/Tier0_REPLAY_2024-SiStripCalZeroBias-PromptReco-v3182029/ALCARECO#219c8a0f-3208-4314-86b6-092e8289c16e
2024-04-04 12:30:50,179:140283221800704:ERROR:DBSUploadPoller:Error trying to process block /L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac through DBS. Details: DBSError code: 110, message: 6d0f69d1208bcc4d2f65cdc5abafe8c5196e64d1e1d695a534
8eb382403935a2 unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.inse
rtFilesViaChunks Message: Error: concurrency error
2024-04-04 12:30:50,180:140283221800704:INFO:DBSUploadPoller:About to call insert block for: /NoBPTX/Tier0_REPLAY_2024-PromptReco-v3182029/NANOAOD#88aaca0d-5912-4a6d-a7bf-174374986b43
2024-04-04 12:30:50,339:140283221800704:INFO:DBSUploadPoller:About to call insert block for: /NoBPTX/Tier0_REPLAY_2024-LogErrorMonitor-PromptReco-v3182029/USER#c2a880cb-5d18-4839-ae85-5c197501bf66
...

2024-04-04 12:30:51,584:140283221800704:INFO:DBSUploadPoller:About to call insert block for: /NoBPTX/Tier0_REPLAY_2024-PromptReco-v3182029/DQMIO#66eb027f-bfe9-4a8e-b37b-cdbb07823043
2024-04-04 12:30:51,658:140283221800704:INFO:DBSUploadPoller:Updating database record with files injected into DBS
2024-04-04 12:30:51,663:140283221800704:INFO:DBSUploadPoller:Updating 40 blocks successfully injected into DBS
[2024-04-05 00:03:18.702591251 +0000 UTC m=+629646.970417431] dataset_output_mod_configs.go:68: Insert DatasetOutputModConfigs
INSERT INTO cms_dbs3_k8s_global_owner.DATASET_OUTPUT_MOD_CONFIGS
    (ds_output_mod_conf_id,dataset_id,output_mod_config_id)
    VALUES
    (:ds_output_mod_conf_id,:dataset_id,:output_mod_config_id)

&{DS_OUTPUT_MOD_CONF_ID:38366926 DATASET_ID:14839992 OUTPUT_MOD_CONFIG_ID:25163930}
[2024-04-05 00:03:18.704100146 +0000 UTC m=+629646.971926341] dataset_output_mod_configs.go:72: unable to insert DatasetOutputModConfigs <nil>
[2024-04-05 00:03:18.706504728 +0000 UTC m=+629646.974331024] blocks.go:181: Insert Blocks
INSERT INTO cms_dbs3_k8s_global_owner.BLOCKS
    (block_id,block_name,dataset_id,
     open_for_writing,origin_site_name,
     block_size,file_count,
     creation_date,create_by,
     last_modification_date,last_modified_by)
    VALUES
    (:block_id,:block_name,:dataset_id,
     :open_for_writing,:origin_site_name,
     :block_size,:file_count,
     :creation_date,:create_by,
     :last_modification_date,:last_modified_by)

&{BLOCK_ID:32919159 BLOCK_NAME:/L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac DATASET_ID:14839992 OPEN_FOR_WRITING:0 ORIGIN_SITE_NAME:T0_CH_CERN_Disk BLOCK_SIZE:691037 FILE_COUNT:2 CREATION_DATE:1712162019 CREATE_BY:T0Replay LAST_MODIFICATION_DATE:1712162019 LAST_MODIFIED_BY:T0Replay}
[2024-04-05 00:03:18.709809489 +0000 UTC m=+629646.977635686] bulkblocks2.go:995: insertFilesViaChunks processed 1 goroutines with ids [3084164837 3084164877], elapsed time 733.305µs
[2024-04-05 00:03:18.710679899 +0000 UTC m=+629646.978506084] files.go:282: unable to validate record DBSError Code:114 Description:DBS validation error when wrong pattern is provided Function:dbs.files.Validate Message: Error: nested DBSError Code:114 Description:DBS validation error when wrong pattern is provided Function:dbs.CheckPattern Message:invalid pattern for key=logical_file_name Error: invalid parameter(s)
[2024-04-05 00:03:18.711000322 +0000 UTC m=+629646.978826502] bulkblocks2.go:1002: fail to insert files chunks, trec &{IsFileValid:1 DatasetID:14839992 BlockID:32919159 CreationDate:1712275398 CreateBy:/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0 FilesMap:{mu:{state:0 sema:0} read:{v:<nil>} dirty:map[] misses:0} NErrors:1}
[2024-04-05 00:03:18.711312571 +0000 UTC m=+629646.979138766] bulkblocks2.go:743: 02adabb6b1bebd9e8845ffe10e6cf08ea44515718fd44085a338cdf181bd5ec8 unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error
[2024-04-05 00:03:18.713443893 +0000 UTC m=+629646.981270092] handlers.go:89: DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:02adabb6b1bebd9e8845ffe10e6cf08ea44515718fd44085a338cdf181bd5ec8 unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error Error: nested DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error Stacktrace:

goroutine 4083874 [running]:
github.com/dmwm/dbs2go/dbs.Error({0xae4c20?, 0xc0009e2b90?}, 0x6e, {0xc00097c000, 0xe6}, {0xa1ae29, 0x2b})
        /go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 +0x99
github.com/dmwm/dbs2go/dbs.(*API).InsertBulkBlocksConcurrently(0xc00035e000)
        /go/src/github.com/vkuznet/dbs2go/dbs/bulkblocks2.go:744 +0x16ef
github.com/dmwm/dbs2go/web.DBSPostHandler({0xae81b0, 0xc00060c090}, 0xc0006ee200, {0x9feab6, 0xa})
        /go/src/github.com/vkuznet/dbs2go/web/handlers.go:542 +0x1274
github.com/dmwm/dbs2go/web.BulkBlocksHandler({0xae81b0?, 0xc00060c090?}, 0x454134?)
        /go/src/github.com/vkuznet/dbs2go/web/handlers.go:957 +0x45
net/http.HandlerFunc.ServeHTTP(0xe10000c00043ca01?, {0xae81b0?, 0xc00060c090?}, 0x0?)
        /usr/local/go/src/net/http/server.go:2109 +0x2f
github.com/dmwm/dbs2go/web.limitMiddleware.func1({0xae81b0?, 0xc00060c090?}, 0x0?)
        /go/src/github.com/vkuznet/dbs2go/web/middlewares.go:111 +0x38
net/http.HandlerFunc.ServeHTTP(0x94c100?, {0xae81b0?, 0xc00060c090?}, 0x11?)
        /usr/local/go/src/n


  • It turns out the so created Concurrency Error handler, can mask not only the ORA00001 error from the database, but also validation errors coming from the dbs server code base.

  • I am now about to change and reconfigure the logical_file_name field at the lexicon and we will retry the injections.

@todor-ivanov
Copy link
Author

todor-ivanov commented Apr 5, 2024

@todor-ivanov
Copy link
Author

Aand the results are more than positive:

2024-04-05 18:31:38,997:140175970297600:DEBUG:DBSUploadPoller:Creating dbsAPI with address https://cmsweb-testbed.cern.ch/dbs/int/global/DBSWriter
2024-04-05 18:31:38,997:140175970297600:DEBUG:DBSUploadPoller:Found block /L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac in blocks
2024-04-05 18:31:38,999:140175970297600:INFO:DBSUploadPoller:Queueing block for insertion: /L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac
2024-04-05 18:31:38,999:140175970297600:DEBUG:DBSUploadPoller:Creating dbsAPI with address https://cmsweb-testbed.cern.ch/dbs/int/global/DBSWriter
2024-04-05 18:31:38,999:140175970297600:INFO:DBSUploadPoller:Executing retrieveBlocks method...
2024-04-05 18:31:39,000:140175970297600:INFO:DBSUploadPoller:About to call insert block for: /L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac
2024-04-05 18:31:39,306:140175970297600:DEBUG:DBSUploadPoller:Got a block to close
2024-04-05 18:31:39,306:140175970297600:INFO:DBSUploadPoller:Updating database record with files injected into DBS
2024-04-05 18:31:39,330:140175970297600:INFO:DBSUploadPoller:Updating 1 blocks successfully injected into DBS
  • from dbs writter logs:
[2024-04-05 16:31:39.064563417 +0000 UTC m=+61.069350939] dataset_output_mod_configs.go:72: unable to insert DatasetOutputModConfigs <nil>
[2024-04-05 16:31:39.254616254 +0000 UTC m=+61.259403765] blocks.go:181: Insert Blocks
INSERT INTO cms_dbs3_k8s_global_owner.BLOCKS
    (block_id,block_name,dataset_id,
     open_for_writing,origin_site_name,
     block_size,file_count,
     creation_date,create_by,
     last_modification_date,last_modified_by)
    VALUES
    (:block_id,:block_name,:dataset_id,
     :open_for_writing,:origin_site_name,
     :block_size,:file_count,
     :creation_date,:create_by,
     :last_modification_date,:last_modified_by)

&{BLOCK_ID:32919750 BLOCK_NAME:/L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac DATASET_ID:14839992 OPEN_FOR_WRITING:0 ORIGIN_SITE_NAME:T0_CH_CERN_Disk BLOCK_SIZE:691037 FILE_COUNT:2 CREATION_DATE:1712162019 CREATE_BY:T0Replay LAST_MODIFICATION_DATE:1712162019 LAST_MODIFIED_BY:T0Replay}
[2024-04-05 16:31:39.260464202 +0000 UTC m=+61.265251713] bulkblocks2.go:995: insertFilesViaChunks processed 1 goroutines with ids [3084212117 3084212157], elapsed time 828.803µs
[2024-04-05 16:31:39.267165923 +0000 UTC m=+61.271953803] files.go:288: Insert Files file_id=3084212117 lfn=/store/backfill/1/data/Tier0_REPLAY_2024/L1Accept/L1SCOUT/v3182029/000/378/788/00000/7fa128e6-a29f-4c2d-8c25-2ebb81e54774.root
[2024-04-05 16:31:39.277333904 +0000 UTC m=+61.282121413] files.go:288: Insert Files file_id=3084212157 lfn=/store/backfill/1/data/Tier0_REPLAY_2024/L1Accept/L1SCOUT/v3182029/000/378/788/00000/bdebda18-d654-4731-8a9e-18dd0b47403f.root
[2024-04-05 16:31:39.279887456 +0000 UTC m=+61.284674964] bulkblocks2.go:808: 02adabb6b1bebd9e8845ffe10e6cf08ea44515718fd44085a338cdf181bd5ec8 insert FileLumi list sequentially 1 records
[2024-04-05 16:31:39.283216813 +0000 UTC m=+61.288004337] bulkblocks2.go:808: 02adabb6b1bebd9e8845ffe10e6cf08ea44515718fd44085a338cdf181bd5ec8 insert FileLumi list sequentially 1 records
  • and also das now shoes all file details

@vkuznet
Copy link
Contributor

vkuznet commented Apr 5, 2024

Todor, what is worry me is this line:
[2024-04-05 16:31:39.279887456 +0000 UTC m=+61.284674964] bulkblocks2.go:808: 02adabb6b1bebd9e8845ffe10e6cf08ea44515718fd44085a338cdf181bd5ec8 insert FileLumi list sequentially 1 records
which claims that FileLumi list was inserted sequentially. We should not degrade DBS code back to sequential processing. But yet, somehow it is the case. It may be FileLumiList is small enough, i.e. it is a last chunk, and I would appreciate if you'll check that.

@todor-ivanov
Copy link
Author

Hi @vkuznet it was a test replay from T0 and we know for sure it was a short one. The whole block consist only of two fairly short files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants