Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatisticsSenderService working in "recent" releases #41975

Closed
davidlange6 opened this issue Jun 15, 2023 · 25 comments
Closed

StatisticsSenderService working in "recent" releases #41975

davidlange6 opened this issue Jun 15, 2023 · 25 comments

Comments

@davidlange6
Copy link
Contributor

Thanks to a question in Mattermost, I noticed that the cmssw file-open monitoring for recent upgrade samples did not see reads that were found by xrootd or classad based monitoring.

Unfortunately the json kept by the monitoring does not include CMSSW release used. Instead, I looked at what file usages are being captured by the cmssw monitoring, and there are no run3 files. (which is an approximation of release used to open) - there are some files whose lfn contains CMSSW_12_0_1, but nothing newer in 2023. (well, unless there is a CMSSW_12_0_111 but I assume thats a bug in the lfn somehow..).

@makortel summarized recent changes as being
#35362 (in CMSSW_12_1_0)
#35505 (in CMSSW_12_1_0)
#36570 (in CMSSW_12_3_0)

Is there an easy way to see the json being produced by cmssw? I can also try to see how to trigger a file open in various releases to see what shows up in the monitoring (not sure how fool proof that is - maybe thats an interesting test too)

@vkuznet @Dr15Jones @makortel

@cmsbuild
Copy link
Contributor

A new Issue was created by @davidlange6 David Lange.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@davidlange6
Copy link
Contributor Author

and second, could we add release name or some other unique identifier that can trace back the release being used to open the file to make understanding future problems easier?

@makortel
Copy link
Contributor

Is there an easy way to see the json being produced by cmssw?

#35505 (since 12_1_X) added a debug parameter to the StatisticsSenderService which prints the JSON document (with LogSystem).

@makortel
Copy link
Contributor

and second, could we add release name or some other unique identifier that can trace back the release being used to open the file to make understanding future problems easier?

The CMSSW version was actually added in #37220 (12_4_X, backported to earlier release cycles down to 8_0_X)

@davidlange6
Copy link
Contributor Author

#37220 seems to do what I ask above...

@davidlange6
Copy link
Contributor Author

I did some checking with an example release - 12_3_0 - the json seems sane to me and the socket interactions appear to go ok according to some printouts.. Seems functional on the cmssw side. I opened some files with various releases so can check the monitoring tomorrow to see if they should up or not.

{"site_name":"T2_CH_CERN", "cmssw_version":"CMSSW_12_3_0", "fallback": false, "type": "primary", "user_dn":"unknown", "client_host":"vocms010", "client_domain":"cern.ch", "server_host":"st-096-ff80923e", "server_domain":"cern.ch", "unique_id":"400f7f28-47bd-4640-a3f6-fa6b694c4236-0", "file_lfn":"/store/data/Run2018D/AlCaLumiPixels/RAW/v1/000/321/012/00000/E027D367-1E9B-E811-8C33-FA163ED2A186.root", "file_size":5680421895, "read_single_sigma":59938.8, "read_single_average":17037.8, "read_vector_average":1.70981e+07, "read_vector_sigma":7.9425e+06, "read_vector_count_average":11.8649, "read_vector_count_sigma":43.2186, "read_bytes":632935608, "read_bytes_at_close":632935608, "read_single_operations":18, "read_single_bytes":306680, "read_vector_operations":37, "read_vector_bytes":632628928, "start_time":1686843293, "end_time":1686843306}

vs something in 10_6_0

{"site_name":"T2_CH_CERN", "user_dn":"unknown", "client_host":"vocms010", "client_domain":"cern.ch", "server_host":"st-096-gg5618jy", "server_domain":"cern.ch", "unique_id":"C98CEAF4-FADA-E946-8199-936C2D4D4A4F-0", "file_lfn":"/store/data/Run2018D/AlCaLumiPixels/RAW/v1/000/321/012/00000/D819947A-159B-E811-9A32-FA163E595669.root", "file_size":6314797261, "read_single_sigma":71125.8, "read_single_average":19762.8, "read_vector_average":8.47139e+06, "read_vector_sigma":9.37531e+06, "read_vector_count_average":38.6, "read_vector_count_sigma":94.0661, "read_bytes":85069678, "read_bytes_at_close":85069678, "read_single_operations":18, "read_single_bytes":355731, "read_vector_operations":10, "read_vector_bytes":84713947, "start_time":1686843594, "end_time":1686843604}

@davidlange6
Copy link
Contributor Author

I also checked that the number of bytes successfully sent on the socket makes sense (eg, matches the size of the json string).

@davidlange6
Copy link
Contributor Author

looking at monitoring this morning, I see the file reads from CMSSW_10_6_0, CMSSW_11_3_0 and CMSSW_12_0_0, but not from those with CMSSW_12_3_0. I somehow missed doing CMSSW_12_1_0. Will add that to today's tests..

Second, if the cmssw_version was back ported to all releases, either no one is using those releases or all such releases are failing. I see no records with the cmssw release version in the json (in 2023).

@makortel
Copy link
Contributor

The earlier-than-12_4_X releases where the cmss_version was backported are

  • 12_3_0
  • 12_2_2
  • 10_6_31
  • 10_2_28_patch1
  • 8_0_36_patch2

(in many release cycles where the backport was done, no release has been built since then).

@davidlange6
Copy link
Contributor Author

For example, I tried using 10_6_33 yesterday, it doesn't look like it worked. [nothing I did yesterday showed up in the monitoring, so I will redo with some that I expect to work..)

@davidlange6
Copy link
Contributor Author

results from my tests of yesterday:

  • CMSSW_10_6_0 : ok
  • CMSSW_10_6_33: Not ok
  • CMSSW_11_3_0: ok
  • CMSSW_12_0_0: ok
  • CMSSW_12_1_0: Not ok
  • CMSSW_12_3_0: Not ok

@makortel
Copy link
Contributor

From CMSSW side the list of "ok" and "not ok" would easiest to explain with #35362 + #35505 somehow breaking things. @davidlange6 Would you be able to test 10_6_29 and 10_6_29_patch1?

@davidlange6
Copy link
Contributor Author

my tests suggest that 10_6_29 is ok and 10_6_29_patch1 is not working.

@makortel
Copy link
Contributor

makortel commented Jun 20, 2023

my tests suggest that 10_6_29 is ok and 10_6_29_patch1 is not working.

Thanks, this is a quite strong evidence for the problem stemming from the backport of #35362 + #35505 (which is also supported by the recent comments in Monitoring Mattermost channel)

@davidlange6
Copy link
Contributor Author

it looks like the problem is that "type" is not a legal key in the json -- eg, it is used for something else in the monitoring data stream and confuses things. We should rename it "read_type" or something like that

@makortel
Copy link
Contributor

Can we have a list of all other key names that are not allowed in the monitoring system?

@makortel
Copy link
Contributor

Can we have a list of all other key names that are not allowed in the monitoring system?

Just to document here, David pointed to https://monit-docs.web.cern.ch/metrics/amq/ as a hint.

@makortel
Copy link
Contributor

Can we have a list of all other key names that are not allowed in the monitoring system?

Just to document here, David pointed to https://monit-docs.web.cern.ch/metrics/amq/ as a hint.

This document was confirmed in https://mattermost.web.cern.ch/cms-o-and-c/pl/6fqtne7uytfpby3rui9as38w7e to be the correct documentation, and the list of key names has been stable over the years (there are also many other users, so on AMQ side they would hopefully avoid potentially breaking changes).

The list of reserved keywords is then

  • producer
  • type
  • type_prefix
  • timestamp
  • host

@makortel
Copy link
Contributor

Fix is in #42060

@makortel
Copy link
Contributor

The fix and backports have been merged

@makortel
Copy link
Contributor

+core

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants