Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad sizes in Popularity #36349

Closed
joseflix opened this issue Dec 3, 2021 · 22 comments
Closed

Bad sizes in Popularity #36349

joseflix opened this issue Dec 3, 2021 · 22 comments

Comments

@joseflix
Copy link

joseflix commented Dec 3, 2021

We have detected that sometimes in the monit_prod_cmssw_pop_* the files have associated a bad File Size. For example [1]:

Captura de pantalla 2021-12-03 a las 14 36 23

The actual file size according to DAS [2] is 4191351315 (4.2GB). Sometimes there are accesses that show less bytes, always the same value. Which could be the reason for this?

This particular file is at PIC Tier-1 storage, and it has the correct value:

[root@dtn01 ~]# ls

-ltr /pnfs/pic.es/data/cms/disk/store/hidata/HIRun2018A/HIMinimumBias16/MINIAOD/PbPb18_MiniAODv1-v1/30000/413a6fdc-4752-41ee-b53e-f221f854ad5e.root
-rw-r--r-- 1 cms001 cms 4191351315 Apr 5 2021 /pnfs/pic.es/data/cms/disk/store/hidata/HIRun2018A/HIMinimumBias16/MINIAOD/PbPb18_MiniAODv1-v1/30000/413a6fdc-4752-41ee-b53e-f221f854ad5e.root

But, we see this happens for many other files...

Valentin pointed to me that we should open a ticket here. This is part of the CMSSW report on popularity, so this might need a deeper look.

[1] https://monit-kibana.cern.ch/kibana/app/kibana#/discover?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now%2Fy,to:now%2Fy))&_a=(columns:!(data.file_lfn,data.file_size,data.read_bytes,data.client_host),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'79cbcb40-e78f-11ea-966a-e1c0a7950cea',key:data.file_lfn,negate:!f,params:(query:%2Fstore%2Fhidata%2FHIRun2018A%2FHIMinimumBias16%2FMINIAOD%2FPbPb18_MiniAODv1-v1%2F30000%2F413a6fdc-4752-41ee-b53e-f221f854ad5e.root),type:phrase,value:%2Fstore%2Fhidata%2FHIRun2018A%2FHIMinimumBias16%2FMINIAOD%2FPbPb18_MiniAODv1-v1%2F30000%2F413a6fdc-4752-41ee-b53e-f221f854ad5e.root),query:(match:(data.file_lfn:(query:%2Fstore%2Fhidata%2FHIRun2018A%2FHIMinimumBias16%2FMINIAOD%2FPbPb18_MiniAODv1-v1%2F30000%2F413a6fdc-4752-41ee-b53e-f221f854ad5e.root,type:phrase))))),index:'79cbcb40-e78f-11ea-966a-e1c0a7950cea',interval:auto,query:(language:kuery,query:''),sort:!(metadata.timestamp,desc))

[2] https://cmsweb.cern.ch/das/request?view=list&limit=50&instance=prod%2Fglobal&input=%2Fstore%2Fhidata%2FHIRun2018A%2FHIMinimumBias16%2FMINIAOD%2FPbPb18_MiniAODv1-v1%2F30000%2F413a6fdc-4752-41ee-b53e-f221f854ad5e.root

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2021

A new Issue was created by @joseflix Josep Flix, PhD.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@vkuznet
Copy link
Contributor

vkuznet commented Dec 3, 2021

I think the issue is related to info sent by CMSSW jobs via UDP to our collector. I suggest that somebody will check the information provided by jobs.

@makortel
Copy link
Contributor

makortel commented Dec 3, 2021

assign core

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2021

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Dec 3, 2021

Thanks @vkuznet for the clarification on where the information originates from.

While addressing #34873 we realized that the file size reported in the UDP packages was set in a non-thread-safe way (#34873 (comment)) that could cause problems like this.

The overhaul of StatisticsSenderService in #35505 should have fixed this issue as well. That PR was merged in 12_1_0_pre5. Would it be useful in earlier (active) release cycles as well?

@davidlange6
Copy link
Contributor

davidlange6 commented Dec 3, 2021 via email

@makortel
Copy link
Contributor

makortel commented Dec 6, 2021

@perrotta asked in #36355 (comment)

just to get an idea of the plans: would these backports in old but still active cycles require a new (patch) release? Or they should be merged "just if a new release has to be built independently"? (Or maybe they don't even need a new release: sorry, I don't know the details about how those popularity monitorings work...)

New releases will be needed for these changes to take effect. The changes themselves wouldn't necessarily warrant new (patch) releases, but perhaps we should in order to get the updated code to be run by WM?

@makortel
Copy link
Contributor

Backports down to 9_4_X have been done.

@makortel
Copy link
Contributor

The backport PRs have been merged

@makortel
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@joseflix
Copy link
Author

Hi there, Sorry, but is this solved already? I still see values which are not consistent in the popularity kibana views, for example the snapshot I attach.

Captura de pantalla 2022-03-10 a las 10 50 42

In Popularity we don't have the CMSSW version used:

Captura de pantalla 2022-03-10 a las 10 53 15

So, I am not sure if this is because those jobs are using a CMSSW where the fix back porting was not applied?

@joseflix
Copy link
Author

@makortel @qliphy

@davidlange6
Copy link
Contributor

davidlange6 commented Mar 10, 2022 via email

@joseflix
Copy link
Author

Hi @davidlange6 , thanks for the answer. How shall I know which release was using the jobs I report here? Maybe we should add the cmssw version on the popu records?

@davidlange6
Copy link
Contributor

davidlange6 commented Mar 10, 2022 via email

@makortel
Copy link
Contributor

For a limited number of cases DAS/DBS should tell the release a file/dataset was produced, so at least for the example case you should be able to check that. For longer term, adding CMSSW release field to the popularity information looks like straightforward to add.

@vkuznet
Copy link
Contributor

vkuznet commented Mar 10, 2022

In DAS getting release info as simple as:

./dasgoclient -query="release dataset=/ZMM/Summer11-DESIGN42_V11_428_SLHC1-v1/GEN-SIM"
CMSSW_4_2_8_SLHC1

In DBS it is also trivial

https://cmsweb.cern.ch/dbs/prod/global/DBSReader/releaseversions?dataset=/ZMM/Summer11-DESIGN42_V11_428_SLHC1-v1/GEN-SIM

and it yields:

[
{"release_version":"CMSSW_4_2_8_SLHC1"}
,{"release_version":"CMSSW_4_2_8_SLHC1"}
,{"release_version":"CMSSW_4_2_8_SLHC1"}
]

(I need to check why we have three identical entries, most likely loose constrain on SQL query JOINS).

@davidlange6
Copy link
Contributor

davidlange6 commented Mar 10, 2022 via email

@joseflix
Copy link
Author

Yes, I want the release used by cmsRun to open those files. We don't have any clue on the output generate files in popularity DB, if I am not mistaken, only input files...

@makortel
Copy link
Contributor

You want the release used by the cmsRun job not the release used to create the input data set, no?

Ah, good point. Ok, we can add the CMSSW version in the popularity data. Unfortunately, as David mentioned above, that will be available only in new releases (of old release cycles).

@makortel
Copy link
Contributor

#37220 adds CMSSW version to the UDP packets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants