Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues related to FastTimerService and HLT #39756

Closed
silviodonato opened this issue Oct 18, 2022 · 13 comments · Fixed by #39859
Closed

Issues related to FastTimerService and HLT #39756

silviodonato opened this issue Oct 18, 2022 · 13 comments · Fixed by #39859

Comments

@silviodonato
Copy link
Contributor

I just want to report here two problems about FastTimerService and the HLT online DQM .

  1. DQM: HLT / TimerService / Running on AMD EPYC 7763 64-Core Processor with 24 streams on 32 threads

image

Here the timing seems wrong of a factor of 2. Our offline measurements showed that the timing should be above 300 ms at high pileup.

  1. Plots timing VsPU and VsSCAL are empty.

@cms-sw/hlt-l2

@cmsbuild
Copy link
Contributor

A new Issue was created by @silviodonato Silvio Donato.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign hlt

@cmsbuild
Copy link
Contributor

New categories assigned: hlt

@missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

@missirol
Copy link
Contributor

(Noted, I will try to understand this in the next days, unless it is considered important to fix this asap.)

FYI: @fwyzard

@fwyzard
Copy link
Contributor

fwyzard commented Oct 18, 2022

Plots timing VsPU and VsSCAL are empty.

The reason is that the FastTimerServiceClient that is responsible for filling those plots is trying to read the information about luminosity and pileup from SCAL, which does not exist any more in Run-3.

I inquired about it a few months ago, and according to @mmusich's answer the solution would be to change the code to read those information from

the OnlineLuminosityRecord from the onlineMetaDataDigis.

Unfortunately I never had time to work on the changes :-/

@fwyzard
Copy link
Contributor

fwyzard commented Oct 18, 2022

Here the timing seems wrong of a factor of 2. Our offline measurements showed that the timing should be above 300 ms at high pileup.

@silviodonato keep in mind that as long as the cpu usage is below ~70%, it's almost like running without hyperthreading, so it would make sense to observe a timing roughly a factor 2 (I'd expect 1.8x) faster than on a fully used machine.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 18, 2022

Unfortunately I never had time to work on the changes :-/

By the way, if anyone else makes the necessary changes, I would suggest to also rename these plots to _vs_pileup and _vs_lumi ...

@fwyzard
Copy link
Contributor

fwyzard commented Oct 18, 2022

OK, some more information: I've re-run over the first lumisections of run 360459 (the same one of @silviodonato's plot), using similar conditions as what we have online (CMSSW_12_4_10, same menu and global tag, 8 jobs times 32 threads/24 streams) and looked at the CPU time (Silvio's plot is for CPU time):

image

Taking total - other (which is what the FastTimerService plots) I get 260 ms/ev:
image

If I zoom on the DQM plot we see that for the first 2-3 lumisections the CPU time measured on the HLT farm was indeed 258 ms/ev:
image

I'm looking at the first two lumisections because at the beginning of the run the HLT has buffered some data, while it was loading the application, starting the jobs, and getting the first conditions -- so the whole farm will run at the maximum capacity until the buffer has been drained.

Keeping all these effects in mind, I would say that the online measurement is in very good agreement with an online-like measurement done in the same conditions 👍🏻

One last comment is about the CPU time vs real (wall clock) time: of course what actually matters for keeping up with the L1 rate is the latter.

The plot for the real time looks similar, just a bit higher:
image

Zooming on the first lumisections shows a similar effect, with a peak for the first lumisection around 298 ms/ev:
image

From my online-like measurement I get 417 - 118 = 299 ms/ev for the real time:
image
which is also very consistent with the online value for the first lumisection.

So... the measurements done on the online machines reproduce pretty accurately the HLT timing measured online (better than I imagined before making this check).

And the comparison between the timing value of ~ 200 ms/ev (~210 ms/ev real) after the first lumisections and the initial peak of 260 ms/ev (~300 ms/ev real) gives us an indication of the effect of hyperthreading at the level of occupancy we have around pileup 50.

@silviodonato
Copy link
Contributor Author

silviodonato commented Oct 19, 2022

Thanks a lot @fwyzard ! So

  • we should/can check the timing including the hyper-threading effect by looking at the first lumisections
  • the DQM plot doesn't include the "others" part (~10%).

I will keep the issue open for the issue 2) (which is not urgent)

@fwyzard
Copy link
Contributor

fwyzard commented Oct 19, 2022

we should/can check the timing including the hyper-threading effect by looking at the first lumisections

Yes, but only for runs that start already in stable beams, otherwise there is very little to run.

the DQM plot doesn't include the "others" part (~10%).

Correct - and the difference is more significative for "real time" than for "cpu time".

@missirol
Copy link
Contributor

To fix the empty plots, an attempt is in #39859.

@missirol
Copy link
Contributor

missirol commented Nov 4, 2022

+hlt

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 4, 2022

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants