New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast Calorimeter Learning (FACILE) algorithm for HCAL reconstruction #32272
Conversation
…ted on GPU with TritonClient, with example config for running in HLT
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32272/20056
|
A new Pull Request was created by @jeffkrupa (Jeff Krupa) for master. It involves the following packages: Configuration/ProcessModifiers @perrotta, @silviodonato, @cmsbuild, @franzoni, @slava77, @jpata, @qliphy, @fabiocos, @davidlange6 can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test |
The tests are being triggered in jenkins.
|
+1 |
Comparison job queued. |
@igv4321 please, take a look. |
Comparison is ready Comparison Summary:
|
From what I can surmise of the discussion so far, the use of case of FACILE, currently, is
The integration of FACILE in CMSSW is not objected to by the HCAL DPG. Currently, the only use case of FACILE is the further development of FACILE itself, and as a the first use case for Sonic + Triton. And, currently, Sonic + Triton is only useful for testing FACILE. So the full request is
with the stated goal of furthering the development of Sonic and FACILE. Is the summary correct ? |
Do we need to get to this point ? since 2016 have the responsibility to maintain the HCAL local reconstruction (indeed have been HCAL=DPG only 2018-2019) so I think I have the scientific responsibility of the review at least of what HCAL will use for the initial part Run3. IN 2021 we will come back after a very long showdown , and we should not repeat the errors of 2015 ! I'm trying to help you guys here. I do not want to enter the details of the algorithm now, I think it has conceptual problem and the presentation at the CMSweek showed them I interpret your last message that you are not satisfied with merging code for study (as Igor and Markus mentioned). snick in a change in the production is not an option |
Objectively false.
These elements are already in CMSSW. As discussed at the core software meeting today, some further automation will be coming soon.
I'm not sure I understand this point. To clarify, the availability of a process modifier is intended to facilitate further testing, not immediately for production. |
As the person who actually led and coordinated the HCAL Phase 1 software development, I'm well aware of the importance of having things ready in advance. This PR in no way interferes with other Run 3 preparations.
I'm not satisfied with specious arguments and special pleading targeted at this PR because some people happen not to like it for subjective reasons. There have been probably more than a dozen PRs this year alone introducing new, often ML-based reconstruction alternatives with associated modifiers. I'm honestly taken aback by the amount of spurious objections being thrown at this PR. Throughout my tenure as a CMSSW L2, I've actively encouraged iterative development practices. If CMS no longer wants to support iterative development in CMSSW, perhaps many of us should reconsider how actively we want to contribute to the experiment software. |
OK, then I misunderstood the results I have seen
They are in CMSSW, but the question as to "what are they currently useful for" remains.
While I don't disagree, from the ongoing discussion it looks like not including the modifiers and the matrix workflow would simplify the integration ? |
We have preliminary numbers and are working (right now) on making sure they're valid. (But we definitely see an improvement.) We know that FACILE inference, even on CPU, is significantly faster than MAHI (~10ms vs. ~60ms).
Let me lay out in more detail what I see as the conflict in the discussion here. Stated and at least semi-official CMSSW policies include:
We have explicitly followed 1. in this PR, and work is ongoing to make sure we can follow 2. Some of the suggestions in this PR discussion actively violate one or both policies for reasons that are definitely not being applied uniformly across all CMSSW PRs. |
@@ -0,0 +1,3 @@ | |||
import FWCore.ParameterSet.Config as cms | |||
|
|||
enableSonicTriton = cms.Modifier() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this meant to enable only the FACILE variant or really everything that can potentially be using SonicTriton?
I'm trying to understand the expected design of the modifier mix.
IIUC, there will/can be many inference variants/models using SonicTriton
and many of them will be for tests only. There should be a way to distinguish.
I guess one option is to assume that this specific modifier will flip between one "default" variant of algorithms to another equivalent variant of the algorithms but remote-offloaded via SONIC+Triton.
This option would apply to #32048: in a mix with mlpf
modifier enabled, addition of enableSonicTriton
will switch the way how an equivalent algorithm is executed (sort of a switch-producer).
This is clearly not the case in this PR: a completely different algorithm is substituted.
It seems to me that we need a more targeted named modifier, smth like run3_HB_FACILE or something similarly compact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, enableSonicTriton
is intended to enable all available producers that use Sonic+Triton.
Introducing a separate modifier specific to this PR seems reasonable to me.
(I don't think we necessarily need to have non-offloaded versions of every ML producer, but again, I think this can be left to developers' discretion right now.)
Kevin, we have been porting Mahi on GPU we get identical results and the HCAL speed is now negligible in Run3-HLT. We are good for HLT. So using the ML is not helping to save time. So what the real Mahi OOT algorithm problem you are trying to overcome ? not clear to me I think Mahi is simple give you here a good benchmark to develop your ML also and test your triton.
|
The question of what constitutes duplicating effort is much larger than this particular PR. However, since you raised the topic: this PR achieves a significant speedup wrt MAHI (even on CPU) with O(100) lines of C++ code. (If you want to count the training code, which currently isn't included in CMSSW for most ML algorithms, this number grows somewhat, but stays at a similar order of magnitude.) The resulting algorithm can be ported to essentially any coprocessor with little more than the press of a button. It does not require anyone to write or maintain any coprocessor-specific code whatsoever. In the long run, I (and many others) expect that this approach will result in significantly less duplication of effort. Also, FACILE is explicitly trained not just to be aware of PU, but to remove it, which we suspect results in the improvement in the MET resolution that we have observed and presented in the HCAL DPG. Conditions are also included in the training. Parameterizing the DNN in terms of the detector conditions, to reduce the need for retraining, is an interesting future R&D project that would be facilitated by having a workflow in CMSSW to run the inference and get high-level physics results. |
Just being curious, I think it would be interesting to compare the performance of the inference server also to within-CMSSW inference with TensorFlow/ONNX to understand the magnitude of the overhead. |
rechits at high energy should get the same estimated energy mahi or any other method. Subtracting IT PU from one detector is bad.
Technology wise I'm opened, that's why we are experimenting. but you need to discuss probably somewhere else. |
Hi Kevin, we were not expecting any PR before the issues brought up in the DPG meetings would be adressed. And it was even more surprising to hear from third party that this was sold as HCAL request in the ORP meeting.
We are happy to have this in as a technology demonstrator for SonicTriton and to easen future R&D. But note that currently there is no clear path to production (and maintenance by the HCAL DPG), as our default algorithm is fast enough to be a negligible portion of the whole reco step (28 ms on CPU for Phase 2) and respects all relevant input conditions. Iff reco conveners prefer to have modifiers added for every experimental algorithm, fine for us. On the physics content:
|
From the reco point of view:
Next, we expect 1-2 to be addressed by the authors. Thanks! |
@intrepid42 just to make sure we're all on the same page:
|
Ok, thank you for clarifying this misunderstanding! |
Here some personal considerations about R&D projects in CMSSW.
About this specific PR, given that HCAL DPG is ok with merging this PR to facilitate the R&D, I'm ok to have this code in CMSSW. |
I agree that putting it in test is one way to isolate the dependencies (for now). Coupled with the expressed preference for a customization function (below), this also implies removing the cfi file and putting the module configuration inside the customization function. This might be okay for now.
If this is the preference of the release management team, we can look into this alternative. I think it would be useful to have a clearer delineation regarding when customization functions are preferable vs. modifiers. My working understanding, since modifiers were introduced, was that they were always preferred whenever feasible.
We currently allow non-functional workflows in the upgrade matrix to facilitate development. They are simply not forwarded to the regular matrix, so not automatically tested in IBs. This has been an accepted practice for upgrade workflows for many years now. Maybe it's worth adding a separate upgrade-workflow test command to the bot (for which it would explicitly use |
Ok
I think the modifiers are more useful when you have something "official" that you want to run in many conditions/eras and then you need to play with the other modifiers. For R&D and testing, I think it is better to use a customization function without touching the already existing sequences.
Ok, no problem from my side with using the upgrade matrix. |
Does it mean that any other non-working stuff can go in, too ? |
On 12/11/20 12:28 AM, Silvio Donato wrote:
If this is the preference of the release management team, we can
look into this alternative. I think it would be useful to have a
clearer delineation regarding when customization functions are
preferable vs. modifiers. My working understanding, since
modifiers were introduced, was that they were always preferred
whenever feasible.
I think the modifiers are more useful when you have something
"official" that you want to run in many conditions/eras and then you
need to play with the other modifiers. For R&D and testing, I think it
is better to use a customization function without touching the already
existing sequences.
Do you see any bad side effects @cms-sw/reconstruction-l2
<https://github.com/orgs/cms-sw/teams/reconstruction-l2>
@cms-sw/core-l2 <https://github.com/orgs/cms-sw/teams/core-l2> ?
no particularly strong objections to customize functions. I think that
use of modifiers for a test-like case is more motivational for stronger
support commitment and more clarity to this becoming a production setup
soon.
|
I agree. |
Currently, there are about 10k workflows in the upgrade matrix. I'm not worried too much about it, as far as the change does not affect the regular matrix and the other workflows.
ok, so let's keep the customize function for the time being. |
kind ping, given that the TritonService PR #32576 was merged. |
kind ping - let us know how you want to proceed with this (conflicts since some time) |
We have decided to close this pull request pending additional timing studies. Creating the |
PR description:
RecoLocalCalo/HcalRecProducers/src/FacileHcalReconstructor.cc
. It loads the HBHEChannelInfo collection and employs a client with Services for Optimized Network Inference in Coprocessors (SONIC) to communicate with a server in as-a-service operation using Triton Inference Server.RecoLocalCalo/HcalRecProducers/src/HBHEPhase1Reconstructor.cc
and the HBHERecHit collection is made with the new producer.facile_all_v5
, in tensorflow SavedModel format (https://github.com/fastmachinelearning/sonic-models). We would like to request a spacecms-data:RecoLocalCalo/HcalRecProducers
to store the model.RecoLocalCalo/HcalRecProducers/test/sonic_hlt_test.py
) and offline reco modifiers (RecoLocalCalo/Configuration/python/hcalGlobalReco_cff.py
) for testing.PR validation:
runTheMatrix.py -l 11634.0 --command="--procModifier enableSonicTriton"
, andcmsRun sonic_hlt_test.py
.Note: to start a local instance of the Triton server for running the commands above, do:
@violatingcp