Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recover TF inference for deepJet and deepDoubleX and MXNET inf for deepBoosted (migrated to ONNX in #28112) #28959

Closed
slava77 opened this issue Feb 13, 2020 · 12 comments

Comments

@slava77
Copy link
Contributor

slava77 commented Feb 13, 2020

[This issue is contingent on making ONNX properly working on PPC. ]

As a part of #28112 the support for running DeepFlavour and DeepDoubleX inference using TensorFlow was removed; it was replaced with the ONNX-based inference.
Trained model is the same in both cases.
The migration was done to minimize code replication and to bypass implementation/support of an adaptor layer that would allow a consistent support of both inference engines.

Similarly, deepBosted tagger was migrated from MXNet.

However, it was later discovered that ONNX lacks support for PPC architecture.
Some significant effort was made on CMS side to add support, e.g. in cms-externals/onnxruntime#4 but this did not give numerically correct performance at run time.

There is some somewhat significant motivation to provide support for PPC with numerically correct performance.
So, the more straightforward solution now is to keep support for TF-based [and MXNet] inference on PPC.

@hqucms @makortel @mrodozov

@slava77
Copy link
Contributor Author

slava77 commented Feb 13, 2020

assign reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @slava77 Slava Krutelyov.

@Dr15Jones, @smuzaffar, @silviodonato, @makortel, @davidlange6, @fabiocos can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

@slava77 Thanks for opening the issue. Are you essentially suggesting to continue use ONNX inference on x86 and ARM, and TF on POWER (i.e. not using TF for all architectures e.g. after #28711)?

Technically we could do that with SwitchProducer, but we would need to solve #28576 first (and that will likely take a while).

@mrodozov @smuzaffar We should probably anyway disable building the ONNX-dependent libraries and plugins on POWER.

@slava77
Copy link
Contributor Author

slava77 commented Feb 13, 2020

@makortel
I think that a switch producer would be a good option.
The alternative is some "ppcProduction" process modifier, which may be possible to pass through the submission infrastructure.

I'm not sure if #28576 fully applies here. IIUC, it is possible to build ONNX on Power and load libraries without anything exploding.

@makortel
Copy link
Contributor

The alternative is some "ppcProduction" process modifier, which may be possible to pass through the submission infrastructure.

I just want to remind that in that case the configuration hash would be different between x86 and POWER (which motivated the SwitchProducer).

I'm not sure if #28576 fully applies here. IIUC, it is possible to build ONNX on Power and load libraries without anything exploding.

At the moment the POWER build fails
https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc7_ppc64le_gcc820/CMSSW_11_1_X_2020-02-12-2300/PhysicsTools/ONNXRuntime

Do I interpret correctly this means that cms-externals/onnxruntime#4 (or something like it) is needed to build ONNX for POWER in the first place?

@slava77
Copy link
Contributor Author

slava77 commented Feb 13, 2020

Do I interpret correctly this means that cms-externals/onnxruntime#4 (or something like it) is needed to build ONNX for POWER in the first place?

right, that's what I implied by "it's possible to build ONNX on power"

@makortel
Copy link
Contributor

Do I interpret correctly this means that cms-externals/onnxruntime#4 (or something like it) is needed to build ONNX for POWER in the first place?

right, that's what I implied by "it's possible to build ONNX on power"

Thanks for the clarification. I'm not really fond of "hacking an external to build" as a mid/long-term solution. I suppose the added complexity might be justified in the (very) short term if the ONNX continues to be "much faster" than TF for these models even after #28711.

I'd prefer to find a general solution (#28576) because I'm afraid the situation will only get worse in the future (but I'm happy to be proven wrong).

@smuzaffar
Copy link
Contributor

cms-externals/onnxruntime#4 only allows us to build but it does not run. ONNX unit tests failed on PowerPC

@slava77 slava77 changed the title recover TF inference for deepJet and DeepDoubleX (migrated to ONNX in #28112) recover TF inference for deepJet and deepDoubleX and MXNET inf for deepBoosted (migrated to ONNX in #28112) Feb 26, 2020
@slava77
Copy link
Contributor Author

slava77 commented Feb 26, 2020

I just realized that this issue should extend to the MXNET as well (used in deepBoosted taggers)

@hqucms
Copy link
Contributor

hqucms commented Mar 10, 2020

#29172 is open to address this issue. Feedback is welcome!

@slava77
Copy link
Contributor Author

slava77 commented May 1, 2020

I'm closing this for now, after cms-sw/cmsdist#5743 was integrated

@slava77 slava77 closed this as completed May 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants