ConvertToOnnx options to exclude the data generation pipeline #5271

go2ready · 2020-07-01T00:23:24Z

System information

OS version/distro: Windows 10
.NET Version (eg., dotnet --info): 4.7

Issue

What did you do?
Calling mlContext.Model.ConvertToOnnx(model, colSelTrainingData2, fs);

Of which model involved data transformations pipeline like encoding, concatenating and column manipulation.

What happened?

All the data transformation pipeline is included in the final ONNX model, which I do not want.

What did you expect?

I want the model to tramsformed into an ONNX model where the InputColumn is the InputColumn that is fit to the model, excluding all the data transformations pipeline like encoding, concatenating and column manipulation before fitted to the model.

Source code / logs

The text was updated successfully, but these errors were encountered:

antoniovs1029 · 2020-07-01T06:38:08Z

Hi, @go2ready . Can you please share the code where you create your model, so that I can fully understand your question . Thanks 😄

go2ready · 2020-07-01T09:20:46Z

Thank you very much @antoniovs1029 , this gist is what I used to train the model and convert to ONNX https://gist.github.com/go2ready/05a5f6cf95d98ee8f12c8dda71294f4f

This is what the ONNX model exported looks like

Ideally I want to produce an ONNX model contains just the part that I circled in red, excluding all previous transformation like OneHotEncoder etc.. is it possible?

antoniovs1029 · 2020-07-01T17:50:31Z

Hi, @go2ready . Unfortunately there's no option in ConvertToOnnx to achieve this. In fact I'm not yet sure about how to correctly achieve this, I have some ideas but I'll have to explore them and see if they work before getting back to you.

In the meantime, may I ask you what's your use case and why do you want to separate the preprocessing and the inferencing steps on your Onnx model? Why is exporting the full onnx model not good for you? Thanks.

go2ready · 2020-07-01T21:02:10Z

Hey @antoniovs1029 thanks for looking into this. Our use case is that the training and inferencing process is two individual system, and we have to deploy the model from training env to inferencing env in ONNX format. The preprocessing steps in the training is not needed in our inferencing stage which have all the features gathered already, the preprocessing steps will also make the model file larger and run slower.

antoniovs1029 · 2020-07-01T22:44:01Z

Hi, @go2ready . Thanks for your answer. I still haven't had the chance to try out a "hack" to let you split your onnx model in two, but I do have some follow-up questions, and some suggestions on how to do the hack yourself in the meantime. But before, please notice that ML.NET doesn't provide any direct way to achieve the result you desire, and we actually don't encourage our users to try to split their Onnx models created in ML.NET. There are some hacks to try to do it, but they're, precisely, "hacks".

So, the follow-up questions are:

In your inferencing system, is your preprocessing done with ML.NET (either an ML.NET pipeline, model, or onnx exported model)?
In your training system, is the preprocessing done with ML.NET (either an ML.NET pipeline, model, or onnx exported model)?

Question 1: In your inferencing system, is your preprocessing done with ML.NET?
If it is not done with ML.NET, then you need to be sure that it actually preprocess it in the same way ML.NET does. Looking to the model capture and code snippet you've shared, it seems that your model only does some OneHotEncoding and MapValueToKey operations, which aren't hard to implement outside of ML.NET, but you'd need to make completely sure that the mappings on your inferencing system are exactly the same as the mappings used while training. In more complex models, such as ones using text featurization, making sure that preprocessing is done the same way that in training is very tricky... so that's one of the reasons we don't recommend trying to separate the onnx pipeline.

If it is done with ML.NET, then please consider simply exporting the whole pipeline to onnx and using that in your inferencing system, as your system would be anyway spending time in preprocessing, and you would anyway be spending some disk space on the preprocessing model (whether it's ML.NET-based or the onnx model exported from ML.NET).

Question 2: In your training system, is the preprocessing done with ML.NET?
If it is not done with ML.NET, then simply creating a pipeline for your LightGBM ranking trainer would be enough. Feed the data from whatever system you use for preprocessing, train the model, and export it to onnx, it won't have the preprocessing steps. Notice, if in Question 1 you answered that your preprocessing is done without ML.NET, then you could use that preprocessing system to preprocess your data when training, and then this will work.

If it is done with ML.NET and you can't change that, then it gets trickier. The current implementation of ConvertToOnnx() needs you to pass both the model and the inputDataView, this is because to export to Onnx we need information found in both. I won't go into the details of how IDataView is implemented, but the way it is implemented makes it hard for us to have a version of ConvertToOnnx() that is able to output only partial onnx models (like the one you want). So if you need a partial model (one that just contains the inferencing steps without the preprocessing steps) you'll need to do a hack similar to what is done in this test inside ML.NET. For your case, the hack would be something like this:

Transform your data with your preprocessing pipeline: IDataView transformedTrainingData = dataPrepTransformer.Transform(data);
Save the data to disk: mlContext.Data.SaveAsBinary(transformedTrainingData, stream, keepHidden: false);
Load back the data from disk: IDataView reloadedData = mlContext.Data.LoadFromBinary(savedDataPath);
Use the loaded data to train your ranking model: var model = pipeline.Fit(reloadedData);
Export your model to ONNX. Now it should only have the LightGBM inferencing steps.

Notice that this needs to be done only when training, once you have your ONNX model for inferencing, you won't need to do this for inferencing.

Also I think it's possible to skip the steps 2 and 3 about saving/loading preprocessed data from disk, by using CreateEnumerable and LoadFromEnumerable respectively, and this would be a lot more efficient. But I have never tried out this other hack, and I'm not sure if it would work.

Please, let us know if the suggestions I made work for you... and if you end up using the hack I described, please let me know if you run into any problems with it.

antoniovs1029 · 2020-07-01T22:58:18Z

By the way, another alternative would be to explore if there any tools elsewhere that let you manipulate an Onnx graph, so that you can manually remove the preprocessing steps. I don't know of any such tool, but maybe OnnxRuntime provides something to do this.

antoniovs1029 · 2020-07-08T14:22:26Z

Hi, @go2ready . Any updates on this? Were you able to apply any of the suggestions I made? Which one?
Getting feedback from users on this would help us to better understand what changes should we be targeting in the ConvertToOnnx method, if we were to enable users to export only partial pipelines.

Thanks!

antoniovs1029 · 2020-07-10T19:46:50Z

I'm tagging this as P2 / Feature Enhancement to keep track of the feature request of enabling users to export only part of their pipeline, instead of the whole pipeline, without having to go through the workaround described on my previous comment: #5271 (comment)

antoniovs1029 added the Awaiting User Input Awaiting author to supply further info (data, model, repro). Will close issue if no more info given. label Jul 1, 2020

antoniovs1029 self-assigned this Jul 1, 2020

antoniovs1029 added P3 Doc bugs, questions, minor issues, etc. question Further information is requested labels Jul 1, 2020

antoniovs1029 added enhancement New feature or request P2 Priority of the issue for triage purpose: Needs to be fixed at some point. and removed P3 Doc bugs, questions, minor issues, etc. question Further information is requested labels Jul 10, 2020

antoniovs1029 removed their assignment Jul 13, 2020

mstfbl removed the Awaiting User Input Awaiting author to supply further info (data, model, repro). Will close issue if no more info given. label Jul 20, 2020

vpenades mentioned this issue Aug 4, 2024

Support scenario: Input Bitmap + Training = ONNX model #7209

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConvertToOnnx options to exclude the data generation pipeline #5271

ConvertToOnnx options to exclude the data generation pipeline #5271

go2ready commented Jul 1, 2020

antoniovs1029 commented Jul 1, 2020

go2ready commented Jul 1, 2020

antoniovs1029 commented Jul 1, 2020

go2ready commented Jul 1, 2020 •

edited

Loading

antoniovs1029 commented Jul 1, 2020 •

edited

Loading

antoniovs1029 commented Jul 1, 2020

antoniovs1029 commented Jul 8, 2020

antoniovs1029 commented Jul 10, 2020 •

edited

Loading

ConvertToOnnx options to exclude the data generation pipeline #5271

ConvertToOnnx options to exclude the data generation pipeline #5271

Comments

go2ready commented Jul 1, 2020

System information

Issue

Source code / logs

antoniovs1029 commented Jul 1, 2020

go2ready commented Jul 1, 2020

antoniovs1029 commented Jul 1, 2020

go2ready commented Jul 1, 2020 • edited Loading

antoniovs1029 commented Jul 1, 2020 • edited Loading

antoniovs1029 commented Jul 1, 2020

antoniovs1029 commented Jul 8, 2020

antoniovs1029 commented Jul 10, 2020 • edited Loading

go2ready commented Jul 1, 2020 •

edited

Loading

antoniovs1029 commented Jul 1, 2020 •

edited

Loading

antoniovs1029 commented Jul 10, 2020 •

edited

Loading