Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConvertToOnnx options to exclude the data generation pipeline #5271

Open
go2ready opened this issue Jul 1, 2020 · 8 comments
Open

ConvertToOnnx options to exclude the data generation pipeline #5271

go2ready opened this issue Jul 1, 2020 · 8 comments
Labels
enhancement New feature or request P2 Priority of the issue for triage purpose: Needs to be fixed at some point.

Comments

@go2ready
Copy link

go2ready commented Jul 1, 2020

System information

  • OS version/distro: Windows 10
  • .NET Version (eg., dotnet --info): 4.7

Issue

  • What did you do?
    Calling mlContext.Model.ConvertToOnnx(model, colSelTrainingData2, fs);

Of which model involved data transformations pipeline like encoding, concatenating and column manipulation.

  • What happened?

All the data transformation pipeline is included in the final ONNX model, which I do not want.

  • What did you expect?

I want the model to tramsformed into an ONNX model where the InputColumn is the InputColumn that is fit to the model, excluding all the data transformations pipeline like encoding, concatenating and column manipulation before fitted to the model.

Source code / logs

@antoniovs1029
Copy link
Member

Hi, @go2ready . Can you please share the code where you create your model, so that I can fully understand your question . Thanks 😄

@antoniovs1029 antoniovs1029 added the Awaiting User Input Awaiting author to supply further info (data, model, repro). Will close issue if no more info given. label Jul 1, 2020
@antoniovs1029 antoniovs1029 self-assigned this Jul 1, 2020
@go2ready
Copy link
Author

go2ready commented Jul 1, 2020

Thank you very much @antoniovs1029 , this gist is what I used to train the model and convert to ONNX https://gist.github.com/go2ready/05a5f6cf95d98ee8f12c8dda71294f4f

This is what the ONNX model exported looks like
image

Ideally I want to produce an ONNX model contains just the part that I circled in red, excluding all previous transformation like OneHotEncoder etc.. is it possible?
image

@antoniovs1029
Copy link
Member

Hi, @go2ready . Unfortunately there's no option in ConvertToOnnx to achieve this. In fact I'm not yet sure about how to correctly achieve this, I have some ideas but I'll have to explore them and see if they work before getting back to you.

In the meantime, may I ask you what's your use case and why do you want to separate the preprocessing and the inferencing steps on your Onnx model? Why is exporting the full onnx model not good for you? Thanks.

@go2ready
Copy link
Author

go2ready commented Jul 1, 2020

Hey @antoniovs1029 thanks for looking into this. Our use case is that the training and inferencing process is two individual system, and we have to deploy the model from training env to inferencing env in ONNX format. The preprocessing steps in the training is not needed in our inferencing stage which have all the features gathered already, the preprocessing steps will also make the model file larger and run slower.

@antoniovs1029
Copy link
Member

antoniovs1029 commented Jul 1, 2020

Hi, @go2ready . Thanks for your answer. I still haven't had the chance to try out a "hack" to let you split your onnx model in two, but I do have some follow-up questions, and some suggestions on how to do the hack yourself in the meantime. But before, please notice that ML.NET doesn't provide any direct way to achieve the result you desire, and we actually don't encourage our users to try to split their Onnx models created in ML.NET. There are some hacks to try to do it, but they're, precisely, "hacks".

So, the follow-up questions are:

  1. In your inferencing system, is your preprocessing done with ML.NET (either an ML.NET pipeline, model, or onnx exported model)?
  2. In your training system, is the preprocessing done with ML.NET (either an ML.NET pipeline, model, or onnx exported model)?

Question 1: In your inferencing system, is your preprocessing done with ML.NET?
If it is not done with ML.NET, then you need to be sure that it actually preprocess it in the same way ML.NET does. Looking to the model capture and code snippet you've shared, it seems that your model only does some OneHotEncoding and MapValueToKey operations, which aren't hard to implement outside of ML.NET, but you'd need to make completely sure that the mappings on your inferencing system are exactly the same as the mappings used while training. In more complex models, such as ones using text featurization, making sure that preprocessing is done the same way that in training is very tricky... so that's one of the reasons we don't recommend trying to separate the onnx pipeline.

If it is done with ML.NET, then please consider simply exporting the whole pipeline to onnx and using that in your inferencing system, as your system would be anyway spending time in preprocessing, and you would anyway be spending some disk space on the preprocessing model (whether it's ML.NET-based or the onnx model exported from ML.NET).

Question 2: In your training system, is the preprocessing done with ML.NET?
If it is not done with ML.NET, then simply creating a pipeline for your LightGBM ranking trainer would be enough. Feed the data from whatever system you use for preprocessing, train the model, and export it to onnx, it won't have the preprocessing steps. Notice, if in Question 1 you answered that your preprocessing is done without ML.NET, then you could use that preprocessing system to preprocess your data when training, and then this will work.

If it is done with ML.NET and you can't change that, then it gets trickier. The current implementation of ConvertToOnnx() needs you to pass both the model and the inputDataView, this is because to export to Onnx we need information found in both. I won't go into the details of how IDataView is implemented, but the way it is implemented makes it hard for us to have a version of ConvertToOnnx() that is able to output only partial onnx models (like the one you want). So if you need a partial model (one that just contains the inferencing steps without the preprocessing steps) you'll need to do a hack similar to what is done in this test inside ML.NET. For your case, the hack would be something like this:

  1. Transform your data with your preprocessing pipeline: IDataView transformedTrainingData = dataPrepTransformer.Transform(data);
  2. Save the data to disk: mlContext.Data.SaveAsBinary(transformedTrainingData, stream, keepHidden: false);
  3. Load back the data from disk: IDataView reloadedData = mlContext.Data.LoadFromBinary(savedDataPath);
  4. Use the loaded data to train your ranking model: var model = pipeline.Fit(reloadedData);
  5. Export your model to ONNX. Now it should only have the LightGBM inferencing steps.

Notice that this needs to be done only when training, once you have your ONNX model for inferencing, you won't need to do this for inferencing.

Also I think it's possible to skip the steps 2 and 3 about saving/loading preprocessed data from disk, by using CreateEnumerable and LoadFromEnumerable respectively, and this would be a lot more efficient. But I have never tried out this other hack, and I'm not sure if it would work.

Please, let us know if the suggestions I made work for you... and if you end up using the hack I described, please let me know if you run into any problems with it.

@antoniovs1029 antoniovs1029 added P3 Doc bugs, questions, minor issues, etc. question Further information is requested labels Jul 1, 2020
@antoniovs1029
Copy link
Member

By the way, another alternative would be to explore if there any tools elsewhere that let you manipulate an Onnx graph, so that you can manually remove the preprocessing steps. I don't know of any such tool, but maybe OnnxRuntime provides something to do this.

@antoniovs1029
Copy link
Member

Hi, @go2ready . Any updates on this? Were you able to apply any of the suggestions I made? Which one?
Getting feedback from users on this would help us to better understand what changes should we be targeting in the ConvertToOnnx method, if we were to enable users to export only partial pipelines.

Thanks!

@antoniovs1029
Copy link
Member

antoniovs1029 commented Jul 10, 2020

I'm tagging this as P2 / Feature Enhancement to keep track of the feature request of enabling users to export only part of their pipeline, instead of the whole pipeline, without having to go through the workaround described on my previous comment: #5271 (comment)

@antoniovs1029 antoniovs1029 added enhancement New feature or request P2 Priority of the issue for triage purpose: Needs to be fixed at some point. and removed P3 Doc bugs, questions, minor issues, etc. question Further information is requested labels Jul 10, 2020
@antoniovs1029 antoniovs1029 removed their assignment Jul 13, 2020
@mstfbl mstfbl removed the Awaiting User Input Awaiting author to supply further info (data, model, repro). Will close issue if no more info given. label Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P2 Priority of the issue for triage purpose: Needs to be fixed at some point.
Projects
None yet
Development

No branches or pull requests

3 participants