Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Avoid propagating input columns when applying Onnx models #4971

Open
wants to merge 8 commits into
base: master
from

Conversation

@antoniovs1029
Copy link
Member

antoniovs1029 commented Mar 25, 2020

The goal of this PR is to:

  1. Fix #4970 regarding ColumnSelectingTransformer onnx exported models not working as expected when applying it with ML.NET
  2. Avoid propagating the input columns through an OnnxTransformer. This is a new requirement that was discussed offline with @harishsk . The idea is that the output schema of the onnxtransformer should contain only a column for each output of the onnx model. Until now, since OnnxTransformer is a RowToRowTransformer, the input schema was propagated to the output.

Fixing point number 2 would actually fix point number 1, given the way that the onnx export works today for ColumnSelectingTransformer (where the dropped columns are removed from the onnx model output inside the onnx graph).

@antoniovs1029 antoniovs1029 requested a review from dotnet/mlnet-core as a code owner Mar 25, 2020
@antoniovs1029 antoniovs1029 changed the title Is25onnx col selector harish Avoid propagating input columns when applying Onnx models Mar 25, 2020
@antoniovs1029

This comment has been minimized.

Copy link
Member Author

antoniovs1029 commented Mar 25, 2020

I think the main problem here is that OnnxTransformer is a RowToRowTransformer, and RTRT's don't allow transformers to remove columns from the output schema, only add columns to it. Because of this, I'd believe there are 2 possible solutions:

  1. Change OnnxTransformer to stop inheriting from RTRT, and add any necessary behavior to it
  2. Change RTRT to make it possible for OnnxTransformer to remove columns through some mechanism.

Currently, this PR uses the second approach, and it's based on a previous attempt that @harishsk tried to implement to solve these problems some months ago, but that weren't included on his final commit because other solutions seemed more appropriate at the time.

With his approach, this PR adds a OnnxDataTransform which will actually transform the input to the output, by still using OnnxTransformer's mapper. It is also necessary to modify RowToRowTransformer in order to make it possible to OnnxTransformer to let OnnxDataTransform to take responsibility of doing the transform.

Personally I am still not convinced that this is the best way to go, particularly after fixing some issues that appeared on our tests while trying to use this approach. I feel that the correct solution should be to stop inheriting from RTRT and then OnnxDataTransform might not be necessary. I still want to explore that other option, perhaps on another PR, but I haven't managed to fully accomplish it, so I'll just leave this PR here and see if anyone has some comments on it.

Also, there are still things that I might be able to remove from OnnxTransformer and OnnxDataTransform, but I will still need to look more into it.

@@ -416,7 +419,7 @@ public void OnnxModelInMemoryImage()
// Convert IDataView back to IEnumerable<ImageDataPoint> so that user can inspect the output, column "softmaxout_1", of the ONNX transform.
// Note that Column "softmaxout_1" would be stored in ImageDataPont.Scores because the added attributed [ColumnName("softmaxout_1")]
// tells that ImageDataPont.Scores is equivalent to column "softmaxout_1".
var transformedDataPoints = ML.Data.CreateEnumerable<ImageDataPoint>(onnx, false).ToList();

This comment has been minimized.

Copy link
@antoniovs1029

antoniovs1029 Mar 25, 2020

Author Member

This test needed to be changed because now we're not propagating the input Image column to the output.

protected override IRowToRowMapper GetRowToRowMapperCore(DataViewSchema inputSchema)
{
Host.CheckValue(inputSchema, nameof(inputSchema));
return new OnnxDataTransform(Host, new EmptyDataView(Host, inputSchema), new Mapper(this, inputSchema));
}

protected override DataViewSchema GetOutputSchemaCore(DataViewSchema inputSchema)
{
return new OnnxDataTransform(Host, new EmptyDataView(Host, inputSchema), new Mapper(this, inputSchema)).OutputSchema;
}

private protected override IDataView MakeDataTransformCore(IDataView input)
{
Host.CheckValue(input, nameof(input));
return new OnnxDataTransform(Host, input, new Mapper(this, input.Schema));
}

Comment on lines +343 to +359

This comment has been minimized.

Copy link
@antoniovs1029

antoniovs1029 Mar 25, 2020

Author Member

Besides the fact that I don't fully like modifying RowToRowTransformer to make these methods overridable, I don't like the way this looks either. It makes it kinda tricky to have a different transform set the output of this mapper, but it's the only way to go with this approach... it gets weirder because the OnnxDataTransform simply gets the output by accessing members that exist on this mapper.

But it's necessary that OnnxTransformer, OnnxTransformer.Mapper, OnnxDataTransform, and OnnxScoringEstimator all give the same output schema, since different code paths will require the output schema from any of these classes. So the entanglement here seems to be necessary. The only alternative is to have each of the classes determine the output schema by themselves, but it might become more difficult to maintain the same code on different places.

@antoniovs1029 antoniovs1029 changed the title Avoid propagating input columns when applying Onnx models [WIP] Avoid propagating input columns when applying Onnx models Mar 25, 2020
@yaeldekel

This comment has been minimized.

Copy link
Member

yaeldekel commented Mar 25, 2020

In the vast majority of cases (including the test that Antonio mentioned needs to change) this will be doing the wrong thing. The ML.NET pipelines (unless they explicitly contain a DropColumns transformer), only add columns, and they don't do anything to the existing columns of the data. A pipeline can contain columns that are not related to the OnnxTransformer, and removing them (especially without giving the user a workaround to not remove them) would make the OnnxTransformer unusable in these cases. For example, imagine that a pipeline has two feature columns - one which the user wants to featurize using an ONNX model, and one that doesn't need featurization, and then the user wants to concatenate the unchanged features column to the output of the OnnxTransformer. Another example of when this would do the wrong thing is if a pipeline contains more than one OnnxTransformer - the first one would remove the input columns, and would make the second one be unusable.

In general, I think that the problem this PR is trying to solve is not a bug. The contract of the OnnxTransformer is to compute outputs of the ONNX model, it does not say anything about the output schema matching the output schema of the ONNX model, so I am not sure there is a change that needs to be made. But I could be misunderstanding the problem that this PR is trying to fix?


In reply to: 603839891 [](ancestors = 603839891)

@antoniovs1029

This comment has been minimized.

Copy link
Member Author

antoniovs1029 commented Mar 25, 2020

In the vast majority of cases (including the test that Antonio mentioned needs to change) this will be doing the wrong thing.

You're right. In fact, the test that is failing right now on the CI demonstrates this. It is failing because it is trying to apply a DNN Featurizer .onnx model that has only 1 input and 1 output to a feature column, before using a classifier... and, with the current implementation on this PR, applying the featurizer also drops the "Label" column, and so then the classifier throws an exception because it can't find that column.

I have discussed this offline with Harish, and I had misunderstood what columns where expected to be propagated. So this was my mistake.

In general, the onnx transformer should only drop input columns when the input columns are actually mentioned as input inside the .onnx model, but they aren't connected to the output of the .onnx model. This way it will actually make the ColumnSelectingTransformer to work as expected (i.e. with the onnx transformer actually dropping the columns).

I will update the PR and my comments here to fit the actual expected behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

2 participants
You can’t perform that action at this time.