Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] add new multi custom processor for data frame analytics and model inference #67362

Merged

Conversation

benwtrent
Copy link
Member

This adds the multi custom feature processor to data frame analytics and inference.

The multi processor allows custom processors to be chained together and use the outputs from one processor as the inputs to another.
Example:

{
    "multi_encoding": {
        "processors": [
            {
                "ngram_encoding" : {
                    "field": "foo",
                    "feature_prefix": "f",
                    "n_grams": [1],
                    "length": 2   
                }
            },
            {
                "one_hot_encoding": {
                    "field": "f.11", //OUTPUT from earlier ngram_encoding
                    "hot_map": {"a": "col_a"}
                }
            },
            {
                "one_hot_encoding": {
                    "field": "some_additional_doc_field", // Some other outside field
                    "hot_map": {"cat": "col_cat"}
                }
            }
        ]
    }
}

This definition has the required input fields of ["foo", "some_additional_doc_field"] and has the output fields of ["f.12", "col_a", "col_cat"]. f.12 is included as it is an output of ngram_encoding but is not used by any other processor in the array.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A few comments to get through.

(p, c, n) -> lenient ?
p.namedObject(LenientlyParsedPreProcessor.class, n, PreProcessor.PreProcessorParseContext.DEFAULT) :
p.namedObject(StrictlyParsedPreProcessor.class, n, PreProcessor.PreProcessorParseContext.DEFAULT),
(multiBuilder) -> multiBuilder.setOrdered(true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why we need to setOrdered. I think I also do not understand what orderedModeCallback is all about, it seems to me it's just a way to set some additional property on the created objects. And I find it strange that the other version of declareNamedObjects that doesn't take an orderedModeCallback passes in one that throws if the value is an array given it's declareNamedObjects plural. Not really this PR's issue but I would like to clear this up. Could you help us with this @nik9000 please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimitris-athanasiou a user can declare named objects in two ways:

  • a hashmap keyed by the named object names (unordered)
  • an array of objects (ordered)

If I did not pass this call back (and subsequently check that it was called in the builder), the user could pass an unordered hashmap, which we don't allow.

Example of unordered :

{
    "multi_encoding": {
        "processors":
            {
                "ngram_encoding" : {
                    "field": "foo",
                    "feature_prefix": "f",
                    "n_grams": [1],
                    "length": 2   
                },
                "one_hot_encoding": {
                    "field": "f.11", //OUTPUT from earlier ngram_encoding
                    "hot_map": {"a": "col_a"}
                }
        }
}

^ This is a valid named objects collection, but obviously, its order is not guaranteed. We need to throw in this case.

As for making the declareNamedObjects array parsing more clear, that is probably best for another PR.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit cb34ca6 into elastic:master Jan 15, 2021
@benwtrent benwtrent deleted the feature/ml-inference-multi-pre-processor branch January 15, 2021 17:43
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Jan 15, 2021
…l inference (elastic#67362)

This adds the multi custom feature processor to data frame analytics and inference.

The `multi_encoding` processor allows custom processors to be chained together and use the outputs from one processor as the inputs to another.
benwtrent added a commit that referenced this pull request Jan 19, 2021
…d model inference (#67362) (#67595)

This adds the multi custom feature processor to data frame analytics and inference.

The `multi_encoding` processor allows custom processors to be chained together and use the outputs from one processor as the inputs to another.
alyokaz pushed a commit to alyokaz/elasticsearch that referenced this pull request Mar 10, 2021
…l inference (elastic#67362)

This adds the multi custom feature processor to data frame analytics and inference.

The `multi_encoding` processor allows custom processors to be chained together and use the outputs from one processor as the inputs to another.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants