Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] adds new n_gram_encoding custom processor #61578

Merged

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Aug 26, 2020

This adds a new n_gram_encoding feature processor for analytics and inference.

The focus of this processor is simple ngram encodings that allow:

  • multiple ngrams [1..5]
  • Prefix, infix, suffix

Format

"n_gram_encoding": {
  "field": <input field name>,
  "n_grams": <array of int indicating ngrams desired, required. Max val 5, min val 1>,
  "feature_prefix": <optional feature name prefix. Defaults n_gram_<start>_<length>,
  "start": optional start index. Defaults to 0. Can be negative to indicate suffix starting,
  "length": optional string length to encode to ngrams. Default to 50, max 100,
}

Example usage:

PUT _ml/data_frame/analytics/foo
{
  "source": {
    "index": "kibana_sample_data_flights"
  },
  "dest": {
    "index": "goof"
  },
  "analysis": {
    "regression": {
      "dependent_variable": "DistanceKilometers",
      "num_top_feature_importance_values": 3,
      "feature_processors": [{
        "n_gram_encoding": {
          "field": "OriginCityName",
          "n_grams": [1, 2, 3],
          "feature_prefix": "f"
        }
      }]
    }
  },
  "analyzed_fields": {"includes": ["OriginCityName","DistanceKilometers"]},
  "model_memory_limit": "1gb"
}

The features names returned from the encoding have the following format:

<feature_prefix>.<n_gram><pos>

Example:
for the string cat with feature_prefix: "f"

f.20: "ca"

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@benwtrent benwtrent marked this pull request as draft August 26, 2020 12:22
@benwtrent benwtrent marked this pull request as ready for review August 27, 2020 18:07
@benwtrent
Copy link
Member Author

run elasticsearch-ci/packaging-sample-windows

@benwtrent
Copy link
Member Author

@elasticmachine update branch

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

*/
public class NGram implements PreProcessor {

public static final long SHALLOW_SIZE = RamUsageEstimator.shallowSizeOfInstance(NGram.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see this field being used in the client.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, should be deleted.

this.field = ExceptionsHelper.requireNonNull(field, FIELD);
this.featurePrefix = ExceptionsHelper.requireNonNull(featurePrefix, FEATURE_PREFIX);
this.nGrams = ExceptionsHelper.requireNonNull(nGrams, NGRAMS);
if (Arrays.stream(this.nGrams).anyMatch(i -> i < 1)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (Arrays.stream(this.nGrams).anyMatch(i -> i < 1)) {
if (Arrays.stream(this.nGrams).anyMatch(i -> (i < MIN_GRAM) || (i > MAX_GRAM))) {

final int len = Math.min(startPos + length, stringValue.length());
for (int i = 0; i < len; i++) {
for (int nGram : nGrams) {
if (startPos + i + nGram - 1 >= len) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (startPos + i + nGram - 1 >= len) {
if (startPos + i + nGram > len) {

null,
null,
Collections.singletonList(new NGram(TEXT_FIELD, "f", new int[]{1, 2}, 0, 2, true))))
.setAnalyzedFields(new FetchSourceContext(true, new String[]{TEXT_FIELD, NUMERICAL_FIELD}, new String[]{}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this confusing at first because I thought the analyzed fields should include the ngram f.x fields and exclude the TEXT_FIELD. setAnalyzedFields is now poorly named it is more like setFetchedFields.

Is there a way of specifying which ngrams fields should be modelled or indeed for the output of any pre-processor which fields are used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

analyzed_fields = All fields grabbed from docs. These fields are chosen for FULL analysis (including being processed)

There is no way of specifying feature inclusion for processed features. They are always included. This is for API simplicity.

Maybe renaming analyzed_fields to fetched_fields is proper.

@dimitris-athanasiou ^ what do you think?

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent benwtrent merged commit 2341b20 into elastic:master Sep 3, 2020
@benwtrent benwtrent deleted the feature/ml-analytics-ngram-processor branch September 3, 2020 16:23
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Sep 3, 2020
This adds a new `n_gram_encoding` feature processor for analytics and inference.

The focus of this processor is simple ngram encodings that allow:
 - multiple ngrams [1..5]
 - Prefix, infix, suffix
benwtrent added a commit that referenced this pull request Sep 4, 2020
* [ML] adds new n_gram_encoding custom processor (#61578)

This adds a new `n_gram_encoding` feature processor for analytics and inference.

The focus of this processor is simple ngram encodings that allow:
 - multiple ngrams [1..5]
 - Prefix, infix, suffix
@jakelandis jakelandis removed the v8.0.0 label Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants