Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add new ml_standard tokenizer for ML categorization #72744

Merged
merged 1 commit into from
May 5, 2021

Conversation

droberts195
Copy link
Contributor

This new tokenizer, ml_standard, is very similar to the original
ml_classic tokenizer.

The difference is that ml_standard aims to parse URLs and
filesystem paths as single tokens instead of splitting them at
the slashes.

This new tokenizer, ml_standard, is very similar to the original
ml_classic tokenizer.

The difference is that ml_standard aims to parse URLs and
filesystem paths as single tokens instead of splitting them at
the slashes.
@droberts195 droberts195 requested a review from edsavage May 5, 2021 11:03
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label May 5, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor Author

This new tokenizer is not used anywhere by default at the time of opening this PR. The intention is that if testing in conjunction with C++ changes goes well then a followup PR will make it the default for new jobs from 7.14 onwards. (Or if the overall categorization project isn't ready for 7.14 then this change will probably be backed out of 7.14.)

Copy link
Contributor

@edsavage edsavage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195 droberts195 merged commit 9bac76b into elastic:master May 5, 2021
@droberts195 droberts195 deleted the add_ml_standard_tokenizer branch May 5, 2021 12:35
droberts195 added a commit that referenced this pull request May 5, 2021
This new tokenizer, ml_standard, is very similar to the original
ml_classic tokenizer.

The difference is that ml_standard aims to parse URLs and
filesystem paths as single tokens instead of splitting them at
the slashes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning Team:ML Meta label for the ML team v7.14.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants