Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add `keep_types` for filtering by token type #7120

Closed
wants to merge 4 commits into from

Conversation

Projects
None yet
6 participants
@rmuir
Copy link
Contributor

commented Aug 1, 2014

For example, you can use this combined with UAXURLTokenizer to extract urls from text.

This patch works, but I am unsure about the naming of the filter and the parameters.
In fact this filter in lucene supports two modes "stop" and "keep".

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Aug 1, 2014

LGTM

@dakrone dakrone added v1.4.0 labels Aug 1, 2014

@dweinstein

This comment has been minimized.

Copy link

commented Aug 1, 2014

👍

@dweinstein

This comment has been minimized.

Copy link

commented Aug 1, 2014

Will this work for custom patterns? I thought it was great that there existed a UAXURLTokenizer but what if one isn't so lucky in the future and must write one using a custom pattern. What would be the type of those tokens?

@rmuir

This comment has been minimized.

Copy link
Contributor Author

commented Aug 1, 2014

Well this is related to token type, which is like a "tag" for the token that is produced by the analysis chain. Typically the lucene tokenizers provide tags if they recognize different types of tokens, but there is nothing limiting it to that. For example it could contain part-of-speech or whatever is useful.

This filter just filters by tag, it doesnt do tagging itself. It couldn't meet all the possible use cases for token types :)

So to tag by "pattern", we would just need a filter that does that. Its separate from what action to do with the actual tags...

@rmuir rmuir added the review label Aug 1, 2014

@rmuir

This comment has been minimized.

Copy link
Contributor Author

commented Aug 1, 2014

Adding review just for another opinion on the API before pushing it.

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Aug 4, 2014

LGTM, except minor naming nit: it's called "keep_types" publicly (I like this name) but in the code it's KeepType (without the s).

@@ -0,0 +1,37 @@
[[analysis-keep-words-tokenfilter]]

This comment has been minimized.

Copy link
@clintongormley

clintongormley Aug 4, 2014

Member

the id of the section should be

[[analysis-keep-types-tokenfilter]]
@rmuir

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2014

Thanks @clintongormley and @mikemccand

I updated the PR.

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2014

LGTM

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2014

LGTM, thanks for the rename!

@rmuir rmuir changed the title Add keep_types for filtering by token type Analysis: Add keep_types for filtering by token type Aug 15, 2014

@rmuir rmuir closed this Aug 15, 2014

@jpountz jpountz removed the review label Aug 18, 2014

@clintongormley clintongormley changed the title Analysis: Add keep_types for filtering by token type Add `keep_types` for filtering by token type Jun 6, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.