Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support analyzer for keyword type #18064

Closed
dadoonet opened this issue Apr 29, 2016 · 12 comments
Closed

Support analyzer for keyword type #18064

dadoonet opened this issue Apr 29, 2016 · 12 comments
Assignees
Labels
discuss >enhancement :Search/Mapping Index mappings, including merging and defining field types

Comments

@dadoonet
Copy link
Member

Sometimes you want to analyze text to make it consistent when running aggregations on top of it.

For example, let's say I have a city field mapped as a keyword.

This field can contain San Francisco, SAN FRANCISCO, San francisco...

If I build a terms aggregation on top of it, I will end up with

San Francisco: 1
SAN FRANCISCO: 1
San francisco: 1

I'd like to be able to analyze this text before it gets indexed. Of course I could use a text field instead and set fielddata: true but that would not create doc values for this field.

I can imagine that we allow an analyzer at index time for this field.

We can restrict its usage if we wish and only allows analyzers which are using tokenizers like lowercase, keyword, path but I would let the user decide.

If we allow setting analyzer: simple for example, my aggregation will become:

san francisco: 3

Same applies for path tokenizer.

Let say I'm building a dir tree like:

/tmp/dir1/file1.txt
/tmp/dir1/file2.txt
/tmp/dir2/file3.txt
/tmp/dir2/file4.txt

Applying a path tokenizer would help me to generate an aggregation like:

/tmp/dir1: 2
/tmp/dir2: 2
/tmp: 4
@clintongormley clintongormley added >enhancement help wanted adoptme :Search/Mapping Index mappings, including merging and defining field types and removed discuss labels Apr 29, 2016
jpountz added a commit to jpountz/elasticsearch that referenced this issue Jun 23, 2016
…lastic#19028

This is the same as what Lucene does for its analysis factories, and we hawe
tests that make sure that the elasticsearch factories are in sync with
Lucene's. This is a first step to move forward on elastic#9978 and elastic#18064.
@jpountz
Copy link
Contributor

jpountz commented Jul 13, 2016

Most of the work needed to implement this feature has been merged into Lucene and will be available in 6.2. Analyzer got a new method called normalize that only applies the subset of the analysis chain that is about normalization (and not eg. stemming) https://issues.apache.org/jira/browse/LUCENE-7355.

Note that it would NOT work for the path tokenization use-case mentioned above since it has a restriction that it can generate a single token, so such use-cases would have to be handled differently, eg. using an ingest processor.

I am wondering if we should use a different property name than analyzer since the analyzer will not be used for tokenizing. I am currently thinking about:

"my_field": {
  "type": "keyword",
  "normalizer": "standard"
}

This would avoid potential confusion about what happens with analyzers that would generate multiple tokens and make clearer that only normalization would be applied?

@dadoonet
Copy link
Member Author

Note that it would NOT work for the path tokenization use-case mentioned above since it has a restriction that it can generate a single token, so such use-cases would have to be handled differently, eg. using an ingest processor.

That would complicate the process but I guess we have to live with that. At least, we have a workaround.

I am wondering if we should use a different property name than analyzer since the analyzer will not be used for tokenizing.

Totally agree.

@synhershko
Copy link
Contributor

Instead of calling it a "normalizer", I'd call it by it's name token_filters and accept an array of token filters. I don't think analyzers should be used it here as they propose a use of a tokenizer.

@jpountz
Copy link
Contributor

jpountz commented Nov 2, 2016

I think I agree with that. I initially thought that maybe integration with https://issues.apache.org/jira/browse/LUCENE-7355 would make sense, but maybe we should just apply a list of token filters manually, this would probably be simpler.

@synhershko
Copy link
Contributor

Yeah, I think it's a much simpler approach than involving a queryparser here. No need for one IMO. Also please note that order matters in the token_filters array.

@clintongormley
Copy link

What about character filters? They can also be useful here. My initial thought was to keep it as analyzers and to only allow analyzers which use the keyword tokenizer. But normalizers would work too...

@ugolas
Copy link

ugolas commented Nov 9, 2016

hi guys, great to see you have an enhancement for this requirement!

Any idea how can I support case insensitive search on a "keyword" type field (which I also use for aggregations) for v5.0?

In ES 2.3 I used:
"analyzer_keyword": {
"tokenizer": "keyword",
"filter": "lowercase"
}

But that does not seem to work without enabling fielddata in ES 5.

Any workaround I can use for now?

@dadoonet
Copy link
Member Author

dadoonet commented Nov 9, 2016

You can use ingest to lower case your field.

@jpountz jpountz self-assigned this Nov 25, 2016
@sumithub
Copy link

sumithub commented Dec 1, 2016

Hi Guys,
Since lucene added custom analyzer normalization with 6.2 release. https://issues.apache.org/jira/browse/LUCENE-7355
Just wondering whether this feature would be available soon in elasticsearch?
Our application makes heavy use of aggregations with lowercase filters to be used as doc-values.

jpountz added a commit to jpountz/elasticsearch that referenced this issue Dec 27, 2016
This adds a new `normalizer` property to `keyword` fields that pre-processes the
field value prior to indexing, but without altering the `_source`. Note that
only the normalization components that work on a per-character basis are
applied, so for instance stemming filters will be ignored while lowercasing or
ascii folding will be applied.

Closes elastic#18064
jpountz added a commit that referenced this issue Dec 30, 2016
This adds a new `normalizer` property to `keyword` fields that pre-processes the
field value prior to indexing, but without altering the `_source`. Note that
only the normalization components that work on a per-character basis are
applied, so for instance stemming filters will be ignored while lowercasing or
ascii folding will be applied.

Closes #18064
jpountz added a commit that referenced this issue Dec 30, 2016
This adds a new `normalizer` property to `keyword` fields that pre-processes the
field value prior to indexing, but without altering the `_source`. Note that
only the normalization components that work on a per-character basis are
applied, so for instance stemming filters will be ignored while lowercasing or
ascii folding will be applied.

Closes #18064
@wgerlach
Copy link

I understand from this thread that the ability has been added to sort case insensitive. But how? Is there documentation or an example available ?

@fabiocatalao
Copy link

@wgerlach : I've added an example for lowercase/asciifolding normalizer on elastic forum: https://discuss.elastic.co/t/wildcard-case-insensitive-query-string/75050/5

jarib added a commit to liquidinvestigations/hoover-snoop2 that referenced this issue Aug 2, 2018
Currently, using the path_hierarcy tokenizer means we can't aggregate
on the field. That means we would have to set `"fielddata": true`, which comes
with a memory cost. This was discussed but not solved in elastic/elasticsearch#18064.

Perhaps fielddata would be OK for file paths like ours, but this seems safer.
@coreation
Copy link

@wgerlach : I've added an example for lowercase/asciifolding normalizer on elastic forum: https://discuss.elastic.co/t/wildcard-case-insensitive-query-string/75050/5

Thanks a million!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss >enhancement :Search/Mapping Index mappings, including merging and defining field types
Projects
None yet
Development

No branches or pull requests

9 participants