Support analyzer for keyword type #18064

dadoonet · 2016-04-29T11:48:41Z

Sometimes you want to analyze text to make it consistent when running aggregations on top of it.

For example, let's say I have a city field mapped as a keyword.

This field can contain San Francisco, SAN FRANCISCO, San francisco...

If I build a terms aggregation on top of it, I will end up with

San Francisco: 1
SAN FRANCISCO: 1
San francisco: 1

I'd like to be able to analyze this text before it gets indexed. Of course I could use a text field instead and set fielddata: true but that would not create doc values for this field.

I can imagine that we allow an analyzer at index time for this field.

We can restrict its usage if we wish and only allows analyzers which are using tokenizers like lowercase, keyword, path but I would let the user decide.

If we allow setting analyzer: simple for example, my aggregation will become:

san francisco: 3

Same applies for path tokenizer.

Let say I'm building a dir tree like:

/tmp/dir1/file1.txt
/tmp/dir1/file2.txt
/tmp/dir2/file3.txt
/tmp/dir2/file4.txt

Applying a path tokenizer would help me to generate an aggregation like:

/tmp/dir1: 2
/tmp/dir2: 2
/tmp: 4

The text was updated successfully, but these errors were encountered:

…lastic#19028 This is the same as what Lucene does for its analysis factories, and we hawe tests that make sure that the elasticsearch factories are in sync with Lucene's. This is a first step to move forward on elastic#9978 and elastic#18064.

jpountz · 2016-07-13T08:56:09Z

Most of the work needed to implement this feature has been merged into Lucene and will be available in 6.2. Analyzer got a new method called normalize that only applies the subset of the analysis chain that is about normalization (and not eg. stemming) https://issues.apache.org/jira/browse/LUCENE-7355.

Note that it would NOT work for the path tokenization use-case mentioned above since it has a restriction that it can generate a single token, so such use-cases would have to be handled differently, eg. using an ingest processor.

I am wondering if we should use a different property name than analyzer since the analyzer will not be used for tokenizing. I am currently thinking about:

"my_field": {
  "type": "keyword",
  "normalizer": "standard"
}

This would avoid potential confusion about what happens with analyzers that would generate multiple tokens and make clearer that only normalization would be applied?

dadoonet · 2016-07-13T08:59:25Z

Note that it would NOT work for the path tokenization use-case mentioned above since it has a restriction that it can generate a single token, so such use-cases would have to be handled differently, eg. using an ingest processor.

That would complicate the process but I guess we have to live with that. At least, we have a workaround.

I am wondering if we should use a different property name than analyzer since the analyzer will not be used for tokenizing.

Totally agree.

synhershko · 2016-11-02T03:41:27Z

Instead of calling it a "normalizer", I'd call it by it's name token_filters and accept an array of token filters. I don't think analyzers should be used it here as they propose a use of a tokenizer.

jpountz · 2016-11-02T08:59:27Z

I think I agree with that. I initially thought that maybe integration with https://issues.apache.org/jira/browse/LUCENE-7355 would make sense, but maybe we should just apply a list of token filters manually, this would probably be simpler.

synhershko · 2016-11-02T14:48:35Z

Yeah, I think it's a much simpler approach than involving a queryparser here. No need for one IMO. Also please note that order matters in the token_filters array.

clintongormley · 2016-11-04T14:12:45Z

What about character filters? They can also be useful here. My initial thought was to keep it as analyzers and to only allow analyzers which use the keyword tokenizer. But normalizers would work too...

ugolas · 2016-11-09T15:58:11Z

hi guys, great to see you have an enhancement for this requirement!

Any idea how can I support case insensitive search on a "keyword" type field (which I also use for aggregations) for v5.0?

In ES 2.3 I used:
"analyzer_keyword": {
"tokenizer": "keyword",
"filter": "lowercase"
}

But that does not seem to work without enabling fielddata in ES 5.

Any workaround I can use for now?

dadoonet · 2016-11-09T19:18:15Z

You can use ingest to lower case your field.

sumithub · 2016-12-01T00:53:29Z

Hi Guys,
Since lucene added custom analyzer normalization with 6.2 release. https://issues.apache.org/jira/browse/LUCENE-7355
Just wondering whether this feature would be available soon in elasticsearch?
Our application makes heavy use of aggregations with lowercase filters to be used as doc-values.

This adds a new `normalizer` property to `keyword` fields that pre-processes the field value prior to indexing, but without altering the `_source`. Note that only the normalization components that work on a per-character basis are applied, so for instance stemming filters will be ignored while lowercasing or ascii folding will be applied. Closes elastic#18064

This adds a new `normalizer` property to `keyword` fields that pre-processes the field value prior to indexing, but without altering the `_source`. Note that only the normalization components that work on a per-character basis are applied, so for instance stemming filters will be ignored while lowercasing or ascii folding will be applied. Closes #18064

wgerlach · 2017-03-23T15:50:15Z

I understand from this thread that the ability has been added to sort case insensitive. But how? Is there documentation or an example available ?

fabiocatalao · 2017-03-23T15:53:20Z

@wgerlach : I've added an example for lowercase/asciifolding normalizer on elastic forum: https://discuss.elastic.co/t/wildcard-case-insensitive-query-string/75050/5

Currently, using the path_hierarcy tokenizer means we can't aggregate on the field. That means we would have to set `"fielddata": true`, which comes with a memory cost. This was discussed but not solved in elastic/elasticsearch#18064. Perhaps fielddata would be OK for file paths like ours, but this seems safer.

coreation · 2018-10-07T20:15:34Z

@wgerlach : I've added an example for lowercase/asciifolding normalizer on elastic forum: https://discuss.elastic.co/t/wildcard-case-insensitive-query-string/75050/5

Thanks a million!

dadoonet added the discuss label Apr 29, 2016

clintongormley added >enhancement help wanted adoptme :Search/Mapping Index mappings, including merging and defining field types and removed discuss labels Apr 29, 2016

jpountz mentioned this issue Jun 22, 2016

Add a MultiTermAwareComponent marker interface to analysis factories. #19028

Merged

s1monw mentioned this issue Oct 26, 2016

Allow "safe" custom analyzers for keyword fields #21137

Closed

jpountz self-assigned this Oct 27, 2016

jpountz removed the help wanted adoptme label Oct 27, 2016

dakrone mentioned this issue Nov 1, 2016

Add "token_filters" array option to keyword type mapping #21241

Closed

clintongormley added the discuss label Nov 4, 2016

dadoonet unassigned jpountz Nov 9, 2016

jpountz self-assigned this Nov 25, 2016

jpountz mentioned this issue Dec 1, 2016

Add the ability to set a normalizer on keyword fields. #21919

Merged

jpountz closed this as completed in #21919 Dec 30, 2016

tlovett1 mentioned this issue Jan 4, 2017

"Fielddata is disabled on text fields by default." Admin Feature 10up/ElasticPress#643

Closed

jimczi mentioned this issue Jan 6, 2017

Case insensitive sorting not working #22410

Closed

stiegerb mentioned this issue Jul 24, 2017

Elasticsearch 5 Migration dmwm/cms-htcondor-es#5

Closed

elasticmachine mentioned this issue Apr 24, 2018

Computers need IDs, people want labels elastic/kibana#17877

Closed

This was referenced Dec 13, 2019

require "analyzer", "search_analyzer" and "normalizer" to be set for all stringy fields pelias/schema#412

Closed

add normalizer for keyword fields pelias/schema#415

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support analyzer for keyword type #18064

Support analyzer for keyword type #18064

dadoonet commented Apr 29, 2016

jpountz commented Jul 13, 2016

dadoonet commented Jul 13, 2016

synhershko commented Nov 2, 2016

jpountz commented Nov 2, 2016

synhershko commented Nov 2, 2016

clintongormley commented Nov 4, 2016

ugolas commented Nov 9, 2016

dadoonet commented Nov 9, 2016

sumithub commented Dec 1, 2016

wgerlach commented Mar 23, 2017

fabiocatalao commented Mar 23, 2017

coreation commented Oct 7, 2018

Support analyzer for keyword type #18064

Support analyzer for keyword type #18064

Comments

dadoonet commented Apr 29, 2016

jpountz commented Jul 13, 2016

dadoonet commented Jul 13, 2016

synhershko commented Nov 2, 2016

jpountz commented Nov 2, 2016

synhershko commented Nov 2, 2016

clintongormley commented Nov 4, 2016

ugolas commented Nov 9, 2016

dadoonet commented Nov 9, 2016

sumithub commented Dec 1, 2016

wgerlach commented Mar 23, 2017

fabiocatalao commented Mar 23, 2017

coreation commented Oct 7, 2018