Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default Analyzer in next major version #3775

Closed
s1monw opened this Issue Sep 25, 2013 · 15 comments

Comments

Projects
None yet
8 participants
@s1monw
Copy link
Contributor

commented Sep 25, 2013

The default analyzer (StandardAnalyzer in Lucene terms) is not a really good default since it really aims to be applied on English full-text. I think it would be wise to only use StandardTokenizer which is based on Unicode Standard Annex #29 and a LowercaseFilter. I could think of using ASCIIFoldingFilter as well since most of the users will expect folding to work out of the box.

This is really a basis for discussion but I think we should really get rid of stopwords in the default analyzer since it's really trappy.

@ghost ghost assigned s1monw Sep 25, 2013

@nik9000

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2013

I like this idea but you should be careful about upgrades. Maybe make a new analyzer and default new fields to that. Everything that was created as standard (defaulted or not) should stay standard or upgrading is going to be rough.

I'm torn on ASCIIFoldingFilter. English users expect it but it is trappy for a bunch of other languages.

+1 on removing stop words.

@s1monw

This comment has been minimized.

Copy link
Contributor Author

commented Sep 25, 2013

@nik9000 we maintain compatibility if you do upgrades. if you create a new index you will get the new behaviour though. IMO we should keep the name standard for the lucene standard analyzer and call this one es_default and make the default analyzer point to es_default if a new index is created. I am on the fence for ascii folding as well. IMO we should just use tokenization and lowercase and drop stopwords

@nik9000

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2013

@nik9000 we maintain compatibility if you do upgrades. if you create a new index you will get the new behaviour though. IMO we should keep the name standard for the lucene standard analyzer and call this one es_default and make the default analyzer point to es_default if a new index is created. I am on the fence for ascii folding as well. IMO we should just use tokenization and lowercase and drop stopwords

Perfect.

@drewr

This comment has been minimized.

Copy link
Member

commented Sep 25, 2013

+1 👍

@brusic

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2013

Another perspective:

Better documentation can alleviate some of the issues faced by beginners to elasticsearch. Currently, the only references to how analysis works is buried in the documentation of the index module. Users are confused by the standard analyzer because they do not understand the analysis pipeline. Analysis needs to be documented at the top-most level.If a new user understands analysis from the start, then the "pitfalls" of the standard analyzer can be anticipated.

Many of us come from a Lucene background, so the concepts are already familiar, but we might forget that not everyone has the some prior knowledge.

@clintongormley

This comment has been minimized.

Copy link
Member

commented Sep 25, 2013

I wouldn't add in the ASCII folding filter. Rather stick with a good generic standard.

@synhershko

This comment has been minimized.

Copy link
Contributor

commented Sep 26, 2013

+1, and I'm pro ACIIFoldingFilter.

Maybe the right route here is to create multiple out of the box named analyzers, each being a good default for something else, and let the documentation do its thing. The actual default could be something not too strict (e.g. including the ASCII filter), and then switching it to something easier. Something like "Default English analyzer", "Non-ASCII folding English Analyzer", "French analyzer" and so on.

Another idea is to actually have that filter as a toggle on all known analyzers. I've seen many use cases where you want to use an analyzer, but it doesn't do ASCII folding, but you do want it to. Instead of sub-classing, this could be a great way out.

Might worth noting the actual tokenizer is source for troubles as well - it will now not preserve acronyms neither emails, and some users might have grown to expect that

@clintongormley

This comment has been minimized.

Copy link
Member

commented Sep 26, 2013

The danger that I see by adding too much stuff into the default analyzer is that you always have your own use case in mind. So it ends up being very useful for that use case, and problematic for others.

The goal of the standard tokenizer is very simple and clean: break words on word boundaries. Nothing more nothing less. That plus lowercasing makes for a good general purpose analyzer.

We also have the language analyzers available by default. I'd consider perhaps making the asciifolding an option there. but then you have to ask yourself: do you want to strip diacritics, or do you want to index both versions: with and without diacritics. Suddenly the number of choices start exploding.

We have a very flexible system for creating your own custom analyzers which do exactly what you want, so I'm not in favour of adding loads of prebuilt analyzers to try to cover every circumstance. Too many options will just make it more difficult for the user to understand.

and as @brusic said, better documentation will help. /me is working on that

@lukas-vlcek

This comment has been minimized.

Copy link
Contributor

commented Sep 26, 2013

+1 @brusic

@clintongormley If you are interested in contributions to analyzer documentation then I can offer updated materials that I have written about "Setting up Czech analysis in Elasticsearch" for dev site Zdrojak.cz [1,2]. Especially the second part focuses on Hunspell token filter and to date I am not aware of any other resource that goes into such important details (patting myself on the back). These concepts apply generally to a lot of other languages and I am sure it would be beneficial to wired audience if it were translated to English. Feel free to let me know if there is any public repo where ppl can contribute.

.oO( Do I mind being asshole? Should I say the next paragraph? ... sigh!)

Don't get me wrong, sometimes the new ES documentation phenomena feels like an Apple TV to me. We all know nothing about it but still, we somehow hope it must be cool and worth waiting for, though it is little helping users in the meantime. Seriously, is there any reason why the documentation thing can not happen in more open and transparent way? Especially with the analysis part the input from more ppl having experience with different languages can be extremely useful. ( //cc @lhawthorn )

[1] http://www.zdrojak.cz/clanky/elasticsearch-vyhledavame-cesky/
[2] http://www.zdrojak.cz/clanky/elasticsearch-vyhledavame-hezky-cesky-ii-a-taky-slovensky/

@brusic

This comment has been minimized.

Copy link
Contributor

commented Sep 26, 2013

Sorry to derail the original topic with more about documentation, but what is the copyright of Lucene's documentation? A simple start would be to copy the existing documentation at http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.html

@s1monw

This comment has been minimized.

Copy link
Contributor Author

commented Sep 26, 2013

@brusic @lukas-vlcek can you take this discussion offline? I don't like if issues like this get hijacked. I appreciate you openness and I share the lack of documentation. Can we open a sep issue for this, I am happy to participate.

@nik9000

This comment has been minimized.

Copy link
Contributor

commented Sep 26, 2013

On Thu, Sep 26, 2013 at 3:26 PM, Lukáš Vlček notifications@github.comwrote:

+1 @brusic https://github.com/brusic

@clintongormley https://github.com/clintongormley If you are interested
in contributions to analyzer documentation then I can offer updated
materials that I have written about "Setting up Czech analysis in
Elasticsearch" for dev site Zdrojak.cz [1,2]. Especially the second part
focuses on Hunspell token filter and to date I am not aware of any other
resource that goes into such important details (patting myself on the
back). These concepts apply generally to a lot of other languages and I am
sure it would be beneficial to wired audience if it were translated to
English. Feel free to let me know if there is any public repo where ppl can
contribute.

I'd love to see better language documentation.

I'd especially like English documentation on setting up some of the
plugins. Unfortunately I can't read all the languages I have to support.

Nik

@lukas-vlcek

This comment has been minimized.

Copy link
Contributor

commented Sep 26, 2013

@s1monw I am sorry. Did not meant to hijack. /me shut-up!

@s1monw

This comment has been minimized.

Copy link
Contributor Author

commented Sep 26, 2013

@lukas-vlcek no speak up - this is open source. Just open another issue I am happy to join there!

@kimchy

This comment has been minimized.

Copy link
Member

commented Sep 26, 2013

@lukas-vlcek I will just answer your question regarding the docs, cause I want to answer it here so people will see (and we can continue the discussion on another thread).

We can't publish anything content related to the book we are working on because the contact with the publisher was not finalized yet. It took some time since we were very adamant on the fact that the book will be free online, which created complexities signing the contract. We have been open about that, btw, and I specifically answered that question raised by you several times already.

The work on moving the current guide to asciidoc is all in the open (and has required extensive work), btw, and we plan to obviously continue and improve it, and that is already all in the open. You can already submit pull request if you want to improve the guide docs (as you could before), though as is typical with us, we don't expect you to and will make sure it happens anyhow. For example, your hunspell docs can easily be added to our guide if appropriate.

I assume you didn't mean it that way, even though you stated it in the .oO, but that question could have been asked in a much nicer tone. This is the type of tone that I tried, and we try, to uphold in the elasticsearch community since inception. You should assume no malice, not the other way around.

s1monw added a commit to s1monw/elasticsearch that referenced this issue Nov 5, 2013

Change 'standart' analyzer to use emtpy stopword list by default.
The 'default' / 'standard' analyzer can be a trappy default sicne it filters
english stopwords by default. Yet a default should not be dedicated to a certain language
since elasticsearch is used in many different scenarios where a standard analysis chain
with specialization to english full-text might be rather counter productive.

This commit changes the 'standard' analyzer to use an empty stopword list for indices
that are created from 1.0.0.Beta1 version onwards but will maintain backwards compatibiliy
for older indices.

Closes elastic#3775

@s1monw s1monw closed this in 9654631 Nov 5, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.