Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune stop words for text index #5297

Merged
merged 1 commit into from
Apr 27, 2020

Conversation

siddharthteotia
Copy link
Contributor

@siddharthteotia siddharthteotia commented Apr 24, 2020

For general english text, indexing the stop words can easily lead to explosion in the size of text index and consequently impact the performance of queries. For now use a predefined list of general stop words that can be passed to the text analyzer both during indexing and querying. The analyzer ensures that stop words in input text are pruned and not indexed. Similarly, on the query path, if stop words are present in search expression, the matching is done without stop words.

This is something that can possibly be specified on a per use case (per text index basis). Currently we have a hardcoded list of common english stop words.

@@ -50,6 +58,15 @@
private final Directory _indexDirectory;
private final IndexWriter _indexWriter;

public static final CharArraySet ENGLISH_STOP_WORDS_SET =
new CharArraySet(Arrays.asList(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont we have this in Lucene StopAnalyzer?
StopAnalyzer.ENGLISH_STOP_WORDS_SET

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is not in StopAnalyzer. That also takes a passed list of stopwords.

The list is borrowed from EnglishAnalyzer. The reason for not directly using EnglishAnalyzer is because it does word stemming and reduces words to their root form. I don't think we need that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StopAnalyzer can't be used either because it tokenizes input text at each character. So we stick to StandardAnalyzer which uses StandardTokenizer (based on unicode general purpose text segmentation algorithm)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks add this comment in the code.

@siddharthteotia siddharthteotia merged commit cb62dfc into apache:master Apr 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants