Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary blank lines found in stopwords.txt of SmartChineseAnalyzer #12291

Closed
JerryChin opened this issue May 15, 2023 · 5 comments · Fixed by #12299
Closed

Unnecessary blank lines found in stopwords.txt of SmartChineseAnalyzer #12291

JerryChin opened this issue May 15, 2023 · 5 comments · Fixed by #12299

Comments

@JerryChin
Copy link
Contributor

JerryChin commented May 15, 2023

Description

Hi team,

This issue is a spin-off from the java-user list thread.

The stopwords.txt of SmartChineseAnalyzer contains two blank lines at L56 & L58. As a result, SmartChineseAnalyzer.getDefaultStopSet() will produce an empty string stop word, but it makes no sense to have empty string as a stop word.

Maybe we can improve it?

@tang-hi
Copy link
Contributor

tang-hi commented May 15, 2023

Good Catch! Could you submit a PR to fix that?

@uschindler
Copy link
Contributor

uschindler commented May 15, 2023

In general I'd suggest to figure out if we should not change the stopword file parser to strip blankempty lines like comments?

@mikemccand
Copy link
Member

I think the stoplist loader already ignores comment lines, but, does not ignore empty lines! Darned empty string rears its head at us again...

@JerryChin
Copy link
Contributor Author

JerryChin commented May 15, 2023

Hi @tang-hi,

I can summit a PR to fix this issue.

How about skipping blank line before

and remove the blank lines from stopwords.txt.

What do you think?

@uschindler
Copy link
Contributor

Also here:

if (word.startsWith(comment) == false) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants