Skip to content

OPENNLP-660: Include list of stop words for various languages#1056

Open
rzo1 wants to merge 3 commits into
mainfrom
OPENNLP-660
Open

OPENNLP-660: Include list of stop words for various languages#1056
rzo1 wants to merge 3 commits into
mainfrom
OPENNLP-660

Conversation

@rzo1
Copy link
Copy Markdown
Contributor

@rzo1 rzo1 commented May 19, 2026

Summary

Resolves OPENNLP-660: bundles stopword lists for the eleven languages enumerated in the JIRA and ships a small, pluggable API that lets users supply or extend their own lists.

Checklist

For all changes:

  • JIRA ticket OPENNLP-660 referenced in the commit message.
  • PR title starts with OPENNLP-660-.
  • Branched off current main.
  • Single commit.

For code changes:

  • Targeted stopword test runs green (./mvnw -pl opennlp-api,opennlp-core/opennlp-runtime,opennlp-core/opennlp-cli -am test -Dtest='*Stopword*').
  • New unit tests added.
  • No new dependencies. Bundled resources are Apache-2.0-compatible Apache Lucene snowball files with their original headers preserved.
  • LICENSE file modified
  • NOTICE updated; rat-excludes updated.

For documentation related changes:

  • DocBook chapter renders correctly: ./mvnw -pl opennlp-docs verify produces the PDF with the new chapter in the TOC; xmllint --noout clean.

Adds an immutable, thread-safe StopwordFilter API with bundled 1-gram lists
for 11 languages (bg, da, de, en, es, fi, fr, it, nl, pt, ru) derived from
Apache Lucene snowball; original copyright headers preserved verbatim and
attributed in NOTICE. n-gram support via greedy longest-first window scan,
plus a StopwordFilteringTokenizer decorator, a StopwordFilterStream, and a
new "opennlp StopwordFilter <lang>" CLI tool. Docs chapter included.
@rzo1 rzo1 requested review from krickert and mawiesne May 19, 2026 20:17
@rzo1 rzo1 self-assigned this May 19, 2026
@krickert
Copy link
Copy Markdown
Contributor

Started this. I'll put in some comments later today.

Comment thread opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/fi.txt Outdated
Comment thread NOTICE
- Expand Finnish (fi.txt) Snowball paradigm rows to one token per line so
  each pronoun/determiner form is registered as an individual stopword.
- Apply the same greedy longest-match window scan in
  StopwordFilteringTokenizer.tokenizePos so n-gram entries are dropped
  consistently across tokenize, tokenizePos and StopwordFilterStream.
- Cache bundled, immutable filters per normalized language code in
  StopwordLists.forLanguage to avoid re-parsing the resource on every call.
- Precompute the ISO 639-3 to 639-1 lookup once at class init instead of
  scanning Locale.getAvailableLocales() on each call.
- Add a BSD-license entry for the bundled stopword lists to LICENSE.
- Add tests covering the Finnish forms, the cache, and the Span-based
  multi-word handling.
- StopwordFilterTool CLI now accepts either a bundled ISO 639 code or a path
  to a custom stopword list file (bundled code takes precedence; tokens are
  still read from stdin). Update help text and the DocBook CLI section.
- Drop the unused boolean parameter from the private DictionaryStopwordFilter
  constructor by removing it and having Builder.build() use the public
  constructor.
- Correct the DocBook note that wrongly stated the bundled lists are
  Apache-2.0 licensed; they retain their original BSD license.
- Add CLI tests for the custom-list file path and the unknown source error.
@rzo1
Copy link
Copy Markdown
Contributor Author

rzo1 commented May 21, 2026

I think it might make sense to centralize the language iso handling. I think we have this multiple times in OpenNLP in different flavours (scattered). Might make sense in a follow up to have this in one utility instead.

Thx for the review so far @krickert

Copy link
Copy Markdown
Contributor

@krickert krickert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great: Finnish list, cache, tokenizePos n-grams, LICENSE, CLI file path. I'm a fan of the Lucene/Snowball bundled lists; I already used those in other projects before.

+1

I may suggest a small follow-up for - as the stopword-list argument when that is worth the stdin semantics; not needed to merge this PR.

@rzo1
Copy link
Copy Markdown
Contributor Author

rzo1 commented May 21, 2026

Sure, we can fine-tune that feature anyway. 3.0.0 isn't released yet and this is a new feature, so semantics of the CLI can change / do not break people.

Copy link
Copy Markdown
Contributor

@jzonthemtn jzonthemtn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants