OPENNLP-660: Include list of stop words for various languages by rzo1 · Pull Request #1056 · apache/opennlp

rzo1 · 2026-05-19T20:17:24Z

Summary

Resolves OPENNLP-660: bundles stopword lists for the eleven languages enumerated in the JIRA and ships a small, pluggable API that lets users supply or extend their own lists.

Checklist

For all changes:

JIRA ticket OPENNLP-660 referenced in the commit message.
PR title starts with OPENNLP-660-.
Branched off current main.
Single commit.

For code changes:

Targeted stopword test runs green (./mvnw -pl opennlp-api,opennlp-core/opennlp-runtime,opennlp-core/opennlp-cli -am test -Dtest='*Stopword*').
New unit tests added.
No new dependencies. Bundled resources are Apache-2.0-compatible Apache Lucene snowball files with their original headers preserved.
LICENSE file modified
NOTICE updated; rat-excludes updated.

For documentation related changes:

DocBook chapter renders correctly: ./mvnw -pl opennlp-docs verify produces the PDF with the new chapter in the TOC; xmllint --noout clean.

Adds an immutable, thread-safe StopwordFilter API with bundled 1-gram lists for 11 languages (bg, da, de, en, es, fi, fr, it, nl, pt, ru) derived from Apache Lucene snowball; original copyright headers preserved verbatim and attributed in NOTICE. n-gram support via greedy longest-first window scan, plus a StopwordFilteringTokenizer decorator, a StopwordFilterStream, and a new "opennlp StopwordFilter <lang>" CLI tool. Docs chapter included.

krickert · 2026-05-20T16:43:56Z

Started this. I'll put in some comments later today.

- Expand Finnish (fi.txt) Snowball paradigm rows to one token per line so each pronoun/determiner form is registered as an individual stopword. - Apply the same greedy longest-match window scan in StopwordFilteringTokenizer.tokenizePos so n-gram entries are dropped consistently across tokenize, tokenizePos and StopwordFilterStream. - Cache bundled, immutable filters per normalized language code in StopwordLists.forLanguage to avoid re-parsing the resource on every call. - Precompute the ISO 639-3 to 639-1 lookup once at class init instead of scanning Locale.getAvailableLocales() on each call. - Add a BSD-license entry for the bundled stopword lists to LICENSE. - Add tests covering the Finnish forms, the cache, and the Span-based multi-word handling.

- StopwordFilterTool CLI now accepts either a bundled ISO 639 code or a path to a custom stopword list file (bundled code takes precedence; tokens are still read from stdin). Update help text and the DocBook CLI section. - Drop the unused boolean parameter from the private DictionaryStopwordFilter constructor by removing it and having Builder.build() use the public constructor. - Correct the DocBook note that wrongly stated the bundled lists are Apache-2.0 licensed; they retain their original BSD license. - Add CLI tests for the custom-list file path and the unknown source error.

rzo1 · 2026-05-21T13:08:45Z

I think it might make sense to centralize the language iso handling. I think we have this multiple times in OpenNLP in different flavours (scattered). Might make sense in a follow up to have this in one utility instead.

Thx for the review so far @krickert

krickert

Looks great: Finnish list, cache, tokenizePos n-grams, LICENSE, CLI file path. I'm a fan of the Lucene/Snowball bundled lists; I already used those in other projects before.

+1

I may suggest a small follow-up for - as the stopword-list argument when that is worth the stdin semantics; not needed to merge this PR.

rzo1 · 2026-05-21T13:12:54Z

Sure, we can fine-tune that feature anyway. 3.0.0 isn't released yet and this is a new feature, so semantics of the CLI can change / do not break people.

jzonthemtn

Nice!

rzo1 requested review from krickert and mawiesne May 19, 2026 20:17

rzo1 self-assigned this May 19, 2026