OPENNLP-660: Include list of stop words for various languages#1056
OPENNLP-660: Include list of stop words for various languages#1056rzo1 wants to merge 3 commits into
Conversation
Adds an immutable, thread-safe StopwordFilter API with bundled 1-gram lists for 11 languages (bg, da, de, en, es, fi, fr, it, nl, pt, ru) derived from Apache Lucene snowball; original copyright headers preserved verbatim and attributed in NOTICE. n-gram support via greedy longest-first window scan, plus a StopwordFilteringTokenizer decorator, a StopwordFilterStream, and a new "opennlp StopwordFilter <lang>" CLI tool. Docs chapter included.
|
Started this. I'll put in some comments later today. |
- Expand Finnish (fi.txt) Snowball paradigm rows to one token per line so each pronoun/determiner form is registered as an individual stopword. - Apply the same greedy longest-match window scan in StopwordFilteringTokenizer.tokenizePos so n-gram entries are dropped consistently across tokenize, tokenizePos and StopwordFilterStream. - Cache bundled, immutable filters per normalized language code in StopwordLists.forLanguage to avoid re-parsing the resource on every call. - Precompute the ISO 639-3 to 639-1 lookup once at class init instead of scanning Locale.getAvailableLocales() on each call. - Add a BSD-license entry for the bundled stopword lists to LICENSE. - Add tests covering the Finnish forms, the cache, and the Span-based multi-word handling.
- StopwordFilterTool CLI now accepts either a bundled ISO 639 code or a path to a custom stopword list file (bundled code takes precedence; tokens are still read from stdin). Update help text and the DocBook CLI section. - Drop the unused boolean parameter from the private DictionaryStopwordFilter constructor by removing it and having Builder.build() use the public constructor. - Correct the DocBook note that wrongly stated the bundled lists are Apache-2.0 licensed; they retain their original BSD license. - Add CLI tests for the custom-list file path and the unknown source error.
|
I think it might make sense to centralize the language iso handling. I think we have this multiple times in OpenNLP in different flavours (scattered). Might make sense in a follow up to have this in one utility instead. Thx for the review so far @krickert |
krickert
left a comment
There was a problem hiding this comment.
Looks great: Finnish list, cache, tokenizePos n-grams, LICENSE, CLI file path. I'm a fan of the Lucene/Snowball bundled lists; I already used those in other projects before.
+1
I may suggest a small follow-up for - as the stopword-list argument when that is worth the stdin semantics; not needed to merge this PR.
|
Sure, we can fine-tune that feature anyway. 3.0.0 isn't released yet and this is a new feature, so semantics of the CLI can change / do not break people. |
Summary
Resolves OPENNLP-660: bundles stopword lists for the eleven languages enumerated in the JIRA and ships a small, pluggable API that lets users supply or extend their own lists.
Checklist
For all changes:
OPENNLP-660-.main.For code changes:
./mvnw -pl opennlp-api,opennlp-core/opennlp-runtime,opennlp-core/opennlp-cli -am test -Dtest='*Stopword*').NOTICEupdated;rat-excludesupdated.For documentation related changes:
./mvnw -pl opennlp-docs verifyproduces the PDF with the new chapter in the TOC;xmllint --nooutclean.