diff --git a/LICENSE b/LICENSE index 27da2a08f..58a20c820 100644 --- a/LICENSE +++ b/LICENSE @@ -230,6 +230,41 @@ The following license applies to the Snowball stemmers: OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +The following license applies to the bundled stopword lists in +opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword. +These lists are derived from Apache Lucene, which redistributes them from +the Snowball project; the Bulgarian list (bg.txt) was created by Jacques +Savoy (http://members.unine.ch/jacques.savoy/clef/index.html). They are +distributed under the BSD license: + + Copyright (c) 2001, Dr Martin Porter + Copyright (c) 2002, Richard Boulton + Copyright (c) Jacques Savoy + All rights reserved. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are met: + + * Redistributions of source code must retain the above copyright notice, + * this list of conditions and the following disclaimer. + * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * Neither the name of the copyright holders nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE + FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR + SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER + CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, + OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + The following license applies to the Wordpiece tokenizer implementation: The MIT License (MIT) diff --git a/NOTICE b/NOTICE index 0a9b44508..bef1f6f2a 100644 --- a/NOTICE +++ b/NOTICE @@ -14,6 +14,19 @@ http://snowball.tartarus.org/ ============================================================================ +The bundled stopword lists in +opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword +are derived from Apache Lucene +(https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis), +which in turn distributes them under the BSD license from the Snowball project +(https://snowballstem.org/license.html). The Bulgarian list (bg.txt) is the +Lucene per-language Bulgarian stopwords file originally created by Jacques +Savoy (http://members.unine.ch/jacques.savoy/clef/index.html) and also +distributed under the BSD license. The original upstream license and +attribution headers are preserved verbatim at the top of each bundled file. + +============================================================================ + The Wordpiece tokenizer in opennlp-tools/main/java/opennlp/tools/tokenize is taken from https://github.com/robrua/easy-bert licensed under diff --git a/README.md b/README.md index 10a5ad074..d9c326434 100644 --- a/README.md +++ b/README.md @@ -30,7 +30,7 @@ The Apache OpenNLP library is a machine learning based toolkit for the processin This toolkit is written completely in Java and provides support for common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, - coreference resolution, language detection and more! + coreference resolution, language detection, stopword filtering (with bundled lists for 11 languages) and more! These tasks are usually required to build more advanced text processing services. diff --git a/opennlp-api/src/main/java/opennlp/tools/stopword/StopwordFilter.java b/opennlp-api/src/main/java/opennlp/tools/stopword/StopwordFilter.java new file mode 100644 index 000000000..abdfe7b16 --- /dev/null +++ b/opennlp-api/src/main/java/opennlp/tools/stopword/StopwordFilter.java @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.stopword; + +import java.util.Set; + +/** + * A pluggable filter that decides whether a token (or a sequence of tokens) + * is a stopword that should be removed during downstream text processing. + *
+ * Implementations may be backed by a static bundled list, a user-supplied + * file, an in-memory data structure, or any other source. + * Both single-token and multi-token (n-gram) membership tests are supported. + * + * @see opennlp.tools.util.LanguageCodeValidator + */ +public interface StopwordFilter { + + /** + * Checks whether the given token is a single-token stopword. + * Equivalent to {@code isStopword(new String[] { token.toString() })} when + * {@code token} is non-{@code null}. + * + * @param token The token to test. May be {@code null}, in which case + * implementations should return {@code false}. + * @return {@code true} if {@code token} is registered as a single-token + * stopword, {@code false} otherwise. + */ + boolean isStopword(final CharSequence token); + + /** + * Checks whether the given sequence of tokens is a multi-token stopword + * (n-gram). For a single token this is equivalent to + * {@link #isStopword(CharSequence)}. + * + * @param tokens The tokens to test as one entry. May be {@code null} or + * empty, in which case implementations should return {@code false}. + * @return {@code true} if the sequence is registered as a stopword, + * {@code false} otherwise. + */ + boolean isStopword(final String... tokens); + + /** + * Returns a copy of {@code tokens} with stopword matches removed, + * preserving the input order. + *
+ * Implementations should honor both 1-gram and n-gram entries. A + * recommended strategy is a greedy left-to-right window scan: at each + * position try the longest registered window first; if it matches, skip + * those tokens; otherwise advance by one and keep the current token. + * Implementations that do not support n-gram entries may fall back to + * 1-gram filtering. + * + * @param tokens The input token array. Must not be {@code null}. + * Individual array elements may be {@code null} and are kept as-is. + * @return A new array containing the surviving tokens. Never {@code null}. + * @throws IllegalArgumentException if {@code tokens} is {@code null}. + */ + String[] filter(final String[] tokens); + + /** + * @return {@code true} if this filter performs case-sensitive matching; + * {@code false} if matching is case-insensitive. + */ + boolean isCaseSensitive(); + + /** + * Returns an unmodifiable snapshot of the registered single-token + * stopwords. Multi-token (n-gram) entries are not included in this view + * and must be tested via {@link #isStopword(String...)}. + *
+ * Attempts to mutate the returned {@link Set} will fail.
+ *
+ * @return An unmodifiable {@link Set} of stopwords. Never {@code null}.
+ * @throws UnsupportedOperationException if a caller attempts to add to,
+ * remove from, or otherwise mutate the returned {@link Set}.
+ */
+ Set Usage: {@code opennlp StopwordFilter
+ * The backing store supports both 1-gram and n-gram entries. Multi-word
+ * entries are queried via {@link #isStopword(String...)}; the
+ * {@link #filter(String[])} method performs a greedy left-to-right window
+ * scan, preferring the longest registered match at each position.
+ *
+ * Instances are constructed once and never modified afterwards. Use the
+ * {@link Builder} ({@link #builder()}) to assemble a filter from one or
+ * more sources (programmatic entries, an input stream, an existing
+ * {@link Dictionary}), or the public constructors for the common cases.
+ *
+ * Thread-safety: instances are immutable after
+ * construction and may be shared freely across threads without external
+ * synchronization. All fields are {@code final}; the only mutation of the
+ * backing {@link Dictionary} happens inside the constructor / builder before
+ * the instance is published.
+ */
+@ThreadSafe
+public final class DictionaryStopwordFilter implements StopwordFilter {
+
+ private static final String COMMENT_PREFIX = "#";
+
+ private final Dictionary backing;
+
+ /**
+ * Loads a stopword list from the given input stream and freezes it into
+ * an immutable filter.
+ *
+ * Format: UTF-8 (or the supplied {@link Charset}), one entry per line.
+ * Whitespace-separated tokens on the same line form one multi-word entry.
+ * Blank lines and lines starting with {@code #} are skipped.
+ *
+ * @param in The input stream to read from. Must not be {@code null}.
+ * @param cs The {@link Charset} to decode with. Must not be {@code null}.
+ * @param caseSensitive Whether matching is case-sensitive.
+ * @throws IllegalArgumentException if {@code in} or {@code cs} is
+ * {@code null}.
+ * @throws IOException Thrown if an IO error occurs while reading.
+ */
+ public DictionaryStopwordFilter(final InputStream in, final Charset cs,
+ final boolean caseSensitive) throws IOException {
+ if (in == null) {
+ throw new IllegalArgumentException("in must not be null");
+ }
+ if (cs == null) {
+ throw new IllegalArgumentException("cs must not be null");
+ }
+ this.backing = parseStream(in, cs, caseSensitive);
+ }
+
+ /**
+ * Creates an immutable filter from a defensive copy of {@code source}.
+ * Subsequent mutation of {@code source} does not affect this filter.
+ *
+ * @param source The dictionary whose contents seed the filter. Must not
+ * be {@code null}.
+ * @throws IllegalArgumentException if {@code source} is {@code null}.
+ */
+ public DictionaryStopwordFilter(final Dictionary source) {
+ if (source == null) {
+ throw new IllegalArgumentException("source must not be null");
+ }
+ final Dictionary copy = new Dictionary(source.isCaseSensitive());
+ for (final StringList entry : source) {
+ copy.put(entry);
+ }
+ this.backing = copy;
+ }
+
+ /**
+ * @return A new {@link Builder} that assembles a {@link DictionaryStopwordFilter}.
+ */
+ public static Builder builder() {
+ return new Builder();
+ }
+
+ /**
+ * Convenience factory equivalent to
+ * {@link #DictionaryStopwordFilter(InputStream, Charset, boolean)} but
+ * wrapping any {@link IOException} thrown during reading in an
+ * {@link UncheckedIOException}. Useful in contexts where a checked
+ * exception is inconvenient (e.g. lambdas, static initializers).
+ *
+ * @param in The input stream. Must not be {@code null}.
+ * @param cs The charset. Must not be {@code null}.
+ * @param caseSensitive Whether matching is case-sensitive.
+ * @return A new filter loaded from {@code in}.
+ * @throws IllegalArgumentException if {@code in} or {@code cs} is
+ * {@code null}.
+ * @throws UncheckedIOException if an IO error occurs while reading from
+ * {@code in}.
+ */
+ public static DictionaryStopwordFilter loadUnchecked(final InputStream in,
+ final Charset cs,
+ final boolean caseSensitive) {
+ try {
+ return new DictionaryStopwordFilter(in, cs, caseSensitive);
+ } catch (final IOException e) {
+ throw new UncheckedIOException(e);
+ }
+ }
+
+ /**
+ * {@inheritDoc}
+ *
+ * @param token The token to test. May be {@code null}, in which case this
+ * method returns {@code false}.
+ * @return {@code true} if {@code token} is registered as a single-token
+ * stopword, {@code false} otherwise.
+ */
+ @Override
+ public boolean isStopword(final CharSequence token) {
+ if (token == null) {
+ return false;
+ }
+ return backing.contains(new StringList(token.toString()));
+ }
+
+ /**
+ * {@inheritDoc}
+ *
+ * @param tokens The tokens to test as one entry. May be {@code null} or
+ * empty, in which case this method returns {@code false}.
+ * @return {@code true} if the sequence is registered as a stopword,
+ * {@code false} otherwise.
+ */
+ @Override
+ public boolean isStopword(final String... tokens) {
+ if (tokens == null || tokens.length == 0) {
+ return false;
+ }
+ for (final String t : tokens) {
+ if (t == null) {
+ return false;
+ }
+ }
+ return backing.contains(new StringList(tokens));
+ }
+
+ /**
+ * {@inheritDoc}
+ *
+ * Performs a greedy left-to-right window scan: at each position the
+ * longest registered window is tried first. If it matches, those tokens
+ * are dropped; otherwise the position advances by one and the current
+ * token is kept. {@code null} elements never participate in a window and
+ * are kept as-is.
+ *
+ * @throws IllegalArgumentException if {@code tokens} is {@code null}.
+ */
+ @Override
+ public String[] filter(final String[] tokens) {
+ if (tokens == null) {
+ throw new IllegalArgumentException("tokens must not be null");
+ }
+ final int maxWindow = backing.getMaxTokenCount();
+ final List
+ * Operations are applied at {@link #build()} time in the order
+ * "all adds, then all removes". Within each phase, insertion order is
+ * preserved but is not externally observable.
+ */
+ public static final class Builder {
+
+ private final List
+ * Stopword membership is decided by the supplied {@link StopwordFilter};
+ * filtering is delegated to {@link StopwordFilter#filter(String[])} so the
+ * relative order of surviving tokens within a sample is preserved.
+ *
+ * {@link #reset()} and {@link #close()} are inherited from
+ * {@link FilterObjectStream} and simply forward to the wrapped stream.
+ */
+public final class StopwordFilterStream extends FilterObjectStream
+ * Both {@link #tokenize(String)} and {@link #tokenizePos(String)} apply the
+ * filter using the same greedy longest-match window scan, so single-token
+ * (1-gram) and multi-token (n-gram) stopword entries are dropped identically
+ * across {@link #tokenize(String)}, {@link #tokenizePos(String)} and
+ * {@link StopwordFilterStream}. For {@link #tokenizePos(String)} the
+ * {@link Span Spans} covering a matched entry are dropped while the offsets of
+ * the remaining spans are kept intact (they continue to refer to positions in
+ * the original input string).
+ *
+ * Instances are immutable and therefore safe for concurrent use provided that
+ * both the wrapped {@link Tokenizer} and the {@link StopwordFilter} are
+ * thread-safe. {@link DictionaryStopwordFilter} is unconditionally
+ * thread-safe; combined with a thread-safe delegate tokenizer
+ * (e.g. {@code SimpleTokenizer.INSTANCE}) the resulting decorator is
+ * thread-safe with no further synchronization required.
+ */
+@ThreadSafe
+public final class StopwordFilteringTokenizer implements Tokenizer {
+
+ private final Tokenizer delegate;
+ private final StopwordFilter filter;
+
+ /**
+ * Initializes a {@link StopwordFilteringTokenizer}.
+ *
+ * @param delegate The underlying {@link Tokenizer} that produces the raw
+ * tokens. Must not be {@code null}.
+ * @param filter The {@link StopwordFilter} which decides whether a token
+ * is a stopword. Must not be {@code null}.
+ * @throws IllegalArgumentException if {@code delegate} or {@code filter} is
+ * {@code null}.
+ */
+ public StopwordFilteringTokenizer(final Tokenizer delegate, final StopwordFilter filter) {
+ if (delegate == null) {
+ throw new IllegalArgumentException("delegate must not be null");
+ }
+ if (filter == null) {
+ throw new IllegalArgumentException("filter must not be null");
+ }
+ this.delegate = delegate;
+ this.filter = filter;
+ }
+
+ /**
+ * Tokenizes the supplied string with the wrapped {@link Tokenizer} and then
+ * removes any tokens which the {@link StopwordFilter} considers a stopword.
+ *
+ * @param s The string to be tokenized.
+ * @return The remaining tokens in their original order.
+ */
+ @Override
+ public String[] tokenize(final String s) {
+ return filter.filter(delegate.tokenize(s));
+ }
+
+ /**
+ * Computes token spans with the wrapped {@link Tokenizer} and then drops
+ * the spans covering any stopword entry according to the
+ * {@link StopwordFilter}. A greedy left-to-right window scan mirrors
+ * {@link StopwordFilter#filter(String[])}: at each position the longest
+ * window of consecutive spans whose covered texts form a registered entry is
+ * removed; otherwise the current span is kept and the scan advances by one.
+ * This way multi-word (n-gram) entries are dropped here exactly as they are
+ * by {@link #tokenize(String)}. The relative order and the offsets of the
+ * surviving spans are preserved.
+ *
+ * @param s The string to be tokenized.
+ * @return The remaining {@link Span Spans} in their original order.
+ */
+ @Override
+ public Span[] tokenizePos(final String s) {
+ final Span[] spans = delegate.tokenizePos(s);
+ if (spans == null || spans.length == 0) {
+ return spans;
+ }
+ final List kept = new ArrayList<>(spans.length);
+ int i = 0;
+ while (i < spans.length) {
+ int matched = 0;
+ // Try the longest possible window first, decreasing down to 1.
+ for (int w = spans.length - i; w >= 1; w--) {
+ final String[] window = new String[w];
+ for (int k = 0; k < w; k++) {
+ window[k] = spans[i + k].getCoveredText(s).toString();
+ }
+ if (filter.isStopword(window)) {
+ matched = w;
+ break;
+ }
+ }
+ if (matched > 0) {
+ i += matched;
+ } else {
+ kept.add(spans[i]);
+ i++;
+ }
+ }
+ return kept.toArray(new Span[0]);
+ }
+}
diff --git a/opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/stopword/StopwordLists.java b/opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/stopword/StopwordLists.java
new file mode 100644
index 000000000..59f1c1550
--- /dev/null
+++ b/opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/stopword/StopwordLists.java
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.stopword;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.UncheckedIOException;
+import java.nio.charset.Charset;
+import java.nio.charset.StandardCharsets;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.LinkedHashSet;
+import java.util.Locale;
+import java.util.Map;
+import java.util.MissingResourceException;
+import java.util.Set;
+import java.util.concurrent.ConcurrentHashMap;
+
+import opennlp.tools.util.LanguageCodeValidator;
+
+/**
+ * Static factory for {@link StopwordFilter} instances backed by bundled
+ * language-specific stopword resources or caller-supplied input streams.
+ *
+ * Bundled lists ship for the eleven languages enumerated in
+ * OPENNLP-660:
+ * Bulgarian (bg), Danish (da), German (de), English (en), Spanish (es),
+ * Finnish (fi), French (fr), Italian (it), Dutch (nl), Portuguese (pt),
+ * Russian (ru). Each list is keyed by its ISO 639-1 two-letter code.
+ */
+public final class StopwordLists {
+
+ private static final String RESOURCE_PATH_PREFIX = "/opennlp/tools/stopword/";
+
+ private static final Set
+ * Two-letter inputs are simply lower-cased and returned. Three-letter inputs
+ * are resolved with a single lookup against {@link #ISO6393_TO_ISO6391},
+ * which is precomputed once at class-initialization time (covering both the
+ * terminologic forms produced by {@link Locale#getISO3Language()} and the
+ * ISO 639-2 bibliographic forms {@code dut}, {@code fre} and {@code ger}).
+ * Unresolved codes are returned lower-cased and unchanged.
+ */
+ private static String normalizeToIso6391(final String code) {
+ final String lower = code.toLowerCase(Locale.ROOT);
+ if (lower.length() == 2) {
+ return lower;
+ }
+ return ISO6393_TO_ISO6391.getOrDefault(lower, lower);
+ }
+}
diff --git a/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/bg.txt b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/bg.txt
new file mode 100644
index 000000000..dbf47d565
--- /dev/null
+++ b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/bg.txt
@@ -0,0 +1,194 @@
+# This file was created by Jacques Savoy and is distributed under the BSD license.
+# See http://members.unine.ch/jacques.savoy/clef/index.html.
+# Also see http://www.opensource.org/licenses/bsd-license.html
+# Comments were changed from `|` to `#` so that this list can be parsed by OpenNLP's stopword loader.
+а
+аз
+ако
+ала
+бе
+без
+беше
+би
+бил
+била
+били
+било
+близо
+бъдат
+бъде
+бяха
+в
+вас
+ваш
+ваша
+вероятно
+вече
+взема
+ви
+вие
+винаги
+все
+всеки
+всички
+всичко
+всяка
+във
+въпреки
+върху
+г
+ги
+главно
+го
+д
+да
+дали
+до
+докато
+докога
+дори
+досега
+доста
+е
+едва
+един
+ето
+за
+зад
+заедно
+заради
+засега
+затова
+защо
+защото
+и
+из
+или
+им
+има
+имат
+иска
+й
+каза
+как
+каква
+какво
+както
+какъв
+като
+кога
+когато
+което
+които
+кой
+който
+колко
+която
+къде
+където
+към
+ли
+м
+ме
+между
+мен
+ми
+мнозина
+мога
+могат
+може
+моля
+момента
+му
+н
+на
+над
+назад
+най
+направи
+напред
+например
+нас
+не
+него
+нея
+ни
+ние
+никой
+нито
+но
+някои
+някой
+няма
+обаче
+около
+освен
+особено
+от
+отгоре
+отново
+още
+пак
+по
+повече
+повечето
+под
+поне
+поради
+после
+почти
+прави
+пред
+преди
+през
+при
+пък
+първо
+с
+са
+само
+се
+сега
+си
+скоро
+след
+сме
+според
+сред
+срещу
+сте
+съм
+със
+също
+т
+тази
+така
+такива
+такъв
+там
+твой
+те
+тези
+ти
+тн
+то
+това
+тогава
+този
+той
+толкова
+точно
+трябва
+тук
+тъй
+тя
+тях
+у
+харесва
+ч
+че
+често
+чрез
+ще
+щом
+я
diff --git a/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/da.txt b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/da.txt
new file mode 100644
index 000000000..c3608fd52
--- /dev/null
+++ b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/da.txt
@@ -0,0 +1,110 @@
+# From https://snowballstem.org/algorithms/danish/stop.txt
+# This file is distributed under the BSD License.
+# See https://snowballstem.org/license.html
+# Also see https://opensource.org/licenses/bsd-license.html
+# - Encoding was converted to UTF-8.
+# - This notice was added.
+# - Comments were changed from `|` to `#` so that this list can be parsed by OpenNLP's stopword loader.
+#
+
+# A Danish stop word list. Comments begin with vertical bar. Each stop
+# word is at the start of a line.
+
+# This is a ranked list (commonest to rarest) of stopwords derived from
+# a large text sample.
+
+
+og
+i
+jeg
+det
+at
+en
+den
+til
+er
+som
+på
+de
+med
+han
+af
+for
+ikke
+der
+var
+mig
+sig
+men
+et
+har
+om
+vi
+min
+havde
+ham
+hun
+nu
+over
+da
+fra
+du
+ud
+sin
+dem
+os
+op
+man
+hans
+hvor
+eller
+hvad
+skal
+selv
+her
+alle
+vil
+blev
+kunne
+ind
+når
+være
+dog
+noget
+ville
+jo
+deres
+efter
+ned
+skulle
+denne
+end
+dette
+mit
+også
+under
+have
+dig
+anden
+hende
+mine
+alt
+meget
+sit
+sine
+vor
+mod
+disse
+hvis
+din
+nogle
+hos
+blive
+mange
+ad
+bliver
+hendes
+været
+thi
+jer
+sådan
diff --git a/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/de.txt b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/de.txt
new file mode 100644
index 000000000..f297306ee
--- /dev/null
+++ b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/de.txt
@@ -0,0 +1,294 @@
+# From https://snowballstem.org/algorithms/german/stop.txt
+# This file is distributed under the BSD License.
+# See https://snowballstem.org/license.html
+# Also see https://opensource.org/licenses/bsd-license.html
+# - Encoding was converted to UTF-8.
+# - This notice was added.
+# - Comments were changed from `|` to `#` so that this list can be parsed by OpenNLP's stopword loader.
+#
+
+# A German stop word list. Comments begin with vertical bar. Each stop
+# word is at the start of a line.
+
+# The number of forms in this list is reduced significantly by passing it
+# through the German stemmer.
+
+
+aber
+
+alle
+allem
+allen
+aller
+alles
+
+als
+also
+am
+an
+
+ander
+andere
+anderem
+anderen
+anderer
+anderes
+anderm
+andern
+anderr
+anders
+
+auch
+auf
+aus
+bei
+bin
+bis
+bist
+da
+damit
+dann
+
+der
+den
+des
+dem
+die
+das
+
+daß
+
+derselbe
+derselben
+denselben
+desselben
+demselben
+dieselbe
+dieselben
+dasselbe
+
+dazu
+
+dein
+deine
+deinem
+deinen
+deiner
+deines
+
+denn
+
+derer
+dessen
+
+dich
+dir
+du
+
+dies
+diese
+diesem
+diesen
+dieser
+dieses
+
+
+doch
+dort
+
+
+durch
+
+ein
+eine
+einem
+einen
+einer
+eines
+
+einig
+einige
+einigem
+einigen
+einiger
+einiges
+
+einmal
+
+er
+ihn
+ihm
+
+es
+etwas
+
+euer
+eure
+eurem
+euren
+eurer
+eures
+
+für
+gegen
+gewesen
+hab
+habe
+haben
+hat
+hatte
+hatten
+hier
+hin
+hinter
+
+ich
+mich
+mir
+
+
+ihr
+ihre
+ihrem
+ihren
+ihrer
+ihres
+euch
+
+im
+in
+indem
+ins
+ist
+
+jede
+jedem
+jeden
+jeder
+jedes
+
+jene
+jenem
+jenen
+jener
+jenes
+
+jetzt
+kann
+
+kein
+keine
+keinem
+keinen
+keiner
+keines
+
+können
+könnte
+machen
+man
+
+manche
+manchem
+manchen
+mancher
+manches
+
+mein
+meine
+meinem
+meinen
+meiner
+meines
+
+mit
+muss
+musste
+nach
+nicht
+nichts
+noch
+nun
+nur
+ob
+oder
+ohne
+sehr
+
+sein
+seine
+seinem
+seinen
+seiner
+seines
+
+selbst
+sich
+
+sie
+ihnen
+
+sind
+so
+
+solche
+solchem
+solchen
+solcher
+solches
+
+soll
+sollte
+sondern
+sonst
+über
+um
+und
+
+uns
+unse
+unsem
+unsen
+unser
+unses
+
+unter
+viel
+vom
+von
+vor
+während
+war
+waren
+warst
+was
+weg
+weil
+weiter
+
+welche
+welchem
+welchen
+welcher
+welches
+
+wenn
+werde
+werden
+wie
+wieder
+will
+wir
+wird
+wirst
+wo
+wollen
+wollte
+würde
+würden
+zu
+zum
+zur
+zwar
+zwischen
+
diff --git a/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/en.txt b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/en.txt
new file mode 100644
index 000000000..7ae6d01dd
--- /dev/null
+++ b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/en.txt
@@ -0,0 +1,320 @@
+# From https://snowballstem.org/algorithms/english/stop.txt
+# This file is distributed under the BSD License.
+# See https://snowballstem.org/license.html
+# Also see https://opensource.org/licenses/bsd-license.html
+# - Encoding was converted to UTF-8.
+# - This notice was added.
+# - Comments were changed from `|` to `#` so that this list can be parsed by OpenNLP's stopword loader.
+#
+
+# An English stop word list. Comments begin with vertical bar. Each stop
+# word is at the start of a line.
+
+# Many of the forms below are quite rare (e.g. "yourselves") but included for
+# completeness.
+
+# PRONOUNS FORMS
+# 1st person sing
+
+i
+
+me
+my
+# the possessive pronoun `mine' is best suppressed, because of the
+# sense of coal-mine etc.
+myself
+# 1st person plural
+we
+
+# us | object
+# care is required here because US = United States. It is usually
+# safe to remove it if it is in lower case.
+our
+ours
+ourselves
+# second person (archaic `thou' forms not included)
+you
+your
+yours
+yourself
+yourselves
+# third person singular
+he
+him
+his
+himself
+
+she
+her
+hers
+herself
+
+it
+its
+itself
+# third person plural
+they
+them
+their
+theirs
+themselves
+# other forms (demonstratives, interrogatives)
+what
+which
+who
+whom
+this
+that
+these
+those
+
+# VERB FORMS (using F.R. Palmer's nomenclature)
+# BE
+am
+is
+are
+was
+were
+be
+been
+being
+# HAVE
+have
+has
+had
+having
+# DO
+do
+does
+did
+doing
+
+# The forms below are, I believe, best omitted, because of the significant
+# homonym forms:
+
+# He made a WILL
+# old tin CAN
+# merry month of MAY
+# a smell of MUST
+# fight the good fight with all thy MIGHT
+
+# would, could, should, ought might however be included
+
+# | AUXILIARIES
+# | WILL
+#will
+
+would
+
+# | SHALL
+#shall
+
+should
+
+# | CAN
+#can
+
+could
+
+# | MAY
+#may
+#might
+# | MUST
+#must
+# | OUGHT
+
+ought
+
+# COMPOUND FORMS, increasingly encountered nowadays in 'formal' writing
+# pronoun + verb
+
+i'm
+you're
+he's
+she's
+it's
+we're
+they're
+i've
+you've
+we've
+they've
+i'd
+you'd
+he'd
+she'd
+we'd
+they'd
+i'll
+you'll
+he'll
+she'll
+we'll
+they'll
+
+# verb + negation
+
+isn't
+aren't
+wasn't
+weren't
+hasn't
+haven't
+hadn't
+doesn't
+don't
+didn't
+
+# auxiliary + negation
+
+won't
+wouldn't
+shan't
+shouldn't
+can't
+cannot
+couldn't
+mustn't
+
+# miscellaneous forms
+
+let's
+that's
+who's
+what's
+here's
+there's
+when's
+where's
+why's
+how's
+
+# rarer forms
+
+# daren't needn't
+
+# doubtful forms
+
+# oughtn't mightn't
+
+# ARTICLES
+a
+an
+the
+
+# THE REST (Overlap among prepositions, conjunctions, adverbs etc is so
+# high, that classification is pointless.)
+and
+but
+if
+or
+because
+as
+until
+while
+
+of
+at
+by
+for
+with
+about
+against
+between
+into
+through
+during
+before
+after
+above
+below
+to
+from
+up
+down
+in
+out
+on
+off
+over
+under
+
+again
+further
+then
+once
+
+here
+there
+when
+where
+why
+how
+
+all
+any
+both
+each
+few
+more
+most
+other
+some
+such
+
+no
+nor
+not
+only
+own
+same
+so
+than
+too
+very
+
+# Just for the record, the following words are among the commonest in English
+
+# one
+# every
+# least
+# less
+# many
+# now
+# ever
+# never
+# say
+# says
+# said
+# also
+# get
+# go
+# goes
+# just
+# made
+# make
+# put
+# see
+# seen
+# whether
+# like
+# well
+# back
+# even
+# still
+# way
+# take
+# since
+# another
+# however
+# two
+# three
+# four
+# five
+# first
+# second
+# new
+# old
+# high
+# long
+
diff --git a/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/es.txt b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/es.txt
new file mode 100644
index 000000000..f1955d62e
--- /dev/null
+++ b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/es.txt
@@ -0,0 +1,356 @@
+# From https://snowballstem.org/algorithms/spanish/stop.txt
+# This file is distributed under the BSD License.
+# See https://snowballstem.org/license.html
+# Also see https://opensource.org/licenses/bsd-license.html
+# - Encoding was converted to UTF-8.
+# - This notice was added.
+# - Comments were changed from `|` to `#` so that this list can be parsed by OpenNLP's stopword loader.
+#
+
+# A Spanish stop word list. Comments begin with vertical bar. Each stop
+# word is at the start of a line.
+
+
+# The following is a ranked list (commonest to rarest) of stopwords
+# deriving from a large sample of text.
+
+# Extra words have been added at the end.
+
+de
+la
+que
+el
+en
+y
+a
+los
+del
+se
+las
+por
+un
+para
+con
+no
+una
+su
+al
+# es from SER
+lo
+como
+más
+pero
+sus
+le
+ya
+o
+# fue from SER
+este
+# ha from HABER
+sí
+porque
+esta
+# son from SER
+entre
+# está from ESTAR
+cuando
+muy
+sin
+sobre
+# ser from SER
+# tiene from TENER
+también
+me
+hasta
+hay
+donde
+# han from HABER
+quien
+# están from ESTAR
+# estado from ESTAR
+desde
+todo
+nos
+durante
+# estados from ESTAR
+todos
+uno
+les
+ni
+contra
+otros
+# fueron from SER
+ese
+eso
+# había from HABER
+ante
+ellos
+e
+esto
+mí
+antes
+algunos
+qué
+unos
+yo
+otro
+otras
+otra
+él
+tanto
+esa
+estos
+mucho
+quienes
+nada
+muchos
+cual
+# sea from SER
+poco
+ella
+estar
+# haber from HABER
+estas
+# estaba from ESTAR
+# estamos from ESTAR
+algunas
+algo
+nosotros
+
+# other forms
+
+mi
+mis
+tú
+te
+ti
+tu
+tus
+ellas
+nosotras
+vosotros
+vosotras
+os
+mío
+mía
+míos
+mías
+tuyo
+tuya
+tuyos
+tuyas
+suyo
+suya
+suyos
+suyas
+nuestro
+nuestra
+nuestros
+nuestras
+vuestro
+vuestra
+vuestros
+vuestras
+esos
+esas
+
+# forms of estar, to be (not including the infinitive):
+estoy
+estás
+está
+estamos
+estáis
+están
+esté
+estés
+estemos
+estéis
+estén
+estaré
+estarás
+estará
+estaremos
+estaréis
+estarán
+estaría
+estarías
+estaríamos
+estaríais
+estarían
+estaba
+estabas
+estábamos
+estabais
+estaban
+estuve
+estuviste
+estuvo
+estuvimos
+estuvisteis
+estuvieron
+estuviera
+estuvieras
+estuviéramos
+estuvierais
+estuvieran
+estuviese
+estuvieses
+estuviésemos
+estuvieseis
+estuviesen
+estando
+estado
+estada
+estados
+estadas
+estad
+
+# forms of haber, to have (not including the infinitive):
+he
+has
+ha
+hemos
+habéis
+han
+haya
+hayas
+hayamos
+hayáis
+hayan
+habré
+habrás
+habrá
+habremos
+habréis
+habrán
+habría
+habrías
+habríamos
+habríais
+habrían
+había
+habías
+habíamos
+habíais
+habían
+hube
+hubiste
+hubo
+hubimos
+hubisteis
+hubieron
+hubiera
+hubieras
+hubiéramos
+hubierais
+hubieran
+hubiese
+hubieses
+hubiésemos
+hubieseis
+hubiesen
+habiendo
+habido
+habida
+habidos
+habidas
+
+# forms of ser, to be (not including the infinitive):
+soy
+eres
+es
+somos
+sois
+son
+sea
+seas
+seamos
+seáis
+sean
+seré
+serás
+será
+seremos
+seréis
+serán
+sería
+serías
+seríamos
+seríais
+serían
+era
+eras
+éramos
+erais
+eran
+fui
+fuiste
+fue
+fuimos
+fuisteis
+fueron
+fuera
+fueras
+fuéramos
+fuerais
+fueran
+fuese
+fueses
+fuésemos
+fueseis
+fuesen
+siendo
+sido
+# sed also means 'thirst'
+
+# forms of tener, to have (not including the infinitive):
+tengo
+tienes
+tiene
+tenemos
+tenéis
+tienen
+tenga
+tengas
+tengamos
+tengáis
+tengan
+tendré
+tendrás
+tendrá
+tendremos
+tendréis
+tendrán
+tendría
+tendrías
+tendríamos
+tendríais
+tendrían
+tenía
+tenías
+teníamos
+teníais
+tenían
+tuve
+tuviste
+tuvo
+tuvimos
+tuvisteis
+tuvieron
+tuviera
+tuvieras
+tuviéramos
+tuvierais
+tuvieran
+tuviese
+tuvieses
+tuviésemos
+tuvieseis
+tuviesen
+teniendo
+tenido
+tenida
+tenidos
+tenidas
+tened
+
diff --git a/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/fi.txt b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/fi.txt
new file mode 100644
index 000000000..667c57a3c
--- /dev/null
+++ b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/fi.txt
@@ -0,0 +1,265 @@
+# From https://snowballstem.org/algorithms/finnish/stop.txt
+# This file is distributed under the BSD License.
+# See https://snowballstem.org/license.html
+# Also see https://opensource.org/licenses/bsd-license.html
+# - Encoding was converted to UTF-8.
+# - This notice was added.
+# - Comments were changed from `|` to `#` so that this list can be parsed by OpenNLP's stopword loader.
+# - The pronoun/determiner paradigm rows (originally whitespace-separated columns) were expanded to one token per line so that each form is registered as an individual stopword by OpenNLP's loader.
+#
+
+# forms of BE
+
+olla
+olen
+olet
+on
+olemme
+olette
+ovat
+ole
+
+oli
+olisi
+olisit
+olisin
+olisimme
+olisitte
+olisivat
+olit
+olin
+olimme
+olitte
+olivat
+ollut
+olleet
+
+en
+et
+ei
+emme
+ette
+eivät
+
+# Personal pronoun paradigms
+# Nom Gen Acc Part Iness Elat Illat Adess Ablat Allat (Ess Trans where present)
+minä
+minun
+minut
+minua
+minussa
+minusta
+minuun
+minulla
+minulta
+minulle
+sinä
+sinun
+sinut
+sinua
+sinussa
+sinusta
+sinuun
+sinulla
+sinulta
+sinulle
+hän
+hänen
+hänet
+häntä
+hänessä
+hänestä
+häneen
+hänellä
+häneltä
+hänelle
+me
+meidän
+meidät
+meitä
+meissä
+meistä
+meihin
+meillä
+meiltä
+meille
+te
+teidän
+teidät
+teitä
+teissä
+teistä
+teihin
+teillä
+teiltä
+teille
+he
+heidän
+heidät
+heitä
+heissä
+heistä
+heihin
+heillä
+heiltä
+heille
+
+# Demonstrative pronoun paradigms
+tämä
+tämän
+tätä
+tässä
+tästä
+tähän
+tällä
+tältä
+tälle
+tänä
+täksi
+tuo
+tuon
+tuota
+tuossa
+tuosta
+tuohon
+tuolla
+tuolta
+tuolle
+tuona
+tuoksi
+se
+sen
+sitä
+siinä
+siitä
+siihen
+sillä
+siltä
+sille
+sinä
+siksi
+nämä
+näiden
+näitä
+näissä
+näistä
+näihin
+näillä
+näiltä
+näille
+näinä
+näiksi
+nuo
+noiden
+noita
+noissa
+noista
+noihin
+noilla
+noilta
+noille
+noina
+noiksi
+ne
+niiden
+niitä
+niissä
+niistä
+niihin
+niillä
+niiltä
+niille
+niinä
+niiksi
+
+# Interrogative pronoun paradigms
+kuka
+kenen
+kenet
+ketä
+kenessä
+kenestä
+keneen
+kenellä
+keneltä
+kenelle
+kenenä
+keneksi
+ketkä
+keiden
+keitä
+keissä
+keistä
+keihin
+keillä
+keiltä
+keille
+keinä
+keiksi
+mikä
+minkä
+mitä
+missä
+mistä
+mihin
+millä
+miltä
+mille
+minä
+miksi
+mitkä
+
+# Relative pronoun paradigms
+joka
+jonka
+jota
+jossa
+josta
+johon
+jolla
+jolta
+jolle
+jona
+joksi
+jotka
+joiden
+joita
+joissa
+joista
+joihin
+joilla
+joilta
+joille
+joina
+joiksi
+
+# conjunctions
+
+että
+ja
+jos
+koska
+kuin
+mutta
+niin
+sekä
+sillä
+tai
+vaan
+vai
+vaikka
+
+
+# prepositions
+
+kanssa
+mukaan
+noin
+poikki
+yli
+
+# other
+
+kun
+nyt
+itse
diff --git a/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/fr.txt b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/fr.txt
new file mode 100644
index 000000000..e721a2a64
--- /dev/null
+++ b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/fr.txt
@@ -0,0 +1,186 @@
+# From https://snowballstem.org/algorithms/french/stop.txt
+# This file is distributed under the BSD License.
+# See https://snowballstem.org/license.html
+# Also see https://opensource.org/licenses/bsd-license.html
+# - Encoding was converted to UTF-8.
+# - This notice was added.
+# - Comments were changed from `|` to `#` so that this list can be parsed by OpenNLP's stopword loader.
+#
+
+# A French stop word list. Comments begin with vertical bar. Each stop
+# word is at the start of a line.
+
+au
+aux
+avec
+ce
+ces
+dans
+de
+des
+du
+elle
+en
+et
+eux
+il
+je
+la
+le
+leur
+lui
+ma
+mais
+me
+même
+mes
+moi
+mon
+ne
+nos
+notre
+nous
+on
+ou
+par
+pas
+pour
+qu
+que
+qui
+sa
+se
+ses
+# son | his, her (masc). Omitted because it is homonym of "sound"
+sur
+ta
+te
+tes
+toi
+ton
+tu
+un
+une
+vos
+votre
+vous
+
+# single letter forms
+
+c
+d
+j
+l
+à
+m
+n
+s
+t
+y
+
+# forms of être (not including the infinitive):
+# été - Omitted because it is homonym of "summer"
+étée
+étées
+# étés - Omitted because it is homonym of "summers"
+étant
+suis
+es
+# est - Omitted because it is homonym of "east"
+# sommes - Omitted because it is homonym of "sums"
+êtes
+sont
+serai
+seras
+sera
+serons
+serez
+seront
+serais
+serait
+serions
+seriez
+seraient
+étais
+était
+étions
+étiez
+étaient
+fus
+fut
+fûmes
+fûtes
+furent
+sois
+soit
+soyons
+soyez
+soient
+fusse
+fusses
+# fût - Omitted because it is homonym of "tap", like in "beer on tap"
+fussions
+fussiez
+fussent
+
+# forms of avoir (not including the infinitive):
+ayant
+eu
+eue
+eues
+eus
+ai
+# as - Omitted because it is homonym of "ace"
+avons
+avez
+ont
+aurai
+# auras - Omitted because it is also the name of a kind of wind
+# aura - Omitted because it is also the name of a kind of wind and homonym of "aura"
+aurons
+aurez
+auront
+aurais
+aurait
+aurions
+auriez
+auraient
+avais
+avait
+# avions - Omitted because it is homonym of "planes"
+aviez
+avaient
+eut
+eûmes
+eûtes
+eurent
+aie
+aies
+ait
+ayons
+ayez
+aient
+eusse
+eusses
+eût
+eussions
+eussiez
+eussent
+
+# Later additions (from Jean-Christophe Deschamps)
+ceci
+cela
+celà
+cet
+cette
+ici
+ils
+les
+leurs
+quel
+quels
+quelle
+quelles
+sans
+soi
+
diff --git a/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/it.txt b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/it.txt
new file mode 100644
index 000000000..dbaf5e860
--- /dev/null
+++ b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/it.txt
@@ -0,0 +1,303 @@
+# From https://snowballstem.org/algorithms/italian/stop.txt
+# This file is distributed under the BSD License.
+# See https://snowballstem.org/license.html
+# Also see https://opensource.org/licenses/bsd-license.html
+# - Encoding was converted to UTF-8.
+# - This notice was added.
+# - Comments were changed from `|` to `#` so that this list can be parsed by OpenNLP's stopword loader.
+#
+
+# An Italian stop word list. Comments begin with vertical bar. Each stop
+# word is at the start of a line.
+
+ad
+al
+allo
+ai
+agli
+all
+agl
+alla
+alle
+con
+col
+coi
+da
+dal
+dallo
+dai
+dagli
+dall
+dagl
+dalla
+dalle
+di
+del
+dello
+dei
+degli
+dell
+degl
+della
+delle
+in
+nel
+nello
+nei
+negli
+nell
+negl
+nella
+nelle
+su
+sul
+sullo
+sui
+sugli
+sull
+sugl
+sulla
+sulle
+per
+tra
+contro
+io
+tu
+lui
+lei
+noi
+voi
+loro
+mio
+mia
+miei
+mie
+tuo
+tua
+tuoi
+tue
+suo
+sua
+suoi
+sue
+nostro
+nostra
+nostri
+nostre
+vostro
+vostra
+vostri
+vostre
+mi
+ti
+ci
+vi
+lo
+la
+li
+le
+gli
+ne
+il
+un
+uno
+una
+ma
+ed
+se
+perché
+anche
+come
+dov
+dove
+che
+chi
+cui
+non
+più
+quale
+quanto
+quanti
+quanta
+quante
+quello
+quelli
+quella
+quelle
+questo
+questi
+questa
+queste
+si
+tutto
+tutti
+
+# single letter forms:
+
+a
+c
+e
+i
+l
+o
+
+# forms of avere, to have (not including the infinitive):
+
+ho
+hai
+ha
+abbiamo
+avete
+hanno
+abbia
+abbiate
+abbiano
+avrò
+avrai
+avrà
+avremo
+avrete
+avranno
+avrei
+avresti
+avrebbe
+avremmo
+avreste
+avrebbero
+avevo
+avevi
+aveva
+avevamo
+avevate
+avevano
+ebbi
+avesti
+ebbe
+avemmo
+aveste
+ebbero
+avessi
+avesse
+avessimo
+avessero
+avendo
+avuto
+avuta
+avuti
+avute
+
+# forms of essere, to be (not including the infinitive):
+sono
+sei
+è
+siamo
+siete
+sia
+siate
+siano
+sarò
+sarai
+sarà
+saremo
+sarete
+saranno
+sarei
+saresti
+sarebbe
+saremmo
+sareste
+sarebbero
+ero
+eri
+era
+eravamo
+eravate
+erano
+fui
+fosti
+fu
+fummo
+foste
+furono
+fossi
+fosse
+fossimo
+fossero
+essendo
+
+# forms of fare, to do (not including the infinitive, fa, fat-):
+faccio
+fai
+facciamo
+fanno
+faccia
+facciate
+facciano
+farò
+farai
+farà
+faremo
+farete
+faranno
+farei
+faresti
+farebbe
+faremmo
+fareste
+farebbero
+facevo
+facevi
+faceva
+facevamo
+facevate
+facevano
+feci
+facesti
+fece
+facemmo
+faceste
+fecero
+facessi
+facesse
+facessimo
+facessero
+facendo
+
+# forms of stare, to be (not including the infinitive):
+sto
+stai
+sta
+stiamo
+stanno
+stia
+stiate
+stiano
+starò
+starai
+starà
+staremo
+starete
+staranno
+starei
+staresti
+starebbe
+staremmo
+stareste
+starebbero
+stavo
+stavi
+stava
+stavamo
+stavate
+stavano
+stetti
+stesti
+stette
+stemmo
+steste
+stettero
+stessi
+stesse
+stessimo
+stessero
+stando
diff --git a/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/nl.txt b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/nl.txt
new file mode 100644
index 000000000..805fe2a8f
--- /dev/null
+++ b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/nl.txt
@@ -0,0 +1,121 @@
+# From https://snowballstem.org/algorithms/dutch/stop.txt
+# This file is distributed under the BSD License.
+# See https://snowballstem.org/license.html
+# Also see https://opensource.org/licenses/bsd-license.html
+# - Encoding was converted to UTF-8.
+# - This notice was added.
+# - Comments were changed from `|` to `#` so that this list can be parsed by OpenNLP's stopword loader.
+#
+
+
+# A Dutch stop word list. Comments begin with vertical bar. Each stop
+# word is at the start of a line.
+
+# This is a ranked list (commonest to rarest) of stopwords derived from
+# a large sample of Dutch text.
+
+# Dutch stop words frequently exhibit homonym clashes. These are indicated
+# clearly below.
+
+de
+en
+van
+ik
+te
+dat
+die
+in
+een
+hij
+het
+niet
+zijn
+is
+was
+op
+aan
+met
+als
+voor
+had
+er
+maar
+om
+hem
+dan
+zou
+of
+wat
+mijn
+men
+dit
+zo
+door
+over
+ze
+zich
+bij
+ook
+tot
+je
+mij
+uit
+der
+daar
+haar
+naar
+heb
+hoe
+heeft
+hebben
+deze
+u
+want
+nog
+zal
+me
+zij
+nu
+ge
+geen
+omdat
+iets
+worden
+toch
+al
+waren
+veel
+meer
+doen
+toen
+moet
+ben
+zonder
+kan
+hun
+dus
+alles
+onder
+ja
+eens
+hier
+wie
+werd
+altijd
+doch
+wordt
+wezen
+kunnen
+ons
+zelf
+tegen
+na
+reeds
+wil
+kon
+niets
+uw
+iemand
+geweest
+andere
+
diff --git a/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/pt.txt b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/pt.txt
new file mode 100644
index 000000000..e54eb08a3
--- /dev/null
+++ b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/pt.txt
@@ -0,0 +1,253 @@
+# From https://snowballstem.org/algorithms/portuguese/stop.txt
+# This file is distributed under the BSD License.
+# See https://snowballstem.org/license.html
+# Also see https://opensource.org/licenses/bsd-license.html
+# - Encoding was converted to UTF-8.
+# - This notice was added.
+# - Comments were changed from `|` to `#` so that this list can be parsed by OpenNLP's stopword loader.
+#
+
+# A Portuguese stop word list. Comments begin with vertical bar. Each stop
+# word is at the start of a line.
+
+
+# The following is a ranked list (commonest to rarest) of stopwords
+# deriving from a large sample of text.
+
+# Extra words have been added at the end.
+
+de
+a
+o
+que
+e
+do
+da
+em
+um
+para
+# é from SER
+com
+não
+uma
+os
+no
+se
+na
+por
+mais
+as
+dos
+como
+mas
+# foi from SER
+ao
+ele
+das
+# tem from TER
+à
+seu
+sua
+ou
+# ser from SER
+quando
+muito
+# há from HAV
+nos
+já
+# está from EST
+eu
+também
+só
+pelo
+pela
+até
+isso
+ela
+entre
+# era from SER
+depois
+sem
+mesmo
+aos
+# ter from TER
+seus
+quem
+nas
+me
+esse
+eles
+# estão from EST
+você
+# tinha from TER
+# foram from SER
+essa
+num
+nem
+suas
+meu
+às
+minha
+# têm from TER
+numa
+pelos
+elas
+# havia from HAV
+# seja from SER
+qual
+# será from SER
+nós
+# tenho from TER
+lhe
+deles
+essas
+esses
+pelas
+este
+# fosse from SER
+dele
+
+# other words. There are many contractions such as naquele = em+aquele,
+# mo = me+o, but they are rare.
+# Indefinite article plural forms are also rare.
+
+tu
+te
+vocês
+vos
+lhes
+meus
+minhas
+teu
+tua
+teus
+tuas
+nosso
+nossa
+nossos
+nossas
+
+dela
+delas
+
+esta
+estes
+estas
+aquele
+aquela
+aqueles
+aquelas
+isto
+aquilo
+
+# forms of estar, to be (not including the infinitive):
+estou
+está
+estamos
+estão
+estive
+esteve
+estivemos
+estiveram
+estava
+estávamos
+estavam
+estivera
+estivéramos
+esteja
+estejamos
+estejam
+estivesse
+estivéssemos
+estivessem
+estiver
+estivermos
+estiverem
+
+# forms of haver, to have (not including the infinitive):
+hei
+há
+havemos
+hão
+houve
+houvemos
+houveram
+houvera
+houvéramos
+haja
+hajamos
+hajam
+houvesse
+houvéssemos
+houvessem
+houver
+houvermos
+houverem
+houverei
+houverá
+houveremos
+houverão
+houveria
+houveríamos
+houveriam
+
+# forms of ser, to be (not including the infinitive):
+sou
+somos
+são
+era
+éramos
+eram
+fui
+foi
+fomos
+foram
+fora
+fôramos
+seja
+sejamos
+sejam
+fosse
+fôssemos
+fossem
+for
+formos
+forem
+serei
+será
+seremos
+serão
+seria
+seríamos
+seriam
+
+# forms of ter, to have (not including the infinitive):
+tenho
+tem
+temos
+tém
+tinha
+tínhamos
+tinham
+tive
+teve
+tivemos
+tiveram
+tivera
+tivéramos
+tenha
+tenhamos
+tenham
+tivesse
+tivéssemos
+tivessem
+tiver
+tivermos
+tiverem
+terei
+terá
+teremos
+terão
+teria
+teríamos
+teriam
diff --git a/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/ru.txt b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/ru.txt
new file mode 100644
index 000000000..311f57b0e
--- /dev/null
+++ b/opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/ru.txt
@@ -0,0 +1,244 @@
+# From https://snowballstem.org/algorithms/russian/stop.txt
+# This file is distributed under the BSD License.
+# See https://snowballstem.org/license.html
+# Also see https://opensource.org/licenses/bsd-license.html
+# - Encoding was converted to UTF-8.
+# - This notice was added.
+# - Comments were changed from `|` to `#` so that this list can be parsed by OpenNLP's stopword loader.
+#
+
+
+# a russian stop word list. comments begin with vertical bar. each stop
+# word is at the start of a line.
+
+# this is a ranked list (commonest to rarest) of stopwords derived from
+# a large text sample.
+
+# letter `ё' is translated to `е'.
+
+и
+в
+во
+не
+что
+он
+на
+я
+с
+со
+как
+а
+то
+все
+она
+так
+его
+но
+да
+ты
+к
+у
+же
+вы
+за
+бы
+по
+только
+ее
+мне
+было
+вот
+от
+меня
+еще
+нет
+о
+из
+ему
+теперь
+когда
+даже
+ну
+вдруг
+ли
+если
+уже
+или
+ни
+быть
+был
+него
+до
+вас
+нибудь
+опять
+уж
+вам
+сказал
+ведь
+там
+потом
+себя
+ничего
+ей
+может
+они
+тут
+где
+есть
+надо
+ней
+для
+мы
+тебя
+их
+чем
+была
+сам
+чтоб
+без
+будто
+человек
+чего
+раз
+тоже
+себе
+под
+жизнь
+будет
+ж
+тогда
+кто
+этот
+говорил
+того
+потому
+этого
+какой
+совсем
+ним
+здесь
+этом
+один
+почти
+мой
+тем
+чтобы
+нее
+кажется
+сейчас
+были
+куда
+зачем
+сказать
+всех
+никогда
+сегодня
+можно
+при
+наконец
+два
+об
+другой
+хоть
+после
+над
+больше
+тот
+через
+эти
+нас
+про
+всего
+них
+какая
+много
+разве
+сказала
+три
+эту
+моя
+впрочем
+хорошо
+свою
+этой
+перед
+иногда
+лучше
+чуть
+том
+нельзя
+такой
+им
+более
+всегда
+конечно
+всю
+между
+
+
+# b: some paradigms
+#
+# personal pronouns
+#
+# я меня мне мной [мною]
+# ты тебя тебе тобой [тобою]
+# он его ему им [него, нему, ним]
+# она ее эи ею [нее, нэи, нею]
+# оно его ему им [него, нему, ним]
+#
+# мы нас нам нами
+# вы вас вам вами
+# они их им ими [них, ним, ними]
+#
+# себя себе собой [собою]
+#
+# demonstrative pronouns: этот (this), тот (that)
+#
+# этот эта это эти
+# этого эты это эти
+# этого этой этого этих
+# этому этой этому этим
+# этим этой этим [этою] этими
+# этом этой этом этих
+#
+# тот та то те
+# того ту то те
+# того той того тех
+# тому той тому тем
+# тем той тем [тою] теми
+# том той том тех
+#
+# determinative pronouns
+#
+# (a) весь (all)
+#
+# весь вся все все
+# всего всю все все
+# всего всей всего всех
+# всему всей всему всем
+# всем всей всем [всею] всеми
+# всем всей всем всех
+#
+# (b) сам (himself etc)
+#
+# сам сама само сами
+# самого саму само самих
+# самого самой самого самих
+# самому самой самому самим
+# самим самой самим [самою] самими
+# самом самой самом самих
+#
+# stems of verbs `to be', `to have', `to do' and modal
+#
+# быть бы буд быв есть суть
+# име
+# дел
+# мог мож мочь
+# уме
+# хоч хот
+# долж
+# можн
+# нужн
+# нельзя
+
diff --git a/opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/stopword/DictionaryStopwordFilterTest.java b/opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/stopword/DictionaryStopwordFilterTest.java
new file mode 100644
index 000000000..8f4683603
--- /dev/null
+++ b/opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/stopword/DictionaryStopwordFilterTest.java
@@ -0,0 +1,467 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.stopword;
+
+import java.io.ByteArrayInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.UncheckedIOException;
+import java.nio.charset.StandardCharsets;
+import java.util.Arrays;
+import java.util.Set;
+
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import opennlp.tools.dictionary.Dictionary;
+import opennlp.tools.util.StringList;
+
+public class DictionaryStopwordFilterTest {
+
+ private static DictionaryStopwordFilter empty() {
+ return DictionaryStopwordFilter.builder().build();
+ }
+
+ private static DictionaryStopwordFilter withEntries(final String[]... entries) {
+ final DictionaryStopwordFilter.Builder b = DictionaryStopwordFilter.builder();
+ for (final String[] e : entries) {
+ b.add(e);
+ }
+ return b.build();
+ }
+
+ @Test
+ void testEmptyBuilderProducesCaseInsensitiveEmptyFilter() {
+ final DictionaryStopwordFilter filter = empty();
+ Assertions.assertFalse(filter.isCaseSensitive());
+ Assertions.assertTrue(filter.stopwords().isEmpty());
+ Assertions.assertFalse(filter.isStopword("the"));
+ }
+
+ @Test
+ void testCaseInsensitiveMatching() {
+ final DictionaryStopwordFilter filter = DictionaryStopwordFilter.builder()
+ .add("the")
+ .build();
+ Assertions.assertTrue(filter.isStopword("the"));
+ Assertions.assertTrue(filter.isStopword("THE"));
+ Assertions.assertTrue(filter.isStopword("The"));
+ }
+
+ @Test
+ void testCaseSensitiveMatching() {
+ final DictionaryStopwordFilter filter = DictionaryStopwordFilter.builder()
+ .caseSensitive(true)
+ .add("the")
+ .build();
+ Assertions.assertTrue(filter.isCaseSensitive());
+ Assertions.assertTrue(filter.isStopword("the"));
+ Assertions.assertFalse(filter.isStopword("The"));
+ Assertions.assertFalse(filter.isStopword("THE"));
+ }
+
+ @Test
+ void testFilterPreservesOrderAndDropsOneGramStopwords() {
+ final DictionaryStopwordFilter filter = DictionaryStopwordFilter.builder()
+ .add("the")
+ .add("a")
+ .build();
+
+ final String[] input = { "the", "quick", "brown", "fox", "jumps", "over", "a", "lazy", "dog" };
+ final String[] expected = { "quick", "brown", "fox", "jumps", "over", "lazy", "dog" };
+ final String[] actual = filter.filter(input);
+
+ Assertions.assertArrayEquals(expected, actual);
+ }
+
+ @Test
+ void testBuilderRemoveUndoesAdd() {
+ final DictionaryStopwordFilter filter = DictionaryStopwordFilter.builder()
+ .add("foo")
+ .remove("foo")
+ .build();
+ Assertions.assertFalse(filter.isStopword("foo"));
+ }
+
+ @Test
+ void testBuilderAddAllAndRemoveAll() {
+ final DictionaryStopwordFilter added = DictionaryStopwordFilter.builder()
+ .addAll(Arrays.asList(new String[] {"alpha"}, new String[] {"beta"}))
+ .build();
+ Assertions.assertTrue(added.isStopword("alpha"));
+ Assertions.assertTrue(added.isStopword("beta"));
+
+ final DictionaryStopwordFilter undone = DictionaryStopwordFilter.builder()
+ .addAll(Arrays.asList(new String[] {"alpha"}, new String[] {"beta"}))
+ .removeAll(Arrays.asList(new String[] {"alpha"}, new String[] {"beta"}))
+ .build();
+ Assertions.assertFalse(undone.isStopword("alpha"));
+ Assertions.assertFalse(undone.isStopword("beta"));
+ }
+
+ @Test
+ void testMultiWordIsStopwordAndIndividualTokensNotMembers() {
+ final DictionaryStopwordFilter filter = DictionaryStopwordFilter.builder()
+ .add("of", "the")
+ .build();
+
+ Assertions.assertTrue(filter.isStopword("of", "the"));
+ Assertions.assertFalse(filter.isStopword("of"));
+ Assertions.assertFalse(filter.isStopword("the"));
+ }
+
+ @Test
+ void testFilterDropsNGramMatches() {
+ final DictionaryStopwordFilter filter = withEntries(new String[] {"of", "the"});
+ final String[] result = filter.filter(new String[] {"of", "the", "cat"});
+ Assertions.assertArrayEquals(new String[] {"cat"}, result);
+ }
+
+ @Test
+ void testFilterPrefersLongestMatchGreedy() {
+ final DictionaryStopwordFilter filter = withEntries(
+ new String[] {"of"}, new String[] {"of", "the"});
+ final String[] result = filter.filter(new String[] {"of", "the", "cat"});
+ Assertions.assertArrayEquals(new String[] {"cat"}, result);
+ }
+
+ @Test
+ void testFilterMixedOneAndTwoGramEntries() {
+ final DictionaryStopwordFilter filter = withEntries(
+ new String[] {"the"}, new String[] {"in", "spite"});
+
+ final String[] result = filter.filter(
+ new String[] {"the", "cat", "sat", "in", "spite", "of", "rain"});
+ Assertions.assertArrayEquals(
+ new String[] {"cat", "sat", "of", "rain"}, result);
+ }
+
+ @Test
+ void testFilterNGramAtStartOfInput() {
+ final DictionaryStopwordFilter filter = withEntries(new String[] {"as", "well", "as"});
+ final String[] result = filter.filter(new String[] {"as", "well", "as", "cats"});
+ Assertions.assertArrayEquals(new String[] {"cats"}, result);
+ }
+
+ @Test
+ void testFilterNGramAtEndOfInput() {
+ final DictionaryStopwordFilter filter = withEntries(new String[] {"of", "the"});
+ final String[] result = filter.filter(new String[] {"king", "of", "the"});
+ Assertions.assertArrayEquals(new String[] {"king"}, result);
+ }
+
+ @Test
+ void testFilterNGramInMiddleOfInput() {
+ final DictionaryStopwordFilter filter = withEntries(new String[] {"of", "the"});
+ final String[] result = filter.filter(new String[] {"king", "of", "the", "hill"});
+ Assertions.assertArrayEquals(new String[] {"king", "hill"}, result);
+ }
+
+ @Test
+ void testFilterPartialTailDoesNotMatch() {
+ final DictionaryStopwordFilter filter = withEntries(new String[] {"of", "the"});
+ final String[] result = filter.filter(new String[] {"king", "of"});
+ Assertions.assertArrayEquals(new String[] {"king", "of"}, result);
+ }
+
+ @Test
+ void testFilterWindowLongerThanInput() {
+ final DictionaryStopwordFilter filter = withEntries(new String[] {"a", "b", "c", "d"});
+ final String[] result = filter.filter(new String[] {"a", "b"});
+ Assertions.assertArrayEquals(new String[] {"a", "b"}, result);
+ }
+
+ @Test
+ void testFilterTwoConsecutiveNGramMatches() {
+ final DictionaryStopwordFilter filter = withEntries(
+ new String[] {"of", "the"}, new String[] {"in", "spite"});
+ final String[] result = filter.filter(
+ new String[] {"of", "the", "in", "spite", "rain"});
+ Assertions.assertArrayEquals(new String[] {"rain"}, result);
+ }
+
+ @Test
+ void testFilterThreeGramEntry() {
+ final DictionaryStopwordFilter filter = withEntries(new String[] {"in", "spite", "of"});
+ final String[] result = filter.filter(
+ new String[] {"won", "in", "spite", "of", "rain"});
+ Assertions.assertArrayEquals(new String[] {"won", "rain"}, result);
+ }
+
+ @Test
+ void testFilterLongestMatchWhenShorterOverlapAlsoMatches() {
+ final DictionaryStopwordFilter filter = withEntries(
+ new String[] {"a", "b"}, new String[] {"a", "b", "c"});
+ final String[] result = filter.filter(new String[] {"a", "b", "c", "d"});
+ Assertions.assertArrayEquals(new String[] {"d"}, result);
+ }
+
+ @Test
+ void testFilterFallsBackToShorterMatchWhenLongestDoesNotApply() {
+ final DictionaryStopwordFilter filter = withEntries(
+ new String[] {"a", "b"}, new String[] {"a", "b", "c"});
+ final String[] result = filter.filter(new String[] {"a", "b", "x", "d"});
+ Assertions.assertArrayEquals(new String[] {"x", "d"}, result);
+ }
+
+ @Test
+ void testFilterNullElementInterruptsWindow() {
+ final DictionaryStopwordFilter filter = withEntries(new String[] {"of", "the"});
+ final String[] result = filter.filter(new String[] {"of", null, "the", "cat"});
+ Assertions.assertArrayEquals(new String[] {"of", null, "the", "cat"}, result);
+ }
+
+ @Test
+ void testFilterLeadingNullPassesThrough() {
+ final DictionaryStopwordFilter filter = withEntries(new String[] {"the"});
+ final String[] result = filter.filter(new String[] {null, "the", "cat"});
+ Assertions.assertArrayEquals(new String[] {null, "cat"}, result);
+ }
+
+ @Test
+ void testFilterNGramCaseInsensitiveByDefault() {
+ final DictionaryStopwordFilter filter = DictionaryStopwordFilter.builder()
+ .caseSensitive(false)
+ .add("of", "the")
+ .build();
+ final String[] result = filter.filter(new String[] {"Of", "THE", "cat"});
+ Assertions.assertArrayEquals(new String[] {"cat"}, result);
+ }
+
+ @Test
+ void testFilterNGramCaseSensitiveHonorsCasing() {
+ final DictionaryStopwordFilter filter = DictionaryStopwordFilter.builder()
+ .caseSensitive(true)
+ .add("of", "the")
+ .build();
+ final String[] caseDiff = filter.filter(new String[] {"Of", "THE", "cat"});
+ Assertions.assertArrayEquals(new String[] {"Of", "THE", "cat"}, caseDiff);
+
+ final String[] exact = filter.filter(new String[] {"of", "the", "cat"});
+ Assertions.assertArrayEquals(new String[] {"cat"}, exact);
+ }
+
+ @Test
+ void testFilterDoesNotEatRegisteredOneGramAfterAddingTwoGram() {
+ final DictionaryStopwordFilter filter = withEntries(new String[] {"of", "the"});
+ final String[] result = filter.filter(new String[] {"king", "of", "rain"});
+ Assertions.assertArrayEquals(new String[] {"king", "of", "rain"}, result);
+ }
+
+ @Test
+ void testFilterEmptyDictionaryKeepsAllTokens() {
+ final DictionaryStopwordFilter filter = empty();
+ final String[] result = filter.filter(new String[] {"the", "quick", "brown", "fox"});
+ Assertions.assertArrayEquals(new String[] {"the", "quick", "brown", "fox"}, result);
+ }
+
+ @Test
+ void testFilterAdjacentSameNGramMatchesBoth() {
+ final DictionaryStopwordFilter filter = withEntries(new String[] {"of", "the"});
+ final String[] result = filter.filter(new String[] {"of", "the", "of", "the", "end"});
+ Assertions.assertArrayEquals(new String[] {"end"}, result);
+ }
+
+ @Test
+ void testFilterNGramMatchAfterUnmatchedToken() {
+ final DictionaryStopwordFilter filter = withEntries(new String[] {"of", "the"});
+ final String[] result = filter.filter(new String[] {"x", "of", "the", "y"});
+ Assertions.assertArrayEquals(new String[] {"x", "y"}, result);
+ }
+
+ @Test
+ void testFilterReturnsNewArrayInstance() {
+ final DictionaryStopwordFilter filter = withEntries(new String[] {"the"});
+ final String[] input = new String[] {"the", "cat"};
+ final String[] output = filter.filter(input);
+ Assertions.assertNotSame(input, output);
+ input[1] = "dog";
+ Assertions.assertArrayEquals(new String[] {"cat"}, output);
+ }
+
+ @Test
+ void testFilterInputNullThrowsIllegalArgument() {
+ final DictionaryStopwordFilter filter = empty();
+ Assertions.assertThrows(IllegalArgumentException.class,
+ () -> filter.filter(null));
+ }
+
+ @Test
+ void testInputStreamConstructorParsesBlanksCommentsAndMultiWordLines() throws Exception {
+ final String contents = "# this is a comment header\n"
+ + "\n"
+ + "the\n"
+ + " and \n"
+ + "# another comment\n"
+ + "of the\n"
+ + "\n"
+ + "by\n";
+
+ final DictionaryStopwordFilter filter;
+ try (ByteArrayInputStream in =
+ new ByteArrayInputStream(contents.getBytes(StandardCharsets.UTF_8))) {
+ filter = new DictionaryStopwordFilter(in, StandardCharsets.UTF_8, false);
+ }
+
+ Assertions.assertTrue(filter.isStopword("the"));
+ Assertions.assertTrue(filter.isStopword("and"));
+ Assertions.assertTrue(filter.isStopword("by"));
+ Assertions.assertTrue(filter.isStopword("of", "the"));
+
+ Assertions.assertFalse(filter.isStopword("#"));
+ Assertions.assertFalse(filter.isStopword(""));
+ Assertions.assertFalse(filter.isStopword("dog"));
+ }
+
+ @Test
+ void testBuilderLoadParsesStream() throws Exception {
+ final String contents = "# bundled-style file\nthe\nof the\n";
+ final DictionaryStopwordFilter filter;
+ try (ByteArrayInputStream in =
+ new ByteArrayInputStream(contents.getBytes(StandardCharsets.UTF_8))) {
+ filter = DictionaryStopwordFilter.builder()
+ .load(in, StandardCharsets.UTF_8)
+ .add("extra")
+ .build();
+ }
+ Assertions.assertTrue(filter.isStopword("the"));
+ Assertions.assertTrue(filter.isStopword("of", "the"));
+ Assertions.assertTrue(filter.isStopword("extra"));
+ }
+
+ @Test
+ void testStopwordsViewIsUnmodifiable() {
+ final DictionaryStopwordFilter filter = withEntries(new String[] {"the"});
+ final Setopennlp.tools.stopword package. The central abstraction is
+ the StopwordFilter interface, with a
+ Dictionary-backed default implementation
+ DictionaryStopwordFilter. Bundled stopword lists are
+ available for eleven languages and can be loaded by ISO 639-1 code via
+ the StopwordLists factory. Users may also load custom lists
+ from any InputStream, mix them with the bundled defaults
+ and add or remove individual entries at runtime.
+ StopwordFilteringTokenizer) and an
+ ObjectStream adapter (StopwordFilterStream) to
+ plug stopword filtering into existing tokenization or training data
+ pipelines, plus a command-line tool for ad-hoc filtering.
+ StopwordLists factory by passing the desired ISO 639-1
+ language code. The returned filter is a
+ DictionaryStopwordFilter that is case-insensitive by
+ default.
+ InputStream
+ via StopwordLists.load. The method takes the input stream,
+ a character set and a flag indicating whether matching should be
+ case-sensitive.
+ # are treated as comments
+ and ignored.DictionaryStopwordFilter is immutable once constructed.
+ To tailor a bundled list to a specific domain — for example to add
+ project-specific noise terms or to retain a particular word that would
+ otherwise be filtered — use the nested
+ DictionaryStopwordFilter.Builder. The builder loads the
+ bundled resource, layers user-supplied add /
+ remove operations on top, and produces a fresh immutable
+ filter from build().
+ addAll(Collection<String[]>) and
+ removeAll(Collection<String[]>), which accept
+ multi-token entries. build() applies all queued additions
+ first, then all queued removals.
+ StopwordFilter can match n-grams.
+ The overloaded isStopword(String...) method accepts a
+ sequence of tokens and returns true if the entire sequence
+ is registered as a multi-word stopword entry.
+ filter(String[]) method honors both 1-gram and n-gram
+ entries. It performs a greedy, left-to-right window scan: at each
+ position the longest registered window is tried first; if it matches,
+ the entire window is dropped and scanning resumes after it. Otherwise
+ the head token is kept and scanning advances by one. Tokens that are
+ null are kept in place and never participate in a window
+ match.
+ Tokenizer implementation can be wrapped in a
+ StopwordFilteringTokenizer to transparently remove
+ stopwords from its output. The decorator delegates tokenization to the
+ wrapped instance and then runs the resulting token array through the
+ provided StopwordFilter.
+ Tokenizer
+ is expected, including in downstream OpenNLP pipelines.
+ ObjectStream<String[]> (for example tokenized
+ sentences), the StopwordFilterStream adapter provides a
+ drop-in filter that strips stopwords from each emitted token array.
+ StopwordFilter command reads whitespace-separated tokens
+ from standard input and writes the non-stopword tokens to standard
+ output. The single argument is either the ISO 639-1 code of a bundled
+ list, or a path to a custom stopword list file (same format as the
+ Java API: one entry per line, with # comments and blank
+ lines ignored, loaded case-insensitively).
+ ./en.
+ The tool is intended for quick interactive checks and for use inside
+ shell pipelines, for example chained behind a tokenizer:
+ StopwordLists.forLanguage:
+ DictionaryStopwordFilter is immutable once constructed and
+ is therefore safe to share across threads without external
+ synchronization. A filter returned by
+ StopwordLists.forLanguage(...) or assembled via the
+ Builder can be stored in a static field and accessed
+ concurrently from any number of readers.
+ StopwordFilteringTokenizer and
+ StopwordFilterStream are also stateless decorators with
+ only final fields, so they inherit the thread-safety of the
+ components they wrap. When paired with a
+ DictionaryStopwordFilter and a thread-safe delegate
+ tokenizer (e.g. SimpleTokenizer.INSTANCE or
+ WhitespaceTokenizer.INSTANCE) the resulting pipeline is
+ fully thread-safe.
+