-
Notifications
You must be signed in to change notification settings - Fork 493
OPENNLP-660: Include list of stop words for various languages #1056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
rzo1
wants to merge
3
commits into
main
Choose a base branch
from
OPENNLP-660
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
94 changes: 94 additions & 0 deletions
94
opennlp-api/src/main/java/opennlp/tools/stopword/StopwordFilter.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package opennlp.tools.stopword; | ||
|
|
||
| import java.util.Set; | ||
|
|
||
| /** | ||
| * A pluggable filter that decides whether a token (or a sequence of tokens) | ||
| * is a stopword that should be removed during downstream text processing. | ||
| * <p> | ||
| * Implementations may be backed by a static bundled list, a user-supplied | ||
| * file, an in-memory data structure, or any other source. | ||
| * Both single-token and multi-token (n-gram) membership tests are supported. | ||
| * | ||
| * @see opennlp.tools.util.LanguageCodeValidator | ||
| */ | ||
| public interface StopwordFilter { | ||
|
|
||
| /** | ||
| * Checks whether the given token is a single-token stopword. | ||
| * Equivalent to {@code isStopword(new String[] { token.toString() })} when | ||
| * {@code token} is non-{@code null}. | ||
| * | ||
| * @param token The token to test. May be {@code null}, in which case | ||
| * implementations should return {@code false}. | ||
| * @return {@code true} if {@code token} is registered as a single-token | ||
| * stopword, {@code false} otherwise. | ||
| */ | ||
| boolean isStopword(final CharSequence token); | ||
|
|
||
| /** | ||
| * Checks whether the given sequence of tokens is a multi-token stopword | ||
| * (n-gram). For a single token this is equivalent to | ||
| * {@link #isStopword(CharSequence)}. | ||
| * | ||
| * @param tokens The tokens to test as one entry. May be {@code null} or | ||
| * empty, in which case implementations should return {@code false}. | ||
| * @return {@code true} if the sequence is registered as a stopword, | ||
| * {@code false} otherwise. | ||
| */ | ||
| boolean isStopword(final String... tokens); | ||
|
|
||
| /** | ||
| * Returns a copy of {@code tokens} with stopword matches removed, | ||
| * preserving the input order. | ||
| * <p> | ||
| * Implementations should honor both 1-gram and n-gram entries. A | ||
| * recommended strategy is a greedy left-to-right window scan: at each | ||
| * position try the longest registered window first; if it matches, skip | ||
| * those tokens; otherwise advance by one and keep the current token. | ||
| * Implementations that do not support n-gram entries may fall back to | ||
| * 1-gram filtering. | ||
| * | ||
| * @param tokens The input token array. Must not be {@code null}. | ||
| * Individual array elements may be {@code null} and are kept as-is. | ||
| * @return A new array containing the surviving tokens. Never {@code null}. | ||
| * @throws IllegalArgumentException if {@code tokens} is {@code null}. | ||
| */ | ||
| String[] filter(final String[] tokens); | ||
|
|
||
| /** | ||
| * @return {@code true} if this filter performs case-sensitive matching; | ||
| * {@code false} if matching is case-insensitive. | ||
| */ | ||
| boolean isCaseSensitive(); | ||
|
|
||
| /** | ||
| * Returns an unmodifiable snapshot of the registered single-token | ||
| * stopwords. Multi-token (n-gram) entries are not included in this view | ||
| * and must be tested via {@link #isStopword(String...)}. | ||
| * <p> | ||
| * Attempts to mutate the returned {@link Set} will fail. | ||
| * | ||
| * @return An unmodifiable {@link Set} of stopwords. Never {@code null}. | ||
| * @throws UnsupportedOperationException if a caller attempts to add to, | ||
| * remove from, or otherwise mutate the returned {@link Set}. | ||
| */ | ||
| Set<String> stopwords(); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
148 changes: 148 additions & 0 deletions
148
...nlp-core/opennlp-cli/src/main/java/opennlp/tools/cmdline/stopword/StopwordFilterTool.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,148 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package opennlp.tools.cmdline.stopword; | ||
|
|
||
| import java.io.BufferedReader; | ||
| import java.io.IOException; | ||
| import java.io.InputStream; | ||
| import java.io.InputStreamReader; | ||
| import java.io.PrintWriter; | ||
| import java.nio.charset.StandardCharsets; | ||
| import java.nio.file.Files; | ||
| import java.nio.file.InvalidPathException; | ||
| import java.nio.file.NoSuchFileException; | ||
| import java.nio.file.Path; | ||
| import java.nio.file.Paths; | ||
|
|
||
| import opennlp.tools.cmdline.BasicCmdLineTool; | ||
| import opennlp.tools.cmdline.CLI; | ||
| import opennlp.tools.cmdline.TerminateToolException; | ||
| import opennlp.tools.stopword.StopwordFilter; | ||
| import opennlp.tools.stopword.StopwordLists; | ||
|
|
||
| /** | ||
| * A command line tool that filters stop words from whitespace-separated | ||
| * tokens read on standard input and prints the kept tokens to standard | ||
| * output, one input line per output line. | ||
| * | ||
| * <p>Usage: {@code opennlp StopwordFilter <lang|file>}. The single argument is | ||
| * either an ISO 639 language code matching one of the bundled lists, or a path | ||
| * to a custom stopword list file (one entry per line, {@code #} comments and | ||
| * blank lines ignored, loaded case-insensitively). The tokens to filter are | ||
| * always read from standard input. A bundled language code takes precedence; | ||
| * to force loading a file whose name happens to be a language code, qualify it | ||
| * with a path (e.g. {@code ./en}). | ||
| */ | ||
| public final class StopwordFilterTool extends BasicCmdLineTool { | ||
|
|
||
| @Override | ||
| public String getShortDescription() { | ||
| return "filters stop words from tokens read on stdin"; | ||
| } | ||
|
|
||
| @Override | ||
| public String getHelp() { | ||
| return "Usage: " + CLI.CMD + " " + getName() + " <lang|file>\n" | ||
| + " <lang> ISO 639 code of a bundled list; supported: " | ||
| + StopwordLists.supportedLanguages() + "\n" | ||
| + " <file> path to a custom stopword list (one entry per line; " | ||
| + "'#' comments and blank lines ignored)"; | ||
| } | ||
|
|
||
| @Override | ||
| public boolean hasParams() { | ||
| return true; | ||
| } | ||
|
|
||
| @Override | ||
| public void run(final String[] args) { | ||
| if (args.length != 1) { | ||
| System.out.println(getHelp()); | ||
| return; | ||
| } | ||
|
|
||
| final StopwordFilter filter = resolveFilter(args[0]); | ||
|
|
||
| try (BufferedReader reader = new BufferedReader( | ||
| new InputStreamReader(System.in, StandardCharsets.UTF_8)); | ||
| PrintWriter writer = new PrintWriter( | ||
| new java.io.OutputStreamWriter(System.out, StandardCharsets.UTF_8))) { | ||
|
|
||
| String line; | ||
| while ((line = reader.readLine()) != null) { | ||
| if (line.isEmpty()) { | ||
| writer.println(); | ||
| continue; | ||
| } | ||
| final String[] tokens = line.split("\\s+"); | ||
| final String[] kept = filter.filter(tokens); | ||
| writer.println(String.join(" ", kept)); | ||
| } | ||
|
|
||
| writer.flush(); | ||
| } catch (final IOException e) { | ||
| throw new TerminateToolException(1, "Error reading from stdin: " + e.getMessage(), e); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Resolves the {@code <lang|file>} argument to a {@link StopwordFilter}. A | ||
| * bundled language code is preferred; otherwise the argument is treated as a | ||
| * path to a custom stopword list file loaded via | ||
| * {@link StopwordLists#load(InputStream, java.nio.charset.Charset, boolean)}. | ||
| */ | ||
| private static StopwordFilter resolveFilter(final String source) { | ||
| final StopwordFilter bundled = tryBundled(source); | ||
| if (bundled != null) { | ||
| return bundled; | ||
| } | ||
|
|
||
| final Path path; | ||
| try { | ||
| path = Paths.get(source); | ||
| } catch (final InvalidPathException e) { | ||
| throw new TerminateToolException(1, neitherMessage(source)); | ||
| } | ||
|
|
||
| try (InputStream in = Files.newInputStream(path)) { | ||
| return StopwordLists.load(in, StandardCharsets.UTF_8, false); | ||
| } catch (final NoSuchFileException e) { | ||
| throw new TerminateToolException(1, neitherMessage(source)); | ||
| } catch (final IOException e) { | ||
| throw new TerminateToolException(1, | ||
| "Error reading stopword list file '" + source + "': " + e.getMessage(), e); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * @return A bundled {@link StopwordFilter} for {@code code}, or {@code null} | ||
| * if {@code code} is not a supported bundled ISO 639 language code. | ||
| */ | ||
| private static StopwordFilter tryBundled(final String code) { | ||
| try { | ||
| return StopwordLists.forLanguage(code); | ||
| } catch (final IllegalArgumentException e) { | ||
| return null; | ||
| } | ||
| } | ||
|
|
||
| private static String neitherMessage(final String source) { | ||
| return "'" + source + "' is neither a supported language code " | ||
| + StopwordLists.supportedLanguages() + " nor an existing file."; | ||
| } | ||
| } |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.