Permalink
Show file tree
Hide file tree
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
[HIVEMALL-305] Kuromoji Japanese tokenizer with Neologd dictionary
## What changes were proposed in this pull request? Add tokenize_ja_neologd UDF that uses Neologd dictionary for Kuromoji tokenization. ## What type of PR is it? Feature ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-305 ## How was this patch tested? unit tests and manual tests on EMR ## How to use this feature? ```sql tokenize_ja_neologd(text input, optional const text mode = "normal", optional const array<string> stopWords, const array<string> stopTags, const array<string> userDict) select tokenize_ja_neologd("彼女はペンパイナッポーアッポーペンと恋ダンスを踊った。"); > ["彼女","ペンパイナッポーアッポーペン","恋ダンス","踊る"] ``` ## Checklist - [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit? - [x] Did you run system tests on Hive (or Spark)? Author: Makoto Yui <myui@apache.org> Closes #235 from myui/neologd.
- Loading branch information
Showing
16 changed files
with
1,019 additions
and
65 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -31,3 +31,4 @@ docs/gitbook/node_modules/** | ||
**/derby.log | ||
**/LICENSE-*.txt | ||
**/Base91.java | ||
**/*.properties |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.