Skip to content
Permalink
Browse files
[HIVEMALL-305] Kuromoji Japanese tokenizer with Neologd dictionary
## What changes were proposed in this pull request?

Add tokenize_ja_neologd UDF that uses Neologd dictionary for Kuromoji tokenization.

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-305

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

```sql
tokenize_ja_neologd(text input, optional const text mode = "normal", optional const array<string> stopWords, const array<string> stopTags, const array<string> userDict)

select tokenize_ja_neologd("彼女はペンパイナッポーアッポーペンと恋ダンスを踊った。");
> ["彼女","ペンパイナッポーアッポーペン","恋ダンス","踊る"]
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #235 from myui/neologd.
  • Loading branch information
myui committed Apr 22, 2021
1 parent dc461c2 commit b56c477a20ef6d7be143cddc49d9f9f85e144b63
Showing 16 changed files with 1,019 additions and 65 deletions.
@@ -31,3 +31,4 @@ docs/gitbook/node_modules/**
**/derby.log
**/LICENSE-*.txt
**/Base91.java
**/*.properties
@@ -52,13 +52,5 @@ define_additional() {
read -p "Function name (e.g., 'hivemall_version'): " function_name
read -p "Class path (e.g., 'hivemall.HivemallVersionUDF'): " class_path

prefix="$(echo "$class_path" | cut -d'.' -f1,2)"
if [[ $prefix == 'hivemall.xgboost' ]]; then
define_all_as_permanent
define_additional
elif [[ $prefix == 'hivemall.nlp' ]]; then
define_additional
else
define_all
define_all_as_permanent
fi
define_all
define_all_as_permanent
@@ -122,6 +122,7 @@
<include>org.apache.lucene:lucene-analyzers-smartcn</include>
<include>org.apache.lucene:lucene-analyzers-common</include>
<include>org.apache.lucene:lucene-core</include>
<include>io.github.myui:lucene-analyzers-kuromoji-neologd</include>
<!-- hivemall-xgboost -->
<include>org.apache.hivemall:hivemall-xgboost</include>
<include>io.github.myui:xgboost4j</include>
@@ -1050,6 +1050,31 @@ Reference: <a href="https://papers.nips.cc/paper/3848-adaptive-regularization-of

- `tfidf(double termFrequency, long numDocs, const long totalNumDocs)` - Return a smoothed TFIDF score in double.

# NLP

- `stoptags_exclude(array<string> excludeTags, [, const string lang='ja'])` - Returns stoptags excluding given tags
```sql
SELECT stoptags_exclude(array('名詞-固有名詞', '形容詞'))
```

- `tokenize_cn(String line [, const list<string> stopWords])` - returns tokenized strings in array&lt;string&gt;

- `tokenize_ja(String line [, const string mode = "normal", const array<string> stopWords, const array<string> stopTags, const array<string> userDict (or string userDictURL)`]) - returns tokenized strings in array&lt;string&gt;
```sql
select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");
> ["kuromoji","使う","分かち書き","テスト","","","引数","normal","search","extended","指定","デフォルト","normal"," モード"]
```

- `tokenize_ja_neologd(String line [, const string mode = "normal", const array<string> stopWords, const array<string> stopTags, const array<string> userDict (or string userDictURL)`]) - returns tokenized strings in array&lt;string&gt;
```sql
select tokenize_ja_neologd("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");
> ["kuromoji","使う","分かち書き","テスト","","","引数","normal","search","extended","指定","デフォルト","normal"," モード"]
```

# Others

- `hivemall_version()` - Returns the version of Hivemall
@@ -28,31 +28,49 @@ tokenize(text input, optional boolean toLowerCase = false)

# Tokenizer for Non-English Texts

Hivemall-NLP module provides some Non-English Text tokenizer UDFs as follows.

First of all, you need to issue the following DDLs to use the NLP module. Note NLP module is not included in `hivemall-with-dependencies.jar`.

> add jar /path/to/hivemall-nlp-xxx-with-dependencies.jar;
> source /path/to/define-additional.hive;
## Japanese Tokenizer

Japanese text tokenizer UDF uses [Kuromoji](https://github.com/atilika/kuromoji).

The signature of the UDF is as follows:

```sql
-- uses Kuromoji default dictionary by the default
tokenize_ja(text input, optional const text mode = "normal", optional const array<string> stopWords, const array<string> stopTags, const array<string> userDict)
-- tokenize_ja_neologd uses mecab-ipa-neologd for it's dictionary.
tokenize_ja_neologd(text input, optional const text mode = "normal", optional const array<string> stopWords, const array<string> stopTags, const array<string> userDict)
```

> #### Note
> `tokenize_ja` is supported since Hivemall v0.4.1, and the fifth argument is supported since v0.5-rc.1 and later.
> `tokenize_ja_neologd` returns tokenized strings in an array by using the NEologd dictionary. [mecab-ipadic-NEologd](https://github.com/neologd/mecab-ipadic-neologd) is a customized system dictionary for MeCab inclucing new vocablaries extracted from many resources on the Web.
See differences between with and without Neologd as follows:

```sql
select tokenize_ja("彼女はペンパイナッポーアッポーペンと恋ダンスを踊った。");
>["彼女","ペンパイナッポーアッポーペン","","ダンス","踊る"]
select tokenize_ja_neologd("彼女はペンパイナッポーアッポーペンと恋ダンスを踊った。");
> ["彼女","ペンパイナッポーアッポーペン","恋ダンス","踊る"]
```

You can print versions for Kuromoji UDFs as follows:

```sql
select tokenize_ja();
> ["8.8.2"]
select tokenize_ja_neologd();
> ["8.8.2-20200910.2"]
```

Its basic usage is as follows:

```sql
select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");
```

> ["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal","モード"]
In addition, the third and fourth argument respectively allow you to use your own list of stop words and stop tags. For example, the following query simply ignores "kuromoji" (as a stop word) and noun word "分かち書き" (as a stop tag):
@@ -70,10 +88,10 @@ select tokenize_ja("kuromojiを使った分かち書きのテストです。", "
`stoptags_exclude(array<string> tags, [, const string lang='ja'])` is a useful UDF for getting [stoptags](https://github.com/apache/lucene-solr/blob/master/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/stoptags.txt) excluding given part-of-speech tags as seen below:


```sql
select stoptags_exclude(array("名詞-固有名詞"));
```

> ["その他","その他-間投","フィラー","副詞","副詞-一般","副詞-助詞類接続","助動詞","助詞","助詞-並立助詞"
,"助詞-係助詞","助詞-副助詞","助詞-副助詞/並立助詞/終助詞","助詞-副詞化","助詞-接続助詞","助詞-格助詞
","助詞-格助詞-一般","助詞-格助詞-引用","助詞-格助詞-連語","助詞-特殊","助詞-終助詞","助詞-連体化","助
@@ -106,16 +124,19 @@ If you have a large custom dictionary as an external file, `userDict` can also b
```sql
select tokenize_ja("日本経済新聞&関西国際空港", "normal", null, null,
"https://raw.githubusercontent.com/atilika/kuromoji/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/resources/userdict.txt");
```
> ["日本","経済","新聞","関西","国際","空港"]
```

Dictionary SHOULD be accessible through http/https protocol. And, it SHOULD be compressed using gzip with `.gz` suffix because the maximum dictionary size is limited to 32MB and read timeout is set to 60 sec. Also, connection must be established in 10 sec.

If you want to use HTTP Basic Authentication, please use the following form: `https://user:password@www.sitreurl.com/my_dict.txt.gz` (see Sec 3.1 of [rfc1738](https://www.ietf.org/rfc/rfc1738.txt))
> #### Note
> Dictionary SHOULD be accessible through http/https protocol. And, it SHOULD be compressed using gzip with `.gz` suffix because the maximum dictionary size is limited to 32MB and read timeout is set to 60 sec. Also, connection must be established in 10 sec.
>
> If you want to use HTTP Basic Authentication, please use the following form: `https://user:password@www.sitreurl.com/my_dict.txt.gz` (see Sec 3.1 of [rfc1738](https://www.ietf.org/rfc/rfc1738.txt))
For detailed APIs, please refer Javadoc of [JapaneseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html) as well.



## Part-of-speech

From Hivemall v0.6.0, the second argument can also accept the following option format:
@@ -16,7 +16,9 @@
specific language governing permissions and limitations
under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<parent>
@@ -33,6 +35,7 @@
<properties>
<main.basedir>${project.parent.basedir}</main.basedir>
<lucene.version>8.8.2</lucene.version>
<lucene-analyzers-kuromoji-neologd.version>8.8.2-20200910.2</lucene-analyzers-kuromoji-neologd.version>
</properties>

<dependencies>
@@ -109,6 +112,12 @@
<version>${lucene.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>io.github.myui</groupId>
<artifactId>lucene-analyzers-kuromoji-neologd</artifactId>
<version>${lucene-analyzers-kuromoji-neologd.version}</version>
<scope>compile</scope>
</dependency>

<!-- test scope -->
<dependency>
@@ -125,4 +134,13 @@

</dependencies>

<build>
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
</resource>
</resources>
</build>

</project>

0 comments on commit b56c477

Please sign in to comment.