Add supports for additional Japanese tokenizers. #1786

himkt · 2020-07-30T16:33:13Z

Related: #1296

Hello all! I'm the author of #1267
Recently, I updated konoha adding supports for new tokenizers.
In this PR, I add supports for two Japanese tokenizers to flair.tokenization.JapaneseTokenizer; Janome and SudachiPy.

These tokenizers work without building any external software outside pip install.
So I'm wondering if I can add support for built-in Japanese tokenization to flair.
(Of course, it's not a strong opinion. I'd like feedback from the flair team!)

I attach examples for using JapaneseTokenizer in some cases:

Case1. New available tokenizers for Japanese!: Janome and SudachiPy

> python                                                                                                                                                                                                                                                                                                                                     (feat/support-japanese-tokenizers| ● 3)
Python 3.8.5 (default, Jul 24 2020, 16:45:21)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import flair
>>> print(flair.data.Sentence("私はベルリンが好き", use_tokenizer=flair.tokenization.JapaneseTokenizer("janome")))
Sentence: "私 は ベルリン が 好き"   [− Tokens: 5]
>>> print(flair.data.Sentence("高輪ゲートウェイ駅", use_tokenizer=flair.tokenization.JapaneseTokenizer("sudachi", sudachi_mode="A")))
Sentence: "高輪 ゲートウェイ 駅"   [− Tokens: 3]
>>> print(flair.data.Sentence("高輪ゲートウェイ駅", use_tokenizer=flair.tokenization.JapaneseTokenizer("sudachi", sudachi_mode="C")))
Sentence: "高輪ゲートウェイ駅"   [− Tokens: 1]

Case2. If it doesn't install konoha, the library for Japanese tokenizer. (almost the same as current message)

> python                                                                                                                                                                                                                                                                                                                                     (feat/support-japanese-tokenizers| ● 3)
Python 3.8.5 (default, Jul 24 2020, 16:45:21)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import flair
>>> print(flair.data.Sentence("高輪ゲートウェイ駅", use_tokenizer=flair.tokenization.JapaneseTokenizer("janome")))
2020-07-31 01:15:32,426 ----------------------------------------------------------------------------------------------------
2020-07-31 01:15:32,426 ATTENTION! The library "konoha" is not installed!
2020-07-31 01:15:32,426 - If you want to use MeCab, install mecab with "sudo apt install mecab libmecab-dev mecab-ipadic".
2020-07-31 01:15:32,426 - Install konoha with "pip install konoha[{tokenizer_name}]"
2020-07-31 01:15:32,426   - You can choose tokenizer from ["mecab", "janome", "sudachi"].
2020-07-31 01:15:32,426 ----------------------------------------------------------------------------------------------------

Case3. If users specify a tokenizer which is not supported.

> python                                                                                                                                                                                                                                                                                                                                     (feat/support-japanese-tokenizers| ● 3)
Python 3.8.5 (default, Jul 24 2020, 16:45:21)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import flair
>>> print(flair.data.Sentence("高輪ゲートウェイ駅", use_tokenizer=flair.tokenization.JapaneseTokenizer("unknown_tokenizer")))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/makoto-hiramatsu/work/github.com/himkt/flair/flair/tokenization.py", line 182, in __init__
    raise NotImplementedError(
NotImplementedError: Currently, unknown_tokenizer is only supported. Supported tokenizers: ['mecab', 'janome', 'sudachi'].

Thanks as always for maintaining flair,

alanakbik · 2020-08-03T15:13:43Z

Hello @himkt thanks for adding this! Yes, if the dependencies are fully installable through pip we can add it. I tested the example and the "janome" tokenizer works great. In order to run "sudachi" I first had to run pip install konoha[sudachi].

himkt · 2020-08-03T15:28:21Z

Thank you for trying out, I really appreciate it as always!

In order to run "sudachi" I first had to run pip install konoha[sudachi]

You're right. It's the Case2 in the above example.
(The message suggests to install konoha with a tokenizer, but the error message could not be straightforward)

Please let me know if you have a good idea to guide users to setup environment more easily.

alanakbik · 2020-08-03T16:29:04Z

I think this is good! Thanks again for adding this!

himkt added 3 commits July 31, 2020 01:18

Add konoha to requirements

4f822cc

Run test for Japanese tokenizer

afbfd7a

Add support for tokenizers other than MeCab 🎉

ca8cd02

alanakbik merged commit d53f522 into flairNLP:master Aug 3, 2020

himkt deleted the feat/support-japanese-tokenizers branch August 3, 2020 23:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add supports for additional Japanese tokenizers. #1786

Add supports for additional Japanese tokenizers. #1786

himkt commented Jul 30, 2020 •

edited

alanakbik commented Aug 3, 2020

himkt commented Aug 3, 2020 •

edited

alanakbik commented Aug 3, 2020

Add supports for additional Japanese tokenizers. #1786

Add supports for additional Japanese tokenizers. #1786

Conversation

himkt commented Jul 30, 2020 • edited

Case1. New available tokenizers for Japanese!: Janome and SudachiPy

Case2. If it doesn't install konoha, the library for Japanese tokenizer. (almost the same as current message)

Case3. If users specify a tokenizer which is not supported.

alanakbik commented Aug 3, 2020

himkt commented Aug 3, 2020 • edited

alanakbik commented Aug 3, 2020

himkt commented Jul 30, 2020 •

edited

himkt commented Aug 3, 2020 •

edited