Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add supports for additional Japanese tokenizers. #1786

Merged
merged 3 commits into from
Aug 3, 2020

Conversation

himkt
Copy link
Contributor

@himkt himkt commented Jul 30, 2020

Related: #1296

Hello all! I'm the author of #1267
Recently, I updated konoha adding supports for new tokenizers.
In this PR, I add supports for two Japanese tokenizers to flair.tokenization.JapaneseTokenizer; Janome and SudachiPy.

These tokenizers work without building any external software outside pip install.
So I'm wondering if I can add support for built-in Japanese tokenization to flair.
(Of course, it's not a strong opinion. I'd like feedback from the flair team!)

I attach examples for using JapaneseTokenizer in some cases:

Case1. New available tokenizers for Japanese!: Janome and SudachiPy

> python                                                                                                                                                                                                                                                                                                                                     (feat/support-japanese-tokenizers| ● 3)
Python 3.8.5 (default, Jul 24 2020, 16:45:21)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import flair
>>> print(flair.data.Sentence("私はベルリンが好き", use_tokenizer=flair.tokenization.JapaneseTokenizer("janome")))
Sentence: "私 は ベルリン が 好き"   [− Tokens: 5]
>>> print(flair.data.Sentence("高輪ゲートウェイ駅", use_tokenizer=flair.tokenization.JapaneseTokenizer("sudachi", sudachi_mode="A")))
Sentence: "高輪 ゲートウェイ 駅"   [− Tokens: 3]
>>> print(flair.data.Sentence("高輪ゲートウェイ駅", use_tokenizer=flair.tokenization.JapaneseTokenizer("sudachi", sudachi_mode="C")))
Sentence: "高輪ゲートウェイ駅"   [− Tokens: 1]

Case2. If it doesn't install konoha, the library for Japanese tokenizer. (almost the same as current message)

> python                                                                                                                                                                                                                                                                                                                                     (feat/support-japanese-tokenizers| ● 3)
Python 3.8.5 (default, Jul 24 2020, 16:45:21)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import flair
>>> print(flair.data.Sentence("高輪ゲートウェイ駅", use_tokenizer=flair.tokenization.JapaneseTokenizer("janome")))
2020-07-31 01:15:32,426 ----------------------------------------------------------------------------------------------------
2020-07-31 01:15:32,426 ATTENTION! The library "konoha" is not installed!
2020-07-31 01:15:32,426 - If you want to use MeCab, install mecab with "sudo apt install mecab libmecab-dev mecab-ipadic".
2020-07-31 01:15:32,426 - Install konoha with "pip install konoha[{tokenizer_name}]"
2020-07-31 01:15:32,426   - You can choose tokenizer from ["mecab", "janome", "sudachi"].
2020-07-31 01:15:32,426 ----------------------------------------------------------------------------------------------------

Case3. If users specify a tokenizer which is not supported.

> python                                                                                                                                                                                                                                                                                                                                     (feat/support-japanese-tokenizers| ● 3)
Python 3.8.5 (default, Jul 24 2020, 16:45:21)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import flair
>>> print(flair.data.Sentence("高輪ゲートウェイ駅", use_tokenizer=flair.tokenization.JapaneseTokenizer("unknown_tokenizer")))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/makoto-hiramatsu/work/github.com/himkt/flair/flair/tokenization.py", line 182, in __init__
    raise NotImplementedError(
NotImplementedError: Currently, unknown_tokenizer is only supported. Supported tokenizers: ['mecab', 'janome', 'sudachi'].

Thanks as always for maintaining flair,

@alanakbik
Copy link
Collaborator

Hello @himkt thanks for adding this! Yes, if the dependencies are fully installable through pip we can add it. I tested the example and the "janome" tokenizer works great. In order to run "sudachi" I first had to run pip install konoha[sudachi].

@himkt
Copy link
Contributor Author

himkt commented Aug 3, 2020

Thank you for trying out, I really appreciate it as always!

In order to run "sudachi" I first had to run pip install konoha[sudachi]

You're right. It's the Case2 in the above example.
(The message suggests to install konoha with a tokenizer, but the error message could not be straightforward)

Please let me know if you have a good idea to guide users to setup environment more easily.

@alanakbik
Copy link
Collaborator

I think this is good! Thanks again for adding this!

@alanakbik alanakbik merged commit d53f522 into flairNLP:master Aug 3, 2020
@himkt himkt deleted the feat/support-japanese-tokenizers branch August 3, 2020 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants