JapaneseTokenizer.pipe added #6515

KoichiYasuoka · 2020-12-07T15:38:42Z

Description

For spacymoji with Japanese().

Types of change

To use JapaneseTokenizer.pipe as same as Tokenizer.pipe

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

For [spacymoji](https://spacy.io/universe/project/spacymoji) with `Japanese()`.

adrianeboyd · 2020-12-07T17:04:28Z

Hi, thanks for this PR! Looking at the details, I think it would make more sense to add it to DummyTokenizer so it fixes this issue for several languages at once (Chinese, Japanese, Korean, Thai), but then you also have to handle an unimplemented __call__ method in DummyTokenizer. I think you could add:

class DummyTokenizer(object):
    def __call__(self, text):
        raise NotImplementedError

    def pipe(self, texts, **kwargs):
        for text in texts:
            yield self(text)

    ...

If you wanted you could warn about n_threads as in Tokenizer.pipe, but I don't think this is a big deal. None of the tokenizers do any batching, and nlp.pipe() just calls nlp.make_doc(), so you don't notice the missing tokenizer.pipe() in most typical use cases.

adrianeboyd · 2020-12-08T10:17:48Z

Thanks, this looks good!

KoichiYasuoka · 2020-12-08T10:20:31Z

Thank you for your suggestion, @adrianeboyd , and Thai() seems OK now with spacymoji. But, among the tests, explosion.spaCy (Test Python35Windows) failed. Umm...

adrianeboyd · 2020-12-08T10:57:11Z

Don't worry about the python 3.5 test, that's unrelated (and should be fixed by #6522).

JapaneseTokenizer.pipe added

bb46a69

For [spacymoji](https://spacy.io/universe/project/spacymoji) with `Japanese()`.

svlandeg added enhancement Feature requests and improvements feat / tokenizer Feature: Tokenizer lang / ja Japanese language data and models labels Dec 7, 2020

DummyTokenizer.pipe added instead

0882272

adrianeboyd closed this Dec 8, 2020

adrianeboyd reopened this Dec 8, 2020

adrianeboyd merged commit 0afb54a into explosion:master Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JapaneseTokenizer.pipe added #6515

JapaneseTokenizer.pipe added #6515

KoichiYasuoka commented Dec 7, 2020 •

edited

adrianeboyd commented Dec 7, 2020

adrianeboyd commented Dec 8, 2020

KoichiYasuoka commented Dec 8, 2020 •

edited

adrianeboyd commented Dec 8, 2020

JapaneseTokenizer.pipe added #6515

JapaneseTokenizer.pipe added #6515

Conversation

KoichiYasuoka commented Dec 7, 2020 • edited

Description

Types of change

Checklist

adrianeboyd commented Dec 7, 2020

adrianeboyd commented Dec 8, 2020

KoichiYasuoka commented Dec 8, 2020 • edited

adrianeboyd commented Dec 8, 2020

KoichiYasuoka commented Dec 7, 2020 •

edited

KoichiYasuoka commented Dec 8, 2020 •

edited