Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JapaneseTokenizer.pipe added #6515

Merged
merged 2 commits into from Dec 8, 2020
Merged

Conversation

KoichiYasuoka
Copy link
Contributor

@KoichiYasuoka KoichiYasuoka commented Dec 7, 2020

Description

For spacymoji with Japanese().

Types of change

To use JapaneseTokenizer.pipe as same as Tokenizer.pipe

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg added enhancement Feature requests and improvements feat / tokenizer Feature: Tokenizer lang / ja Japanese language data and models labels Dec 7, 2020
@adrianeboyd
Copy link
Contributor

Hi, thanks for this PR! Looking at the details, I think it would make more sense to add it to DummyTokenizer so it fixes this issue for several languages at once (Chinese, Japanese, Korean, Thai), but then you also have to handle an unimplemented __call__ method in DummyTokenizer. I think you could add:

class DummyTokenizer(object):
    def __call__(self, text):
        raise NotImplementedError

    def pipe(self, texts, **kwargs):
        for text in texts:
            yield self(text)

    ...

If you wanted you could warn about n_threads as in Tokenizer.pipe, but I don't think this is a big deal. None of the tokenizers do any batching, and nlp.pipe() just calls nlp.make_doc(), so you don't notice the missing tokenizer.pipe() in most typical use cases.

@adrianeboyd
Copy link
Contributor

Thanks, this looks good!

@KoichiYasuoka
Copy link
Contributor Author

KoichiYasuoka commented Dec 8, 2020

Thank you for your suggestion, @adrianeboyd , and Thai() seems OK now with spacymoji. But, among the tests, explosion.spaCy (Test Python35Windows) failed. Umm...

@adrianeboyd
Copy link
Contributor

Don't worry about the python 3.5 test, that's unrelated (and should be fixed by #6522).

@adrianeboyd adrianeboyd closed this Dec 8, 2020
@adrianeboyd adrianeboyd reopened this Dec 8, 2020
@adrianeboyd adrianeboyd merged commit 0afb54a into explosion:master Dec 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / tokenizer Feature: Tokenizer lang / ja Japanese language data and models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants