Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make "text" key in JSONL format optional when "tokens" key is provided #3694

Closed
devforfu opened this issue May 7, 2019 · 4 comments
Closed
Labels
enhancement Feature requests and improvements feat / cli Feature: Command-line interface help wanted (easy) Contributions welcome! (also suited for spaCy beginners) help wanted Contributions welcome!

Comments

@devforfu
Copy link
Contributor

devforfu commented May 7, 2019

Feature description

I would like to build a pre-trained model using an already tokenized dataset. I store my data in JSONL format as described in the docs.

{"tokens": ["my", "tokenized", "data", "."]}
{"tokens": ["one", "more", "example", "."]}
...

As I understand, the text attribute is obligatory. (The CLI tool raises error trying to access text key). However, it seems that these lines use exactly one of these two keys:

def make_docs(nlp, batch, min_length, max_length):
    docs = []
    for record in batch:
        text = record["text"]
        if "tokens" in record:
            doc = Doc(nlp.vocab, words=record["tokens"])  # use tokens
        else:
            doc = nlp.make_doc(text)  # use "raw" text
        ... # the rest of the code

Is it possible to make the text key optional if tokens are provided? It is not a big deal to extend my data with something like {"text": null, "tokens": [...]} but it seems a bit clumsy taking into account that this fragment can be easily refactored to account both cases without explicitly asking for a missing key.

Or am I missing something and these keys are used in some other places also?

@BreakBB
Copy link
Contributor

BreakBB commented May 8, 2019

In the code the text key is used only optional, since it is just used in the else part. So moving text = record["text"] in the else part should solve your issue and would make pre-training require either text or tokens instead of text and optional tokens.

Moreover the heads key (see these lines) isn't documented at all in the CLI docs.

@devforfu
Copy link
Contributor Author

devforfu commented May 8, 2019

Yes, that's what I was talking about. The text attributed is forced while it is not really required and can be accessed only in the second branch of the conditional logic. Also, a good point about heads thing.

@ines ines added enhancement Feature requests and improvements feat / cli Feature: Command-line interface help wanted Contributions welcome! help wanted (easy) Contributions welcome! (also suited for spaCy beginners) labels May 10, 2019
@ines
Copy link
Member

ines commented May 10, 2019

Yes, good point – moving text = record["text"] into the else should be fine. If you want to submit a PR with this, that'd be great! 👍

@lock
Copy link

lock bot commented Jun 10, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements feat / cli Feature: Command-line interface help wanted (easy) Contributions welcome! (also suited for spaCy beginners) help wanted Contributions welcome!
Projects
None yet
Development

No branches or pull requests

3 participants