Make "text" key in JSONL format optional when "tokens" key is provided #3694

devforfu · 2019-05-07T15:11:50Z

Feature description

I would like to build a pre-trained model using an already tokenized dataset. I store my data in JSONL format as described in the docs.

{"tokens": ["my", "tokenized", "data", "."]}
{"tokens": ["one", "more", "example", "."]}
...

As I understand, the text attribute is obligatory. (The CLI tool raises error trying to access text key). However, it seems that these lines use exactly one of these two keys:

def make_docs(nlp, batch, min_length, max_length):
    docs = []
    for record in batch:
        text = record["text"]
        if "tokens" in record:
            doc = Doc(nlp.vocab, words=record["tokens"])  # use tokens
        else:
            doc = nlp.make_doc(text)  # use "raw" text
        ... # the rest of the code

Is it possible to make the text key optional if tokens are provided? It is not a big deal to extend my data with something like {"text": null, "tokens": [...]} but it seems a bit clumsy taking into account that this fragment can be easily refactored to account both cases without explicitly asking for a missing key.

Or am I missing something and these keys are used in some other places also?

The text was updated successfully, but these errors were encountered:

BreakBB · 2019-05-08T06:21:48Z

In the code the text key is used only optional, since it is just used in the else part. So moving text = record["text"] in the else part should solve your issue and would make pre-training require either text or tokens instead of text and optional tokens.

Moreover the heads key (see these lines) isn't documented at all in the CLI docs.

devforfu · 2019-05-08T15:22:56Z

Yes, that's what I was talking about. The text attributed is forced while it is not really required and can be accessed only in the second branch of the conditional logic. Also, a good point about heads thing.

ines · 2019-05-10T12:04:16Z

Yes, good point – moving text = record["text"] into the else should be fine. If you want to submit a PR with this, that'd be great! 👍

lock · 2019-06-10T16:47:32Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added enhancement Feature requests and improvements feat / cli Feature: Command-line interface help wanted Contributions welcome! help wanted (easy) Contributions welcome! (also suited for spaCy beginners) labels May 10, 2019

devforfu mentioned this issue May 10, 2019

Make "text" key in JSONL format optional when "tokens" key is provided #3721

Merged

3 tasks

ines closed this as completed May 11, 2019

lock bot locked as resolved and limited conversation to collaborators Jun 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make "text" key in JSONL format optional when "tokens" key is provided #3694

Make "text" key in JSONL format optional when "tokens" key is provided #3694

devforfu commented May 7, 2019

BreakBB commented May 8, 2019

devforfu commented May 8, 2019

ines commented May 10, 2019

lock bot commented Jun 10, 2019

Make "text" key in JSONL format optional when "tokens" key is provided #3694

Make "text" key in JSONL format optional when "tokens" key is provided #3694

Comments

devforfu commented May 7, 2019

Feature description

BreakBB commented May 8, 2019

devforfu commented May 8, 2019

ines commented May 10, 2019

lock bot commented Jun 10, 2019