Make "text" key in JSONL format optional when "tokens" key is provided #3694
Labels
enhancement
Feature requests and improvements
feat / cli
Feature: Command-line interface
help wanted (easy)
Contributions welcome! (also suited for spaCy beginners)
help wanted
Contributions welcome!
Feature description
I would like to build a pre-trained model using an already tokenized dataset. I store my data in JSONL format as described in the docs.
As I understand, the
text
attribute is obligatory. (The CLI tool raises error trying to accesstext
key). However, it seems that these lines use exactly one of these two keys:Is it possible to make the
text
key optional if tokens are provided? It is not a big deal to extend my data with something like{"text": null, "tokens": [...]}
but it seems a bit clumsy taking into account that this fragment can be easily refactored to account both cases without explicitly asking for a missing key.Or am I missing something and these keys are used in some other places also?
The text was updated successfully, but these errors were encountered: