Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
sent_id format and parallel treebanks #321
In #273, it was suggested that each sentence in CoNLL-U should have its ID encoded in header (comment) in a standardized way, e.g.
in short: bundle_id/zone
I know not everyone needs to work with (multi-) parallel treebanks stored in one file, so this proposal may sound too complex. However, note that
referenced this issue
Jul 6, 2016
From the UD perspective, this proposal just reserves certain characters ("/") for specialized usage, which goes beyond the current scope of the UD project, yet I find it useful, and it has actually been deployed already. We will have to modify sentence IDs in Arabic UD, where the slash is used, but that should not be a problem.
The specification should go into the version 2 of UD guidelines. (To keep the format.md page focused on UD, I would just say that slash has a special meaning in IDs, and put the details in a separate page linked from there. However, the validator would have to check the entire syntax.)
Updated to add
which I hope is sufficient for the initial release. If anyone is interested in documenting and linking these use cases, please do!