sent_id format and parallel treebanks #321

Closed
martinpopel opened this Issue Jul 6, 2016 · 5 comments

Comments

Projects
None yet
5 participants
@martinpopel
Member

martinpopel commented Jul 6, 2016

In #273, it was suggested that each sentence in CoNLL-U should have its ID encoded in header (comment) in a standardized way, e.g. # sent_id = 123. This issue is about the format of the ID itself (i.e. the 123 part) and also about a related question of storing parallel treebanks in CoNLL-U.

My motivation

  • CoNLL-U format should be used not only for storing UD treebanks (frozen in v1.2, 1.3 etc.) but also as data interchange format and for various NLP tools, in all intermediate stages of the pipeline. See #242.
  • I would like to store parallel treebanks with word alignment in CoNLL-U format. For many reasons (e.g. efficient parallel processing, serialization, streaming, consistency and alignment) it is useful to have all the languages in one file (interleaved as: sent1-langA, sent1-langB, sent2-langA, sent2-langB etc). We plan to release Czech-English treebank CzEng 1.6 with 62M sentences in this format. See a sample.
    By parallel treebanks I mean not only different languages and paraphrases, but also alternative annotations of the same sentence, e.g. gold and automatic.
  • I would like to store word-alignment and coreference (and possibly other types of relations) links in CoNLL-U files. Coreference can go across sentences. This has some consequences for IDs. I plan to open a separate issue for this soon.
  • I would like to keep the CoNLL-U format simple (not bloated like CoNLL2009).

My proposal

in short: bundle_id/zone
An example of a valid sent_id is f123-s9/en_udpipe.

  • The part (f123-s9) is called bundle_id and in parallel treebanks it is shared for all translations of the same sentence (which form a so-called bundle). The internal structure of bundle_id can reflect the original treebank numbering, e.g. here f123 is the filename and s9 is the 9th bundle in that file. I suggest bundle_id format is restricted by a [a-zA-Z0-9_-]+ regex. We can make it less strict if needed for some legacy data, but it should not contain whitespace nor slash.
  • The second part (en_udpipe) is so-called zone and it can be omitted in treebanks where each bundle has just one zone (so the zone is an empty string). If present, it must be separated by a slash from the bundle_id and it must match the regex ^[a-z-]+(_[a-zA-Z0-9-]+)?$. The internal structure of zone is language_selector, where the _selector part is optional.
  • language is a ISO639 (or rather IETF) language code
  • selector is any string (^[a-zA-Z0-9-]+$), which allows to store parallel sentences in the same language. E.g. udpipe indicates that the tree was parsed using UDPipe. Another example: selectors ref and mt may distinguish reference translation and machine translation.

Notes

I know not everyone needs to work with (multi-) parallel treebanks stored in one file, so this proposal may sound too complex. However, note that

  • You can use simple IDs (e.g. integers) as sent_id and just one language (one zone) per file. It is still valid according to the proposal.
  • I think IDs should be optional in CoNLL-U (though I would like to see them in all UD v2 treebanks). All UD-compatible tools should handle files without IDs. This proposal is just for those who need IDs, so they use it in the same standardized way allowing interoperability.
  • We have a real need for such format (e.g. releasing the CzEng treebank in CoNLL-U, evaluation and visualization tools, an MT system).
  • We are working on a Python+Perl+Java API for UD called Udapi, which benefits from the proposal and also makes it easy to use (e.g. extract trees from one zone and store in a separate file). We want to invite the UD community to contribute to Udapi soon.
@dan-zeman

This comment has been minimized.

Show comment
Hide comment
@dan-zeman

dan-zeman Aug 13, 2016

Member

+1

From the UD perspective, this proposal just reserves certain characters ("/") for specialized usage, which goes beyond the current scope of the UD project, yet I find it useful, and it has actually been deployed already. We will have to modify sentence IDs in Arabic UD, where the slash is used, but that should not be a problem.

The specification should go into the version 2 of UD guidelines. (To keep the format.md page focused on UD, I would just say that slash has a special meaning in IDs, and put the details in a separate page linked from there. However, the validator would have to check the entire syntax.)

Member

dan-zeman commented Aug 13, 2016

+1

From the UD perspective, this proposal just reserves certain characters ("/") for specialized usage, which goes beyond the current scope of the UD project, yet I find it useful, and it has actually been deployed already. We will have to modify sentence IDs in Arabic UD, where the slash is used, but that should not be a problem.

The specification should go into the version 2 of UD guidelines. (To keep the format.md page focused on UD, I would just say that slash has a special meaning in IDs, and put the details in a separate page linked from there. However, the validator would have to check the entire syntax.)

@manning

This comment has been minimized.

Show comment
Hide comment
@manning

manning Sep 26, 2016

Contributor

Approve!

Contributor

manning commented Sep 26, 2016

Approve!

@dan-zeman dan-zeman referenced this issue in UniversalDependencies/UD_v2 Nov 13, 2016

Open

Support for alignment and standoff annotation #25

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Nov 30, 2016

Member

The "slash is special" constraint made it to http://universaldependencies.org/v2/conll-u.html but doesn't appear in the current v2 draft of the format page. Was this rejected or or just forgotten? ( @jnivre )

Member

spyysalo commented Nov 30, 2016

The "slash is special" constraint made it to http://universaldependencies.org/v2/conll-u.html but doesn't appear in the current v2 draft of the format page. Was this rejected or or just forgotten? ( @jnivre )

@jnivre

This comment has been minimized.

Show comment
Hide comment
@jnivre

jnivre Nov 30, 2016

Contributor

Just forgotten. Can you add it?

Contributor

jnivre commented Nov 30, 2016

Just forgotten. Can you add it?

@spyysalo spyysalo self-assigned this Nov 30, 2016

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Nov 30, 2016

Member

Updated to add

In sentence ids, the slash character ("/") is reserved for specialized downstream use and should be avoided in UD treebanks.

which I hope is sufficient for the initial release. If anyone is interested in documenting and linking these use cases, please do!

Member

spyysalo commented Nov 30, 2016

Updated to add

In sentence ids, the slash character ("/") is reserved for specialized downstream use and should be avoided in UD treebanks.

which I hope is sufficient for the initial release. If anyone is interested in documenting and linking these use cases, please do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment