Releases: amir-zeldes/gum
V12.1.0 - Closed book summaries, UD v2.18
This release adds closed-book summaries to dev set documents and contains minor corrections and is meant to be used for parity with Universal Dependencies release v2.16.
- Added 1-3 closed book summaries (written from memory) to dev set documents
- Content-identical with UD v2.18
- Numerous corrections
V12.0.0 - New documents, new bridging anaphora scheme
This is the initial release of GUM series 12:
- Added GUM V12 documents
- Completely reworked GUMBridge bridging annotation scheme:
- Manual re-annotation effort of the entire corpus
- Much more densely and consistently annotated using new guidelines
- 11 subtypes of bridging anaphora
- Multiple concurrent bridging subtypes are now possible
The entire corpus now contains 291,056 tokens, including the GENTLE test2 partition.
V11.1.0 - minor corrections, UD v2.16
This release contains minor corrections and is meant to be used for parity with Universal Dependencies release v2.16.
- Content identical with UD v2.16
- Numerous corrections
V11.0.0 - New documents, five summaries and graded salience
This is the initial release of GUM series 11:
- Added GUM V11 documents
- GENTLE data is now merged into the GUM repo as test2 partition
- Added graded salience annotations (this changes the tsv format)
- Added 5 summaries per document (also appear in tsv format)
- New dependency subtype
nmod:descfor titles etc.
The entire corpus now contains 268,208 tokens, including the GENTLE test2 partition.
V10.2.0 - PDTB style discourse relations and more
V10.2.0
- Content identical with UD v2.15
- Add GDTB version of discourse relation annotations following PDTB v3 guidelines
- Change
:npmodand:tmodUD subtypes to:unmarked - Support gold status for eRST signals in .rs4 format
- New DISRPT rels format with relation types (explicit/implicit) and raw text columns
- Use chain discourse dependency algorithm for DISRPT data
- Numerous corrections
This is the final release for GUM series 10, to be followed by V11 with new data this winter
V10.1.0 - corrections and minor updates
This is a corrected version of GUM series 10 (no additional documents since V10.0.0)
- Added ExtPos to multiword fixed expression
- Revised Cxn annotations to follow latest UCxn standard for construction annotation
- Content-identical with UD v2.14
V10.0.0 - added court, essay, letter and podcast genres
This is the first release of GUM series 10, with 16 genres in total.
- Four new growing genres:
court- courtroom transcriptsessay- argumentative essaysletter- personal and professional correspondence on paper (not e-mails)podcast- podcast on various topics
- Many corrections to all annotation layers
Note on document names compared to V9:
- With the addition of the
courtgenre, one conversation from GUM V9 which is actually from courtroom proceedings has been moved to the new court genre (GUM_conversation_court->GUM_court_carpet) - To compensate for the removed conversation, an additional conversation has been added in V10:
GUM_conversation_toys
V9.2.0 - RST++, MSeg and CxG
This is the final release of the GUM 9.X series, which is the basis for the contents of the equivalent Universal Dependencies release v2.13. New in this version:
- Enhanced Rhetorical Structure Theory annotations using RST++:
- Additional, tree breaking secondary discourse relations
- Annotation of connectives and many other signaling devices for discourse relations
- Morphological segmentation based on Unimorph in the MSeg annotation (e.g. un-break-able)
- Construction Grammar annotation of constructions in the Cxn annotation
- A second human written summary for each document in the test set
- Numerous corrections and consistency improvements bringing this corpus and the English Web Treebank (EWT) closer
V9.1.0 - Numerous corrections
- Numerous corrections to all layers
- Consistency improved with other LDC and UD English corpora
- Added xpos tag GW for goeswith handling as in EWT
- MWT fixed for "let's"
- Label consistency with EWT for assigning iobj without obj
- Many RST corrections for the DISRPT shared task
- Data in this version is even with the UD v2.12 release
V9.0.0 - new data, summaries and entity salience
- 20 documents added including more conversational data (total tokens: 203,879)
- Abstractive summaries for each document in metadata
- Annotations for most salient entities in each document
- Foreign language tags identify individual source languages
- New process for reconstructing Reddit text data in top-level folders (see README.md)
- Many corrections to all annotation layers