Skip to content

Releases: amir-zeldes/gum

V12.1.0 - Closed book summaries, UD v2.18

02 May 21:01
22fdf87

Choose a tag to compare

This release adds closed-book summaries to dev set documents and contains minor corrections and is meant to be used for parity with Universal Dependencies release v2.16.

  • Added 1-3 closed book summaries (written from memory) to dev set documents
  • Content-identical with UD v2.18
  • Numerous corrections

V12.0.0 - New documents, new bridging anaphora scheme

03 Mar 22:15
dcaa405

Choose a tag to compare

This is the initial release of GUM series 12:

  • Added GUM V12 documents
  • Completely reworked GUMBridge bridging annotation scheme:
    • Manual re-annotation effort of the entire corpus
    • Much more densely and consistently annotated using new guidelines
    • 11 subtypes of bridging anaphora
    • Multiple concurrent bridging subtypes are now possible

The entire corpus now contains 291,056 tokens, including the GENTLE test2 partition.

V11.1.0 - minor corrections, UD v2.16

12 May 16:01
72ce5ac

Choose a tag to compare

This release contains minor corrections and is meant to be used for parity with Universal Dependencies release v2.16.

  • Content identical with UD v2.16
  • Numerous corrections

V11.0.0 - New documents, five summaries and graded salience

13 Mar 20:34
1f87216

Choose a tag to compare

This is the initial release of GUM series 11:

  • Added GUM V11 documents
  • GENTLE data is now merged into the GUM repo as test2 partition
  • Added graded salience annotations (this changes the tsv format)
  • Added 5 summaries per document (also appear in tsv format)
  • New dependency subtype nmod:desc for titles etc.

The entire corpus now contains 268,208 tokens, including the GENTLE test2 partition.

V10.2.0 - PDTB style discourse relations and more

27 Nov 22:57
aaa74a3

Choose a tag to compare

V10.2.0

  • Content identical with UD v2.15
  • Add GDTB version of discourse relation annotations following PDTB v3 guidelines
  • Change :npmod and :tmod UD subtypes to :unmarked
  • Support gold status for eRST signals in .rs4 format
  • New DISRPT rels format with relation types (explicit/implicit) and raw text columns
  • Use chain discourse dependency algorithm for DISRPT data
  • Numerous corrections

This is the final release for GUM series 10, to be followed by V11 with new data this winter

V10.1.0 - corrections and minor updates

16 May 20:45
9df08e9

Choose a tag to compare

This is a corrected version of GUM series 10 (no additional documents since V10.0.0)

  • Added ExtPos to multiword fixed expression
  • Revised Cxn annotations to follow latest UCxn standard for construction annotation
  • Content-identical with UD v2.14

V10.0.0 - added court, essay, letter and podcast genres

15 Feb 20:19
e7491c8

Choose a tag to compare

This is the first release of GUM series 10, with 16 genres in total.

  • Four new growing genres:
    • court - courtroom transcripts
    • essay - argumentative essays
    • letter - personal and professional correspondence on paper (not e-mails)
    • podcast - podcast on various topics
  • Many corrections to all annotation layers

Note on document names compared to V9:

  • With the addition of the court genre, one conversation from GUM V9 which is actually from courtroom proceedings has been moved to the new court genre (GUM_conversation_court -> GUM_court_carpet)
  • To compensate for the removed conversation, an additional conversation has been added in V10: GUM_conversation_toys

V9.2.0 - RST++, MSeg and CxG

10 Nov 16:29
3b0ab7d

Choose a tag to compare

This is the final release of the GUM 9.X series, which is the basis for the contents of the equivalent Universal Dependencies release v2.13. New in this version:

  • Enhanced Rhetorical Structure Theory annotations using RST++:
    • Additional, tree breaking secondary discourse relations
    • Annotation of connectives and many other signaling devices for discourse relations
  • Morphological segmentation based on Unimorph in the MSeg annotation (e.g. un-break-able)
  • Construction Grammar annotation of constructions in the Cxn annotation
  • A second human written summary for each document in the test set
  • Numerous corrections and consistency improvements bringing this corpus and the English Web Treebank (EWT) closer

V9.1.0 - Numerous corrections

05 May 16:42
b153503

Choose a tag to compare

  • Numerous corrections to all layers
  • Consistency improved with other LDC and UD English corpora
    • Added xpos tag GW for goeswith handling as in EWT
    • MWT fixed for "let's"
    • Label consistency with EWT for assigning iobj without obj
    • Many RST corrections for the DISRPT shared task
  • Data in this version is even with the UD v2.12 release

V9.0.0 - new data, summaries and entity salience

02 Feb 18:55
5f724df

Choose a tag to compare

  • 20 documents added including more conversational data (total tokens: 203,879)
  • Abstractive summaries for each document in metadata
  • Annotations for most salient entities in each document
  • Foreign language tags identify individual source languages
  • New process for reconstructing Reddit text data in top-level folders (see README.md)
  • Many corrections to all annotation layers