Skip to content

Latest commit

 

History

History
159 lines (121 loc) · 10.2 KB

README.md

File metadata and controls

159 lines (121 loc) · 10.2 KB

Summary

The Italian corpus annotated according to the UD annotation scheme was obtained by conversion from ISDT (Italian Stanford Dependency Treebank), released for the dependency parsing shared task of Evalita-2014 (Bosco et al. 2014).

Introduction

ISDT is a resource annotated according to the Stanford dependencies scheme (de Marneffe et al. 2008, 2013a, 2013b, 2014), obtained through a semi-automatic conversion process starting from MIDT (the Merged Italian Dependency Treebank). MIDT, in turn, is the result of a previous effort in the direction of improving interoperability of data sets available for Italian by harmonizing and merging two existing dependency–based resources, differing both in corpus composition and adopted annotation schemes, namely:

  • TUT, the Turin University Treebank (Bosco et al. 2000);
  • ISST-TANL, first released as ISST-CoNLL for the CoNLL-2007 shared task (Montemagni, Simi 2007), which was developed as a joint effort by the Istituto di Linguistica Computazionale (ILC–CNR) and the University of Pisa and originating from the Italian Syntactic–Semantic Treebank (ISST, Montemagni et al. 2003).

The details of the harmonization and conversion process leading to MIDT are discussed in (Bosco, Montemagni, Simi, 2012). The Stanford annotation scheme, obtained from an enriched version of MIDT, was adapted to the specificity of the Italian language. We refer to (Bosco, Montemagni, Simi, 2013 and 2014) for a discussion.

Acknowledgments

We wish to thank all of the contributors to the original annotation efforts, as well as the supporting organizations, i.e. the Institute for Computational Linguistics "A. Zampolli", the University of Pisa, and the University of Torino. Thanks go to Chiara Alzetta and Giulia Venturi for the good work in defining the error detection methodology and the manual revision / correction of automatically identified errors in Version 2.1.

Main contributors

  • Cristina Bosco - Università di Torino, Dipartimento di Informatica
  • Alessandro Lenci - Università di Pisa, Dipartimento di Filologia, Letteratura, Linguistica
  • Simonetta Montemagni - Istituto di Linguistica Computazionale A. Zampolli, CNR, Pisa
  • Maria Simi - Università di Pisa, Dipartimento di Informatica

Corpus composition

Original formatSourceGenreSize in tokensSize in sentences
TUT-CONLLEvalita 2011 Dependency parsingLegal texts, news articles, Wikipedia articles101,3093,842
ISST-TANLEvalita 2011 Domain adaptation taskNewspaper articles80,9674,135
ISST-TANLSPLeT 2012 Legal texts: European directives6,166260
MIDTSeveral QA competitionsQuestions20,6802,228
MIDTEvalita 2014 Dependency parsing:test data set (partial)News articles7,618304
TUT-CONLLParallel TUT (Italian part)Various genres55,9422,131
UDDue ParoleSimplified Italian news24,9771,421
UD2New dataVarious sentences2,504150

Sentences ids explicitly mark the source of the sentence.

Corpus splitting

The Corpus (14,167 sentences; 278,429 tokens; 298,344 words) has been randomly split as follows:

  • it-ud-train.conllu: 257616 tokens (13121 sentences)
  • it-ud-dev.conllu: 11133 tokens (564 sentences)
  • it-ud-test.conllu: 9680 tokens (482 sentences)

Changelog

  • 2018-11-01 v2.3

    • added enhanced dependencies
  • 2018-04-01 v2.2

    • Repository renamed from UD_Italian to UD_Italian-ISDT.
    • Additional corrections of 1340 arcs, specifically:
      • 525 arcs retrieved with the methodology already used in the previous release, applied to the rest of the treebank;
      • 815 non-projective arcs were also corrected.
    • Added to the train set a new section of 2Parole, a newspaper of simplified Italian texts (283 sentences, 4985 tokens)
  • 2017-11-01 v2.1

    • Corrected 786 dependency errors distributed into 567 sentences:
      • Auxiliary verbs erroneously treated as head of a dependency relation
      • Bare past participles functioning as adjectival modifiers of nouns erroneously annotated as clausal modifiers
      • Adjectives functioning as secondary predicates erroneously annotated as adjectival modifiers
      • Coordinating conjunctions erroneously headed by the first conjunct
      • Oblique nominal arguments erroneously annotated as nominal modifiers
      • Nonfinite verbs functioning as nominals erroneously annotated as oblique nominals
    • Consistency in the treatment of fixed multi-word expressions has been checked and improved.
  • 2017-02-15 v2.0

    • Changes to comply with V2.
    • Splitting revised to comply with shared task.
  • 2016-11-01 v1.4

    • Complete revision of the treatment of clitic pronouns
    • Added dependency subtype expl:pass, used in passive constructions
    • Added a new collection of texts from 2Parole, a newspaper of simplified Italian texts (25995 tokens)
  • 2016-05-01 v1.3

    • Added feature value PronType=Ord for ordinal pronouns
    • Added feature value PronType=Predet for predeterminers
    • Added feature value NumType=Range
    • Added feature value NumType=Gen
    • Added sentence full text as comment
    • Added SpaceAfter=No, needed for recovering original text
    • Fixed errors found running content validation queries
  • 2015-11-01 v1.2

    • Added dependencies expl:impers as specialization of expl for impersonal clitic pronouns
    • Fixed case in articulated preposition, previously lost during splitting
    • More fixes to xcomp/ccomp distinction
    • Harmonization of case marking for infinitive verbs introduced by articles
    • Harmonization of Light Verb constructions
    • Eliminated duplicated sentences and overlappings train/dev and train/test
    • Added short sentences to train
  • 2015-05-15 v1.1

    • Added Italian section of ParTUT (71645 tokens)
    • Checked SYM
    • Checked X
    • Added more negation adverbs
    • Eliminated Gender=Com and Number=Com
    • Eliminated Negation=Neg
    • Added language specific feature PronType=Clit
    • Changed 'case' into 'mark' for 'xcomp'
    • Fixed xcomp/ccomp distinction
    • Checked dependencies marked 'dep', and resolved most of them

References

=== Machine-readable metadata (DO NOT REMOVE!) ================================ Data available since: UD v1.0 License: CC BY-NC-SA 3.0 Includes text: yes Genre: legal news wiki Lemmas: converted from manual UPOS: converted from manual XPOS: manual native Features: converted from manual Relations: converted from manual Contributors: Bosco, Cristina; Lenci, Alessandro; Montemagni, Simonetta; Simi, Maria Contributing: elsewhere Contact: simi@di.unipi.it