This repository contains the annotated datasets of the papers:
- Monolingual and Cross-Lingual Acceptability Judgments with the Italian CoLA corpus.
In Findings of EMNLP 2021. Daniela Trotta, Raffaele Guarasci, Elisa Leonardelli, Sara Tonelli.
- Work Hard, Play Hard: Collecting Acceptability Annotations through a 3D Game In Proceedings of Language Resources and Evaluation Conference In LREC 2022. ELRA, 2022. Federico Bonetti, Elisa Leonardelli, Daniela Trotta, Raffaele Guarasci, Sara Tonelli.
[cite] [read the paper] [try the game here] (clic on "DEMO" when prompt for login)
The Italian Corpus of Linguistic Acceptability includes almost 10k sentences taken from linguistic literature with a binary annotation made by the original authors themselves. The work is inspired by CoLA1.
Read the paper at https://aclanthology.org/2021.findings-emnlp.250/
Download ItaCola from this ItaCoLA_dataset.tsv
ItaCola is split into:
train
: 7801 sentencesdev
: 946 sentencestest
: 975 sentences
Each line in the .tsv
files consists of 5 tab-separated columns.
- Column 1: an unique ID
- Column 2: the source of the sentence
- Column 3: the acceptability judgment label (0=unacceptable, 1=acceptable)
- Column 4: the sentence
- Column 5: the split to which the sentence belongs
UniqueID | Source | Judgement | Sentence | Split |
---|---|---|---|---|
3 | Graffi_1994 | 1 | Questa donna mi ha colpito. (That women impressed me) | train |
5784 | Vietri_2017 | 1 | Alice ha fatto terrorizzare Francesco da quell'uomo. (Alice made Francesco terrified of the man.) | train |
8307 | Vietri_2004 | 0 | Quell'architetto ha alcuni progettato musei. (That architect has some designed museums.) | dev |
9206 | Elia-et-al_1981 | 0 | Il ministro è dal ritiro del passaporto. (The minister is from passport withdrawal) | test |
9366 | Vietri_2004 | 1 | Edoardo ne ride. (Edward laughs about it) | test |
Sources come from different sources extracted from the linguistic literature, covering a wide range of topics.
Source Legend | Topic |
---|---|
D-Agostino_19832 | locative constructions |
D-Agostino_19923 | discourse analysis |
Elia-et-al_19814 | lexicon and syntactic structures |
Elia_19825 | locative adverbs and idioms |
Graffi-Scalise_20026 | theoretical linguistics |
Graffi_19947 | syntax |
Graffi_20088 | generative grammar |
Jezek_20039 | verb classification |
Simone-Masini_201310 | theoretical linguistics |
Vietri_198511 | idiomatic expressions |
Vietri_200412 | lexicon-grammar approach |
Vietri_201713 | anticausative sentences |
Part of the dataset has been manually annotated with 9 linguistic phenomena.
Phenomenon |
---|
Cleft constructions |
Copular constructions |
Subject-Verb Agreement |
Wh-islands violations |
Simple |
Question |
Auxiliary |
Bind |
Indefinite pronouns |
File containing annotated sentences can be downloaded from this ItaCoLA_dataset_phenomenon. Every annotated sentence in this file is linked to main one through the unique id. The .tsv
file has a structure similar to the main corpus, but each phenomenon is represented in a column (1 if present, 0 if not present).
Trotta D., Guarasci R., Leonardelli E., Tonelli S. Monolingual and Cross-Lingual Acceptability Judgments with the Italian CoLA corpus. In Findings of EMNLP 2021.
ItaCoLA_dataset_non-expertannotations is a subset of about 1,000 sentenceces from the ItaCoLA dataset. Those sentences have been ri-annotated for acceptability by non experts annotators (i.e. non linguists) by using a 3D video game.
Download ItaCola-nonexpert dataset from this ItaCoLA_dataset_non-expertannotations.tsv
ItaCola-nonexpert consists of 1062 sentences.
Each line in the .tsv
files consists of 7 tab-separated columns.
- Column 1
UniqueIndexID_ItaCoLA
: an unique ID that links to (ItaCoLA) - Column 2
Sentence
: the sentence - Column 3
ExpertAcceptability
: the acceptability judgment label from the original ItaCoLA (0=unacceptable, 1=acceptable) - Column 4
PlayersAcceptability
: the acceptability judgment label from the non-experts annotators, based on majority voting (0=unacceptable, 1=acceptable) - Column 5
NumAnnotations
: total number of non-expert annotations collected - Column 6
DisaggragatedAnnotations
: non-expert indivdual annotations - Column 7
PlayersID
: anonymized non-expert identifier. Order of the identifiers reflect annotations given in theDisaggragatedAnnotations
ItaCoLA | Sentence | ExpertAcc. | PlayersAcc. | NumAnn | Disaggragated | PlayersID |
---|---|---|---|---|---|---|
8897 | È arrivato tuo padre? | 1 | 1 | 4 | 1,1,1,1 | vj6buGww,HQ8yy7gw,OeynLncz,C62PYy1K |
4042 | Alice smette del pasto. | 0 | 0 | 2 | 0,0 | vj6buGww,HQ8yy7gw |
Discover more about the videogame here
You can also try the game here [try the game] (click on "DEMO" when prompt for login)
Bonetti F., Leonardelli E., Trotta D., Guarasci R., Tonelli S. Work Hard, Play Hard: Collecting Acceptability Annotations through a 3D Game. In Proceedings of Language Resources and Evaluation Conference (LREC 2022). ELRA, 2022.
- Alex Warstadt, Amanpreet Singh, Samuel R. Bowman; Neural Network Acceptability Judgments. Transactions of the Association for Computational Linguistics. 2019; 7 625–641.
- Emilio D’Agostino. 1983. Lessico e sintassi dellecostruzioni locative: materiali per la didattica dell’italiano. Liguori.
- Emilio D’Agostino. 1992. Analisi del discorso: metodi descrittivi dell’italiano d’uso. Loffredo.
- Annibale Elia, Maurizio Martinelli, and Emilio d’Agostino. 1981. Lessico e strutture sintattiche: introduzione alla sintassi del verbo italiano. Liguori Napoli.
- Annibale Elia. 1982. Avverbi ed espressioni idiomatiche di carattere locativo. Studi di Grammatica Italiana Firenze, 11:327–379.
- Giorgio Graffi and Sergio Scalise. 2002. Le lingue e il linguaggio. Introduzione alla linguistica. Il Mulino, Bologna, Italy.
- Giorgio Graffi. 1994. Le strutture del linguaggio. Sintassi. Il Mulino, Bologna, Italy.
- Giorgio Graffi. 2008. Che cos’è la grammatica generativa. Carocci editore, Roma, Italy
- Elisabetta Jezek. 2003. Classi di verbi tra semantica e sintassi. Edizioni ETS, Pisa, Italy.
- Raffaele Simone and Francesca Masini. 2013. Nuovi fondamenti di linguistica. McGraw Hill.
- Simonetta Vietri. 2004. Lessico-grammatica dell’italiano. Metodi, descrizioni e applicazioni. UTET Università.
- Simonetta Vietri. 2014. Idiomatic constructions in Italian: a lexicon-grammar approach, volume 31. John Benjamins Publishing Company.
- Simonetta Vietri. 2017. Usi verbali dell’italiano: le frasi anticausative. Carocci editore.