Universal Proposition Banks

These is release 1.0 of the Universal Proposition Banks. It is built upon release 1.4 of the Universal Dependency Treebanks and inherits their licence. We use the frame and role labels from the English Proposition Bank version 3.0.

News (10/01/2019): Two domain-specific Propbank released (Contract, Finance)!

News (02/10/2017): Initial version of Italian UP released!

News (01/31/2017): Initial versions of Finnish, Portuguese and Spanish UP released!

News (04/15/2022): We are freezing the resources in this repository.

To be in consistent with UP2.0 repository format, we reorganize this repo and copy the data from each langauge specific folder to langauge specific repository. Following are the changes:

Introducing language and corpus specific repository similar to Universal Dependencies project.

All the UP1.0 resources have been moved to language specific repositories. Following folders are copied to corresponding repositories.

No further changes will be made to this repository (freezing all the resources). All the language specific updates will be in the corresponding repositories UP_<language>-<corpus>. To make this data available as it is, a RELEASE will be made named v1.0 data release. For more information, follow Universal PropBanks Website https://universalpropositions.github.io/

Languages

This release contains propbanks for the following languages:

Chinese UP - Inherits license CC BY-NC-SA 3.0 US from the Chinese Universal Treebank
Finnish UP - Inherits license CC BY-NC-SA 3.0 US from the Finnish Universal Treebank
French UP - Inherits license CC BY-NC-SA 3.0 US from the French Universal Treebank
German UP - Inherits license CC BY-NC-SA 3.0 US from the German Universal Treebank
Italian UP - Inherits license CC BY-NC-SA 3.0 US from the Italian Universal Treebank
Portuguese UP - Inherits license CC BY-NC-SA 3.0 US from the Portuguese Universal Treebank
Spanish UP - Inherits license CC BY-NC-SA 3.0 US from the Spanish Universal Treebank

Multilingual SRL

Using this data, we can create SRL systems that predict English PropBank labels for many different languages. See a recent demo screencast of this SRL for English, French and German here.

Introduction

This project aims to annotate text in different languages with a layer of "universal" semantic role labeling annotation. For this purpose, we use the frame and role labels of the English Proposition Bank to label shallow semantics in sentences in new target languages.

For instance, consider the German sentence "Seine Arbeit wird von ehrenamtlichen Helfern und Regionalgruppen des Vereins unterstützt" (His work is supported by volunteers and regional groupings of the association). In CoNLL format, it looks like this, with English PropBank labels in the last two columns:

Id	Form	POS	HeadId	Deprel	Frame	Role
1	Seine	DET	2	det:poss	_	_
2	Arbeit	NOUN	11	nsubjpass	_	A1
3	wird	AUX	11	auxpass	_	_
4	von	ADP	6	case	_	_
5	ehrenamtlichen	ADJ	6	amod	_	_
6	Helfern	NOUN	11	nmod	_	A0
7	und	CONJ	6	cc	_	_
8	Regionalgruppen	NOUN	6	conj	_	_
9	des	DET	10	det	_	_
10	Vereins	NOUN	8	nmod	_	_
11	unterstützt	VERB	0	root	support.01	_
12	.	PUNCT	11	punct	_	_

The German verb 'unterstützt' is labeled as evoking the 'support.01' frame with two roles: "Seine Arbeit" (his work) is labeled A1 (project being supported) and "ehrenamtlichen Helfern und Regionalgruppen des Vereins" (volunteers and regional groupings of the association) is labeled A0 (the helper).

Format

The universal propbank (UP) for each language consists of three files (training, dev and test data) with the extension .conllu but currently encoding an extension of the CoNLL-U format. The extension is based on the CoNLL format produced by the Propbank conversion scripts, called .gold_conll.

Besides the original 10 columns from the CoNLL-U format, the roleset column (column 11) gives the actual sense used, and that sense provides roleset specific meanings for each of the numbered arguments. Every column after the eleventh is a predicate, in order that they appear in the sentence. Note that the Propbank .gold_conll files contain a "frame file" column (column 11) that lets you know which ".xml" file contains the actual semantic form for the predicate in question (which is not always the same as the predicate: one must reference "lighten.xml" for lighten_up.02), but since all predicate identifier is unique, we haven't preserved this column.

The English dataset was the only one obtained in a different maner. See the README.org file in that directory for information.

In addition, each language has a folder with verb overview files (produced from the frame files) in html format. These files can be viewed in a browser and give an overview of all English frames that each target language verb can evoke.

Scope

Our current focus is to annotate all target language verbs with appropriate English frames. This means that the scope of frame-evoking elements is currently limited to verbs. We also do not label target language auxiliary verbs. For each universal propbank, about 90% of all verbs are currently labeled. Unlabeled verbs often convey semantics for which we either could not find an appropriate English verb, or are part of complex verb constructions which we currently do not handle.

A note on quality

This is an ongoing research project in which we use a combination of data-driven methods and some post-processing to generate these resources. This means that the labels in the UPs are mostly predicted over models trained on a different domain, which affects the quality. A good example is the German verb "angeben" which in our source data was mostly used in the "brag.01" sense, but in the German UD data is mostly used in the "report.01" sense, but almost never detected as such.

Current and future work

This is an ongoing project which we are improving along three lines: (1) We are working on adding new languages to the current release. (2) We are working to curate the data to improve the quality of SRL annotation. (3) We are looking into extending the scope of frame-evoking-elements to other types of predicates besides verbs. (4) We will migrate the data to newer UD standard.

Publications

Crowd-in-the-Loop: A Hybrid Approach for Annotating Semantic Roles. Chenguang Wang, Alan Akbik, Laura Chiticariu, Yunyao Li, Fei Xia and Anbang Xu. 2017 Conference on Empirical Methods on Natural Language Processing EMNLP 2017.

Active Learning for Black-Box Semantic Role Labeling with Neural Factors. Chenguang Wang, Laura Chiticariu and Yunyao Li. 2017 International Joint Conference on Artificial Intelligence IJCAI 2017.

Multilingual Aliasing for Auto-Generating Proposition Banks. Alan Akbik, Xinyu Guan and Yunyao Li. 26th International Conference on Computational Linguistics COLING 2016.

K-SRL: Instance-based Learning for Semantic Role Labeling. Alan Akbik and Yunyao Li. 26th International Conference on Computational Linguistics COLING 2016.

Multilingual Information Extraction with PolyglotIE. Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yonas Kbrom, Yunyao Li and Huaiyu Zhu. 26th International Conference on Computational Linguistics COLING 2016.

Towards Semi-Automatic Generation of Proposition Banks for Low-Resource Languages. Alan Akbik, Vishwajeet Kumar and Yunyao Li. 2016 Conference on Empirical Methods on Natural Language Processing EMNLP 2016.

Polyglot: Multilingual Semantic Role Labeling with Unified Labels. Alan Akbik and Yunyao Li. 54th Annual Meeting of the Association for Computational Linguistics ACL 2016.

Generating High Quality Proposition Banks for Multilingual Semantic Role Labeling. Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yunyao Li, Shivakumar Vaithyanathan and Huaiyu Zhu. 53rd Annual Meeting of the Association for Computational Linguistics ACL 2015.

People

Contact

Please email your questions or comments to Huaiyu Zhu.

Core Team

Alan Akbik
Laura Chiticariu
Marina Danilevsky
Yunyao Li
Huaiyu Zhu

Contributors

Xinyu Guan, Yale University
Tomer Mahlin, IBM Systems Division, Israel
Vishwajeet Kumar, IIT Bombay
Fei Xia, University of Washington
Chenguang (Ray) Wang, Amazon

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
UP_Chinese		UP_Chinese
UP_English-EWT		UP_English-EWT
UP_Finnish		UP_Finnish
UP_French		UP_French
UP_German		UP_German
UP_Italian		UP_Italian
UP_Portuguese-Bosque		UP_Portuguese-Bosque
UP_Spanish-AnCora		UP_Spanish-AnCora
UP_Spanish		UP_Spanish
LICENSE		LICENSE
README.md		README.md

License

UniversalPropositions/UP-1.0

Folders and files

Latest commit

History

Repository files navigation

Universal Proposition Banks

Languages

Multilingual SRL

Introduction

Format

Scope

A note on quality

Current and future work

Publications

People

Contact

Core Team

Contributors

About

Resources

License

Stars

Watchers

Forks

Languages