Permalink
Switch branches/tags
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
165 lines (117 sloc) 6.41 KB
# Summary
UD English_LinES is the English half of the LinES Parallel Treebank
with the original dependency annotation first automatically converted
into Universal Dependencies and then partially reviewed. Its contents cover
literature, an online manual and Europarl data.
# Introduction
UD English_LinES is the English half of the LinES Parallel Treebank with UD annotations.
The majority of segments are from literature but there is also a section with online manual
data and one section with Europarl data. All segments have an associated translation in the
UD Swedish_LinES treebank (with the same segment index). The original dependency annotation
was first automatically converted to Universal Dependencies and
then partially reviewed (Ahrenberg, 2015). In January-February 2017 it was converted to UD version 2
and again reviewed for errors. With version 2.1 lemma information has been added.
The treebank is being developed continuously.
# Acknowledgements
Three of the source texts were collected as part of the Linköping Translation Corpus Corpus
(Merkel, 1999). The treebank was first developed in the project 'Micro- and macro-level
analysis of translations' funded by the Swedish Research Council (Ahrenberg, 2007).
# Details on the source texts
UD English_LinES contains segments from seven different sources, three of which are
part of the Linköping Translation Corpus Corpus (Merkel, 1999). The
treebank was first developed in the project 'Micro- and macro-level
analysis of translations' funded by the Swedish Research Council (Ahrenberg, 2007).
Five of the sub-corpora are taken from literary works:
Paul Auster: City of Glass. The New York Trilogy, Volume One. First
published by Faber & Faber in 1985.
Saul Bellow: To Jerusalem and back: a personal accunt. First published
in 1976.
Joseph Conrad: Heart of darkness. First published in 1899 as a serial
in Blackwood's Magazine, in 1902 as a book.
Nadine Gordimer: A Guest of Honour. First published in 1970.
J. K. Rowling: Harry Potter and the Chamber of Secrets. First published
in 1998.
In addition the corpus includes segments from Microsoft Access 2002
Online Help and the Englsh part of the Europarl corpus (v.7).
The segments have been word-aligned to the corresponding segments in the
UD Swedish_LinES treebank. Contact Lars Ahrenberg if you are interested in
obtaining the word alignments.
DATA SPLITS
The data has been split so that about 20% is used for the dev-file, 20% for the test file and the rest for training. In each file, segments from the same sub-corpus are held together. The files are named
- en_lines-ud-test.conllu
- en_lines-ud-dev.conllu
- en_lines-ud-train.conllu
English_LinES and Swedish_LinES have been split the same way.
BASIC STATISTICS
Tree count: 4564
Word count: 82821
Token count: 82821
Dep. relations: 40 of which 7 language specific
POS tags: 17
Category=value feature pairs: 0
TOKENIZATION
The tokenization is largely based on whitespace, but punctuation marks
except word-internal hyphens are treated as separate tokens. The
original file also has several multi-word tokens, but these are
separated in the UD version with all parts except the first assigned
the UD dependency function 'fixed'. There are no blanks inside tokens.
MORPHOLOGY
From version 2.2 the UFEATS column is filled. The XPOS column has features from the original LinES
with the exception of nouns that are not annotated for case, only number. Verbs are annotated for tense and,
adjectives for degree. Pronouns are sub-divided in the morphological
description into Personal, Demonstrative, Interrogative, Indefinite,
Relative, Total, and Expletive, and are annotated for Number, Person
and Case, when relevant.
The mapping from language-specific part-of-speech tags to universal tags
was done automatically. Errors have been corrected when found but
there may still be some errors remaining.
SYNTAX
The syntactic annotation was first automatically converted from the original
LinES annotation scheme as described in Ahrenberg (2015). Then converted again, mostly
automatically to UD version 2.0. The test sample has been thoroughly reviewed before
the release of version 2.1.
For version 2.2 the relative word 'that' is analysed as PRON and assigned dependency
relations from the clausal relations nsubj, obj, obl, xcomp. Adposition introducing
clauses are assigned the relation 'mark' consistently. Unlike previous versions,
the relation 'obl:agent' is not used, instead 'obl' is used as in other English
treebanks.
There may still be occasional deviations from the general guidelines.
REFERENCES
Lars Ahrenberg, 2007. LinES: An English-Swedish Parallel
Treebank. Proceedings of the 16th Nordic Conference of Computational
Linguistics (NODALIDA, 2007).
Lars Ahrenberg, 2015. Converting an English-Swedish Parallel Treebank
to Universal Dependencies. Proceedings of the Third International
Conference on Dependency Linguistics (DepLing 2015), Uppsala, August
24-26, 2015, pp. 10-19. ACL Anthology W15-2103.
Magnus Merkel, 1999: Understanding and enhancing translation by
parallel text processing. Linköping Studies in Science and Technology,
Dissertation No. 607.
Changelog
From UD version 1.3 to UD version 2.0
* changes of part-of-speech labels and dependency labels in accordance with 2.0 guidelines
* addition of comments for sent_id, text, and document boundaries
* addition of SpaceAfter=No features in the MISC column
From UD version 2.0 to UD version 2.1
* all tokens have received a lemma
* the test data have been manually reviewed to correct errors and make data agree better with the version 2 guidelines.
Changes affect about 14% of all tokens and some 36% of all punctuation tokens.
From UD version 2.1 to UD version 2.2
* features have been added to the UFEATS column. They have been mapped from UD_English v2.1 and then manually reviewed.
* the word 'that' when introducing a relative clause is tagged PRON and assigned a clausal relation.
* the train and dev data have been partially reviewed to correct errors and make data agree better with the
version 2 guidelines.
--- Machine readable metadata ---
Documentation status: partial
Data source: semi-automatic
Data available since: UD v1.3
License: CC BY-NC-SA 4.0
Genre: fiction nonfiction spoken
Lemmas: converted from manual
UPOS: converted from manual
XPOS: manual
Features: not available
Relations: converted from manual and corrected
Contributors: Ahrenberg, Lars
Contributing: elsewhere
Contact: lars.ahrenberg@liu.se