Summary
UD_Icelandic-IcePaHC is a conversion of the Icelandic Parsed Historical Corpus (IcePaHC) to the Universal Dependencies scheme.
The conversion was done using UDConverter.
Introduction
The Icelandic Parsed Historical Corpus (IcePaHC) is a one-million-word, diachronic corpus which includes 61 texts from the 12th to 21st centuries. These texts were originally manually parsed according to the Penn Parsed Corpora of Historical English (PPCHE) annotation scheme. These parsed texts where then automatically converted to the Universal Dependencies scheme to create UD_Icelandic-IcePaHC.
Text categories
UD_Icelandic-IcePaHC contains the following main genres:
- NAR: Narratives (sagas, fiction)
- REL: Religious texts (bible, sermons)
- SCI: Science (linguistics, natural sciences, history)
- BIO: Biographical material (biographies, travelogues)
- LAW: Law texts
Further subclassification is reflected in the extended genre label. For example NAR-SAG means narrative-saga and REL-BIB means religious text-bible
Each sentence ID in UD-Icelandic-IcePaHC carries the following information:
1150.FIRSTGRAMMAR.SCI-LIN,1.1
- Publication year of the text (
1150
) - Name of the text (
FIRSTGRAMMAR
) - Text genre (
SCI-LIN
) - Index within text (
1
) - Index within file (
1
)
Using the sentence IDs within UD_Icelandic-IcePaHC, specific genres or periods can be extracted or filtered from the treebank CoNLL-U files.
Data split
For further info on each text, see the IcePaHC documnentation.
TRAIN:
1150.HOMILIUBOK.REL-SER
1210.THORLAKUR.REL-SAG
1250.STURLUNGA.NAR-SAG
1260.JOMSVIKINGAR.NAR-SAG
1270.GRAGAS.LAW-LAW
1275.MORKIN.NAR-HIS
1300.ALEXANDER.NAR-SAG
1310.GRETTIR.NAR-SAG
1325.ARNI.NAR-SAG
1350.FINNBOGI.NAR-SAG
1400.GUNNAR.NAR-SAG
1400.VIGLUNDUR.NAR-SAG
1450.ECTORSSAGA.NAR-SAG
1450.JUDIT.REL-BIB
1450.VILHJALMUR.NAR-SAG
1480.JARLMANN.NAR-SAG
1525.ERASMUS.NAR-SAG
1540.NTJOHN.REL-BIB
1593.EINTAL.REL-OTH
1611.OKUR.REL-OTH
1650.ILLUGI.NAR-SAG
1659.PISLARSAGA.BIO-AUT
1661.INDIAFARI.BIO-TRA
1675.ARMANN.NAR-FIC
1675.MAGNUS.BIO-OTH
1675.MODARS.NAR-FIC
1680.SKALHOLT.NAR-REL
1725.BISKUPASOGUR.NAR-REL
1790.FIMMBRAEDRA.NAR-SAG
1791.JONSTEINGRIMS.BIO-AUT
1830.HELLISMENN.NAR-SAG
1835.JONASEDLI.SCI-NAT
1859.HUGVEKJUR.REL-SER
1861.ORRUSTA.NAR-FIC
1882.TORFHILDUR.NAR-FIC
1888.VORDRAUMUR.NAR-FIC
1907.LEYSING.NAR-FIC
1908.OFUREFLI.NAR-FIC
1985.MARGSAGA.NAR-FIC
1985.SAGAN.NAR-FIC
2008.MAMMA.NAR-FIC
TEST:
1150.FIRSTGRAMMAR.SCI-LIN
1210.JARTEIN.REL-SAG
1350.MARTA.REL-SAG
1450.BANDAMENN.NAR-SAG
1400.GUNNAR2.NAR-SAG
1540.NTACTS.REL-BIB
1628.OLAFUREGILS.BIO-TRA
1745.KLIM.NAR-FIC
1850.PILTUR.NAR-FIC
1920.ARIN.REL-SER
DEV:
1250.THETUBROT.NAR-SAG
1350.BANDAMENNM.NAR-SAG
1475.AEVINTYRI.NAR-REL
1525.GEORGIUS.NAR-REL
1630.GERHARD.REL-OTH
1720.VIDALIN.REL-SER
1888.GRIMUR.NAR-FIC
1883.VOGGUR.NAR-FIC
1902.FOSSAR.NAR-FIC
2008.OFSI.NAR-SAG
Acknowledgments
This project is funded by The Strategic Research and Development Programme for Language Technology, grant no. 180020-5301. Thanks are due to Örvar Kárason, whose previous work was used as a basis for the conversion.
The Icelandic Parsed Historical Corpus (IcePaHC) is available at https://linguist.is/icelandic_treebank/Download.
Morphological features were generated using ABLTagger, a PoS tagger for Icelandic, developed by Steinþór Steingrímsson, Örvar Kárason and Hrafn Loftsson and available here.
References
@inproceedings{arnardottir-etal-2020-universal,
title = "A {U}niversal {D}ependencies Conversion Pipeline for a {P}enn-format Constituency Treebank",
author = "Arnard{\'o}ttir, {\TH}{\'o}runn and
Hafsteinsson, Hinrik and
Sigur{\dh}sson, Einar Freyr and
Bjarnad{\'o}ttir, Krist{\'\i}n and
Ingason, Anton Karl and
J{\'o}nsd{\'o}ttir, Hildur and
Steingr{\'\i}msson, Stein{\th}{\'o}r",
booktitle = "Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.udw-1.3",
pages = "16--25",
abstract = "The topic of this paper is a rule-based pipeline for converting constituency treebanks based on the Penn Treebank format to Universal Dependencies (UD). We describe an Icelandic constituency treebank, its annotation scheme and the UD scheme. The conversion is discussed, the methods used to deliver a fully automated UD corpus and complications involved. To show its applicability to corpora in different languages, we extend the pipeline and convert a Faroese constituency treebank to a UD corpus. The result is an open-source conversion tool, published under an Apache 2.0 license, applicable to a Penn-style treebank for conversion to a UD corpus, along with the two new UD corpora.",
}
Changelog
- 2022-11-15 v2.11
- Various lemmas fixed.
- Validation syntax errors (too many subjects).
- Various minor fixes for UPOS, XPOS, deprels and UD features.
- Missing UD features added to
is_icepahc-trian.conllu
.- 1680.SKALHOLT.NAR-REL, sentences 150.21339 to 223.21412.
- Missing UD features added to joined tokens, mostly pronomial clitics.
- Incorrect
case
deprels changed tomark
for tokens 'ef', 'þegar', 'nema', 'þótt', 'þó'.
- 2022-05-15 v2.10
- A few errors, such as wrong lemmas, fixed.
- 2020-11-15 v2.7
- Initial release in Universal Dependencies.
=== Machine-readable metadata (DO NOT REMOVE!) ================================ Data available since: UD v2.7 License: CC BY-SA 4.0 Includes text: yes Genre: fiction bible nonfiction legal Lemmas: converted from manual UPOS: converted from manual XPOS: manual native Features: automatic Relations: converted from manual Contributors: Arnardóttir, Þórunn; Hafsteinsson, Hinrik; Sigurðsson, Einar Freyr; Jónsdóttir, Hildur; Bjarnadóttir, Kristín; Ingason, Anton Karl; Rúnarsson, Kristján; Steingrímsson, Steinþór; Wallenberg, Joel C.; Rögnvaldsson, Eiríkur Contributing: elsewhere Contact: thar@hi.is, hinrik.hafst@gmail.com, einar.freyr.sigurdsson@arnastofnun.is ===============================================================================