A treebank of Scottish Gaelic based on the Annotated Reference Corpus Of Scottish Gaelic (ARCOSG).
The Scottish Gaelic treebank takes data from ARCOSG, the Annotated Reference Corpus of Scottish Gaelic (Lamb et al. 2016) with the annotation scheme based on that in the Irish UD treebank. Full bibliographic details are to be had there.
It contains eight subcorpora of a varying number of original files, each of approximately 1000 tokens. Not all of the trees have made it into release 2.9. The test and dev files are complete and the training set will be filled out, hopefully in 2.10. All files listed below are in the training set unless they are explicitly marked as being in test or dev. In the ARCOSG documentation the names of contributors are largely given in Gaelic, which I have kept and glossed with their names in English where they will be familiar to non-Gaelic speakers.
- Conversation. c01 is in test, c03 in dev and the rest in train. These are transcripts of interviews in the Western Isles from 1998 to 2000. In c03 and c04 speakers 2, 4 and 5 are children.
- Sport. s06 is in test, s08 in dev and the rest in train. s01 to s05 are Radio nan Gàidheal commentary on a match between Scotland and Australia; s06 to s10 on Scotland vs. Yugoslavia.
- Oral narrative.
- n01: Na Trì Leinntean Canaich (test)
- n02: Conall Gulban (dev)
- n03: Na Fiantaichean
- n04: Gille an Fheadain Duibh
- n05: Bodach Ròcabarraigh
- n06: Iain Beag MacAnndra
- n07: Fear a' Churracain Ghlais
- n08: Boban Saor
- n09: Bean 'ic Odrum
- n10: Blàr Chàirinis
- News scripts from Radio nan Gàidheal in the early 1990s.
- ns01: Màiri Anna NicUalraig (Mary Ann Kennedy)
- ns02: Dòmhnall Moireasdan
- ns03: Iseabail NicIllinnein
- ns04: Innes Rothach
- ns05: Innes Rothach (test)
- ns06: Pàdraig MacAmhlaigh (dev)
- ns07: Dòmhnall Moireasdan (test)
- ns08: Màiri Anna NicUalraig (dev)
- ns09: Seumas Domhnallach
- ns10: Seumas Domhnallach
- Public interview
- p01: Peataichean, conversation on Coinneach MacÌomhair's programme
- p02: Fred MacAulay and Martin MacDonald
- p03: John MacInnes and William Matheson
- p04: Geamaichean Sholais 1, conversation on Coinneach MacÌomhair's programme (test)
- p05: Geamaichean Sholais 2 (dev)
- p06: Bonn Comhraidh, 1980s political discussion programme
- p07: Conversation on Coinneach MacÌomhair's programme 2000-01-17 part 1
- p08: Conversation on Coinneach MacÌomhair's programme 2000-01-17 part 2
- f01: Am Fainne by Eilidh Watt
- f02: from Cùmhnantan by Tormod MacGill-Eain
- f03: Droch Àm by Pòl MacAonghais (test)
- f04: Spàl Tìm by Cailean T. MacCoinneach
- f05: Teine a Loisgeas by Eilidh Watt
- f06: Beul na h-Oidhche by Somhairle MacGill-Eain (Sorley Maclean)
- f07: from An t-Aonaran by Iain Mac a' Ghobhainn (Iain Crichton Smith)
- f08: Briseadh na Cloiche by Iain Moireach (dev)
- Formal prose:
- fp01: Trì Ginealaichean by D. E. Dòmhnallach
- fp02: Nua-Bhàrdachd Ghàidhlig by Dòmhnall MacAmhlaigh (Donald MacAulay)
- fp03: Mairead N. Lachlainn by Somhairle MacGill-Eain (test)
- fp04: from Bith-eòlas ('Biology'), a translation by Ruairidh MacThòmais (Derick Thomson)
- fp05: Aramach am Bearnaraidh
- fp06: Blàr a' Chumhaing by Iain A. MacDonald
- fp07: Na Marbhrannan by Coinneach D. MacDhòmhnaill
- fp08: Cainnt is Cànan by J. MacInnes
- fp09: from Dòmhnall Uilleam Stiùbhart (Donald William Stewart)'s unpublished PhD thesis (dev)
- Popular writing: columns from The Scotsman:
- pw01: An Cuir am Papa... by Aileig O Hianlaidh (Alex O'Henley)
- pw02: A bith mar Chorra... by Joina NicDhomnaill (test)
- pw03: Pàdraig Sellar by Ùisdean MacIllinnein
- pw04: A' Cur Às Dhuinn Fhìn by Aonghas Mac-a-Phì
- pw05: Aon Dùthaich by Murchadh MacLeòid
- pw06: Blas a' Ghuga by Coinneach MacLeòid (dev)
- pw07: Luchd-ciùil by Criosaidh Dick
- pw08: Na Gàidheil Ùra by Criosaidh Dick
- pw09: A' Siubhail gu Rèidh by Tormod Domhnallach (dev)
- pw10: Poileaticeans by Niall M. Brownlie
- pw11: Oifigeir Gàidhlig by Aileig O Hianlaidh (test)
See https://universaldependencies.org/gd/index.html for detailed linguistic documentation.
We wish to thank all of the contributors to ARCOSG and fellow Celtic language UD developers Teresa Lynn, Kevin Scannell, Johannes Heinecke and Fran Tyers.
- Colin Batchelor, 2019. Universal dependencies for Scottish Gaelic: syntax, in Proceedings of CLTW2019 at Machine Translation Summit XVII, Dublin, August
- Lamb, William, Sharon Arbuthnot, Susanna Naismith, and Samuel Danso. 2016. Annotated Reference Corpus of Scottish Gaelic (ARCOSG), 1997–2016 [dataset]. Technical report, University of Edinburgh; School of Literatures, Languages and Cultures; Celtic and Scottish Studies. https://doi.org/10.7488/ds/1411.
- Lynn, Teresa and Jennifer Foster, [Universal Dependencies for Irish] (http://www.nclt.dcu.ie/~tlynn/Lynn_CLTW2016.pdf), CLTW 2016, Paris, France, July 2016
- 2022-05-15 v2.10
- All of ARCOSG now in the treebank.
- 2021-11-15 v2.9
- Small fixes to README.md
- Some missing sentences added.
PronType=Intfor interrogative pronouns and
- Made sure interrogative pronouns were all pronouns and adjusted trees and documentation accordingly.
- 2021-05-15 v2.8
- ri linn 's is a fixed expression now.
- the ' in, for example, 'dol is no longer a separate token.
flathas been replaced with
flat:namein personal names and
flat:foreignin foreign expressions. It remains for placenames, dates and telephone numbers.
oblhave been reviewed and corrected throughout the corpus and now replace
compoundfor f(h)(è)in and a/ri chèile.
- Documents identified with
- 2020-11-15 v2.7
Poss=Yesadded in line with Irish.
- Tokens in the original with XPOS beginning
Sppare divided into their component words.
- Systematic tidying of
Form=Empin line with Irish and extended to other parts of speech.
PARTs with XPOS
Qanow tagged correctly
- Words with UPOS
AUXnow have full features.
- The English borrowing so is
- 's in fad 's and the like is now related to fad or o chionn by
- Cosubordinative agus and is are now
SCONJlike in Irish.
- ach is
PARTwhere it is a focus particle rather than a preposition or a conjunction.
- Use of
xcomp:predconsistent in the sport subcorpora where the root is a footballer rather than bi.
- 2020-05-15 v2.6
- Small fixes to README.md.
- Some missing sentences added to dev and test, bringing them both over 10000 words.
- 2019-11-15 v2.5
- Initial release in Universal Dependencies.
=== Machine-readable metadata (DO NOT REMOVE!) ================================ Data available since: UD v2.5 License: CC BY-SA 4.0 Includes text: yes Genre: nonfiction fiction news spoken Lemmas: converted from manual UPOS: converted from manual XPOS: manual native Features: converted from manual Relations: converted from manual Contributors: Batchelor, Colin Contributing: here Contact: email@example.com ===============================================================================