Joint UD Parsing of Norwegian Bokmål and Nynorsk

This repo contains the data and models used for the experiments in the paper Joint UD Parsing of Norwegian Bokmål and Nynorsk by E. Velldal, L. Øvrelid and P. Hohle for Nodalida 2017 (http://www.ep.liu.se/ecp/131/001/ecp17131001.pdf).

Abstract

This paper investigates interactions in parser performance for the two official standards for written Norwegian: Bokmål and Nynorsk. We demonstrate that while applying models across standards yields poor performance, combining the training data for both standards yields better results than previously achieved for each of them in isolation. This has immediate practical value for processing Norwegian, as it means that a single parsing pipeline is sufficient to cover both varieties, with no loss in accuracy. Based on the Norwegian Universal Dependencies treebank we present results for multiple taggers and parsers, experimenting with different ways of varying the training data given to the learners, including the use of machine translation.

Datasets

The experiments in the paper are based on both the Bokmål and Nynorsk sections of Norwegian Dependency Treebank (NDT), converted to Universal Dependencies v1.2. We also experimented with translating between the two standards using the Apertium MT toolkit.

Training and applying models

In addition to the datasets, this repo provides the key UDPipe models used for tagging and parsing, in addition to the script used for training; scripts/train-udpipe.sh. Assuming you have UDPipe and word2vec installed:

Training:

./train-udpipe.sh model-prefix training-data validation-data

Example:

./train-udpipeline.sh my-bm ../ud-data/bm/bm-train.conllu ../ud-data/bm/bm-dev.conllu

To evaluate:

udpipe --accuracy --tag --parse ../udpipe-models/my-bm.model ../ud-data/bm/bm-dev.conllu

All model files (word2vec embeddings and udpipe model) are created under ../udpipe-models/.

Cite

@InProceedings{VelOvrHoh17,
  author = {Erik Velldal and Lilja Øvrelid and Petter Hohle},
  title = {Joint UD Parsing of Norwegian Bokmål and Nynorsk},
  booktitle = {Proceedings of the 21st Nordic Conference of Computational Linguistics},
  pages = {1--10},
  year = {2017},
  address = {Gothenburg, Sweden}
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
scripts		scripts
ud-data		ud-data
udpipe-models		udpipe-models
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

ud-data

ud-data

udpipe-models

udpipe-models

README.md

README.md

Repository files navigation

Joint UD Parsing of Norwegian Bokmål and Nynorsk

Abstract

Datasets

Training and applying models

Cite

About

Releases

Packages

Languages

erikve/bm-nn-parsing

Folders and files

Latest commit

History

Repository files navigation

Joint UD Parsing of Norwegian Bokmål and Nynorsk

Abstract

Datasets

Training and applying models

Cite

About

Resources

Stars

Watchers

Forks

Languages