-
Notifications
You must be signed in to change notification settings - Fork 4
FeforParCorp
Contents
-
Parallel Corpora for DELPH-IN
-
Collections/Samples of available parallel
corpora
- Europarl Corpus
- OPUS: Technical Documentation (plus Europarl and European Constitution)
- The Sofie Treebank
- The JRC-Acquis Multilingual Parallel Corpus
- Cathedral and the Bazaar
- Universal Declaration of Human Rights
- Scroogled
- Some criteria for choosing a corpus
-
Collections/Samples of available parallel
corpora
-
- URL: http://people.csail.mit.edu/koehn/publications/europarl/
- Samples of Europarl Corpus - Languages: da, de, en, el, es, fi, fr, it, nl, pt, sv - Size per language: 600-700k sents - Format: currently distributed over approx. 400 files - Alignment: implicit by basename of file and relative position in raw sentence-separated ascii files - Todo: complete cross-lingual alignment (currently only pair-wise implicit alignment). Possibly we can get something along these lines from Andreas Eisele.
- URL: http://logos.uio.no/opus/
- The treebank was developed by the participants of the Nordic Treebank Network, in which academic institutions from Denmark, Estonia, Finland, Iceland, Norway, and Sweden took part. Information about status, availability, formats and analyses can be found at - URL: http://www.hf.uio.no/tekstlab/prosjekter/SOFIE.html
This is not redistributable:
- "Permission to use the corpus can be given to those signing an agreement that they will only use the corpus for research, development and teaching. A web-form will be available soon, in the meantime, contact Lars Nygaard. If you already have got a permission, click here to use the corpus."
Translations in other languages exist (including Japanese), which we may be able to get permission for.
We decided to use this as a corpus, the full description is now up at MatrixMrsCatb.
The preamble (a multi paragraph sentence) is impossible, but apart from that it isn't too difficult, and gets some nice universal quantifiers and modals. It is a little short (65 sentences), but there are many other declarations. There are 369 different translations (4 more than last year), most of excellent quality --- the multilinguality is the main selling point. It is freely available. There is a little synergy as it is the de facto standard for testing Unicode fonts --- it should print nicely.
A short story with many free translations. It is a bit short: about 500 sentences.
- difficulty -- we need to have some hope of parsing it
- size --- to build statistical models it has to be a certain size
- quality --- the language should be natural (often a problem for translations)
- availability --- we need to be able to share the data
- multilinguality --- it would be nice to have exisiting translations
- relevance --- the genre should be one you are interested in
- synergy --- it is nice to reuse/complement existing markup
- diversity --- it can be interesting to experiment with a mixture of corpora, of different text types
Home | Forum | Discussions | Events