Unitex/GramLab is the open source, cross-platform, multilingual, lexicon- and grammar-based corpus processing suite
This repository contains the Language Resources which are distributed within Unitex/GramLab.
Language name | Native name | Language Family | IETF | ISO 639-2 | ISO 639-1 |
---|---|---|---|---|---|
Arabic | العربية | Afro-Asiatic | ar | ara | ar |
Chinese | 汉语/漢語 | Sino-Tibetan | zh | chi/zho | zh |
English | English | Indo-European | en | eng | en |
Finnish | Suomi | Uralic | fi | fin | fi |
French | Français | Indo-European | fr | fra | fr |
Georgian (Ancient) | ქართული | South Caucasian | oge | ||
German | Deutsch | Indo-European | de | deu | de |
Greek (ancient) | Αρχαία Ελληνικα | Indo-European | grc | grc | |
Greek (modern) | Ελληνικά | Indo-European | el | ell | el |
Italian | Italiano | Indo-European | it | ita | it |
Korean | 한국어 | Koreanic | ko | kor | ko |
Latin | Latine | Indo-European | la | lat | la |
Malagasy | Malagasy | Austronesian | mg | mlg | mg |
Norwegian Bokmål | Norsk bokmål | Indo-European | no | nob | nb |
Norwegian Nynorsk | Norsk nynorsk | Indo-European | nn | nno | nn |
Polish | Polski | Indo-European | pl | pol | pl |
Portuguese (Portugal) | Português (Portugal) | Indo-European | pt-BR | ||
Portuguese (Brazil) | Português (Brasil) | Indo-European | pt-PT | ||
Russian | Русский | Indo-European | ru | rus | ru |
Serbian-Cyrillic | Српски | Indo-European | sr-Cyrl | sro | sr |
Serbian-Latin | Serbian (Latin) | Indo-European | sr-Latn | srm | |
Spanish | Español | Indo-European | es | spa | es |
Thai | ไทย | Tai–Kadai | th | tha | th |
We welcome everyone to contribute to improve the Unitex/GramLab Language Resources by forking this repository and sending a pull request with their changes.
To add a new language to Unitex:
- Copy the folder template
zxx-t-Skel
and rename it according to the ISO 639-1 code of the new language - Use the IETF language tag if the ISO 639-1 code is not available for your language.
Your new language must provide at least:
- An alphabet file (
Alphabet.txt
) and optionally a sorted alphabet (Alphabet_sort.txt
) - A sample corpus (
Corpus/Corpus.txt
). Make sure you have the rights to share this resource and provide the author information onCorpus/Corpus.info
- A sample dictionary (
Dela/lang-CODE.dic
) containing at least the words of the sample text - A sentence delimitation graph (
Graphs/Preprocessing/Sentence/Sentence.grf
)
Before share your contribution, make sure that:
- File names only use 7-bits ASCII characters.
- For each compiled graph
fst2
you are also proving the.grf
version. - For each dictionary
.dic
you are also providing a.info
file describing the dictionary content (codes used in it, number of entries, authors, etc). - You accept the LGPLLR license.
Language Resources are mainly built and maintained by the members of the RELEX network, an international network of laboratories specialized in Computational Linguistics that was created by Maurice Gross and his LADL (Laboratoire d'Automatique Documentaire et Linguistique) team.
User's Manual (in PDF format) is available in English and French (more translations are welcome). You can view and print them with Evince, downloadable here. The latest version of the User's Manual is accessible here.
Support questions can be posted in the community support forum. Please feel free to submit any suggestions or requests for new features too. Some general advice about asking technical support questions can be found here.
See the Bug Reporting Guide for information on how to report bugs.
Unitex/GramLab project decision-making is based on a community meritocratic process, anyone with an interest in it can join the community, contribute to the project design and participate in decisions. The Unitex/GramLab Governance Model describes how this participation takes place and how to set about earning merit within the project community.
Unitex/GramLab is spelled with capitals "U" "G" and "L", and with everything else in lower case. Excepting the forward slash, do not put a space or any character between words. Only when the forward slash is not allowed, you can simply write “UnitexGramLab”.
It's common to refer to the Unitex/GramLab Core as "Unitex", and to the Unitex Project-oriented IDE as "GramLab". If you are mentioning the distribution suite (Core, IDE, Linguistic Resources and others bundled tools) always use "Unitex/GramLab".
Language Resources are distributed under the terms of the Lesser General Public License For Linguistic Resources (LGPLLR). Contact unitex-devel@univ-mlv.fr for further inquiries.
Copyright (C) 2019 Université Paris-Est Marne-la-Vallée