HTML Shell Roff Makefile Perl Nemerle
Switch branches/tags
Nothing to show
Clone or download
eric-laporte Adding Chinese (#17)
Adding Chinese
Latest commit f1519d9 Jun 21, 2018
Permalink
Failed to load latest commit information.
ar [bin remove] Delete no-longer-used CasSys/Share folder Apr 14, 2017
de [bin remove] Delete no-longer-used CasSys/Share folder Apr 14, 2017
el [fix #15] Remove <MIX> lexical mask Jun 13, 2018
en Feature/numbers: Adding graphs for en, es and pt-BR numbers and updat… Mar 25, 2018
es Feature/numbers: Adding graphs for en, es and pt-BR numbers and updat… Mar 25, 2018
fi [bin remove] Delete no-longer-used CasSys/Share folder Apr 14, 2017
fr [fix #15] Remove <MIX> lexical mask Jun 13, 2018
grc [bin remove] Delete no-longer-used CasSys/Share folder Apr 14, 2017
it [bin remove] Delete no-longer-used CasSys/Share folder Apr 14, 2017
ko [bin remove] Delete no-longer-used CasSys/Share folder Apr 14, 2017
la [bin remove] Delete no-longer-used CasSys/Share folder Apr 14, 2017
mg [fix #15] Remove <MIX> lexical mask Jun 13, 2018
nn [bin remove] Delete no-longer-used CasSys/Share folder Apr 14, 2017
no [bin remove] Delete no-longer-used CasSys/Share folder Apr 14, 2017
oge [bin remove] Delete no-longer-used CasSys/Share folder Apr 14, 2017
pl [bin remove] Delete no-longer-used CasSys/Share folder Apr 14, 2017
pt-BR Portug brazil - Remove obsolete entries from DELAF (#11) Apr 6, 2018
pt-PT [bin remove] Delete no-longer-used CasSys/Share folder Apr 14, 2017
ru [bin remove] Delete no-longer-used CasSys/Share folder Apr 14, 2017
sr-Cyrl [bin remove] Delete no-longer-used CasSys/Share folder Apr 14, 2017
sr-Latn [fix #15] Remove <MIX> lexical mask Jun 13, 2018
th [bin remove] Delete no-longer-used CasSys/Share folder Apr 14, 2017
zh Adding Chinese (#17) Jun 21, 2018
zxx-p-XAlign [fix #15] Remove <MIX> lexical mask Jun 13, 2018
zxx-t-Skel [bin remove] Delete no-longer-used CasSys/Share folder Apr 14, 2017
.gitattributes [ci fix] Force CRLF line endings for Unitex resources Jun 13, 2018
.gitignore add .gitignore Apr 26, 2016
.gitignore.in add .gitignore Apr 26, 2016
.pullapprove.yml [refactor] Migrate PullApprove from v1 to v2 May 29, 2018
LICENSE Update copyright year to 2018 Jun 13, 2018
README.md Update copyright year to 2018 Jun 13, 2018
gitignore.io.sh minor: use .gitkeep placeholder to preserve Unitex folder structure Apr 26, 2016
install.sh Update copyright year to 2018 Jun 13, 2018

README.md

Unitex/GramLab Language Resources

Unitex/GramLab is an open source, cross-platform, multilingual, lexicon- and grammar-based corpus processing suite

This repository contains the Language Resources which are distributed within Unitex/GramLab.

Languages

Language name Native name Language Family IETF ISO 639-2 ISO 639-1
Arabic العربية Afro-Asiatic ar ara ar
English English Indo-European en eng en
Finnish Suomi Uralic fi fin fi
French Français Indo-European fr fra fr
Georgian (Ancient) ქართული South Caucasian oge
German Deutsch Indo-European de deu de
Greek (ancient) Αρχαία Ελληνικα Indo-European grc grc
Greek (modern) Ελληνικά Indo-European el ell el
Italian Italiano Indo-European it ita it
Korean 한국어 Koreanic ko kor ko
Latin Latine Indo-European la lat la
Malagasy Malagasy Austronesian mg mlg mg
Norwegian Bokmål Norsk bokmål Indo-European no nob nb
Norwegian Nynorsk Norsk nynorsk Indo-European nn nno nn
Polish Polski Indo-European pl pol pl
Portuguese (Portugal) Português (Portugal) Indo-European pt-BR
Portuguese (Brazil) Português (Brasil) Indo-European pt-PT
Russian Русский Indo-European ru rus ru
Serbian-Cyrillic Српски Indo-European sr-Cyrl sro sr
Serbian-Latin Serbian (Latin) Indo-European sr-Latn srm
Spanish Español Indo-European es spa es
Thai ไทย Tai–Kadai th tha th

Contributing

We welcome everyone to contribute to improve the Unitex/GramLab Language Resources by forking this repository and sending a pull request with their changes.

How to add a new language support in Unitex

To add a new language to Unitex:

  • Copy the folder template zxx-t-Skel and rename it according to the ISO 639-1 code of the new language
  • Use the IETF language tag if the ISO 639-1 code is not available for your language.

Your new language must provide at least:

  • An alphabet file (Alphabet.txt) and optionally a sorted alphabet (Alphabet_sort.txt)
  • A sample corpus (Corpus/Corpus.txt). Make sure you have the rights to share this resource and provide the author information on Corpus/Corpus.info
  • A sample dictionary (Dela/lang-CODE.dic) containing at least the words of the sample text
  • A sentence delimitation graph (Graphs/Preprocessing/Sentence/Sentence.grf)

Before share your contribution, make sure that:

  • File names only use 7-bits ASCII characters.
  • For each compiled graph fst2 you are also proving the .grf version.
  • For each dictionary .dic you are also providing a .info file describing the dictionary content (codes used in it, number of entries, authors, etc).
  • You accept the LGPLLR license.

RELEX network

Language Resources are mainly built and maintained by the members of the RELEX network, an international network of laboratories specialized in Computational Linguistics that was created by Maurice Gross and his LADL (Laboratoire d'Automatique Documentaire et Linguistique) team.

Country Partner
Belgium Catholic University of Leuven
Belgium CENTAL
Brazil Federal University of Goias
Brazil NILC
Brazil Projeto Relex
Brazil PUC RIO
Canada University of Montréal
Denmark University of Copenhagen
England Research and Development Unit for English Studies
France CRISCO
France EHESS
France LDI
France LIGM
France LIMSI
France LIP6
France LORIA
France UFRL
France Université de Tours
France University Bordeaux 3
France University Grenoble 3
France University of Franche-Comté
France University of Paris-Est Marne-la-Vallée
France University of Rouen
France University of Strasbourg
France University Paris 8
France University Paris-Sorbonne
Germany CIS, University of Munich
Germany University of Heidelberg
Greece ILSP
Greece University of Thessaloniki
Hong Kong City University of Hong Kong
Hungaria Research Institute for Linguistics
Israel University of Tel Aviv
Italy University of Bari
Italy University of Salerno
Japan Information Science Research Center
Korea Hankuk University of Foreign Studies
Madagascar University of Antananarivo
Norway University of Bergen
Poland Adam Mickiewicz University
Portugal LabEL
Portugal University of Algarve
Serbia University of Belgrad
Slovakia The Faculty of Economics
Spain Autonomous University of Barcelona
Spain University of Alicante
Switzerland University of Genève
Switzerland University of Zürich
United States Florida International University
United States New York University
United States University of California San Diego
United States University of North Texas

Documentation

User's Manual (in PDF format) is available in English and French (more translations are welcome). You can view and print them with Evince, downloadable here. The latest version of the User's Manual is accessible here.

Support

Support questions can be posted in the community support forum. Please feel free to submit any suggestions or requests for new features too. Some general advice about asking technical support questions can be found here.

Reporting Bugs

See the Bug Reporting Guide for information on how to report bugs.

Governance Model

Unitex/GramLab project decision-making is based on a community meritocratic process, anyone with an interest in it can join the community, contribute to the project design and participate in decisions. The Unitex/GramLab Governance Model describes how this participation takes place and how to set about earning merit within the project community.

Spelling

Unitex/GramLab is spelled with capitals "U" "G" and "L", and with everything else in lower case. Excepting the forward slash, do not put a space or any character between words. Only when the forward slash is not allowed, you can simply write “UnitexGramLab”.

It's common to refer to the Unitex/GramLab Core as "Unitex", and to the Unitex Project-oriented IDE as "GramLab". If you are mentioning the distribution suite (Core, IDE, Linguistic Resources and others bundled tools) always use "Unitex/GramLab".

License

Language Resources are distributed under the terms of the Lesser General Public License For Linguistic Resources (LGPLLR). Contact unitex-devel@univ-mlv.fr for further inquiries.

--

Copyright (C) 2018 Université Paris-Est Marne-la-Vallée