This is a legacy repository for the STB subcorpora of the Nanyang Technological University - Multilingual Corpus (NTU-MC) project. New editions of NTU-MC are maintained by NTU Computational Linguistics Lab
- NTU-MC Toolkit: An annotation toolkit for multilingual text (supports Arabic, Chinese, Japanese, Korean, Indonesian, Vietnamese and English)
- GaChalign: A python implementation of Gale-Church Sentence-level Aligner with variable parameters
- Mini-segmenter: A Dictionary based Chinese segmenter
- Indotag: Implementation of Pisceldo et al. (2010) Bahasa Indonesian Part of Speech tagger, using 1M word corpus from the Pan Asia Networking Localization Project.
- NTU-MC v5.1 (26.08.14): Added NTU-MC Toolkit
- NTU-MC v5.0 (29.04.13): Better cleaning with titles
- NTU-MC v4.1 (08.04.13): Scheduled release.
- NTU-MC v4.0 (27.01.13): Re-clean and retagged from scratch.
- NTU-MC v3.0 (01.05.12): Scheduled release for IJALP
- NTU-MC v2.0 (20.08.11): Cleaned and sentence aligned.
- NTU-MC v1.0 (01.05.11): Foundation text.
Please cite the following when using the data/scripts from the NTU-MC:
author = {Liling Tan and
Francis Bond},
title = {Building and Annotating the Linguistically Diverse NTU-MC
(NTU-Multilingual Corpus)},
booktitle = {PACLIC},
year = {2011},
pages = {362-371},
ee = {},
Liling Tan. 2011. Building the foundation text for Nanyang Technological University - Multilingual Corpus (NTU-MC).. Bachelor Final Year Project. Nanyang Technological University: Singapore.
Liling Tan and Francis Bond. 2012. Building and annotating the linguistically diverse NTU-MC (NTU-multilingual corpus). International Journal of Asian Language Processing, 22(4):161–174
Liling Tan and Francis Bond. 2014. NTU-MC Toolkit: Annotating a Linguistically Diverse Corpus. In Proceedings of 25th International Conference on Computational Linguistics (COLING 2014). Dublin, Ireland.
Other References: