Nanyang Technological University - Multilingual Corpus (STB subcorpora)
This is a legacy repository for the STB subcorpora of the Nanyang Technological University - Multilingual Corpus (NTU-MC) project. New editions of NTU-MC are maintained by NTU Computational Linguistics Lab


  • NTU-MC Toolkit: An annotation toolkit for multilingual text (supports Arabic, Chinese, Japanese, Korean, Indonesian, Vietnamese and English)
  • GaChalign: A python implementation of Gale-Church Sentence-level Aligner with variable parameters
  • Mini-segmenter: A Dictionary based Chinese segmenter
  • Indotag: Implementation of Pisceldo et al. (2010) Bahasa Indonesian Part of Speech tagger, using 1M word corpus from the Pan Asia Networking Localization Project.


  • NTU-MC v5.1 (26.08.14): Added NTU-MC Toolkit
  • NTU-MC v5.0 (29.04.13): Better cleaning with titles
  • NTU-MC v4.1 (08.04.13): Scheduled release.
  • NTU-MC v4.0 (27.01.13): Re-clean and retagged from scratch.
  • NTU-MC v3.0 (01.05.12): Scheduled release for IJALP
  • NTU-MC v2.0 (20.08.11): Cleaned and sentence aligned.
  • [NTU-MC v1.0] (01.05.11): Foundation text.


Please cite the following when using the data/scripts from the NTU-MC:

  author    = {Liling Tan and
               Francis Bond},
  title     = {Building and Annotating the Linguistically Diverse NTU-MC
               (NTU-Multilingual Corpus)},
  booktitle = {PACLIC},
  year      = {2011},
  pages     = {362-371},
  ee        = {},

Other References: