Nanyang Technological University - Multilingual Corpus (STB subcorpora)
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
ntumc added batch tokenization and pos tagging to chinese Nov 6, 2014 Update Nov 26, 2015


This is a legacy repository for the STB subcorpora of the Nanyang Technological University - Multilingual Corpus (NTU-MC) project. New editions of NTU-MC are maintained by NTU Computational Linguistics Lab


  • NTU-MC Toolkit: An annotation toolkit for multilingual text (supports Arabic, Chinese, Japanese, Korean, Indonesian, Vietnamese and English)
  • GaChalign: A python implementation of Gale-Church Sentence-level Aligner with variable parameters
  • Mini-segmenter: A Dictionary based Chinese segmenter
  • Indotag: Implementation of Pisceldo et al. (2010) Bahasa Indonesian Part of Speech tagger, using 1M word corpus from the Pan Asia Networking Localization Project.


  • NTU-MC v5.1 (26.08.14): Added NTU-MC Toolkit
  • NTU-MC v5.0 (29.04.13): Better cleaning with titles
  • NTU-MC v4.1 (08.04.13): Scheduled release.
  • NTU-MC v4.0 (27.01.13): Re-clean and retagged from scratch.
  • NTU-MC v3.0 (01.05.12): Scheduled release for IJALP
  • NTU-MC v2.0 (20.08.11): Cleaned and sentence aligned.
  • [NTU-MC v1.0] (01.05.11): Foundation text.


Please cite the following when using the data/scripts from the NTU-MC:

  author    = {Liling Tan and
               Francis Bond},
  title     = {Building and Annotating the Linguistically Diverse NTU-MC
               (NTU-Multilingual Corpus)},
  booktitle = {PACLIC},
  year      = {2011},
  pages     = {362-371},
  ee        = {},

Other References: