MeCab on the Web!
MeCab is a popular Japanese part-of-speech and morphological analyzer. Originally a native compiled application, you can now interact with it in your web browser at http://fasiha.github.io/mecab-emscripten/.
Usage on the web
Much of the documentation on MeCab is in Japanese, including the official documentation as well as unofficial man pages. There is also a MeCab tutorial video by Jeffrey Berhow that may be of help (though it is mainly concerned with installing MeCab in Windows). I am not an expert on using MeCab: the instructions accompanying the online MeCab interface exhaust my knowledge. I will gratefully accept any and all contributions of a tutorial nature explaining how to use this tool.
MeCab on the Web was built on a fresh install of Ubuntu 14.04 TLS (inside a VirtualBox virtual machine if you must know), but has also been successfully built on Mac OS X.
Native binary MeCab
First, if you don't already have MeCab installed as a native compiled application, build it. Download the latest source release (
mecab-0.996.tar.gz as of September 2014), uncompress it, build it and install it via
$ ./configure --with-charset=utf8 && make && make test && sudo make install
In Linux, you may need to run the following before the
mecab executable will work:
$ export LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH"
Next, download and build the IPADIC dictionary for your native binary MeCab. Download the latest source release (
mecab-ipadic-2.7.0-20070801.tar.gz as of September 2014). Decompress it, then run the following to configure, build, and install it:
./configure --with-charset=utf-8 && make && sudo make install
You may test your installation of MeCab and the IPADIC dictionary by confirming that this command produces the following output:
$ echo test | mecab test 名詞,固有名詞,組織,*,*,*,* EOS
We finally come to Emscripten. Clone this repository. It contains the same files as the 0.996 release of MeCab, with some modifications needed for Emscripten.
configure script overrides the
CXXFLAGS arguments that are passed into it, which are used to control optimization levels. Therefore lines 17991--17994 have to be commented out. The
configure script included with this repository contains this update.
Configure and compile MeCab to LLVM intermediate representation (IR) with the following
$ EMCONFIGURE_JS=1 emconfigure ./configure --with-charset=utf8 CXXFLAGS="-std=c++11 -O1" CFLAGS="-O1" && emmake make
configure step will produce some scary-looking errors, but they are merely the bundled script doing some unusual things that Emscripten can't yet handle neatly.)
Next, rename the MeCab LLVM IR (which resides in
src/.libs) so Emscripten knows what to do with it, and copy in the default
mecabrc configuration and the IPADIC dictionary data files.
$ cd src/.libs $ cp mecab mecab.bc $ cp /usr/local/etc/mecabrc . $ cp -r /usr/local/lib/mecab/dic/ipadic .
$ em++ -O1 mecab.bc libmecab.so -o mecab.js -s EXPORTED_FUNCTIONS="['_mecab_do2']" --preload-file mecabrc --preload-file ipadic/
Note that only Linux will produce
libmecab.so; in Mac OS X, this file will be called
libmecab.dylib, so adjust accordingly. Also note that that
mecab_do2 function was added for this project: it is a copy of the
mecab.data files. Along with
index.js, which are included in this repository, these files make up the entire MeCab on the Web project.
MeCab is released under BSD, GPL, or LGPL licenses, and so I have chosen to license MeCab on the Web (this repository) under the most liberal of these licenses, the BSD license.
MeCab is copyrighted by Taku Kudo and NTT. The IPADIC data is copyrighted by the Nara Institute of Science and Technology, and its authors have released it under a BSD-like license (see
Emscripten is the work of Alon Zakai, of the Mozilla Foundation, and many contributors. This project also uses D3.js, by Mike Bostock, of the New York Times, and many contributors.
Many thanks to all these projects' authors.
In case you missed it above, MeCab on the Web is released under the BSD license.