Skip to content

Speed Comparison

Koichi Akabe edited this page Jun 11, 2022 · 5 revisions

This wiki shows the analysis speed of python-vaporetto and other tokenizers and morphological analyzers.

Experimental Setup

We compared the following softwares:

For python-vaporetto and Mykytea-python, we used the compact SVM model based on BCCWJ and UniDic downloaded from KyTea Models page. For mecab-python3, we used unidic 1.1.0. For SudachiPy, we used SudachiDict-core 20220519 based on UniDic and used both "a" and "c" modes.

We tokenized I Am a Cat (by Soseki Natsume), which is available at Aozora Bunko, and measured elapsed time of counting tokens, concatenating all surfaces, and directly generating tokenized strings.

The following is the specification of the used machine:

  • CPU: Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
  • Memory: 64GiB
  • OS: CentOS Linux release 7.5.1804 (Core)

The benchmark code can be found here.

Results

Tool Name Counting [ms] STD Concatenating [ms] STD To String [ms] STD
Mykytea-python 883. 3. 2,227. 105. ----
python-vaporetto 177. 0. 424. 3. 166. 0.
mecab-python3 ---- ---- 207. 7.
SudachiPy (a) 635. 2. 1,097. 44. 788. 1.
SudachiPy (c) 614. 1. 1,044. 3. 773. 3.
Clone this wiki locally