A PHP parser for the CC-CEDICT Chinese-English dictionary
PHP
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is 7 commits ahead, 3 commits behind mdsills:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
demo
src/CcCedict
.gitignore
LICENSE
README.md
composer.json

README.md

cccedict

About

An object-oriented PHP parser for the Chinese-English dictionary CC-CEDICT. Fully customizable, it is easy on memory and cpu, and outputs structured data.

PHP Version

This parser is written for PHP7. It will not work on PHP5.

Demo

Clone the repository to your system, then run the following commands in the repository root:

wget -O demo/cedict.gz http://www.mdbg.net/chindict/export/cedict/cedict_1_0_ts_utf-8_mdbg.txt.gz
php -f demo/index.php

Options

Required settings

  • setFilePath(string) sets path of file to extract and process

Optional settings

  • setBlockSize(int) sets block size to read and parse at a time
  • setStartLine(int) in case you don't want to start from the beginning
  • setNumberOfBlocks(float) in case you don't want to read all the way to the end. You can use INF.
  • setOptions(array) define which data you want returned (see below)

Returned data

The parser will return an array with:

  • an array of Entry objects filled with data as per your configuration (see below)
  • an array of any skipped lines
  • the number of parsed lines
  • the number of skipped lines

Basic Entry object

By default, the parser will fill the Entry object with:

  • a string of traditional characters from the dictionary entry
  • a string of simplified characters from the dictionary entry
  • an array of English translations from the dictionary entry
  • an array of pinyin syllables with diacritics

Customising the Entry object

With setOptions(array), you can change the data included in the Entry object. If any options are set, the Entry will not include any data that is not specified with setOptions()!

  • Entry::F_ORIGINAL the original unparsed line from CC-CEDICT as a string
  • Entry::F_TRADITIONAL a string with the dictionary entry in traditional characters
  • Entry::F_TRADITIONAL_CHARS an array of the above but stripping out any non-Han characters
  • Entry::F_SIMPLIFIED same as above but in simplified characters
  • Entry::F_SIMPLIFIED_CHARS an array of the above but stripping out any non-Han characters
  • Entry::F_PINYIN a string of pinyin as formatted in CC-CEDICT (like numeric Hanyu Pinyin but with ideosyncrasies)
  • Entry::F_PINYIN_NUMERIC a string of pinyin converted to numeric Hanyu Pinyin
  • Entry::F_PINYIN_NUMERIC_EXPANDED an array of the above
  • Entry::F_PINYIN_DIACRITIC a string of pinyin converted to Hanyu Pinyin with diacritics
  • Entry::F_PINYIN_DIACRITIC_EXPANDED an array of the above
  • Entry::F_ENGLISH a string with all the English translations for the dictionary entry
  • Entry::F_ENGLISH_EXPANDED an array of the above

Limitations, bugs, roadmap

System requirements

Works on any system with PHP 5.1 or higher.

During tests, the PHP 7.0.10 CLI on Windows 10 used a mere 9.2 MB of RAM when parsing the dictionary in chunks of 2200 lines, largely due to the PHP CLI's overhead. After 2200 lines, memory consumption went up fairly evenly by about 1 MB per 700 lines increase in chunk size, meaning that a typical web server with 64 MB of RAM will take chunk sizes of up to about 38360 lines. Chunk sizes below 2200 lines do not really make sense from a memory optimisation point unless the receiving interface has other limitations.

Opportunities for improvement

  • Perhaps it could output various formats (e.g. JSON) instead of arrays
  • Any further Chinese in the English translation (references, alternative spellings, or full forms of abbreviations) could be structured and nested