hanzi2reading

A library for transcribing strings of Chinese characters to their readings in Mandarin.

An example JavaScript application: http://bdon.org/hanzireader/

Disambiguates multiple-reading characters based on a dictionary.
Defines a binary format for dictionaries that can be loaded at runtime.
- The dictionary format is designed to be as compact as possible.
- Dictonaries are agnostic to Traditional/Simplified script and transliteration format, and store pronunciations as 2-byte syllable sequences based on Zhuyin.
- A typical dictionary CC-CEDICT in this format is around 700 kB, or less than 300 kB Brotli-compressed, meaning it is practical to load the entire dictionary once over the web and then perform transcription without any network communication.
The library and dictionary can be shared across multiple programming languages. Python and JavaScript are supported right now.

Installation

Javascript: npm install hanzi2reading Python: pip install hanzi2reading

Dictionaries

CC-CEDICT. Licensed CC-BY-SA.
Moedict. Licensed CC-BY-ND. https://github.com/g0v/moedict-data/blob/master/README.md
Unihan database, which contains 1-grams only. Licensed under Unicode License.

Limitations

This library only does dictionary-based lookups of character sequences. It does not attempt to disambiguate readings based on parts of speech, which is necessary for transcribing complete sentences.
Word segmentation and proper nouns for formatted Pinyin is not supported, but may be in the future.

Syllable Format

Part	Bits
Initial	5
Medial	2
Final	4
Tone	3
Er	1

A syllable is serialized in a dictionary as a 2-byte sequence (little-endian). When loaded into a programming runtime, a syllable is a tuple or array of five integers. Example: the syllable kiāng ㄎㄧㄤ corresponds to the array [10,1,11,1,0] or the byte sequence 0b 1011 0010 0010 1001

Notes

Resources

https://github.com/mozillazg/python-pinyin (SC only, data embedded in code)
https://github.com/tsroten/dragonmapper (data is in large CSV files, Python only)
https://github.com/g0v/moedict-data
https://cc-cedict.org/editor/editor.php
https://chrome.google.com/webstore/detail/zhongwen-chinese-english/kkmlkkjojmombglmlpbpapmhcaljjkde
https://github.com/skishore/makemeahanzi

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
js		js
python		python
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

js

js

python

python

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

hanzi2reading

Installation

Dictionaries

Limitations

Syllable Format

Notes

Resources

About

Languages

License

bdon/hanzi2reading

Folders and files

Latest commit

History

Repository files navigation

hanzi2reading

Installation

Dictionaries

Limitations

Syllable Format

Notes

Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Languages