Skip to content

bdon/hanzi2reading

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hanzi2reading

A library for transcribing strings of Chinese characters to their readings in Mandarin.

An example JavaScript application: http://bdon.org/hanzireader/

  • Disambiguates multiple-reading characters based on a dictionary.
  • Defines a binary format for dictionaries that can be loaded at runtime.
    • The dictionary format is designed to be as compact as possible.
    • Dictonaries are agnostic to Traditional/Simplified script and transliteration format, and store pronunciations as 2-byte syllable sequences based on Zhuyin.
    • A typical dictionary CC-CEDICT in this format is around 700 kB, or less than 300 kB Brotli-compressed, meaning it is practical to load the entire dictionary once over the web and then perform transcription without any network communication.
  • The library and dictionary can be shared across multiple programming languages. Python and JavaScript are supported right now.

Installation

Javascript: npm install hanzi2reading Python: pip install hanzi2reading

Dictionaries

Limitations

  • This library only does dictionary-based lookups of character sequences. It does not attempt to disambiguate readings based on parts of speech, which is necessary for transcribing complete sentences.
  • Word segmentation and proper nouns for formatted Pinyin is not supported, but may be in the future.

Syllable Format

Part Bits
Initial 5
Medial 2
Final 4
Tone 3
Er 1

A syllable is serialized in a dictionary as a 2-byte sequence (little-endian). When loaded into a programming runtime, a syllable is a tuple or array of five integers. Example: the syllable kiāng ㄎㄧㄤ corresponds to the array [10,1,11,1,0] or the byte sequence 0b 1011 0010 0010 1001

Notes

Resources

About

Library for Standard Chinese pronunciation with swappable dictionary backends

Topics

Resources

License

Stars

Watchers

Forks