Skip to content

fanzfeng/DaCiDian

Repository files navigation

DaCiDian (大词典)

DaCiDian is an open-sourced lexicon for Chinese Automatic Speech Recognition(ASR)


Design

In mainstream ASR system, lexicon is a core component, that maps word into acoustic modeling units(such as phone). In DaCiDian, we break the mapping into 2 independent layers:

word --> PinYin syllable --> phoneme

The purpose of this design is as follows:

  • Anyone who is familiar with PinYin (basically every mandarin speaker), can enrich DaCiDian's vocabulary, by adding new entry(word) into the layer-1 mapping.

  • ASR system developers can easily adapt DaCiDian to their own phone set by defining their own layer-2 mapping.


Layer-1: Word -> Syllable Mapper (word_to_pinyin.txt)

  • word and its pronunciation(s) are seperated by tab
  • multiple pronunciations are seperated by ;
  • each pronunciation is a sequence of Chinese PinYin syllables, seperated by space
  • tone infomation are encoded into 0,1,2,3,4

examples:

...
裤子    KU_4 ZI_0
好事    HAO_4 SHI_4;HAO_3 SHI_4
教授    JIAO_1 SHOU_4;JIAO_4 SHOU_4
...
语音识别  YU_3 YIN_1 SHI_2 BIE_2
傅里叶变换 FU_4 LI_3 YE_4 BIAN_4 HUAN_4

Layer-2: Syllabel->Phone Mapper (pinyin_to_phone.txt)

pinyin_to_phone is a user-defined mapping from PinYin syllables to target phone set

Take traditional PinYin's Initial-Final structure for example, a mapping should be defined as follows:

A	$0 a
AI	$0 ai
AN	$0 an
ANG	$0 ang
AO	$0 ao
BA	b a
BAI	b ai
BAN	b an
BANG	b ang
BAO	b ao
...
...
...
ZONG	z ong
ZOU	z ou
ZU	z u
ZUAN	z uan
ZUI	z ui
ZUN	z un
ZUO	z uo
  • syllable and its phoneme representation are seperated by tab
  • $0 is the NULL-Initial phone.

Notes

  • English word in Chinese context is a problem. An upgraded version covering both English and Chinese can be found here BigCiDian

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages