Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


This is the code associated with the publication Glyph-aware Embedding of Chinese Characters by Dai and Cai. Please consider to cite the paper if you find the code useful in some way for your research.

Dai, Falcon Z., and Zheng Cai. "Glyph-aware Embedding of Chinese Characters." EMNLP 2017 (2017): 64.
  title={Glyph-aware Embedding of Chinese Characters},
  author={Dai, Falcon Z and Cai, Zheng},
  journal={EMNLP 2017},


  • We used Google Noto font for all of our experiments. Download Google Noto Simplified Chinese fonts ( Unzip it under the project directory. It is needed to render the glyphs.
  • Requires Tensorflow v1.1 and Python 2.7.x
  • Clone the repo and check out a particular branch or a specific commit with $ git checkout <branch-name or git-tag>


In favor of replicability, we git-tagged the original git commits we used to obtain the published figures. Please see the release for a complete list of git tags (compare with the model names in the paper). Please use the issues page to contact us with code issues so more people can benefit from the conversations.

summary of our implementation

Commit msr-m1 is a good place to start for language modeling. See for a few related models (they differ by whether they use character-id embedding, glyph embedding, or both). For the Chinese segmentation task (tokenizing Chinese sentences which lack whitespaces by convention), you probably want to consult

On a high level, our implementation uses no pre-trained embeddings and render the characters into glyphs on-the-fly. Glyph rendering calls are slow, so we cache the glyphs of seen characters which gives a dramatic speedup (see We consider the input activation, - the combined output of a CNN over the glyph and a trained character-id embedding -, to the RNN as the effective embedding for an input character.

In terms of implementation:

  1. It takes in the path to a text file (utf-8 encoded) and the path to a vocabulary as input (see to build a tensorflow input pipeline. In the case of segmentation, an additional path to the ground truth segmentation annotations.
  2. The characters are rendered into glyphs (see and pass to the CNN. In parallel, we also look up the embedding using its vocabulary id. (We do both for all models, and then simply use a 0/1 multiplier to shutdown the path we don't need before outputting to the RNN in the specific model variant. See
  3. Lastly the output is fed into an standard RNN as common in other contemporary works.
  4. Train end-to-end for the given task.


Falcon Dai (

Zheng Cai (


explores Chinese language models with sub-character level visual information







No packages published