Skip to content
Switch branches/tags
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


This project is the offline part of Icytranslate , an English-Chinese translate platform. The output of this project is a translate model, which is the core component of icytranslate.

data preparation

We use UM-corpus as our default training dataset, which can be applied here:

User your own dataset

Althrough UM-corpus is a fine dataset, we encourage you to use your own dataset and report your results. If you want to use other datasets , you might need to modify the code in and change the train_prefix and test_prefix to the actual data dir.

Beweare the dataset you use should have the same data structure as the UM-crop otherwise you might want to read the and modify some of the code in it.

data preprocessing


We first need to process the corpus into series of words, run python to do that. The output should be segment_train.pkl in middleresult dir.

We process all english sentences in to words in lower case , and process all chinese sentences into lists of chinese characters.

encode the tokenlized series

The next step is to convert the tokenlizered sentences into sequences of words, doing that, you only need to run

python --max_words=[max words in a sentence that you want]

model training

Now we can train our model. You may find a align-and-translate-char ipynb file in the folder, open the file with an IDE or jupyter notebook, and follow the steps there, you will get the model trained and a test bleu around 0.22.


tensorflow 1.2.0 for neural network
jieba for english word tokenlizer
nltk to calculate bleu score
sklearn , numpy as toolkit


The offline part of icytranslate(a english-chinese translate platform) ,the output of this project should be a translate model



No releases published


No packages published