Python utilities for Chinese Notes Chinese-English dictionary
Prerequisite: Python 3.6+
Create a virtual environment
python3 -m venv venv
Activate the virtual environment
source venv/bin/activate
Install software and activate again when coming back.
You can use the chinesenotes Python package for basic command line utilities as described here.
From a directory above chinesenotes-python, clone the chinesenots.com project
git clone https://github.com/alexamies/chinesenotes.com.git
Set the location to an environment variable
export CNREADER_HOME=$HOME/chinesenotes.com
Lookup a word in the dictionary
python -m chinesenotes.cndict --lookup "你好"
You should see output like
INFO:root:Opening the Chinese Notes Reader dictionary
INFO:root:OpenDictionary completed with 141896 entries
INFO:root:hello
Same as above for environment setup. To run the utility:
python -m chinesenotes.cndict --tokenize "東家人死。西家人助哀。"
You should see output like
INFO:root:Opening the Chinese Notes Reader dictionary
INFO:root:OpenDictionary completed with 141896 entries
INFO:root:Greedy dictionary-based text segmentation
INFO:root:Chunk: 東家人死。西家人助哀。
INFO:root:Segments: ['東家', '人', '死', '。', '西家', '人', '助', '哀', '。']
To run the word similarity tool
python -m chinesenotes.similarity --word TARGET_WORD
Substitute the value of TARGET_WORD for your search.
To convert traditional to simplified
python main.py --tosimplified "四種廣說"
To convert to traditional
python main.py --totraditional "操作系统"
To get pinyin
python main.py --topinyin "操作系统"
The Colab notebook to add new words is at (open with Chrome) http://colab.research.google.com/github/alexamies/chinesenotes-python/blob/master/add_mod_entry.ipynb
The text analysis programs require the Apache Beam Python SDK. See Apache Beam Python SDK Quickstart for details on running Apache Beam . You can run it locally or on the cloud Google Cloud Dataflow or another implementation.
pip install apache-beam
A small corpus of Chinese texts is included in this repo. To run this on a full corpus download either the Chinese Notes corpus of literary Chinese
git clone https://github.com/alexamies/chinesenotes.com.git
or the NTI Reader corpus for the Taisho Tripitaka version of the Chinese Buddhist canon
git clone https://github.com/alexamies/buddhist-dictionary.git
Set environment variables
GOOGLE_APPLICATION_CREDENTIALS=credentials.json
INPUT_BUCKET=ntinreader-text
OUTPUT_BUCKET=ntinreader-analysis
PROJECT=[your project]
To list all options
python charcount.py --help
To run locally, reading one file only
CORPUS_HOME=.
python charcount.py \
--input $CORPUS_HOME/corpus/shijing/shijing001.txt \
--ignorelines $CORPUS_HOME/data/corpus/ignorelines.txt \
--output outputs
To compute the character count for all files in corpus
CORPUS_HOME=.
python charcount.py \
--corpus_home $CORPUS_HOME \
--corpus_prefix corpus \
--ignorelines $CORPUS_HOME/data/corpus/ignorelines.txt \
--output outputs
Move the results to a convenient location
cat output* > data/corpus/analysis/char_freq.tsv
rm output*
Run with Dataflow. You will need to copy the corpus text files into GCS first. For a single file
python charcount.py \
--input gs://$INPUT_BUCKET/taisho/t2003_01.txt \
--output gs://$OUTPUT_BUCKET/analysis/outputs \
--runner DataflowRunner \
--project $PROJECT \
--temp_location gs://$OUTPUT_BUCKET/tmp/
Results
gsutil cat "gs://$OUTPUT_BUCKET/analysis/outputs*" > output.txt
less output.txt
rm output.txt
For the whole corpus, running on Dataflow
python charcount.py \
--corpus_home gs://$INPUT_BUCKET \
--ignorelines $CORPUS_HOME/data/corpus/ignorelines.txt \
--output gs://$OUTPUT_BUCKET/charcount/outputs \
--runner DataflowRunner \
--project $PROJECT \
--temp_location gs://$OUTPUT_BUCKET/tmp/
Get the results
mkdir tmp
gsutil -m cp gs://$OUTPUT_BUCKET/analysis/* tmp/
cat tmp/* > char_freq.tsv
rm -rf tmp
The command line options are the same as the charcount.py program. To run locally, reading one file only
CORPUS_HOME=.
python char_bigram_count.py \
--input $CORPUS_HOME/corpus/shijing/shijing001.txt \
--ignorelines $CORPUS_HOME/data/corpus/ignorelines.txt \
--output outputs
For the whole corpus
python char_bigram_count.py \
--corpus_home $CORPUS_HOME \
--corpus_prefix corpus \
--ignorelines $CORPUS_HOME/data/corpus/ignorelines.txt \
--output outputs
Move the results to a convenient location
cat output* > data/corpus/analysis/char_bigram_freq.tsv
Run on Dataflow
python char_bigram_count.py \
--corpus_home gs://$INPUT_BUCKET \
--ignorelines $CORPUS_HOME/data/corpus/ignorelines.txt \
--output gs://$OUTPUT_BUCKET/charbigramcount/outputs \
--runner DataflowRunner \
--project $PROJECT \
--temp_location gs://$OUTPUT_BUCKET/tmp/
To list all options
python term_frequency.py --help
Run locally, read one file only
CORPUS_HOME=.
python term_frequency.py \
--input $CORPUS_HOME/corpus/shijing/shijing001.txt \
--ignorelines $CORPUS_HOME/data/corpus/ignorelines.txt \
--output outputs
For example, Blue Cliff Record Scroll 1:
CORPUS_HOME=../buddhist-dictionary
python term_frequency.py \
--input $CORPUS_HOME/corpus/taisho/t2003_01.txt \
--ignorelines $CORPUS_HOME/data/corpus/ignorelines.txt \
--output output/bluecliff01.tsv
Run on the entire test corpus
python term_frequency.py \
--corpus_home $CORPUS_HOME \
--corpus_prefix corpus \
--ignorelines $CORPUS_HOME/data/corpus/ignorelines.txt \
--output outputs
Move the results to a convenient location
cat output* > data/corpus/analysis/term_freq.tsv
Run using Dataflow
python term_frequency.py \
--corpus_home gs://$INPUT_BUCKET \
--ignorelines $CORPUS_HOME/data/corpus/ignorelines.txt \
--output gs://$OUTPUT_BUCKET/termfreq/outputs \
--runner DataflowRunner \
--project $PROJECT \
--setup_file ./setup.py \
--temp_location gs://$OUTPUT_BUCKET/tmp/
Get the results
mkdir tmp
gsutil -m cp gs://$OUTPUT_BUCKET/analysis/* tmp/
cat tmp/* > term_freq.tsv
rm -rf tmp
To compute the mutual information for each term and write it to an output file:
python chinesenotes/mutualinfo.py \
--char_freq_file [FILE_NAME] \
--bigram_freq_file [FILE_NAME] \
--filter_file [FILE_NAME] \
--output_file [FILE_NAME]
For example, for the test corpus
CORPUS_HOME=.
python chinesenotes/mutualinfo.py \
--char_freq_file $CORPUS_HOME/data/corpus/analysis/char_freq.tsv \
--bigram_freq_file $CORPUS_HOME/data/corpus/analysis/char_bigram_freq.tsv \
--filter_file $CORPUS_HOME/data/corpus/analysis/term_freq.tsv \
--output_file $CORPUS_HOME/data/corpus/analysis/mutual_info.tsv
For the NTI Reader Taisho corpus
CORPUS_HOME=../buddhist-dictionary
python chinesenotes/mutualinfo.py \
--char_freq_file $CORPUS_HOME/index/char_freq.tsv \
--bigram_freq_file $CORPUS_HOME/index/char_bigram_freq.tsv \
--filter_file $CORPUS_HOME/index/term_freq.tsv \
--output_file $CORPUS_HOME/index/mutual_info.tsv
Filter to specific terms, for example, the terms in the Blue Cliff Record, Scroll 1:
CORPUS_HOME=../buddhist-dictionary
python chinesenotes/mutualinfo.py \
--char_freq_file $CORPUS_HOME/index/char_freq.tsv \
--bigram_freq_file $CORPUS_HOME/index/char_bigram_freq.tsv \
--filter_file output/bluecliff01.tsv \
--output_file output/bluecliff01_mi.tsv
Check for the term 蠛蠓 'midge' in the NTI Reader corpus:
Bigram freq (蠛蠓 + 蠓蠛): 13 + 1 = 14 Character freq (蠛): 19 Character freq (蠓): 23 Total characters: 85,519,494 Total bigrams: 83,666,199 Mutual information: I(之,下) = log2[ P(a, b) / P(a) P(b)] = log2[(14 / 83666199) / [(19 / 85519494) * (23/85519494)]] = log2(2800443.4) = 21.42
which matches the output of the program.
To process an annotated corpus file
python chinesenotes/process_annotated.py \
--filename corpus/shijing/shijing_annotated_example.md \
--mutual_info data/corpus/analysis/mutual_info.tsv \
--outfile data/corpus/analysis/shijing_training_example.tsv
A tab separated output file will be written containing all the terms in the annotated corpus and whether they were tokenized correctly.
To run with the NTI Reader Blue Cliff Record data set
export CORPUS_HOME=../buddhist-dictionary
python chinesenotes/process_annotated.py \
--filename $CORPUS_HOME/data/corpus/analysis/tokenization_annotated.md \
--mutual_info $CORPUS_HOME/index/mutual_info.tsv \
--outfile $CORPUS_HOME/data/corpus/analysis/tokenization_training.tsv
First Install Matplotlib. Also, a graphics backend. For example, on Debian
sudo apt-get install python3-tk
To plot the result of processing the annotated corpus file
python chinesenotes/plot_tokenization_results.py \
--infile data/corpus/analysis/shijing_training_example.tsv \
--outfile data/corpus/analysis/shijing_training_example.png
For the Blue Cliff Record
python chinesenotes/plot_tokenization_results.py \
--infile $CORPUS_HOME/data/corpus/analysis/tokenization_training.tsv \
--outfile $CORPUS_HOME/data/corpus/analysis/tokenization_training.png \
--decision_point -0.507
Dictionary tokenization is still used by a filter is training to qualify whether to accept the token. The scikit-learn decision tree classifier is used for this.
Install scikit-learn and graphviz
pip install -U scikit-learn
pip install graphviz
Run the trainer to the example corpus
python chinesenotes/train_tokenizer.py \
--infile data/corpus/analysis/shijing_training_example.tsv \
--outfile data/corpus/analysis/shijing_example_decision_tree.png
For the Blue Cliff Record
python chinesenotes/train_tokenizer.py \
--infile $CORPUS_HOME/data/corpus/analysis/tokenization_training.tsv \
--outfile $CORPUS_HOME/data/corpus/analysis/tokenization_decision_tree.svg
Also, the points with low mutual information can also be added before training.
Run unit tests with the command
python -m unittest discover -s tests -p "*_test.py"
Environment variables
PROJECT_ID=[your project]
DATASET=[your dataset]
ANALYSIS_BUCKET=[your GCS bucket]
Move the frequency data to Google Cloud Storage
gsutil cp index/word_freq_doc.txt gs://${ANALYSIS_BUCKET}
Create the word frequency table and load data
bq load --field_delimiter='\t' \
${PROJECT_ID}:${DATASET}.word_freq_doc \
gs://${ANALYSIS_BUCKET}/word_freq_doc.txt \
word:string,frequency:integer,collection:string,document:string,idf:float,doc_len:integer
Query word frequency to find top 2,000 docs with the term 陀羅尼 (dharani):
bq query --destination_table ${PROJECT_ID}:${DATASET}.dharani_doc_freq \
"SELECT * FROM ${DATASET}.word_freq_doc WHERE word='陀羅尼' ORDER BY frequency DESC LIMIT 2000"
Export results from BQ to GCS:
bq extract \
--destination_format CSV \
--field_delimiter '\t' \
--print_header=true \
${PROJECT_ID}:${DATASET}.dharani_doc_freq \
gs://${ANALYSIS_BUCKET}/dharani_doc_freq.tsv
Download file from GCS:
gsutil cp gs://${ANALYSIS_BUCKET}/dharani_doc_freq.tsv dharani_doc_freq.tsv
Delete the dataset in BQ:
bq rm ${PROJECT_ID}:${DATASET}.dharani_doc_freq
Delete the file in GCS:
gsutil rm
The format of the text below is: Segmented Chinese text [English no. segments] Legend: Error false positive (segmentation incorrectly predicted by the NTI Reader) Error: false positive (not in dictionary on first pass, or not detected) English translation is followed by number of segments Segments are delimited by Chinese enumeration mark 、
Source (English): Cleary, T 1998, The Blue Cliff Record, Berkeley: Numata Center for Buddhist Translation and Research, https://www.bdkamerica.org/book/blue-cliff-record.
Source (Chinese): Chong Xian and Ke Qin, 《佛果圜悟禪師碧巖錄》 'The Blue Cliff Record (Biyanlu),' in Taishō shinshū Daizōkyō 《大正新脩大藏經》, in Takakusu Junjiro, ed., (Tokyo: Taishō Shinshū Daizōkyō Kankōkai, 1988), Vol. 48, No. 2003, accessed 2020-01-26, http://ntireader.org/taisho/t2003_01.html.
Koan 1
Source: Cleary 1998, pp. 11-12
舉、梁武帝、問、 達磨、大師 Story: The Emperor Wu of Liang asked the great teacher Bodhidharma 5 (說、這、不、唧𠺕、漢) (Here’s someone talking such nonsense.) 5 如何、是、聖諦、第一義 “What is the ultimate meaning of the holy truths?” 4 (是、甚、繫驢橛) (What donkey-tethering stake is this?) 3 磨 云。 Bodhidharma said, 2 廓然無聖 “Empty, nothing holy.” 1 (將、謂、多少、奇特。 (One might have thought he’d say something extraordinary. 4 箭、過、新羅。 The point has already whizzed past. 3 可、殺、明白) It’s quite clear.) 3 帝、曰。 The emperor said, 2 對、朕、者、誰 “Who is answering me?” 4 (滿面、慚惶。 (Filled with embarrassment, 2 強、惺惺 果然。 he tries to force himself to be astute. 3 摸索、不、着) After all he gropes without finding.) 3 磨、云。 Bodhidharma said, 2 不識 “Don’t know.” 1 (咄。[(Tsk! 1] 再、來、不、直 半、文、錢) A second try isn’t worth half a cent.) 7 帝、不、契 The emperor didn’t understand. 3 (可惜、許。 (Too bad. 2 却、較、些、子) Still, this is getting somewhere.) 4 達磨、遂、渡江、至、魏( Bodhidharma subsequently crossed the Yangtse River into the kingdom of Wei. 5 (這、野狐精。 (Foxy devil! 2 不免、一、場、懡、㦬。 He can’t avoid embarrassment. 5 從、西、過、東。 He goes from west to east, 4 從、東、過、西) east to west.) 4
帝、後、舉、問、志、公 Later the emperor brought this up to Master Zhi and asked him about it. 6 (貧、兒、思、舊、債。 (A poor man remembers an old debt. 5 傍人、有、眼) The bystander has eyes.) 3 志、公、云。 Master Zhi said, 3 陛下、還、識、此、人、否 “Did you recognize the man?” 6 (和、志、公、趕、出國、始、得。 (He should drive Master Zhi out of the country too. 7 好、與、三十、棒。 He deserves a beating. 4 達磨、來、也) Bodhidharma is here.) 3 帝、云。 The emperor said 2 不識 he didn’t know him. 1 (却是、武帝、承當、得、達磨、公案) (So after all the Emperor Wu has understood Bodhidharma’s case.) 6 志、公、云。 Master Zhi said, 3 此、是、觀音、大士。 “He is Mahasattva Avalokitesvara, 4 傳、佛心印 transmitting the seal of the Buddha mind.” 2 (胡亂、指、注。 (An arbitrary explanation. 3 臂膊、不、向、外、曲) The elbow doesn’t bend outwards.) 5 帝、悔。 The emperor, regretful, 2 遂、遣使、去、請 sent an emissary to invite Bodhidharma back. 4 (果然、把、不住。 (After all Wu can’t hold Bodhidharma back; 3 向、道、不唧、𠺕) I told you he was a dunce.) 4 志、公、云。 Master Zhi said, 3 莫道、陛下、發、使、去、取 “Don’t tell me you’re going to send an emissary to get him!” 6 (東家、人、死。 (When someone in the house to the east dies, 3 西家、人、助、哀。 someone from the house to the west helps in the mourning. 4 Error: 西家 (false negative, missing term) 也好、一時 趕、出國) Better they should all be driven out of the country at once.) 4 闔、國人、去。 “Even if everyone in the country went, 4 他、亦、不、回 he wouldn’t return.” 4 (志、公、也好、與、三十、棒。 (Master Zhi again deserves a beating. 6 不知、脚跟、下放、大、光明)。 He doesn’t know the great illumination shines forth right where one is.) 5 Error: 下放 (false positive)
Note: total 199 segments, 2 errors (1 false negative, 1 false positive)
Install the prerequisite libraries in the virtual env:
python -m pip install -U matplotlib
python -m pip install -U graphviz
python -m pip install -U sklearn
python -m pip install -U tensorflow
python -m pip install -U pandas
To use sim_log_parser for parsing chinesenotes-go web app logs including similarity results.
python -m chinesenotes.sim_log_parser \
--outfile=data/phrase_similarity_training_logs.csv
For the Chuniu Fanlu data set:
python -m chinesenotes.sim_log_parser \
--outfile=data/phrase_similarity_chunqiu_logs.csv
Score the results for relevance in a spreadsheet and export to the CSV file
data/training_balanced.csv
.
Train and validate a decision tree classifier:
python -m chinesenotes.similarity_train \
--infile=data/training_balanced.csv \
--outfile=drawings/phrase_similarity_graph.png \
--valfile=data/validation_biyan.csv
Training results
precision recall f1-score support
0 0.84 0.94 0.89 170
1 0.63 0.35 0.45 48
accuracy 0.81 218
macro avg 0.73 0.65 0.67 218
weighted avg 0.79 0.81 0.79 218
|--- Unigram count <= 2.50
| |--- Hamming distance <= 2.50
| | |--- class: 0
| |--- Hamming distance > 2.50
| | |--- class: 0
|--- Unigram count > 2.50
| |--- Hamming distance <= 9.50
| | |--- class: 1
| |--- Hamming distance > 9.50
| | |--- class: 0
Validation results
precision recall f1-score support
0 0.88 0.94 0.91 49
1 0.62 0.45 0.53 11
accuracy 0.85 60
macro avg 0.75 0.70 0.72 60
weighted avg 0.84 0.85 0.84 60
Score the combined Biyan and Chunqiu training data for relevance in a
spreadsheet and export to the CSV file data/training_combined.csv
.
Train a decision tree classifier:
python -m chinesenotes.similarity_train \
--infile=data/training_combined.csv \
--outfile=drawings/phrase_similarity_combined_graph.png
# output
precision recall f1-score support
0 0.80 0.83 0.82 173
1 0.52 0.48 0.50 67
accuracy 0.73 240
macro avg 0.66 0.65 0.66 240
weighted avg 0.73 0.73 0.73 240
|--- Unigram count <= 2.50
| |--- Query length <= 3.50
| | |--- class: 0
| |--- Query length > 3.50
| | |--- class: 0
|--- Unigram count > 2.50
| |--- Hamming distance <= 9.50
| | |--- class: 1
| |--- Hamming distance > 9.50
| | |--- class: 0
Plot the results or Biyan only
python -m chinesenotes.plot_sim_training \
--infile=data/training_balanced.csv \
--outfile2=drawings/phrase_similarity_plot.png \
--unigram_lim2=0.41 \
--hamming_lim2=0.62 \
--outfile3=drawings/phrase_similarity_plot3.png \
--unigram_lim3=2.5 \
--hamming_lim3=9.5
Plot the results or Biyan and Chunqiu combined
python -m chinesenotes.plot_sim_training \
--infile=data/training_combined.csv \
--outfile2=drawings/phrase_similarity_combined_plot2.png \
--unigram_lim2=0.41 \
--hamming_lim2=0.62 \
--outfile3=drawings/phrase_similarity_combined_plot.png \
--unigram_lim3=2.5 \
--hamming_lim3=9.5
Train the neutral net with the command
python -m chinesenotes.similarity_tf \
--trainfile=data/training_combined.csv \
--valfile=data/validation_biyan.csv
# output
Epoch 10/10
48/48 [==============================] - 0s 835us/step - loss: 0.5144 - accuracy: 0.7416 - val_loss: 0.4352 - val_accuracy: 0.8333
The probability p(a, b) can be computed as
p(a, b) = [f(ab) + f(ba)] / B
Where f(ab) is the frequency of the character bigram ab, f(ba) is the frequency of character bigram of ba and B is the total number of character bigrams.
These should be computed over the entire corpus but let’s use Scroll 1 of the Blue Cliff Record for a simple illustration. It is preferable to compute the frequencies over the entire corpus (volumes 1-55 of the Taisho) for more representative statistical values than just the given document in question. 西家 occurs 53 times in the corpus and 家西 occurs 4 times. The corpus is 85,519,494 characters and 83,666,199 character bigrams. 西 occurs 30743 times and . The pointwise mutual information is
p(西家) = (53 + 4) / 83666199 = 0.0000006665145 p(西) = 30743 / 83666199 = 0.000367448 p(家) = 66590 / 83666199 = 0.000795901 I(西, 家) = log2[0.0000006665145 / (0.000367448 * 0.000795901)] = log2(2.27905) = 1.19 T value The t value for the character bigram 西家 is approximately,
x ~= P(西家) = 53 / 83666199 = 0.0000006334697 s2 ~= x μ = P(西) · P(家) = 0.000367448 * 0.000795901 = 0.0000002924522 t = [x - μ] / [s2 / N]1/2 = [(0.0000006334697 - 0.0000002924522) / (0.0000006334697 / 85519494)1/2 = 3.96 Chi Square For a 2-by-2 matrix, such as the test for bigram correlation, the chi-square statistic is given by the formula
X2 = N(O11O22 - O12O21) / [(O11 + O12)(O11 + O21)(O12 + O22)(O21 + O22)]
Where N is the total number of bigrams in the corpus and Oij are the values observed each combination of characters i and j. For example, for 西家
O11 = 53 (count for 西 followed by 家) O12 = 30743 (count for 西 not followed by 家) O21 = 66590 (count for not 西 followed by 家) O22 = 85519494 - 66590 - 30743 = 85422161 (count for not 西 followed by not 家) X2 = 85519494 * (53 * 85422161 - 30743 * 66590)2 / ((53 + 30743) * (53 + 66590) * (30743 + 85422161) * (66590 + 85422161)) = 35.1