LSTM-(CNN)-CRF for CWS

Bi-LSTM+CNN+CRF for Chinese word segmentation.

The new version has come. However, the old version is still available on another branch.

Usage

What's new?

The new system is arranged more orderly;
The CNN model has been tweaked;
Remove the limitation of maximum length of sentences, although you can still set it;
Add gradient clipping;
Pre-training is your choice (whether to use the pretrained embeddings or not), while I didn't see a non-trivial margin in my experiments;
The system can save the best model during training, scored by F-value.

Command Step by Step

Preprocessing
Used to generate training files from the Corpora such as People 2014 and icwb2-data. See the source code or run python preprocess.py -h to see more details.

For example, for the People data, use the default arguments; (The input file is just --all_corpora, the others are output files.)

For the icwb2-data such as PKU: (The input files are --all_corpora and --gold_file)
python3 preprocess.py --all_corpora /home/synrey/data/icwb2-data/training/pku_training.utf8 --vob_path /home/synrey/data/icwb2-data/data-pku/vocab.txt --char_file /home/synrey/data/icwb2-data/data-pku/chars.txt --train_file_pre /home/synrey/data/icwb2-data/data-pku/train --eval_file_pre /home/synrey/data/icwb2-data/data-pku/eval --gold_file /home/synrey/data/icwb2-data/gold/pku_test_gold.utf8 --is_people False --word_freq 2
Pretraining
You may need to use the file third_party/compile_w2v.sh to compile word2vec.c firstly.
For the PKU corpus:
./third_party/word2vec -train /home/synrey/data/icwb2-data/data-pku/chars.txt -output /home/synrey/data/icwb2-data/data-pku/char_vec.txt -size 100 -sample 1e-4 -negative 0 -hs 1 -min-count 2

For the People corpus:
./third_party/word2vec -train /home/synrey/data/cws-v2-data/chars.txt -output /home/synrey/data/cws-v2-data/char_vec.txt -size 100 -sample 1e-4 -negative 0 -hs 1 -min-count 3
Training
For example:

python3 -m sycws.sycws --train_prefix /home/synrey/data/cws-v2-data/train --eval_prefix /home/synrey/data/cws-v2-data/eval --vocab_file /home/synrey/data/cws-v2-data/vocab.txt --out_dir /home/synrey/data/cws-v2-data/model --model CNN-CRF

If you want to use the pretrained embeddings, set the argument --embed_file to the path of your embeddings, such as --embed_file /home/synrey/data/cws-v2-data/char_vec.txt

See the source code for more args' configuration. It shuold perform well with the default parameters. Naturally, you may also try out other parameter settings.

About the Models

Bi-LSTM-SL-CRF

Take reference to Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural Architectures for Named Entity Recognition. In Proc. ACL. 2016.

Actually, there is a single layer (SL) between BiLSTM and CRF.

Bi-LSTM-CNN-CRF

See Here.
Namely, the single layer between BiLSTM and CRF is replaced by a layer of CNN.

Comparison

Experiments on corpus People 2014.

Models	Bi-LSTM-SL-CRF	Bi-LSTM-CNN-CRF
Precision	96.25%	96.30%
Recall	95.34%	95.70%
F-value	95.79%	96.00%

Segmentation

Inference
For example, to use model BiLSTM-CNN-CRF for decoding.

python3 -m sycws.sycws --vocab_file /home/synrey/data/cws-v2-data/vocab.txt --out_dir /home/synrey/data/cws-v2-data/model/best_Fvalue --inference_input_file /home/synrey/data/cws-v2-data/test.txt --inference_output_file /home/synrey/data/cws-v2-data/result.txt

Set --model CRF to use model BiLSTM-SL-CRF for inference. Note, Even if you use pretrained embeddings, the inference command is still the same.
PRF Scoring

python3 PRF_Score.py <test_file> <gold_file>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extra

Extra

sycws

sycws

third_party

third_party

LICENSE.txt

LICENSE.txt

PRF_Score.py

PRF_Score.py

README.md

README.md

preprocess.py

preprocess.py

Repository files navigation

LSTM-(CNN)-CRF for CWS

Usage

Command Step by Step

About the Models

Bi-LSTM-SL-CRF

Bi-LSTM-CNN-CRF

Comparison

Segmentation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
Extra		Extra
sycws		sycws
third_party		third_party
LICENSE.txt		LICENSE.txt
PRF_Score.py		PRF_Score.py
README.md		README.md
preprocess.py		preprocess.py

License

fendouai/LSTM-CNN-CWS

Folders and files

Latest commit

History

Repository files navigation

LSTM-(CNN)-CRF for CWS

Usage

Command Step by Step

About the Models

Bi-LSTM-SL-CRF

Bi-LSTM-CNN-CRF

Comparison

Segmentation

About

Resources

License

Stars

Watchers

Forks

Languages