Self-Attention Model for Segmentation (SAMS)

This is a Python implementation of the CWS described in "Attention is All You Need for Chinese Word Segmentation"

Requirements

Python 3.6
Pytorch 1.0.0 or later. 1.5.0 is available.
Scipy 1.5.2

Pre-trained Embeddings

Gensim

Training

Data

Training dataset is from SIGHAN05. There are four datasets:

Corpus	Encoding	Word Types	Words	Character Types	Characters
Peking University (PKU)	CP936/Unicode	55,303	1,109,947	4,698	1,826,448
Microsoft Research (MSR)	CP936/Unicode	88,119	2,368,391	5,167	4,050,469
Academia Sinica (AS)	Unicode/Big Five Plus	141,340	5,449,698	6,117	8,368,050
City University of Hong Kong (CITYU)	HKSCS Unicode/Big Five	69,085	1,455,629	4,923	2,403,355

Training Instructions

We provide run512msrpkuascityu.sh for training. This shell will preprocess training data and train the model. After preprocessing, given data will be splited into two parts for training and development.

To train the model, you should fill the shell with arguments

name of dataset.
namespace of your training.
path to your training data.
dropout
learning rate
warmup step
number of layer
number of head
dimension of model
dimension of feed-forward layer.

Here we give a list of value for you to train the model.

dropout=0.1
learning rate=0.0003
warmup step=16000
number of layer=6
number of head=4
dimension of model=256
dimension of feed-forward layer=1024

bash predatas.sh $name

dropout=0.1
lr=0.0003
warmup=16000

layer=6
head=4
dim=256
ff=1024

After fill the shell, simply run:

bash run512msrpkuascityu.sh

Log will be saved to

gofile/log/your namespace

Two checkpoints will be saved to

gofile/checkpoints/your namespace

One is the last checkpoint and another is the best checkpoint.

You can also train the model with other shells.

Arguments for Train

If you want to train the model with other shells, you can use command

python train_single.py

with the following arguments:

Argument	Description	Type	Required
`--logfiles`	Path to save log	string	Yes
`--name`	Name of dataset	string	Yes
`--dataset`	Name of your training	string	No
`--trainfile`	Path to train data	string	Yes
`--evalfile`	Path to development data	string	Yes
`--savefiles`	Path to save checkpoint	string	Yes
`--savesteps`	Valid and save checkpoint every steps	int	Yes
`--savevalid`	Path to save result of eval	string	Yes
`--train_from`	Path to load checkpoint	string	No
`--showsteps`	Show status of model every steps	int	No
`--gpu`	Use gpu		No
`--use_buffers`	Use buffer for dataset		No
`--dropout`	Dropout for Model	float	Yes
`--des`	Sort data		No
`--buffer_size`	Size of buffer	No
`--token`	Token Type	token or sentence	No
`-e`,`--epoch`	Epoch	int	Yes
`--warmup_start_lr`	Warmup learning rate	float	No
`--warmup_steps`	Steps for warmup	int	No
`--batch_size`,`-b`	Size of batch	int	No
`--loss`	Type of loss	crossentropyloss,nllloss	No
`--learning_rate`	Learning rate	float	No
`--adam_beta1`	Beta1 for Adam	float	No
`--adam_beta2`	Beta2 for Adam	float	No
`--optim`	Optimizer	sgd,adagrad,adam	No
`--cyc`	Train model using steps only and epoch will not be shown		No
`--seglayers`	Using HiRED layer		No
`--segwords`	Using result of HiRED layer to generate a word vector		No
`--middecode`	Using output vector of HiRED layer as feature		No
`--gate`	Using gate to incorporate directional representation		No
`--head`	Heads of multi-head attention	int	Yes
`-l`,`--layer`	Layers of model	int	Yes
`-d`,`--dimension`	Dimension of model	int	Yes
`-f`,`--ff`	Dimension of feed-forward layer	int	Yes
`-p`,`--position_encoding`	Using position-encoding		No
`--norm_after`	Norm after		No
`--reloadlrs`	Reload learning rate		No

You can also use shell createTrainData.sh to generate dataset for training and validation.

bash createTrainData.sh indata sizeofvalid nameofdataset

Evaluation Instructions

With a saved checkpoint, you can use

bash dovalids.sh

to evaluate your model.

Before evaluation, you should fill dovalids.sh with arguments

name, name of your evaluation
checkpoint, path to checkpoint
testfile, path to file for testing
type, pku, msr, as ,cityu
gold, path to gold
words, path to vocabulary

This process will create some files and a one files end with eval_repo is the result. The precision, recall and f-1 value will be printed.

You can also use

perl score path_to_words path_to_gold path_to_result

or

python calculatePRF1.py --pred path_to_result --gold path_to_gold --word path_to_words

to get score.

Sequence Labeling

Coming soon.

Citation

If you use this software for research, please cite our paper as follows:

@inproceedings{duan-zhao-2020-attention,
    title = "Attention Is All You Need for {C}hinese Word Segmentation",
    author = "Duan, Sufeng  and Zhao, Hai",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    pages = "3862--3872"
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
log		log
model		model
module		module
scripts		scripts
train		train
LICENSE		LICENSE
README.md		README.md
calculatePRF1.py		calculatePRF1.py
createTrainData.sh		createTrainData.sh
dovalids.sh		dovalids.sh
gettraindev.py		gettraindev.py
maxlens.py		maxlens.py
predatas.sh		predatas.sh
preprocess.py		preprocess.py
pro.sh		pro.sh
repo.sh		repo.sh
run512msrpkuascityu.sh		run512msrpkuascityu.sh
score		score
train_single.py		train_single.py
valid_single.py		valid_single.py

License

akibcmi/SAMS

Folders and files

Latest commit

History

Repository files navigation

Self-Attention Model for Segmentation (SAMS)

Requirements

Pre-trained Embeddings

Training

Data

Training Instructions

Arguments for Train

Evaluation Instructions

Sequence Labeling

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages