Which Is Better For Vietnamese Named Entity Recognition: Features or Embeddings?

With the move from traditional machine learning to neural network models, the emphasis on feature engineering has been diminished accordingly. It is often claimed that the neural networks can learn directly from the dataset the information needed for the task at hand. In this paper, we seek to understand the power of the neural networks by exploring whether handcrafted features still have a role in the area of deep learning through Named Entity Recognition task for Vietnamese language. The results show that, though the model architecture is indeed crucial, the word embeddings are not necessarily superior to handcrafted features.

Dependencies

You need to install Pytorch library.

Data and pre-trained embeddings

For the NER dataset, please see VLSP web page.

For pre-trained embeddings, you can download fastTex at here and GloVe at here.

If you wish to see data distribution on training and test set, type:

python data_stats.py

Models

To run experiments on the perceptron, type:

python perceptron_exp.py

The perceptron accepts these arguments:

--path_to_data_dir  path to data directory
--data_file         name of data file/corpus to load or create
--no_misc           True if do no recognition for MISC entity, default=True
--learning_rate     learning rate, default=0.2
--num_epochs        number of training epochs, default=20

To run experiments on neural network models (crf, lstm, lstm-crf), type:

python nn_experiments.py

It accepts these arguments:

--data_path           path to data directory
--path_to_emb_file    path to embedding file
--no_misc             True if do no recognition for MISC entity, default=True
--num_epochs          number of training epochs, default=20
--lstm_hidden_dim     size of hidden layer, default=100
--batch_size          size of training batch, default=30
--val_batch_size      size of validation batch, default=10
--learning_rate       learning rate, default=0.001
--input_type          type of input representation, choices=['features', 'embeddings', 'stackings'], default='features'
--model_architecture  the model for NER task, choices=['crf', 'lstm', 'lstm-crf'], default='lstm-crf'
--more_training       continue training for more epoches

For example, assuming you have the data available, if you wish to perform an experiment with stacking (features + embeddings) representation on LSTM, type:

python nn_experiments.py --input_type stackings --model_architecture lstm

To continue training, simply type:

python nn_experiments.py --more_training

You can also specify the number of training epochs with argument --num_epochs.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
data_stats.py		data_stats.py
dataset.py		dataset.py
evaluation.py		evaluation.py
models.py		models.py
nn_experiments.py		nn_experiments.py
perceptron_exp.py		perceptron_exp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Which Is Better For Vietnamese Named Entity Recognition: Features or Embeddings?

Dependencies

Data and pre-trained embeddings

Models

About

Uh oh!

Releases

Packages

Languages

VanHoang85/CL_Lab_NER

Folders and files

Latest commit

History

Repository files navigation

Which Is Better For Vietnamese Named Entity Recognition: Features or Embeddings?

Dependencies

Data and pre-trained embeddings

Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages