Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer #2155

trungtv · 2018-03-28T14:53:10Z

Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer

Description

Entire vi package which can be installed in the spacy/lang folder. I followed the way spaCy adding support to Chinese and did the same for Vietnamese
External package to be installed: pip install pyvi (PS: I am the owner of pyvi so I can do my best to impove it in the long run)
Please give a look at my compiled model: vi_core_news_md

Types of change

It is a new feature to add support for Vietnamese on spaCy

Checklist

[ x] I have submitted the spaCy Contributor Agreement.
[ x] I ran the tests, and all new and existing tests passed.
[ x] My changes don't require a change to the documentation, or if they do, I've added all required information.

explosion-bot · 2018-03-28T14:53:23Z

Hi @trungtv, thanks for your pull request! 👍 It looks like you haven't filled in the spaCy Contributor Agreement (SCA) yet. The agrement ensures that we can use your contribution across the project. Once you've filled in the template, put it in the .github/contributors directory and add it to this pull request. If your pull request targets a branch that's not master, for example develop, make sure to submit the Contributor Agreement to the master branch. Thanks a lot!

If you've already included the Contributor Agreement in your pull request above, you can ignore this message.

honnibal · 2018-03-28T17:18:16Z

Thanks! Looks great!

Actually in v2.1 we should be able to supply nice Vietnamese support in spaCy itself as well. If you checkout the develop branch and download the Universal Dependencies corpora, you should be able to do:

spacy ud-train /path/to/ud-treebanks-conll2017 /tmp/parses /path/to/config.json UD_Vietnamese

An example config.json is:

{
    "multitask_tag": true,
    "multitask_sent": true,
    "dropout": 0.2,
    "batch_size": 5000
}

(Note batch size is in number of words, not documents or sentences!)

I've attached an example accuracy log below.

The parser on the develop branch (which will be in v2.1) learns to assign the label subtok to parts of a word that the tokenizer has split up. This is really nice for Vietnamese, where the problem is exactly that the tokenizer over-segments.

I'm really pleased with how accuracy is coming along for Vietnamese, so I'm very glad you submitted this: with the pyvi option for segmentation and the language data, we should be able to offer much better support for Vietnamese than other tools.

I'm also hoping that the joint segmentation and parsing strategy can allow some semi-supervised learning. It may be that the word segmenter like pyvi makes different errors from the parser. If so, we may be able to find a way to partially train the parser on its output.

Train and evaluate UD_Vietnamese using lang vi
Epoch   Loss    LAS     UAS     TAG     SENT    WORD                                                
0       3189.4  7.7     11.3    46.5    85.1    60.5
1       2242.0  18.5    22.7    60.2    90.4    69.2                                                
2       1969.0  28.4    33.6    68.7    92.9    77.8                                                
3       1491.5  34.1    39.9    72.2    94.2    81.8                                                
4       1293.8  37.8    43.8    74.7    94.0    84.3                                                
5       1102.9  39.5    45.2    75.2    94.6    85.1                                                
6       1068.9  41.6    47.1    75.4    95.0    85.5                                                
7       930.8   43.1    48.8    76.0    94.6    86.2                                                
8       858.3   43.9    49.6    76.4    95.6    86.7                                                
9       751.6   44.8    50.5    76.8    95.4    86.8                                                
10      700.6   43.7    49.2    76.4    96.1    86.6                                                
11      688.9   44.7    50.3    76.6    96.1    87.0                                                
12      543.6   44.8    50.4    76.5    95.7    86.9                                                
13      471.9   45.1    50.6    76.7    96.1    87.1                                                
14      420.6   45.6    51.2    76.6    95.9    87.4                                                
15      398.9   46.2    51.5    76.5    95.9    87.2                                                
16      374.8   46.0    51.3    76.5    95.6    87.2                                                
17      338.5   46.0    51.5    76.5    95.7    87.3                                                
18      307.7   45.7    51.1    76.4    95.9    87.0                                                
19      260.2   46.0    51.4    76.6    95.7    87.2                                                
20      275.7   46.1    51.5    76.4    95.9    87.0                                                
21      308.0   46.1    51.5    76.6    95.6    87.4                                                
22      231.5   46.4    52.0    76.8    95.3    87.7                                                
23      216.4   46.3    52.0    76.8    95.1    87.6
24      183.5   46.1    51.9    76.8    95.6    87.7                                                
25      157.0   46.2    52.1    76.9    95.7    87.7
26      182.6   46.1    52.0    76.8    95.2    87.7

trungtv · 2018-03-29T09:19:51Z

Wow, the future looks bright for our Vietnamese NLP community. Up to now, I can tell we had not have a robust, industrial strength nlp toolkit for Vietnamese. Don't hesitate to guide us making this meaningful milestone. Taking about man power, other guys from our (research/applied) Vietnamese NLP community, and also from underthesea already agreed to give hands.

honnibal · 2018-03-29T09:51:38Z

What's your PyVI model trained on? Is the token F1 score the same metric as the UD word segmentation evaluation? At first I wondered whether the difference was that the CoNLL 2017 scoring script was evaluating whole words, while you're evaluating the surface tokens. But the CoNLL script outputs the same score for "word" and "token" F1, so I'm a bit confused.

The CoNLL 2017 evaluation showed quite low word F1 for all participants. Our current score of 88% would put us at the top of the pack, but it's much below the F1 you're reporting. Is it a different data set, or is it just that none of the participants took Vietnamese-specific measures that easily fix the accuracy problems?

There was a team from Facebook who submitted a CRF-based system in CoNLL 2017. Their score on Vietnamese was very low, while they scored first in Chinese. So it does seem possible they made one more mistake on Vietnamese than everyone else, and you've done one fewer :). Or it could just be the datasets...

honnibal · 2018-03-29T10:19:40Z

Merging this as I'm keen to start playing with the model!

trungtv · 2018-03-30T02:55:39Z

My model was trained on different dataset (Vietnamese treebank). Also, there is a common practice in our Vietnamese NLP community to join compound word after tokenized by '_'. For example: "Bách Khoa Hà Nội" (en: Hanoi University of Science and Technology) would become "Bách_Khoa Hà_Nội". We can then leverage tokenized text where words are separated by space as in other languages directly for word2vec training in gensim.

honnibal · 2018-03-30T10:08:01Z

In spaCy the .orth_ attribute should always match up to the original string, or we lose information. We don't actually keep the string anywhere; we rely on reconstructing it with the words and the token.spacy boolean (this is why we have some purely whitespace tokens). We can have an underscored version in the .norm_ attribute though?

Either way it will be easy to output a Gensim-friendly version by replacing the spaces within each token.text when printing.

Would it be possible to license the Vietnamese treebank for commercial purposes? I know this is often tricky when multiple institutions are involved.

trungtv · 2018-03-30T15:15:55Z

Your are right. It'd better to have underscored version in the .norm_. Regarding the license, I think we can sort it out as long as we build something good for our VLSP community.

trung@gmailmail.com added 2 commits March 28, 2018 21:34

support for Vietnamese

1f95c3b

Contributor Agreement for adding Vietnamese support on spaCy

26476ad

ines added enhancement Feature requests and improvements lang / vi Vietnamese language data and models labels Mar 28, 2018

honnibal merged commit ea2af94 into explosion:master Mar 29, 2018

rain1024 mentioned this pull request May 30, 2018

NLTK for vietnamese nltk/nltk#995

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer #2155

Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer #2155

trungtv commented Mar 28, 2018

explosion-bot commented Mar 28, 2018

honnibal commented Mar 28, 2018

trungtv commented Mar 29, 2018

honnibal commented Mar 29, 2018

honnibal commented Mar 29, 2018

trungtv commented Mar 30, 2018

honnibal commented Mar 30, 2018

trungtv commented Mar 30, 2018

Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer #2155

Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer #2155

Conversation

trungtv commented Mar 28, 2018

Description

Types of change

Checklist

explosion-bot commented Mar 28, 2018

honnibal commented Mar 28, 2018

trungtv commented Mar 29, 2018

honnibal commented Mar 29, 2018

honnibal commented Mar 29, 2018

trungtv commented Mar 30, 2018

honnibal commented Mar 30, 2018

trungtv commented Mar 30, 2018