Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer #2155

Merged
merged 2 commits into from
Mar 29, 2018

Conversation

trungtv
Copy link
Contributor

@trungtv trungtv commented Mar 28, 2018

Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer

Description

  1. Entire vi package which can be installed in the spacy/lang folder. I followed the way spaCy adding support to Chinese and did the same for Vietnamese
  2. External package to be installed: pip install pyvi (PS: I am the owner of pyvi so I can do my best to impove it in the long run)
    Please give a look at my compiled model: vi_core_news_md

Types of change

It is a new feature to add support for Vietnamese on spaCy

Checklist

  • [ x] I have submitted the spaCy Contributor Agreement.
  • [ x] I ran the tests, and all new and existing tests passed.
  • [ x] My changes don't require a change to the documentation, or if they do, I've added all required information.

@explosion-bot
Copy link
Collaborator

Hi @trungtv, thanks for your pull request! 👍 It looks like you haven't filled in the spaCy Contributor Agreement (SCA) yet. The agrement ensures that we can use your contribution across the project. Once you've filled in the template, put it in the .github/contributors directory and add it to this pull request. If your pull request targets a branch that's not master, for example develop, make sure to submit the Contributor Agreement to the master branch. Thanks a lot!

If you've already included the Contributor Agreement in your pull request above, you can ignore this message.

@ines ines added enhancement Feature requests and improvements lang / vi Vietnamese language data and models labels Mar 28, 2018
@honnibal
Copy link
Member

Thanks! Looks great!

Actually in v2.1 we should be able to supply nice Vietnamese support in spaCy itself as well. If you checkout the develop branch and download the Universal Dependencies corpora, you should be able to do:

spacy ud-train /path/to/ud-treebanks-conll2017 /tmp/parses /path/to/config.json UD_Vietnamese

An example config.json is:

{
    "multitask_tag": true,
    "multitask_sent": true,
    "dropout": 0.2,
    "batch_size": 5000
}

(Note batch size is in number of words, not documents or sentences!)

I've attached an example accuracy log below.

The parser on the develop branch (which will be in v2.1) learns to assign the label subtok to parts of a word that the tokenizer has split up. This is really nice for Vietnamese, where the problem is exactly that the tokenizer over-segments.

I'm really pleased with how accuracy is coming along for Vietnamese, so I'm very glad you submitted this: with the pyvi option for segmentation and the language data, we should be able to offer much better support for Vietnamese than other tools.

I'm also hoping that the joint segmentation and parsing strategy can allow some semi-supervised learning. It may be that the word segmenter like pyvi makes different errors from the parser. If so, we may be able to find a way to partially train the parser on its output.

Train and evaluate UD_Vietnamese using lang vi
Epoch   Loss    LAS     UAS     TAG     SENT    WORD                                                
0       3189.4  7.7     11.3    46.5    85.1    60.5
1       2242.0  18.5    22.7    60.2    90.4    69.2                                                
2       1969.0  28.4    33.6    68.7    92.9    77.8                                                
3       1491.5  34.1    39.9    72.2    94.2    81.8                                                
4       1293.8  37.8    43.8    74.7    94.0    84.3                                                
5       1102.9  39.5    45.2    75.2    94.6    85.1                                                
6       1068.9  41.6    47.1    75.4    95.0    85.5                                                
7       930.8   43.1    48.8    76.0    94.6    86.2                                                
8       858.3   43.9    49.6    76.4    95.6    86.7                                                
9       751.6   44.8    50.5    76.8    95.4    86.8                                                
10      700.6   43.7    49.2    76.4    96.1    86.6                                                
11      688.9   44.7    50.3    76.6    96.1    87.0                                                
12      543.6   44.8    50.4    76.5    95.7    86.9                                                
13      471.9   45.1    50.6    76.7    96.1    87.1                                                
14      420.6   45.6    51.2    76.6    95.9    87.4                                                
15      398.9   46.2    51.5    76.5    95.9    87.2                                                
16      374.8   46.0    51.3    76.5    95.6    87.2                                                
17      338.5   46.0    51.5    76.5    95.7    87.3                                                
18      307.7   45.7    51.1    76.4    95.9    87.0                                                
19      260.2   46.0    51.4    76.6    95.7    87.2                                                
20      275.7   46.1    51.5    76.4    95.9    87.0                                                
21      308.0   46.1    51.5    76.6    95.6    87.4                                                
22      231.5   46.4    52.0    76.8    95.3    87.7                                                
23      216.4   46.3    52.0    76.8    95.1    87.6
24      183.5   46.1    51.9    76.8    95.6    87.7                                                
25      157.0   46.2    52.1    76.9    95.7    87.7
26      182.6   46.1    52.0    76.8    95.2    87.7                                                

@trungtv
Copy link
Contributor Author

trungtv commented Mar 29, 2018

Wow, the future looks bright for our Vietnamese NLP community. Up to now, I can tell we had not have a robust, industrial strength nlp toolkit for Vietnamese. Don't hesitate to guide us making this meaningful milestone. Taking about man power, other guys from our (research/applied) Vietnamese NLP community, and also from underthesea already agreed to give hands.

@honnibal
Copy link
Member

What's your PyVI model trained on? Is the token F1 score the same metric as the UD word segmentation evaluation? At first I wondered whether the difference was that the CoNLL 2017 scoring script was evaluating whole words, while you're evaluating the surface tokens. But the CoNLL script outputs the same score for "word" and "token" F1, so I'm a bit confused.

The CoNLL 2017 evaluation showed quite low word F1 for all participants. Our current score of 88% would put us at the top of the pack, but it's much below the F1 you're reporting. Is it a different data set, or is it just that none of the participants took Vietnamese-specific measures that easily fix the accuracy problems?

There was a team from Facebook who submitted a CRF-based system in CoNLL 2017. Their score on Vietnamese was very low, while they scored first in Chinese. So it does seem possible they made one more mistake on Vietnamese than everyone else, and you've done one fewer :). Or it could just be the datasets...

@honnibal
Copy link
Member

Merging this as I'm keen to start playing with the model!

@honnibal honnibal merged commit ea2af94 into explosion:master Mar 29, 2018
@trungtv
Copy link
Contributor Author

trungtv commented Mar 30, 2018

My model was trained on different dataset (Vietnamese treebank). Also, there is a common practice in our Vietnamese NLP community to join compound word after tokenized by '_'. For example: "Bách Khoa Hà Nội" (en: Hanoi University of Science and Technology) would become "Bách_Khoa Hà_Nội". We can then leverage tokenized text where words are separated by space as in other languages directly for word2vec training in gensim.

@honnibal
Copy link
Member

In spaCy the .orth_ attribute should always match up to the original string, or we lose information. We don't actually keep the string anywhere; we rely on reconstructing it with the words and the token.spacy boolean (this is why we have some purely whitespace tokens). We can have an underscored version in the .norm_ attribute though?

Either way it will be easy to output a Gensim-friendly version by replacing the spaces within each token.text when printing.

Would it be possible to license the Vietnamese treebank for commercial purposes? I know this is often tricky when multiple institutions are involved.

@trungtv
Copy link
Contributor Author

trungtv commented Mar 30, 2018

Your are right. It'd better to have underscored version in the .norm_. Regarding the license, I think we can sort it out as long as we build something good for our VLSP community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements lang / vi Vietnamese language data and models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants