Code for paper Neural Semi-Markov Conditional Random Fields for Robust Character-Based Part-of-Speech Tagging
Go to train.py from arg_list select the model_path and the LANG (language) you want to select. The options are:
'en1.2' for English UD 1.2
'en' for English UD 2.0
'ja' for Japanese UD 2.0
'zh' for Chinese UD 2.0
'vi' for Vietnamese UD 2.0
You can also select the segment constructor (grConv or SRNN, default grConv)
To train the model with the default parameters now run
After training a model go to
predict.py and select the model_path and the LANG you want to evaluate in the args_list
Current repo has an untrained model on Vietnamese
For MarMot evaluation (Table 1, 4) run scripts
The relaxed evaluation constructs a label for each gold token by taking the label outputs of the UDPipe + Marmot.
Four possible tokenization output cases are detected.
A correct token was detected: e.g Wonderful.
The label selected in this case is the predicted.
A corrupted space tokenization wasn't merged. e.g Wonderful -> Wo-nd-er-ful.
The label selected in this case is the golden one if any of the sub-tokens had a correct label.
For example if any of the (Wo, nd, er, ful) tokens has an ADJ label (Wonderful is adjective), the golden label is selected as output.
If none has the golden one, a random wrong label is selected instead.
A merge of one or more tokens occured. e.g Don'tgo
In this case MarMot outputs only 1 label for all three seperate tokens. In this case we output the golden label for each of the individual gold token, only if that token has its gold label equal to the single predicted label. Otherwise we output a randomly wrong label for that token
e.g Don'tgo GOLD -> [(Do) VERB ,(n't) ADV, (go) VERB]
prediction -> [ADV]
constructed output-> [WRONG_LABEL, ADV, WRONG_LABEL]
A space tokenization wasn't merged, but also the last sub-token was merged with a next word (or more) (e.g Wonderful world -> Wo-nd-er-fulworld)
In this case all of the subtokens before the last sub-token (Wo-nd-er) are used to create the label for the first word (Wonderful), in the same way as in case 2. The merged case (-fulworld) is treated as in case 3.