馃挮 Bring English tag_map in line with UD Treebank #3455
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I wrote a small script to read the UD English training data and check
that our tag map and morph rules were resulting in the best POS map.
This hadn't been done for some time, and there have been various changes
to the UD schema since it has been done. After these changes we should
see much better agreement between our POS assignments and the UD POS
tags.
The following cases were clear-cut misassignments in the tag map. The format here is the previous tag map mapping (of
TAG->POS
) followed by the frequency distribution observed. For instance, we had mappedMD
toVERB
, but in the UD dataMD
maps toAUX
3294 times, and no other tag ever.Having corrected the tag map, I also reduced the errors by updating the
morph_rules
table. Thisis keyed by both a tag and a word form, allowing overrides for closed-class words.
Here's a summary of the remaining error counts. The table reads
TAG
, correctPOS
, frequency:Here are the previous counts, before the updates to the morph rules:
Checklist