Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃挮 Bring English tag_map in line with UD Treebank #3455

Merged
merged 2 commits into from Mar 21, 2019

Conversation

honnibal
Copy link
Member

@honnibal honnibal commented Mar 21, 2019

I wrote a small script to read the UD English training data and check
that our tag map and morph rules were resulting in the best POS map.
This hadn't been done for some time, and there have been various changes
to the UD schema since it has been done. After these changes we should
see much better agreement between our POS assignments and the UD POS
tags.

The following cases were clear-cut misassignments in the tag map. The format here is the previous tag map mapping (of TAG->POS) followed by the frequency distribution observed. For instance, we had mapped MD to VERB, but in the UD data MD maps to AUX 3294 times, and no other tag ever.

MD->VERB  [('AUX', 3294)]
PRP$->DET  [('PRON', 3063)]
WDT->DET [('PRON', 855), ('DET', 93)]
RP->PART [('ADP', 746), ('ADV', 9)]
EX->ADV [('PRON', 359)]
LS->PUNCT [('X', 117)]
WP$->DET [('PRON', 15)]

Having corrected the tag map, I also reduced the errors by updating the morph_rules table. This
is keyed by both a tag and a word form, allowing overrides for closed-class words.

Here's a summary of the remaining error counts. The table reads TAG, correct POS, frequency:

NN SYM 34
NN PRON 10
NN ADJ 2
NN VERB 2
NN PROPN 1
NN NUM 1
IN SCONJ 21
IN SYM 1
IN CCONJ 1
DT PRON 9
DT ADJ 3
NNP ADJ 2
NNP X 2
NNP NOUN 1
JJ ADV 6
JJ PROPN 3
JJ NOUN 1
RB PART 46
RB ADP 5
RB ADJ 3
RB CCONJ 1
NNS VERB 2
VB AUX 13
CC ADV 1
VBN AUX 1
VBD AUX 19
TO ADP 7
VBG AUX 15
VBG NOUN 1
VBP AUX 68
VBZ AUX 120
WDT DET 93
RP ADV 9
UH PUNCT 1
PRP DET 1
NFP SYM 149
PDT PUNCT 1

Here are the previous counts, before the updates to the morph rules:

NN PRON 532
NN SYM 34
NN ADJ 2
NN VERB 2
NN PROPN 1
NN NUM 1
IN SCONJ 3848
IN SYM 1
IN CCONJ 1
DT PRON 798
DT ADJ 3
NNP ADJ 2
NNP X 2
NNP NOUN 1
PRP DET 1
JJ ADV 6
JJ PROPN 3
JJ NOUN 1
RB PART 1604
RB ADP 5
RB ADJ 3
RB CCONJ 1
VB AUX 1285
NNS VERB 2
CC ADV 1
VBD AUX 1879
VBP AUX 2642

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

I wrote a small script to read the UD English training data and check
that our tag map and morph rules were resulting in the best POS map.
This hadn't been done for some time, and there have been various changes
to the UD schema since it has been done. After these changes we should
see much better agreement between our POS assignments and the UD POS
tags.
@honnibal honnibal requested a review from ines March 21, 2019 15:35
@ines ines changed the title Bring English tag_map in line with UD Treebank 馃挮 Bring English tag_map in line with UD Treebank Mar 21, 2019
@ines ines added bug Bugs and behaviour differing from documentation feat / tagger Feature: Part-of-speech tagger labels Mar 21, 2019
@honnibal honnibal merged commit 7ec64a3 into master Mar 21, 2019
@ines ines deleted the bugfix/fix-en-tag-map branch May 11, 2019 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation feat / tagger Feature: Part-of-speech tagger
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants