Handle "_" value for token pos in conllu data #9903
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
If a conllu file has no data for a token's part of speech,
conllu_sentence_to_doc()
throws an error that'_'
is not a valid upos value.To allow tokens with no pos data, this PR changes the "_" to "" to allow pos assignment during
spacy convert
from conllu. It's a small change that will help our users with less than complete and polished treebanks.Types of change
Handles an exception when UD has a upos value of
"_"
.Checklist