Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle "_" value for token pos in conllu data #9903

Merged
merged 2 commits into from
Dec 21, 2021

Conversation

apjanco
Copy link
Contributor

@apjanco apjanco commented Dec 18, 2021

Description

If a conllu file has no data for a token's part of speech, conllu_sentence_to_doc() throws an error that '_' is not a valid upos value.
To allow tokens with no pos data, this PR changes the "_" to "" to allow pos assignment during spacy convert from conllu. It's a small change that will help our users with less than complete and polished treebanks.

Types of change

Handles an exception when UD has a upos value of "_".

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@polm polm added the feat / cli Feature: Command-line interface label Dec 20, 2021
@adrianeboyd
Copy link
Contributor

Thanks, this is a good idea, since spacy definitely supports missing values in annotation. Let me just make a short suggestion (you can commit it directly from the comment) that uses the same style as the surrounding code.

@adrianeboyd
Copy link
Contributor

adrianeboyd commented Dec 20, 2021

And looking at this again, I don't think the morph _ value is handled correctly here, since this is a valid empty value rather than a missing value, but this would be a separate PR...

Edit: most likely nevermind about this (I've gone through all of this in detail multiple times in the past, since empty vs. missing is always tricky), since adding the morph value "" is converted to the empty morph value "_". It's possible there's no good way to have a missing value in the JSON training data because of this, but CoNLL-U can't represent this distinction either, so assuming that it's empty is probably fine for this converter in practice.

@svlandeg svlandeg added the enhancement Feature requests and improvements label Dec 20, 2021
@adrianeboyd adrianeboyd merged commit 3cfeb51 into explosion:develop Dec 21, 2021
@apjanco
Copy link
Contributor Author

apjanco commented Dec 21, 2021

Thank you for all your help and style suggestion. My first PR with spaCy😊

@danieldk danieldk mentioned this pull request Jan 13, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / cli Feature: Command-line interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants