Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spans to doc.to_json #10073

Merged
merged 6 commits into from
Mar 14, 2022
Merged

Add spans to doc.to_json #10073

merged 6 commits into from
Mar 14, 2022

Conversation

thomashacker
Copy link
Contributor

Description

This PR adds spans to doc.to_json.

It is saved as a dict with SpanGroup names as keys and list[Span] as values.
The stored spans contain: start, end, label, and kb_id

Types of change

New feature to existing function

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@adrianeboyd
Copy link
Contributor

adrianeboyd commented Jan 18, 2022

This opens a can of worms related to whatever we're calling the "JSON format" and options for serializing docs to JSON.

What are the intended uses of Doc.to_json?

Doc.to_json claims:

RETURNS (dict): The data in spaCy's JSON format.

but it's not the same format that can be loaded by spacy.training.converters.json_to_doc (and therefore not the format documented at https://spacy.io/api/data-formats#json-input), and this may really get confusing, especially since there's also no parallel Doc.from_json.

Doc.to_json diverges a lot from spacy.training.gold_io.docs_to_json and I'm trying to remember what the reasons were for these to be implemented separately / differently...

@svlandeg svlandeg added enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects labels Jan 18, 2022
@svlandeg svlandeg marked this pull request as draft February 4, 2022 20:10
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree with Adriane's concerns about "the JSON format", I think that's an issue we'll have to address more properly in a separate discussion/PR. Given that we currently do already have doc.to_json and that we won't be removing this in a next release, we might as well make it support spans.

Good work on also including an appropriate unit test, @thomashacker !

I had a few more small comments, cf below.

spacy/tokens/doc.pyx Outdated Show resolved Hide resolved
spacy/tokens/doc.pyx Outdated Show resolved Hide resolved
@svlandeg svlandeg marked this pull request as ready for review March 3, 2022 15:56
spacy/tokens/doc.pyx Outdated Show resolved Hide resolved
spacy/tokens/doc.pyx Outdated Show resolved Hide resolved
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
@svlandeg svlandeg merged commit b68bf43 into explosion:master Mar 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants