New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: import of NER JSON #32
Comments
@DSLituiev for sure thanks for creating a feature for it. Right now, if you want to import NER you have to construct a JSON file in this JSON format. We would build an import button for whatever file format you're using if it's a common standard. What do you currently use to store NER? |
I currently store NER as JSONL (entry-per-line) in following formats (using your example):
Or
I see you require a schema, which is fair. I would prefer it still to be in JSONL format, with "header" / first line representing schema. But I'll be happy with any import functionality. |
We're working on a JSONL version of the udt format. Right now, the CSV format is very similar to JSONL you've written. We don't want to create a format that decouples the interface data from the sample data, but I understand that this is sometimes useful. I think a CSV import will make this fairly easy to do and will work across all datatypes that we currently support. In a CSV import, the interface data can be ignored. An example import would look like the following...
The
The reason to prefer the csv over JSONL for the moment (and the difficulty in general with JSONL) is interface data is easily included with the csv format e.g. ...
|
Glad to hear. I am not sure I grasp what you mean by "interface" when referring to data -- is it metadata / schema / label categories? Well, IMO it is not any easier than having a JSONL with first line being the schema, but up to you. |
I gave it a try. this example fails for no obvious reason:
|
I am not clear how text and labels are linked per this document. Is it just sequence order? What if some documents have no annotations? |
Thanks @DSLituiev, and sorry for the delay in answering. It's really important that the format is as easy to use as possible. I'm taking a look at the details you've posted to understand where the confusion is. There is an update coming to the format that alleviates the need for embedded JSON for most things except the interface. |
I believe the issue is the string delimination with apostrophe instead of quote. Our CSV parser is probably trying to be compliant with RFC 4180 (check out section 2.7 so see how to embed quotes, most libraries take care of this for you). That said it is a extremely high priority to be easy to use, so if possible I'll adjust the CSV parsing library to handle apostrophes. I will also clarify our CSV standard. Edit: I was wrong about. The Regarding annotations. Yes it is currently sequence order. This is my least favorite part of the format. I think it should probably be more like this: {
"interface": { /* ... */ },
"samples": Array<{
/* document, imageUrl etc. */
"output": {
/* entities etc. */
}
}>
} Currently if a sample does not have annotations, it is represented by How do you feel about that revised format? |
I would rename "output" to "labels" like docanno uses. It sounds more
intuitive, especially when "output" is in the input
…On Sat, Apr 11, 2020 at 12:38 PM Severin Ibarluzea ***@***.***> wrote:
I believe the issue is the string delimination with apostrophe instead of
quote. Our CSV compliant with RFC 4180
<https://tools.ietf.org/html/rfc4180#page-2> (check out section 2.7 so
see how to embed quotes, most libraries take care of this for you). That
said it is a extremely high priority to be easy to use, so if possible I'll
adjust the CSV parsing library to handle apostrophes. I will also clarify
our CSV standard.
Regarding annotations. Yes it is currently sequence order. This is my
least favorite part of the format. I think it should probably be more like
this:
{
"interface": { /* ... */ },
"samples": {
/* document, imageUrl etc. */
output: {
/* entities etc. */
}
}
Currently if a sample does not have annotations, it is represented by null.
If a sample has been annotated to be empty, it has an empty array in
entities.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#32 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACAJGMWYCCAUXIAICYOYKCLRMDBK7ANCNFSM4LSH2F6Q>
.
--
Dima Lituiev, PhD
|
@DSLituiev what language/framework do you use (if I may ask)? Would python/npm bindings help for manipulating It looks like it'll be really hard to support single-quote style csvs because there's ambiguity in CSVs that isn't easily figured out automatically. That said, the "trim()" error is a real error in our csv parsing library which I've fixed today. Thanks for reporting :) We had multiple issues which your bug report help identify, the import feature was released fairly recently. Corrected CSV Document
|
I use python mostly. I have been using doccano, which has a pretty simple JSONL import interface (which lacks labelling schema though). I would very much advocate for JSONL with first line for labelling schema. Once I understand udt, I might help building a translator. |
Thanks, @DSLituiev, I think other projects using JSONL is evidence as to how understandable it is. Is there a reason to prefer JSONL over the JSON format? Are there programs that make JSONL easier to read? Or is it just easier for maintaining doccano compatibility? I've created issue #78 for helping with importing doccano files. Also note that #75 (now merged, desktop app still building though) included the changes that fixed the bugs in CSV importing you found :) As of now I'm thinking this project should probably support JSONL and should clean up the |
Thank you guys for quick response.
|
The reason to prefer jsonl is that one can use unix cmd line tools with it, like |
Hi there. hank you for a great tool.
I am curious whether you are considering support for import of pre-annotated text for NER?
This is a very common task in active learning setup / post-regex-clean up step.
The text was updated successfully, but these errors were encountered: