Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: import of NER JSON #32

Open
DSLituiev opened this issue Mar 23, 2020 · 14 comments
Open

feature: import of NER JSON #32

DSLituiev opened this issue Mar 23, 2020 · 14 comments

Comments

@DSLituiev
Copy link

Hi there. hank you for a great tool.
I am curious whether you are considering support for import of pre-annotated text for NER?
This is a very common task in active learning setup / post-regex-clean up step.

@seveibar
Copy link
Collaborator

@DSLituiev for sure thanks for creating a feature for it.

Right now, if you want to import NER you have to construct a JSON file in this JSON format.

We would build an import button for whatever file format you're using if it's a common standard. What do you currently use to store NER?

@DSLituiev
Copy link
Author

DSLituiev commented Mar 27, 2020

I currently store NER as JSONL (entry-per-line) in following formats (using your example):

{ "title": "document123",   "document": "This strainer makes a great hat, I'll wear it while I serve spaghetti!",  "entities": [    { label: "hat", start: 5, end: 13 },    { label: "food", start: 60, end: 69 }  ] }

Or

{ "title": "document123",   "document": "This strainer makes a great hat, I'll wear it while I serve spaghetti!",   "entities": [[5,13,"hat"],  [60, 69, "food"]] }

I see you require a schema, which is fair. I would prefer it still to be in JSONL format, with "header" / first line representing schema. But I'll be happy with any import functionality.

@seveibar
Copy link
Collaborator

We're working on a JSONL version of the udt format. Right now, the CSV format is very similar to JSONL you've written. We don't want to create a format that decouples the interface data from the sample data, but I understand that this is sometimes useful.

I think a CSV import will make this fairly easy to do and will work across all datatypes that we currently support. In a CSV import, the interface data can be ignored. An example import would look like the following...

myimport.udt.csv

path document output.entities
samples.0 This strainer makes a great... [ { "label": "hat", "start": ... } ]
samples.1 Boy spaghetti is sure tasty... [ { "label": "food", "start": ... } ]

The *.udt.csv format is fairly flexible with column labels, so this would also be acceptable...

myimport.udt.csv

path document output
samples.0 This strainer makes a great... { "entities": [ { "label": "hat", "start": ... } ]}
samples.1 Boy spaghetti is sure tasty... {"entities": [ { "label": "food", "start": ... } ]}

The reason to prefer the csv over JSONL for the moment (and the difficulty in general with JSONL) is interface data is easily included with the csv format e.g. ...

path . document output
interface { .... }
samples.0 This strainer makes a great... { "entities": [ { "label": "hat", "start": ... } ]}
samples.1 Boy spaghetti is sure tasty... {"entities": [ { "label": "food", "start": ... } ]}

@DSLituiev
Copy link
Author

Glad to hear. I am not sure I grasp what you mean by "interface" when referring to data -- is it metadata / schema / label categories?

Well, IMO it is not any easier than having a JSONL with first line being the schema, but up to you.

@DSLituiev
Copy link
Author

DSLituiev commented Apr 10, 2020

I gave it a try.
My impression: formatting data into udt.csv is hard. One needs to write tons of custom formatting pieces, come up with quotations to escape json quotations etc etc, given hybrid nature of this format (csv+json).

this example fails for no obvious reason:


path,.,document,output
interface,'[{"id":"diseases","displayName":"disease"},{"id":"hx_diseases","displayName":"history of disease"},{"id":"neg_diseases","displayName":"negated disease"},{"id":"medications","displayName":"medication"},{"id":"hx_medications","displayName":"history of medication"},{"id":"neg_medications","displayName":"negated medication"},{"id":"procedures","displayName":"procedure"},{"id":"hx_procedures","displayName":"history of procedure"},{"id":"neg_procedures","displayName":"negated procedure"},{"id":"symptoms","displayName":"symptom"},{"id":"hx_symptoms","displayName":"history of symptom"},{"id":"neg_symptoms","displayName":"negated symptom"}]',,
'admission-note-for-abdominal-pain.txt',,'Admission Note for Abdominal Pain\nDATE: ................\n\nCHIEF COMPLAINT: Abdominal pain x ......... hours/days/months\n\nHISTORY OF PRESENT ILLNESS:\n\nSite -\nOnset -\nCharacter -\nRadiation -\nAlleviating factors -\nTime course -\nExacerbating factors -\nSeverity -\nSimilar pain before -\nNausea -\nVomiting -\nDiarrhea -\nConstipation -\nLoss of appetite -\nBlack/bloody stools -\nSick contacts -, Suspicious food consumed -\nFever/chills -, SOB -, Chest pain -, Headache -\nDysuria -\n\nER Tx given -\n\nPAST MEDICAL HISTORY: (circle all that apply)\nPUD Gallstones Kidney stones UTIs MI CAD HTN DM\nStroke CA PVD DVT COPD Asthma\nEGD -\nColonoscopy -\n\nPAST SURGICAL HISTORY: (circle all that apply)\nCholecystectomy Hernia Appendectomy Hysterectomy\n\nMEDICATIONS:\n\nALLERGY: NKDA\n\nFMH: (circle all that apply)\nCAD 55 yo DM Stroke HTN CA\n\nSOCIAL HISTORY: (circle all that apply)\nIndependent NH Lives w spouse son daughter\nAlcohol - no heavy occasional last drink\nSmoker - no\nIllicit drugs - no cocaine heroin marijuana\n\nREVIEW OF SYSTEMS: unremarkable apart from above symptoms\n\nPHYSICAL EXAM:\nVITALS: Orthostatics -\nSpO2 - Initial vitals -\n\nGENERAL APPEARANCE: WD/WN in NAD\nSKIN: no rash\nHEENT: NC/AT, PERRLA (B), moist MM, no epistaxis\nNECK: Supple, no JVD +JVD\nLUNGS: CTA (B) crackles L R B wheezing\nHEART: Clear S1S2, RRR irregular murmur S D /6 S3\nABDOMEN: Soft, NT, ND, +BS\nRectal exam:\nEXTREMITIES: no edema +edema\nPERIPHERAL VASCULAR: palpable nonpalpable Doppler\nNEURO:\nAAO x 3, CN 2-12: non focal\nMUSCLE STRENGHT: 5/5 (B), SENSATION: nonfocal\nDTR: ++, CEREBELLAR: non focal\n\nLABS:\n\nN= B= L= AG= LFT\nAmylase , Lipase\nCardiac enzymes x 1 - negative , UA:\nBlood cx:\nCXR:\nKUB:\nEKG:\n\nASSESSMENT:\n- Abdominal pain due to\n*Gastroenteritis\n*Gastritis\n*PUD\n*Pancreatitis\n*Cholecystitis\n*Diverticulitis\n*UTI\n\nPLAN:\n- NPO apart from meds\n- IVF, D5 1/2 NS at 125 cc/hr x 2 L\n- EKG in AM\n- Urine C+S\n- Morphine 2 mg IV q 2-4 hr PRN pain\n- Liver/gallbladder U/S\n- CT abdomen (with or without PO and IV contrast)\n- GI consult\n- CBCD, CMP in AM\n\nSignature:\n\n\nPublished: 02/12/2005\nUpdated: 03/08/2009\n','{"entities": [{"start": 19, "end": 33, "label": "symptoms"}, {"start": 64, "end": 73, "label": "symptoms"}, {"start": 75, "end": 89, "label": "symptoms"}, {"start": 140, "end": 147, "label": "hx_symptoms"}, {"start": 177, "end": 186, "label": "procedures"}, {"start": 267, "end": 271, "label": "symptoms"}, {"start": 281, "end": 287, "label": "symptoms"}, {"start": 290, "end": 298, "label": "symptoms"}, {"start": 301, "end": 309, "label": "symptoms"}, {"start": 312, "end": 324, "label": "symptoms"}, {"start": 327, "end": 343, "label": "symptoms"}, {"start": 346, "end": 365, "label": "symptoms"}, {"start": 368, "end": 372, "label": "symptoms"}, {"start": 412, "end": 424, "label": "symptoms"}, {"start": 428, "end": 431, "label": "symptoms"}, {"start": 435, "end": 445, "label": "symptoms"}, {"start": 449, "end": 457, "label": "symptoms"}, {"start": 460, "end": 467, "label": "symptoms"}, {"start": 536, "end": 546, "label": "diseases"}, {"start": 547, "end": 560, "label": "diseases"}, {"start": 561, "end": 565, "label": "diseases"}, {"start": 569, "end": 572, "label": "diseases"}, {"start": 573, "end": 576, "label": "diseases"}, {"start": 580, "end": 586, "label": "diseases"}, {"start": 590, "end": 593, "label": "diseases"}, {"start": 594, "end": 597, "label": "diseases"}, {"start": 598, "end": 602, "label": "diseases"}, {"start": 603, "end": 609, "label": "diseases"}, {"start": 610, "end": 613, "label": "procedures"}, {"start": 616, "end": 627, "label": "procedures"}, {"start": 678, "end": 693, "label": "procedures"}, {"start": 694, "end": 700, "label": "diseases"}, {"start": 701, "end": 713, "label": "procedures"}, {"start": 714, "end": 726, "label": "procedures"}, {"start": 742, "end": 749, "label": "symptoms"}, {"start": 786, "end": 789, "label": "diseases"}, {"start": 799, "end": 805, "label": "diseases"}, {"start": 806, "end": 809, "label": "diseases"}, {"start": 814, "end": 828, "label": "symptoms"}, {"start": 854, "end": 865, "label": "symptoms"}, {"start": 897, "end": 904, "label": "medications"}, {"start": 938, "end": 944, "label": "symptoms"}, {"start": 950, "end": 963, "label": "medications"}, {"start": 969, "end": 976, "label": "medications"}, {"start": 977, "end": 983, "label": "neg_medications"}, {"start": 984, "end": 993, "label": "neg_medications"}, {"start": 1146, "end": 1149, "label": "medications"}, {"start": 1159, "end": 1163, "label": "neg_symptoms"}, {"start": 1203, "end": 1212, "label": "neg_symptoms"}, {"start": 1230, "end": 1233, "label": "neg_symptoms"}, {"start": 1235, "end": 1238, "label": "symptoms"}, {"start": 1254, "end": 1262, "label": "symptoms"}, {"start": 1269, "end": 1277, "label": "symptoms"}, {"start": 1311, "end": 1317, "label": "symptoms"}, {"start": 1355, "end": 1361, "label": "medications"}, {"start": 1384, "end": 1389, "label": "neg_symptoms"}, {"start": 1391, "end": 1396, "label": "neg_symptoms"}, {"start": 1439, "end": 1446, "label": "procedures"}, {"start": 1508, "end": 1517, "label": "symptoms"}, {"start": 1560, "end": 1564, "label": "symptoms"}, {"start": 1580, "end": 1583, "label": "procedures"}, {"start": 1584, "end": 1591, "label": "medications"}, {"start": 1594, "end": 1600, "label": "medications"}, {"start": 1601, "end": 1616, "label": "medications"}, {"start": 1648, "end": 1651, "label": "procedures"}, {"start": 1658, "end": 1661, "label": "procedures"}, {"start": 1678, "end": 1692, "label": "symptoms"}, {"start": 1701, "end": 1716, "label": "diseases"}, {"start": 1718, "end": 1727, "label": "diseases"}, {"start": 1734, "end": 1746, "label": "diseases"}, {"start": 1748, "end": 1761, "label": "diseases"}, {"start": 1763, "end": 1777, "label": "diseases"}, {"start": 1779, "end": 1782, "label": "diseases"}, {"start": 1784, "end": 1788, "label": "diseases"}, {"start": 1792, "end": 1795, "label": "procedures"}, {"start": 1814, "end": 1817, "label": "procedures"}, {"start": 1850, "end": 1853, "label": "procedures"}, {"start": 1874, "end": 1882, "label": "medications"}, {"start": 1904, "end": 1908, "label": "symptoms"}, {"start": 1935, "end": 1945, "label": "procedures"}, {"start": 1973, "end": 1981, "label": "medications"}, {"start": 2004, "end": 2007, "label": "medications"}]}'

Error: 
JSON Error: SyntaxError: Unexpected token p in JSON at position 0
CSV Error: TypeError: Cannot read property 'trim' of undefined

@DSLituiev
Copy link
Author

I am not clear how text and labels are linked per this document. Is it just sequence order? What if some documents have no annotations?

@seveibar
Copy link
Collaborator

Thanks @DSLituiev, and sorry for the delay in answering. It's really important that the format is as easy to use as possible.

I'm taking a look at the details you've posted to understand where the confusion is. There is an update coming to the format that alleviates the need for embedded JSON for most things except the interface.

@seveibar
Copy link
Collaborator

seveibar commented Apr 11, 2020

I believe the issue is the string delimination with apostrophe instead of quote. Our CSV parser is probably trying to be compliant with RFC 4180 (check out section 2.7 so see how to embed quotes, most libraries take care of this for you). That said it is a extremely high priority to be easy to use, so if possible I'll adjust the CSV parsing library to handle apostrophes. I will also clarify our CSV standard.

Edit: I was wrong about. The path variable is what is confusing it. I'm changing the error message to reflect something that makes more sense in the future. I'll post an update soon.

Regarding annotations. Yes it is currently sequence order. This is my least favorite part of the format. I think it should probably be more like this:

{
  "interface": { /* ... */ },
  "samples": Array<{
      /* document, imageUrl etc. */
     "output": {
      /* entities etc. */
     }
   }>
}

Currently if a sample does not have annotations, it is represented by null. If a sample has been annotated to be empty, it has an empty array in entities.

How do you feel about that revised format?

@DSLituiev
Copy link
Author

DSLituiev commented Apr 11, 2020 via email

@seveibar
Copy link
Collaborator

seveibar commented Apr 12, 2020

@DSLituiev what language/framework do you use (if I may ask)? Would python/npm bindings help for manipulating udt.* files?

It looks like it'll be really hard to support single-quote style csvs because there's ambiguity in CSVs that isn't easily figured out automatically. That said, the "trim()" error is a real error in our csv parsing library which I've fixed today. Thanks for reporting :) We had multiple issues which your bug report help identify, the import feature was released fairly recently.

Corrected CSV Document


path,.,document,output
interface,"{""type"": ""text_entity_recognition"",""labels"": [{""id"":""diseases"",""displayName"":""disease""},{""id"":""hx_diseases"",""displayName"":""history of disease""},{""id"":""neg_diseases"",""displayName"":""negated disease""},{""id"":""medications"",""displayName"":""medication""},{""id"":""hx_medications"",""displayName"":""history of medication""},{""id"":""neg_medications"",""displayName"":""negated medication""},{""id"":""procedures"",""displayName"":""procedure""},{""id"":""hx_procedures"",""displayName"":""history of procedure""},{""id"":""neg_procedures"",""displayName"":""negated procedure""},{""id"":""symptoms"",""displayName"":""symptom""},{""id"":""hx_symptoms"",""displayName"":""history of symptom""},{""id"":""neg_symptoms"",""displayName"":""negated symptom""}]}",,
samples.0,,"Admission Note for Abdominal Pain\nDATE: ................\n\nCHIEF COMPLAINT: Abdominal pain x ......... hours/days/months\n\nHISTORY OF PRESENT ILLNESS:\n\nSite -\nOnset -\nCharacter -\nRadiation -\nAlleviating factors -\nTime course -\nExacerbating factors -\nSeverity -\nSimilar pain before -\nNausea -\nVomiting -\nDiarrhea -\nConstipation -\nLoss of appetite -\nBlack/bloody stools -\nSick contacts -, Suspicious food consumed -\nFever/chills -, SOB -, Chest pain -, Headache -\nDysuria -\n\nER Tx given -\n\nPAST MEDICAL HISTORY: (circle all that apply)\nPUD Gallstones Kidney stones UTIs MI CAD HTN DM\nStroke CA PVD DVT COPD Asthma\nEGD -\nColonoscopy -\n\nPAST SURGICAL HISTORY: (circle all that apply)\nCholecystectomy Hernia Appendectomy Hysterectomy\n\nMEDICATIONS:\n\nALLERGY: NKDA\n\nFMH: (circle all that apply)\nCAD 55 yo DM Stroke HTN CA\n\nSOCIAL HISTORY: (circle all that apply)\nIndependent NH Lives w spouse son daughter\nAlcohol - no heavy occasional last drink\nSmoker - no\nIllicit drugs - no cocaine heroin marijuana\n\nREVIEW OF SYSTEMS: unremarkable apart from above symptoms\n\nPHYSICAL EXAM:\nVITALS: Orthostatics -\nSpO2 - Initial vitals -\n\nGENERAL APPEARANCE: WD/WN in NAD\nSKIN: no rash\nHEENT: NC/AT, PERRLA (B), moist MM, no epistaxis\nNECK: Supple, no JVD +JVD\nLUNGS: CTA (B) crackles L R B wheezing\nHEART: Clear S1S2, RRR irregular murmur S D /6 S3\nABDOMEN: Soft, NT, ND, +BS\nRectal exam:\nEXTREMITIES: no edema +edema\nPERIPHERAL VASCULAR: palpable nonpalpable Doppler\nNEURO:\nAAO x 3, CN 2-12: non focal\nMUSCLE STRENGHT: 5/5 (B), SENSATION: nonfocal\nDTR: ++, CEREBELLAR: non focal\n\nLABS:\n\nN= B= L= AG= LFT\nAmylase , Lipase\nCardiac enzymes x 1 - negative , UA:\nBlood cx:\nCXR:\nKUB:\nEKG:\n\nASSESSMENT:\n- Abdominal pain due to\n*Gastroenteritis\n*Gastritis\n*PUD\n*Pancreatitis\n*Cholecystitis\n*Diverticulitis\n*UTI\n\nPLAN:\n- NPO apart from meds\n- IVF, D5 1/2 NS at 125 cc/hr x 2 L\n- EKG in AM\n- Urine C+S\n- Morphine 2 mg IV q 2-4 hr PRN pain\n- Liver/gallbladder U/S\n- CT abdomen (with or without PO and IV contrast)\n- GI consult\n- CBCD, CMP in AM\n\nSignature:\n\n\nPublished: 02/12/2005\nUpdated: 03/08/2009\n","{""entities"": [{""start"": 19, ""end"": 33, ""label"": ""symptoms""}, {""start"": 64, ""end"": 73, ""label"": ""symptoms""}, {""start"": 75, ""end"": 89, ""label"": ""symptoms""}, {""start"": 140, ""end"": 147, ""label"": ""hx_symptoms""}, {""start"": 177, ""end"": 186, ""label"": ""procedures""}, {""start"": 267, ""end"": 271, ""label"": ""symptoms""}, {""start"": 281, ""end"": 287, ""label"": ""symptoms""}, {""start"": 290, ""end"": 298, ""label"": ""symptoms""}, {""start"": 301, ""end"": 309, ""label"": ""symptoms""}, {""start"": 312, ""end"": 324, ""label"": ""symptoms""}, {""start"": 327, ""end"": 343, ""label"": ""symptoms""}, {""start"": 346, ""end"": 365, ""label"": ""symptoms""}, {""start"": 368, ""end"": 372, ""label"": ""symptoms""}, {""start"": 412, ""end"": 424, ""label"": ""symptoms""}, {""start"": 428, ""end"": 431, ""label"": ""symptoms""}, {""start"": 435, ""end"": 445, ""label"": ""symptoms""}, {""start"": 449, ""end"": 457, ""label"": ""symptoms""}, {""start"": 460, ""end"": 467, ""label"": ""symptoms""}, {""start"": 536, ""end"": 546, ""label"": ""diseases""}, {""start"": 547, ""end"": 560, ""label"": ""diseases""}, {""start"": 561, ""end"": 565, ""label"": ""diseases""}, {""start"": 569, ""end"": 572, ""label"": ""diseases""}, {""start"": 573, ""end"": 576, ""label"": ""diseases""}, {""start"": 580, ""end"": 586, ""label"": ""diseases""}, {""start"": 590, ""end"": 593, ""label"": ""diseases""}, {""start"": 594, ""end"": 597, ""label"": ""diseases""}, {""start"": 598, ""end"": 602, ""label"": ""diseases""}, {""start"": 603, ""end"": 609, ""label"": ""diseases""}, {""start"": 610, ""end"": 613, ""label"": ""procedures""}, {""start"": 616, ""end"": 627, ""label"": ""procedures""}, {""start"": 678, ""end"": 693, ""label"": ""procedures""}, {""start"": 694, ""end"": 700, ""label"": ""diseases""}, {""start"": 701, ""end"": 713, ""label"": ""procedures""}, {""start"": 714, ""end"": 726, ""label"": ""procedures""}, {""start"": 742, ""end"": 749, ""label"": ""symptoms""}, {""start"": 786, ""end"": 789, ""label"": ""diseases""}, {""start"": 799, ""end"": 805, ""label"": ""diseases""}, {""start"": 806, ""end"": 809, ""label"": ""diseases""}, {""start"": 814, ""end"": 828, ""label"": ""symptoms""}, {""start"": 854, ""end"": 865, ""label"": ""symptoms""}, {""start"": 897, ""end"": 904, ""label"": ""medications""}, {""start"": 938, ""end"": 944, ""label"": ""symptoms""}, {""start"": 950, ""end"": 963, ""label"": ""medications""}, {""start"": 969, ""end"": 976, ""label"": ""medications""}, {""start"": 977, ""end"": 983, ""label"": ""neg_medications""}, {""start"": 984, ""end"": 993, ""label"": ""neg_medications""}, {""start"": 1146, ""end"": 1149, ""label"": ""medications""}, {""start"": 1159, ""end"": 1163, ""label"": ""neg_symptoms""}, {""start"": 1203, ""end"": 1212, ""label"": ""neg_symptoms""}, {""start"": 1230, ""end"": 1233, ""label"": ""neg_symptoms""}, {""start"": 1235, ""end"": 1238, ""label"": ""symptoms""}, {""start"": 1254, ""end"": 1262, ""label"": ""symptoms""}, {""start"": 1269, ""end"": 1277, ""label"": ""symptoms""}, {""start"": 1311, ""end"": 1317, ""label"": ""symptoms""}, {""start"": 1355, ""end"": 1361, ""label"": ""medications""}, {""start"": 1384, ""end"": 1389, ""label"": ""neg_symptoms""}, {""start"": 1391, ""end"": 1396, ""label"": ""neg_symptoms""}, {""start"": 1439, ""end"": 1446, ""label"": ""procedures""}, {""start"": 1508, ""end"": 1517, ""label"": ""symptoms""}, {""start"": 1560, ""end"": 1564, ""label"": ""symptoms""}, {""start"": 1580, ""end"": 1583, ""label"": ""procedures""}, {""start"": 1584, ""end"": 1591, ""label"": ""medications""}, {""start"": 1594, ""end"": 1600, ""label"": ""medications""}, {""start"": 1601, ""end"": 1616, ""label"": ""medications""}, {""start"": 1648, ""end"": 1651, ""label"": ""procedures""}, {""start"": 1658, ""end"": 1661, ""label"": ""procedures""}, {""start"": 1678, ""end"": 1692, ""label"": ""symptoms""}, {""start"": 1701, ""end"": 1716, ""label"": ""diseases""}, {""start"": 1718, ""end"": 1727, ""label"": ""diseases""}, {""start"": 1734, ""end"": 1746, ""label"": ""diseases""}, {""start"": 1748, ""end"": 1761, ""label"": ""diseases""}, {""start"": 1763, ""end"": 1777, ""label"": ""diseases""}, {""start"": 1779, ""end"": 1782, ""label"": ""diseases""}, {""start"": 1784, ""end"": 1788, ""label"": ""diseases""}, {""start"": 1792, ""end"": 1795, ""label"": ""procedures""}, {""start"": 1814, ""end"": 1817, ""label"": ""procedures""}, {""start"": 1850, ""end"": 1853, ""label"": ""procedures""}, {""start"": 1874, ""end"": 1882, ""label"": ""medications""}, {""start"": 1904, ""end"": 1908, ""label"": ""symptoms""}, {""start"": 1935, ""end"": 1945, ""label"": ""procedures""}, {""start"": 1973, ""end"": 1981, ""label"": ""medications""}, {""start"": 2004, ""end"": 2007, ""label"": ""medications""}]}"


@DSLituiev
Copy link
Author

I use python mostly. I have been using doccano, which has a pretty simple JSONL import interface (which lacks labelling schema though).

I would very much advocate for JSONL with first line for labelling schema. Once I understand udt, I might help building a translator.

@seveibar
Copy link
Collaborator

For reference, doccano's file format can be found here: https://github.com/doccano/doccano/wiki/Import-and-Export-File-Formats

Thanks, @DSLituiev, I think other projects using JSONL is evidence as to how understandable it is. Is there a reason to prefer JSONL over the JSON format? Are there programs that make JSONL easier to read? Or is it just easier for maintaining doccano compatibility?

I've created issue #78 for helping with importing doccano files.

Also note that #75 (now merged, desktop app still building though) included the changes that fixed the bugs in CSV importing you found :)

As of now I'm thinking this project should probably support JSONL and should clean up the *.udt.json format, I'll start some of that today. I think working with the udt format is a major ergonomic we could make really easy. I've started a repository to begin the specification of the pip module. https://github.com/UniversalDataTool/python-universaldatatool/blob/master/README.md

@DSLituiev
Copy link
Author

Thank you guys for quick response.
Here is how I would read jsonl:

def read_jsonl_w_header(filename):
    result = []
    with open(filename) as fh:
        header = json.loads(next(fh))
        for line in fh:
            result.append(json.loads(line))
    return header, result

@DSLituiev
Copy link
Author

The reason to prefer jsonl is that one can use unix cmd line tools with it, like head and tail

@seveibar seveibar mentioned this issue Apr 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants