# Doccano JSONL to spaCy v2.0 JSON format

Data exported out from Doccano is usually in a JSONL format and references the entities in text differently from the spaCy v2.0 method. So far I haven't found any method to directly converted Doccano-formatted data to a format ready for use for spaCy v3.0, so converting to v2.0 as an intermediate step will have to do for now.

Import JSON and Random

In [1]:
import json
import random

Use json.loads() to handle the JSONL file. Then, for each line, reorganise data and labels into the spaCy *[text, {"entities":label}]* format.

In [28]:
results = []

with open("../../data/doccano_annotated_data/edited_annotations.jsonl") as annotations_in_jsonl:
   for line in annotations_in_jsonl:
      j_line=json.loads(line)
      # Reorganise data to spaCy's [text, {"entities":label}] format
      line_results = [j_line['data'], {"entities":j_line['label']}]
      results.append(line_results)

In [29]:
with open("../../data/doccano_annotated_data/v3.1_add_data.jsonl") as annotations_in_jsonl2:
   for line in annotations_in_jsonl2:
      j_line=json.loads(line)
      # Reorganise data to spaCy's [text, {"entities":label}] format
      line_results = [j_line['text'], {"entities":j_line['label']}]
      results.append(line_results)

In [43]:
print(len(results))
print(results[2615])

2616
["The three landlords currently each have a building seeking the certifications - Keppel REIT's Keppel Bay Tower, a 386,000 sq ft office building at Harbourfront; Shaw Towers Realty's Shaw Tower, a 33-storey building on Beach Road currently undergoing redevelopment that is managed by Lendlease, as well as PAG's 400,000 sq ft StarHub Green building in Paya Lebar.", {'entities': [[80, 91, 'ORG'], [94, 110, 'LOC'], [147, 159, 'LOC'], [161, 179, 'ORG'], [182, 192, 'LOC'], [218, 228, 'LOC'], [283, 292, 'ORG'], [305, 308, 'ORG'], [325, 338, 'LOC'], [351, 361, 'LOC']]}]


In [24]:
print(results[2599])

['A customer scans a QR code using the TraceTogether contact-tracing phone app before entering the United Overseas Bank Ltd. (UOB) main branch at UOB Plaza in Singapore, on Thursday, March 4, 2021.', {'entities': [[97, 121, 'ORG'], [124, 127, 'ORG'], [144, 153, 'LOC'], [157, 166, 'LOC']]}]


Shuffle the datasets, then split to training and validation data in an 80:20 ratio

In [53]:
random.shuffle(results)

In [4]:
# fulltrain_data_end_index = int(len(results) - (len(results) / 6))
# golden_data_start_index = int(len(results) - (len(results) / 6) + 1)
# final_index = int(len(results) - 1)

In [54]:
trainset_end = 1744
valset_start = 1744
valset_end = 2180
goldset_start = 2180
goldset_end = 2616

In [55]:
training_set = results[0:trainset_end]
validation_set = results [valset_start:valset_end]
evaluation_set = results[goldset_start:goldset_end]

In [56]:
print(len(training_set), len(validation_set), len(evaluation_set))
print(evaluation_set[435])

1744 436 436
['Currently closed for renovation works, the S$90 million revamp of the Singapore Arts Museum (SAM) will be completed in 2026.', {'entities': [[70, 91, 'LOC'], [93, 96, 'LOC']]}]


Save to JSON file for further conversion to spaCy v3.0 format

In [57]:
save_data_path = "../../data/training_datasets/"

def save_data(file, data):
    with open (save_data_path + file, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4)

save_data("full_train_data_v3.1.json", results)
save_data("training_set_v3.1.json", training_set)
save_data("validation_set_v3.1.json", validation_set)
save_data("golden_set.json", evaluation_set)