# Doccano JSONL to spaCy v2.0 JSON format

Data exported out from Doccano is usually in a JSONL format and references the entities in text differently from the spaCy v2.0 method. So far I haven't found any method to directly converted Doccano-formatted data to a format ready for use for spaCy v3.0, so converting to v2.0 as an intermediate step will have to do for now.

Import JSON and Random

In [1]:
import json
import random

Use json.loads() to handle the JSONL file. Then, for each line, reorganise data and labels into the spaCy *[text, {"entities":label}]* format.

In [2]:
results = []

with open("../../data/doccano_annotated_data/edited_annotations.jsonl") as annotations_in_jsonl:
   for line in annotations_in_jsonl:
      j_line=json.loads(line)
      # Reorganise data to spaCy's [text, {"entities":label}] format
      line_results = [j_line['data'], {"entities":j_line['label']}]
      results.append(line_results)

Shuffle the datasets, then split to training and validation data in an 80:20 ratio

In [3]:
random.shuffle(results)

In [4]:
train_data_end_index = int(len(results) - (len(results) / 5))
validation_data_start_index = int(len(results) - (len(results) / 5) + 1)
final_index = int(len(results) - 1)

In [5]:
training_set = results[0:train_data_end_index]
validation_set = results[validation_data_start_index:final_index]

In [8]:
print(len(training_set), len(validation_set))

1280 318


Save to JSON file for further conversion to spaCy v3.0 format

In [6]:
save_data_path = "../../data/training_datasets/"

def save_data(file, data):
    with open (save_data_path + file, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4)

save_data("full_train_data_doccano.json", results)
save_data("training_set_doccano.json", training_set)
save_data("validation_set_doccano.json", validation_set)