# spaCy v3 Data Converter
This converts the training and validation data stored in json format to spacy's binary file format. The data referenced and being converted here is the doccano traindata, but the reference file links can be modified to pull spacy v2 traindata from other files.

Import DocCin from spaCy's Tokens module, spaCy itself and json.

In [1]:
from spacy.tokens import DocBin
import spacy
import json

Function to load json file

In [2]:
def load_data(file):
    with open (file, "r", encoding="utf-8") as f:
        data = json.load(f)
    return (data)

Link to the training and validation set json files here. Link should be modified if needed.

In [3]:
training_set = load_data("../data/training_datasets/training_set_doccano.json")
validation_set = load_data("../data/training_datasets/validation_set_doccano.json")

Test to check if the training data is in the correct format

In [4]:
training_set[20]

['On 26 April 2005, Today reported that soil tests were being conducted around Chinatown station, raising speculations of a possible rail link between Chinatown and Marina Bay via a new developing downtown.',
 {'entities': [[18, 23, 'ORG'],
   [77, 94, 'LOC'],
   [149, 158, 'LOC'],
   [163, 173, 'LOC']]}]

Function to convert the data to DocBin format below.

In [5]:
nlp = spacy.blank("en")
def create_training(TRAIN_DATA):
    db = DocBin()
    for text, annot in TRAIN_DATA:
        doc = nlp(text)
        ents = []
        for start, end, label in annot["entities"]:
            span = doc.char_span(start, end, label=label, alignment_mode="strict")
            if span is None:
                print ("Skipping entity")
            else:
                ents.append(span)
        doc.ents = ents
        print(doc, doc.ents)
        db.add(doc)
    return (db)

Call the function to convert the data, and save the spacy binary file

In [7]:
spacy_training_set = create_training(training_set)
spacy_training_set.to_disk("../data/training_datasets/training_set_doccano.spacy")

The construction of the North-South Corridor (NSC) has resulted in road diversions around the Novena area. (North-South Corridor, NSC, Novena)
Launched by Ah Girls Go Army actress Debra Loi, Singapore Crawfish Ramen in Shenton Food Hall claims to be the first crawfish ramen in Singapore. The food hall is located inside Shenton House. (Singapore Crawfish Ramen, Shenton Food Hall, Singapore, Shenton House)
Following Race Course Road and Serangoon Road through Little India and Boon Keng, it cuts through Whampoa River and Kallang River before reaching Potong Pasir. (Race Course Road, Serangoon Road, Little India, Boon Keng, Whampoa River, Kallang River, Potong Pasir)
The stadium (replete with a full 400m track) can also be found nearby, as well as a SAFRA Toa Payoh club. (SAFRA Toa Payoh,)
Phase 1 includes new developments such as the already-completed Outram Community Hospital, SGH Accident & Emergency Block, SGH Elective Care Centre, a new National Dental Centre Singapore (NDCS), and a n

In [8]:
spacy_validation_set = create_training(validation_set)
spacy_validation_set.to_disk("../data/training_datasets/validation_set_doccano.spacy")

Please avoid the above areas for the next one hour," said PUB in a Facebook post. (PUB, Facebook)
The new Comcentre will also be equipped with hybrid working spaces for other tenants. (Comcentre,)
The Straits Times understands that the accident took place at the site of the upcoming Hougang Olive BTO project in Hougang Avenue 3. (Hougang Olive, Hougang Avenue 3)
This incident occurred near the proposed site of the Nicoll Highway station, not far from the Merdeka Bridge. (Nicoll Highway station,)
Housing two decommissioned power plants, oil and gas tanks, and ancillary buildings, the 15-hectare Pasir Panjang Power District, located next to the Labrador Nature Reserve, will be redeveloped as part of the Greater Southern Waterfront transformation plans. (Pasir Panjang Power District, Labrador Nature Reserve, Greater Southern Waterfront)
The Institute of Policy Studies falls under the Lee Kuan Yew School of Public Policy at the National University of Singapore. It is located at the univers