# spaCy v3 Data Converter
This converts the training and validation data stored in json format to spacy's binary file format. The data referenced and being converted here is the (cleaned to spacy v2 format) doccano annotation data, but the reference file links can be modified to pull spacy v2 traindata from other files.

Import DocBin from spaCy's Tokens module, spaCy itself and json.

In [1]:
from spacy.tokens import DocBin
import spacy
import json

Function to load json file

In [2]:
def load_data(file):
    with open (file, "r", encoding="utf-8") as f:
        data = json.load(f)
    return (data)

Link to the training and validation set json files here. Link should be modified if needed.

In [3]:
training_set = load_data("../data/training_datasets/dataset_for_v3.1/training_set_v3.1.json")
validation_set = load_data("../data/training_datasets/dataset_for_v3.1/validation_set_v3.1.json")
evaluation_set = load_data("../data/training_datasets/evaluation_for_v3.1+.json")

Test to check if the training data is in the correct format

In [4]:
training_set[20]

['Exit A and Exit E of Choa Chu Kang station is linked to Lot One.',
 {'entities': [[0, 6, 'FAC'],
   [11, 17, 'FAC'],
   [21, 42, 'LOC'],
   [56, 63, 'LOC']]}]

Function to convert the data to DocBin format below.

In [5]:
nlp = spacy.blank("en")
def create_training(TRAIN_DATA):
    db = DocBin()
    for text, annot in TRAIN_DATA:
        doc = nlp(text)
        ents = []
        for start, end, label in annot["entities"]:
            span = doc.char_span(start, end, label=label, alignment_mode="strict")
            if span is None:
                print ("Skipping entity")
            else:
                ents.append(span)
        doc.ents = ents
        print(doc, doc.ents)
        db.add(doc)
    return (db)

Call the function to convert the data, and save the spacy binary file

**NOTE: To prevent data from being overriden, write locations have NOT been updated. DO NOT WRITE TO evaluation_for_v3.1+.spacy - CHANGES TO EVALUATION SET MAY MAKE IT MORE BIASED TOWARDS MODEL V3.1 AND USELESS FOR EVALUATION**

In [6]:
spacy_training_set = create_training(training_set)
spacy_training_set.to_disk("../data/training_datasets/training_set_v3.1.spacy")

Take a stroll along the Prunus-Petai Trail, a 2km boardwalk close to the entrance of MacRitchie Reservoir Park. You'll be rewarded with beautiful views of the calm forest against the backdrop of dense forest, and close-up sightings of various plants and wildlife, including the Singapore Rhododendron, clouded monitor lizards and orange-bellied squirrels. (Prunus-Petai Trail, MacRitchie Reservoir Park)
Nestled along Mackenzie Road, the cafe can be easily spotted from afar, thanks to a giant curry puff planted right outside the store. (Mackenzie Road,)
An angry Facebook user reported a case of a cut sustained (on the face) by her 14 year old niece at the latest attraction - Mirror Maze, located at Jewel Changi Airport. (Facebook, Mirror Maze, Jewel Changi Airport)
The SLA, which has been managing Turf City since 1999, leases the site for lifestyle and recreational uses on an interim basis. (SLA, Turf City)
A two-station extension to Marina Bay opened on 14 January 2012. (Marina Bay,)
For 

In [7]:
spacy_validation_set = create_training(validation_set)
spacy_validation_set.to_disk("../data/training_datasets/validation_set_v3.1.spacy")

Get started at the Changi Chapel & Museum in Singapore, where you can view prized artefacts such as photographs, drawings and letters by the POWs and take a complimentary guided tour around the site. To date, Changi Chapel & Museum has amassed almost 5,000 pieces of documents from the POWs during the war. (Changi Chapel & Museum, Singapore, Changi Chapel & Museum)
A proposal has been further mooted to extend the line from Bukit Panjang towards Sungei Kadut which will interchange with the North South Line. (Bukit Panjang, Sungei Kadut, North South Line)
There were 459 such flats on offer across two projects - Hougang Citrine and Kovan Wellspring - which attracted 10,602 applicants as at 5pm on Tuesday (Aug 17). (Hougang Citrine, Kovan Wellspring)
The 43rd case had not travelled to China recently, but visited Malaysia on Jan 26. On Jan 30, he reported the onset of symptoms and visited two GP clinics on Jan 30, Feb 5 and 6. He then sought treatment at Sengkang General Hospital on Feb 6 an

In [8]:
spacy_evaluation_set = create_training(evaluation_set)
spacy_evaluation_set.to_disk("../data/training_datasets/golden_set.spacy")

There will be two more polyclinics in Bidadari and Bishan and a new and larger facility in Tiong Bahru by 2030. (Bidadari, Bishan, Tiong Bahru)
Skipping entity
This is particularly evident in the first few stages of the North South and East West lines that opened between 1987 and 1988 from Yio Chu Kang to Clementi. (North South, Yio Chu Kang, Clementi)
This land acquisition will result in the combined site area to be 50% more than the existing Guoco Midtown, and is set to become one of the largest developments in the Central Business District (CBD). (Guoco Midtown, Central Business District, CBD)
Sree Narayana Mission was one of four nursing homes that started these rides in November last year, but it stopped them in January this year due to the tightening of Covid-19 restrictions. (Sree Narayana Mission,)
The reservoir, together with the MacRitchie Reservoir, the Lower Peirce Reservoir and the Upper Seletar Reservoir, bound the Central Catchment Nature Reserve. The nature reserve acts