# Data Conversion
## ECHR JSON files to SpaCy JSON files
modifications: de-duplicates documents, re-names some TAB entities to equivalent SpaCy entities, removes entities already covered by the base model

modified from original: Brad Payne
original file: https://github.com/NorskRegnesentral/text-anonymization-benchmark/blob/master/longformer_experiments/data_manipulation.py
original author: Norsk Regnesentral
license: MIT


**Spacy Entities:**

PERSON:      People, including fictional.<br>
NORP:        Nationalities or religious or political groups.<br>
FAC:         Buildings, airports, highways, bridges, etc.<br>
ORG:         Companies, agencies, institutions, etc.<br>
GPE:         Countries, cities, states.<br>
LOC:         Non-GPE locations, mountain ranges, bodies of water.<br>
PRODUCT:     Objects, vehicles, foods, etc. (Not services.)<br>
EVENT:       Named hurricanes, battles, wars, sports events, etc.<br>
WORK_OF_ART: Titles of books, songs, etc.<br>
LAW:         Named documents made into laws.<br>
LANGUAGE:    Any named language.<br>
DATE:        Absolute or relative dates or periods.<br>
TIME:        Times smaller than a day.<br>
PERCENT:     Percentage, including ”%“.<br>
MONEY:       Monetary values, including unit.<br>
QUANTITY:    Measurements, as of weight or distance.<br>
ORDINAL:     “first”, “second”, etc.<br>
CARDINAL:    Numerals that do not fall under another type.<br>

**Text Anonymization Benchmark Entities:**

PERSON:	    Names of people, including nicknames/aliases, usernames and initials<br>
CODE:	    Numbers and codes that identify something, such as SSN, phone number, passport number, license plate<br>
LOC:	    Places and locations, such as: Cities, areas, countries, etc. Addresses Named infrastructures (bus stops, bridges, etc.)<br>
ORG:	    Names of organisations, such as: public and private companies schools, universities, public institutions, prisons, healthcare institutions non-governmental organisations, churches, etc.<br>
DEM:	    Demographic attribute of a person, such as: Native language, descent, heritage, ethnicity Job titles, ranks, education Physical descriptions, diagnosis, birthmarks, ages<br>
DATETIME:	Description of a specific date (e.g. October 3, 2018), time (e.g. 9:48 AM) or duration (e.g. 18 years).<br>
QUANTITY:	Description of a meaningful quantity, e.g. percentages or monetary values.<br>
MISC:	    Every other type of information that describes an individual and that does not belong to the categories above<br>

In [1]:
import json
from typing import Dict

translator = {
    "LOC": "GPE",
    "DATETIME": "DATE",
}

def update_entity_types(swap:str, entity_mapping:Dict[str,str]):
    """Replace entity types using a translator dictionary."""
    if swap in entity_mapping.keys():
        return entity_mapping[swap]
    else:
        return swap

def de_dupe(duplicated_list:list):
    deduplicated_list = list()
    for item in duplicated_list:
        if item not in deduplicated_list:
            deduplicated_list.append(item)
    return deduplicated_list

with open('../../data/annotated/echr_train.json', "r", encoding="utf-8") as f1, open('../../data/annotated/echr_dev.json', "r", encoding="utf-8") as f2, open('../../data/annotated/echr_test.json', "r", encoding="utf-8") as f3:
    train = json.load(f1)
    dev = json.load(f2)
    test = json.load(f3)
    training_raw = []
    dev_raw = []
    test_raw = []

    # the following entities overlap with entities in the pretrained Spacy model
    overlap = ['QUANTITY', 'MISC', 'CODE']
    for ann_data in train:
        dct = {}
        for annotator in ann_data['annotations']:
            dct['full_text'] = ann_data['text']
            annotations = []
            for annotation in ann_data['annotations'][annotator]['entity_mentions']:
                span = {
                    'entity_type': update_entity_types(annotation['entity_type'], translator),
                    'entity_value': annotation['span_text'],
                    'start_position': annotation['start_offset'],
                    'end_position': annotation['end_offset']
                }
                if span['entity_type'] not in overlap:
                    annotations.append(span)
            dct['spans'] = annotations
            training_raw.append(dct)
    print("Number of Training files: ", len(training_raw))
    # merging 'dev' into 'training dataset' is intentional, as Spacy doesn't require a dev dataset
    for ann_data in dev:
        dct = {}
        for annotator in ann_data['annotations']:
            dct['full_text'] = ann_data['text']
            annotations = []
            for annotation in ann_data['annotations'][annotator]['entity_mentions']:
                span= {
                    'entity_type': update_entity_types(annotation['entity_type'], translator),
                    'entity_value': annotation['span_text'],
                    'start_position': annotation['start_offset'],
                    'end_position': annotation['end_offset']
                }
                if span['entity_type'] not in overlap:
                    annotations.append(span)
            dct['spans'] = annotations
            training_raw.append(dct)
    print("Number of Dev files appended to Training files: ", len(training_raw))
    for ann_data in test:
        dct = {}
        for annotator in ann_data['annotations']:
            dct['full_text'] = ann_data['text']
            annotations = []
            for annotation in ann_data['annotations'][annotator]['entity_mentions']:
                span= {
                    'entity_type': update_entity_types(annotation['entity_type'], translator),
                    'entity_value': annotation['span_text'],
                    'start_position': annotation['start_offset'],
                    'end_position': annotation['end_offset']
                }
                if span['entity_type'] not in overlap:
                    annotations.append(span)
            dct['spans'] = annotations
            test_raw.append(dct)
    print("Number of Test files: ", len(test_raw))
de_duped_training = de_dupe(training_raw)
de_duped_test = de_dupe(test_raw)
print("Number of Training files after de-duplication: ", len(de_duped_training))
print("Number of Test files after de-duplication: ", len(de_duped_test))
with open('../../data/annotated/train_data.json', 'w', encoding='utf-8') as f1:
    json.dump(de_duped_training, f1, ensure_ascii=False, indent=4)
with open('../../data/annotated/test_data.json', 'w', encoding='utf-8') as f3:
    json.dump(de_duped_test, f3, ensure_ascii=False, indent=4)

TR 1112
TR2 1653
TE 555
TR-dd 1141
TE-dd 127
