# Preparing data for labelling (phase 2)

## I. Manual labelling

The goal of manual labelling was to determine whether sampled phrases represent a named entity (NE). 
More specifically, annotators where given choices for evaluating the match with NE: 'partial match', 'full match' ja 'no'.

Also, [Estonian NE annotation guidelines by Laura Katrin Leman and Kairit Sirts](https://docs.google.com/document/d/1gZcNHmSEK3ua6EwsGJJgRUbfOTzSSa6LuQaDvgNThM4/edit#heading=h.ottwb26al57x) were followed.

The first phase of manual labelling involved checking of 17 part-of-speech subsets, each containing 100 sentences at maximum. 

## II. New data preparation

Prepare data for the next annotation phase:
* take 17 unlabelled part of speech subsets (1000 samples in each) as a basis;
* exclude part of speech subsets that did not contain any partial nor full NE phrase matches in the first phase (O, X, U, P, V, K, J);
* exclude sentences that have already been annotated in the first phase;

In [1]:
import os
import json
import copy
import random
import os.path
from tqdm import tqdm
import pandas as pd

First, gather sentences and spans that have already been annotated.

In [2]:
input_dir = "labelled/extended_100"
assert os.path.exists(input_dir), \
    f'(!) Missing input dir {input_dir!r}. Complete the first annotation phase and download annotated data.'

In [3]:
def read_json_file(filename):    
    f = open(filename, encoding="utf-8")
    data = json.load(f)
    f.close()
    return data

already_annotated = []
for file in os.listdir(input_dir):
    pos = read_json_file(os.path.join(input_dir, file))
    try:
        for elem in pos:
            start  = elem["predictions"][0]["result"][0]["value"]["start"]
            end    = elem["predictions"][0]["result"][0]["value"]["end"]
            phrase = elem["predictions"][0]["result"][0]["value"]["text"]
            text   = elem["data"]["text"]
            already_annotated.append((file, phrase, text, start, end))
    except Exception as e:
        print(str(e))
        print(elem)
        break
    #break

already_annotated_df = pd.DataFrame(already_annotated, columns=["file", "phrase", "text", "start", "end"])
already_annotated_df = already_annotated_df.drop_duplicates()
already_annotated_df

'int' object is not subscriptable
{'id': 1001, 'annotations': [{'id': 8, 'completed_by': 1, 'result': [{'value': {'start': 116, 'end': 133, 'text': 'tekitatud kaldale', 'labels': ['v172_geo_terms']}, 'id': 'Kd6TO66VlK', 'from_name': 'label', 'to_name': 'text', 'type': 'labels', 'origin': 'prediction'}, {'value': {'choices': ['no']}, 'id': 'J7PutgBJ_X', 'from_name': 'review', 'to_name': 'text', 'type': 'choices', 'origin': 'manual'}], 'was_cancelled': False, 'ground_truth': False, 'created_at': '2023-06-14T11:03:41.014143Z', 'updated_at': '2023-06-14T11:03:41.014143Z', 'draft_created_at': None, 'lead_time': 49.145, 'prediction': {'id': 1001, 'model_version': 'undefined', 'created_ago': '21\xa0hours, 28\xa0minutes', 'result': [{'value': {'start': 116, 'end': 133, 'text': 'tekitatud kaldale', 'idx': 962, 'labels': ['v172_geo_terms']}, 'to_name': 'text', 'from_name': 'label', 'type': 'labels'}], 'score': None, 'cluster': None, 'neighbors': None, 'mislabeling': 0.0, 'created_at': '2023-06-1

Unnamed: 0,file,phrase,text,start,end


**TODO:** Unexpectedly, the previous code does not work because the input data is malformed: it has integers (instead of nested dictionaries) inside the "predictions" slot. Remains to be investigated why.

In [4]:
hopeless = ["pos_O", "pos_X", "pos_U", "pos_P", "pos_V", "pos_K", "pos_J"]

In [5]:
input_dir = 'unlabelled/pos_terms_1000_extended'
assert os.path.exists(input_dir), \
    f'(!) Missing input dir {input_dir!r}. Please run "02_prepare_for_labelling.ipynb" before running this.'

In [6]:
additional_data = []
for file in tqdm(os.listdir(input_dir)):
    stop= False
    if not any(hl in file for hl in hopeless):
        pos = read_json_file(os.path.join(input_dir, file))
        orig_file = file[:5]
        for elem in pos:
            new = True
            try:
                if len(elem['predictions'][0]['result']) != 0:
                    text = elem["data"]["text"]
                    start = elem['predictions'][0]['result'][0]['value']['start']
                    end = elem['predictions'][0]['result'][0]['value']['end']
                    phrase = elem['predictions'][0]['result'][0]['value']['text']
                    # Check wheter this sentence was previously annotated:
                    temp = already_annotated_df[already_annotated_df["file"].str.match(orig_file)]
                    for row in temp.iterrows():
                        if row[1][1]==phrase and row[1][2]==text and row[1][3]==start and row[1][4]==end:
                            # Affirmative: this is not new
                            new = False
                            break
                    if new:        
                        additional_data.append((orig_file, phrase, text, start, end))
            except Exception as e:
                print(str(e))
                print(orig_file, elem)
                stop=True
                break
    if stop:
        break

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:01<00:00, 12.30it/s]


In [7]:
ext_df = pd.DataFrame(additional_data, columns=["file", "phrase", "text", "start", "end"])
ext_df

Unnamed: 0,file,phrase,text,start,end
0,pos_A,sooja merd,“ Me armastame sooja merd ning ilusaid ja vana...,15,25
1,pos_A,pruunide säärte,"Kepka viltu peas , sööstab Ülle pruunide säärt...",32,47
2,pos_A,usaldusväärne allikas,“ See lugu kõlab nende esituses paremini kui i...,80,101
3,pos_A,kaheksatuhandelist mäge,"Teisegi tipptulemuse tegi ta läinud aastal , v...",77,100
4,pos_A,kohalikke panku,Valitsus mõjutas kohalikke panku ja kindlustus...,17,32
...,...,...,...,...,...
9053,pos_Z,PIPI-vs-banaanikalajutt: oja,PIPI-vs-banaanikalajutt: oja käskis lugeda ...,0,28
9054,pos_Z,Polla: kanalite,Polla: kanalite statistika,0,15
9055,pos_Z,kiisumiisu: raba,kiisumiisu: raba on sinilill,0,16
9056,pos_Z,Polla: kanalite,Polla: kanalite statistika,0,15


In [8]:
file_counts = dict(ext_df["file"].value_counts())
file_counts

{'pos_C': 994,
 'pos_A': 991,
 'pos_G': 991,
 'pos_S': 981,
 'pos_N': 979,
 'pos_H': 976,
 'pos_D': 973,
 'pos_Y': 925,
 'pos_Z': 902,
 'pos_I': 346}

Export as labelsstudio files.

In [9]:
from pd_collection_to_ls import collection_to_labelstudio

output_dir = 'unlabelled/pos_terms_1000_extended_phase_2'
os.makedirs(output_dir, exist_ok=True)

for key in file_counts.keys():
    file_df = ext_df[ext_df["file"]==key]
    output_path = os.path.join(output_dir, f'{key}.json')
    collection_to_labelstudio(file_df, "v172_geo_terms", output_path)

**Note**: this notebook contains refactored code for data preparation, but the original input data this code was created for is no longer fully available (due to missing data sampling seed). Thus, the outcomes printed in this notebook do no correspond exactly to outputs of original data preparation notebooks (which are distributed elsewhere).