### Replicate Tagged Entities
We'll take a dataset that was partially annotated in doccano and fill the rest of the documents in the dataset with the entities that were already tagged.\
Steps:
1. Load the partially tagged dataset
2. Create a dictionary of the tagged entities
3. Loop through the unannotated documents searching for possible entities in the dictionary.
4. Save the resulting dataset and load it in Doccano to finish manual tagging.



In [54]:
import pandas as pd
import os

#### 1. Load the dataset

In [55]:
path_to_test = "acerpi_dataset/test/"
path_to_train = "acerpi_dataset/train/"

In [56]:
ner_dataset = pd.read_json(os.path.join(path_to_train, 'annotated_10_percent.jsonl'), orient='record', lines=True)

In [57]:
tagged_sentences = ner_dataset.iloc[0:80].copy()
untagged_sentences = ner_dataset.iloc[80:].copy()

#### 2. Create dictionary of entities

In [58]:
known_entities = {}

for index, row in tagged_sentences.iterrows():
    for entity in row['label']:
        start_pos = int(entity[0])
        end_pos = int(entity[1])
        label = entity[2]
        entity_text = row['text'][start_pos:end_pos]
        known_entities[entity_text] = label


In [59]:
def find_matching_entities(sentence, dictionary):
    matches = []
    for entity in dictionary:
        index = sentence.find(entity)
        if index != -1:
            matches.append([index, index + len(entity), dictionary[entity]])
    return matches

In [61]:
for index, row in untagged_sentences.iterrows():
    row['label'].extend(find_matching_entities(row['text'], known_entities))

In [71]:
replicated_data = pd.concat([tagged_sentences, untagged_sentences])
print(
    tagged_sentences.shape,
    '+', untagged_sentences.shape,
    '=', replicated_data.shape)

(80, 7) + (649, 7) = (729, 7)


In [72]:
replicated_data.to_json(os.path.join(path_to_train, 'unique_sentences_replicated.jsonl'),lines=True, orient = 'records')