# Preprocessing

This notebook preprocesses the raw annotated JSON files (from multiple domains) into JSON Lines files (`ner_dataset.jsonl`, `re_dataset.jsonl`) which are later used to train the models for the Named Entity Recognition and Relation Extraction. 

**Input Data:**  
Each file includes:
- `doc`: A string containing the full document text.
- `entities`: A list of entity annotations. Each entity is a dictionary with:
  - `id`: A unique ID for the entity (int).
  - `type`: The custom entity type (e.g., "ORG", "GPE", "MISC", "EVENT").
  - `mentions`: A list of each mention of the entity in different variants.
- `triples`: A list of relation annotations. Each entity is a dictinoray with:
    - `head`: The head entity or subject, from which the realtion origins.
    - `tail`: The tail entity which the relation targets
    - `relation`: The type of the relation (e.g. "PresentedIn", "LocatedIn").
- `label_set`: A list of all relation types used in the document.
- `entity_label_set`: A list of all entity types used in the document.

**Output NER Format:**  
For each document, we produce a JSON object with:
- `doc_id`: A unique identifier for the document.
- `text`: The raw document text.
- `tokens`: The list of tokens obtained from tokenizing the text with a pretrained tokenizer.
- `offsets`: The corresponding character offset mapping for each token.
- `entities`: A list of entities with the following structure:
  - `entity_id`
  - `type`
  - `mentions`: Each mention includes the mention text and its `span` (character offsets).


In [15]:
!pip install transformers bitsandbytes accelerate peft datasets torch torchinfo matplotlib pandas

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




# Import Libraries and Define Helper Functions

**load_json**
Loads a json file from a defined path.

**find_mention_spans**
Finds all occurrences of a mention in the text and returns the spans of the mentions.

**append_to_jsonl**
Builds a JSON Lines file from a list of documents. Jsonl is a convenient format for storing structured data while it is still human-readable. It is also ideal for line-by-line processing what will be done in the next step of the pipeline. 

In [None]:
import os
import json
import glob
import re
from pathlib import Path

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-base-4096")

input_folder = "./data/raw/train"
file_paths = glob.glob(os.path.join(input_folder, "*.json"))

def load_json(file_path):
    with open(file_path, "r") as f:
        return json.load(f)

def find_mention_spans(doc_text, mention_text):
    pattern = re.escape(mention_text)
    spans = [(m.start(), m.end()) for m in re.finditer(pattern, doc_text)]
    return spans

def append_to_jsonl(documents, output_file):
    with open(output_file, "a") as f:
        for doc in documents:
            f.write(json.dumps(doc) + "\n")

# Process and Preprocess the Docuemnts

In [19]:
import json
from pathlib import Path

file_paths = [
    "Communication_all_examples.json",
    "Education_all_examples.json",
    "Energy_all_examples.json",
    "Entertainment_all_examples.json",
    "Government_all_examples.json",
]

merged_data_set = []

for file_path in file_paths:
    data_file_path = Path("data/raw/train/") / file_path
    with data_file_path.open("r", encoding="utf-8") as f:
        data = json.load(f)
        merged_data_set.extend(data)




In [20]:
def get_unique_labels(data_set):
    unique_labels = set()
    
    for entry in data_set:
        for entity in entry["entities"]:
            unique_labels.add(entity["type"])
    
    sorted_labels = sorted(unique_labels)
    return sorted_labels

unique_labels = get_unique_labels(merged_data_set)

label_to_id = {label: idx for idx, label in enumerate(unique_labels)}
id_to_label = {idx: label for label, idx in label_to_id.items()}

print("Extracted Labels:", unique_labels)


Extracted Labels: ['CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MISC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART']


In [21]:
from datasets import Dataset

train_dataset = Dataset.from_list(merged_data_set)
print(train_dataset)

from transformers import AutoTokenizer
model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["doc"], truncation=True, is_split_into_words=False)
    all_labels = []
    for i, entities in enumerate(examples["entities"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label_to_id.get("O", 0))
            else:
                label_ids.append(label_to_id.get("O", 0))
            previous_word_idx = word_idx
        all_labels.append(label_ids)
    tokenized_inputs["labels"] = all_labels
    return tokenized_inputs

train_dataset = train_dataset.map(tokenize_and_align_labels, batched=True)

train_dataset.save_to_disk("data/processed/train")


Dataset({
    features: ['domain', 'title', 'doc', 'entities', 'triples', 'label_set', 'entity_label_set'],
    num_rows: 51
})


Map: 100%|██████████| 51/51 [00:00<00:00, 439.88 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 51/51 [00:00<00:00, 3347.99 examples/s]
