# Preprocessing

This notebook preprocesses the raw annotated JSON files (from multiple domains) into JSON Lines files (`ner_dataset.jsonl`, `re_dataset.jsonl`) which are later used to train the models for the Named Entity Recognition and Relation Extraction. 

**Input Data:**  
Each file includes:
- `doc`: A string containing the full document text.
- `entities`: A list of entity annotations. Each entity is a dictionary with:
  - `id`: A unique ID for the entity (int).
  - `type`: The custom entity type (e.g., "ORG", "GPE", "MISC", "EVENT").
  - `mentions`: A list of each mention of the entity in different variants.
- `triples`: A list of relation annotations. Each entity is a dictinoray with:
    - `head`: The head entity or subject, from which the realtion origins.
    - `tail`: The tail entity which the relation targets
    - `relation`: The type of the relation (e.g. "PresentedIn", "LocatedIn").
- `label_set`: A list of all relation types used in the document.
- `entity_label_set`: A list of all entity types used in the document.

**Output NER Format:**  
For each document, we produce a JSON object with:
- `doc_id`: A unique identifier for the document.
- `text`: The raw document text.
- `tokens`: The list of tokens obtained from tokenizing the text with a pretrained tokenizer.
- `offsets`: The corresponding character offset mapping for each token.
- `entities`: A list of entities with the following structure:
  - `entity_id`
  - `type`
  - `mentions`: Each mention includes the mention text and its `span` (character offsets).


In [None]:
!pip install torch transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




# Import Libraries and Define Helper Functions

**load_json**
Loads a json file from a defined path.

**find_mention_spans**
Finds all occurrences of a mention in the text and returns the spans of the mentions.

**append_to_jsonl**
Builds a JSON Lines file from a list of documents. Jsonl is a convenient format for storing structured data while it is still human-readable. It is also ideal for line-by-line processing what will be done in the next step of the pipeline. 

In [None]:
import os
import json
import glob
import re
from pathlib import Path

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-base-4096")

input_folder = "./data/raw/train"
file_paths = glob.glob(os.path.join(input_folder, "*.json"))

def load_json(file_path):
    with open(file_path, "r") as f:
        return json.load(f)

def find_mention_spans(doc_text, mention_text):
    pattern = re.escape(mention_text)
    spans = [(m.start(), m.end()) for m in re.finditer(pattern, doc_text)]
    return spans

def append_to_jsonl(documents, output_file):
    with open(output_file, "a") as f:
        for doc in documents:
            f.write(json.dumps(doc) + "\n")

# Process and Preprocess the Docuemnts

In [None]:
import os
import json
import glob
import re
from pathlib import Path

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-base-4096")

input_folder = "./data/raw/train"
file_paths = glob.glob(os.path.join(input_folder, "*.json"))

ner_dataset = []
re_dataset = [] 

def find_mention_spans(doc_text, mention_text):
    pattern = re.escape(mention_text)
    spans = [(m.start(), m.end()) for m in re.finditer(pattern, doc_text)]
    return spans

doc_id_counter = 0
for file_path in file_paths:
    with open(file_path, "r") as f:
        docs = json.load(f)
    
    for doc in docs:
        doc_id = f"doc_{doc_id_counter}"
        doc_id_counter += 1

        raw_text = doc["doc"]

        encoding = tokenizer(raw_text, return_offsets_mapping=True, truncation=False)
        tokens = encoding.tokens()
        offsets = encoding.offset_mapping

        entities_out = []
        for entity in doc["entities"]:
            entity_id = entity["id"]
            entity_type = entity["type"]
            mention_annotations = []
            
            for mention in entity["mentions"]:
                spans = find_mention_spans(raw_text, mention)
    
                for span in spans:
                    mention_annotations.append({"mention": mention, "span": span})
            
            if mention_annotations:
                entities_out.append({
                    "entity_id": entity_id,
                    "type": entity_type,
                    "mentions": mention_annotations
                })
        
        ner_example = {
            "doc_id": doc_id,
            "text": raw_text,
            "tokens": tokens,
            "offsets": offsets,
            "entities": entities_out
        }
        ner_dataset.append(ner_example)
        
        mention_lookup = {}
        for entity in doc["entities"]:
            if entity["mentions"]:
                rep_mention = entity["mentions"][0]
                mention_lookup[rep_mention.lower()] = {
                    "entity_id": entity["id"],
                    "mention": rep_mention,
                    "type": entity["type"]
                }
        
        positive_pairs = {}
        for triple in doc.get("triples", []):
            head_text = triple["head"].lower().strip()
            tail_text = triple["tail"].lower().strip()
            relation = triple["relation"]
            if head_text in mention_lookup and tail_text in mention_lookup:
                h_info = mention_lookup[head_text]
                t_info = mention_lookup[tail_text]
                positive_pairs[(h_info["entity_id"], t_info["entity_id"])] = relation
                re_dataset.append({
                    "doc_id": doc_id,
                    "text": raw_text,
                    "head": h_info,
                    "tail": t_info,
                    "relation": relation
                })
                
        entity_ids = [entity["id"] for entity in doc["entities"]]
        for i in entity_ids:
            for j in entity_ids:
                if i == j or ((i, j) in positive_pairs):
                    continue
                
                h_entity = next(e for e in doc["entities"] if e["id"] == i)
                t_entity = next(e for e in doc["entities"] if e["id"] == j)
                if not h_entity["mentions"] or not t_entity["mentions"]:
                    continue
                re_dataset.append({
                    "doc_id": doc_id,
                    "text": raw_text,
                    "head": {"entity_id": i, "mention": h_entity["mentions"][0], "type": h_entity["type"]},
                    "tail": {"entity_id": j, "mention": t_entity["mentions"][0], "type": t_entity["type"]},
                    "relation": "no_relation"
                })

output_dir = Path("./data/processed")
output_dir.mkdir(exist_ok=True)

with open(output_dir / "ner_dataset.jsonl", "w") as f_out:
    for example in ner_dataset:
        f_out.write(json.dumps(example) + "\n")

with open(output_dir / "re_dataset.jsonl", "w") as f_out:
    for example in re_dataset:
        f_out.write(json.dumps(example) + "\n")

print("Preprocessing complete.")
print(f"NER dataset saved to {output_dir / 'ner_dataset.jsonl'}")
print(f"RE dataset saved to {output_dir / 're_dataset.jsonl'}")


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Preprocessing complete.
NER dataset saved to data/processed/ner_dataset.jsonl
RE dataset saved to data/processed/re_dataset.jsonl
