# ~~Nested~~ NER with dictionary

This notebook has a dictionary approach to NER. The task is reduced to a flat NER tagging for a minimal baseline.

## Imports

In [1]:
%pip install razdel

In [2]:
import json
import os

from collections import Counter, defaultdict
from typing import Iterator

from razdel import tokenize

## Constants

In [3]:
DATA_PATH = "../../data/"
OUT_PATH = "../../out/dict/"

TRAIN_DATA = os.path.join(DATA_PATH, "jsonl/train.jsonl")
TEST_DATA = os.path.join(DATA_PATH, "jsonl/test.jsonl")

SUBMIT_PATH = os.path.join(OUT_PATH, "test.jsonl")

## Utils

In [4]:
def read_jsonl(file_path: str) -> Iterator[dict]:
    """Reads a file in jsonl format"""
    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            yield json.loads(line)

def write_jsonl(file_path: str, data: Iterator[dict]):
    """Writes data to a file in jsonl format"""
    with open(file_path, "w", encoding="utf-8") as f:
        for line in data:
            f.write(json.dumps(line, ensure_ascii=False) + "\n")

## Model training

The dictionary is created using the training data. Every token is stored with the collection of labels it is associated with.

In [5]:
model = defaultdict(Counter)

for sample in read_jsonl(TRAIN_DATA):
    for beg, end, label in sample["ners"]:
        token = sample["sentences"][beg : end + 1]
        model[token][label] += 1

## Inference

Inference is done by choosing the most common label for each token. If a token is not found in the dictionary, it is skipped.

In [6]:
predictions = []

for sample in read_jsonl(TEST_DATA):
    tokens = list(tokenize(sample["sentences"]))
    ners = []

    for beg, end, text in tokens:
        if text in model:
            label = model[text].most_common(1)[0][0]
            ners.append([beg, end - 1, label])

    predictions.append({"id": sample["id"], "ners": ners})

## Submission

In [7]:
os.makedirs(OUT_PATH, exist_ok=True)
write_jsonl(SUBMIT_PATH, predictions)

print("Done! Predictions are saved to", SUBMIT_PATH)

Done! Predictions are saved to ../../out/dict/test.jsonl
