#  NERC task with CoNLL 2003

This is the first part of the task dedicated to reserching *Named Entity Recognition and Classification (NERC)* task applying to the *CoNLL 2003 dataset*. In this step we need find the best way to generate markup for the dataset in the subsequent steps. Also, we will experiment with prompt engineering.


## Dataset analysis

First of all, we need to import necessery libraries and download our dataset:

In [6]:
# !pip install datasets

In [65]:
from datasets import load_dataset
import pandas as pd
from collections import Counter
import requests
from pprint import pprint
import json
import random

In [None]:
API_URL = "https://llm.ispras.ru/api/chat/completions"
API_KEY = "YOUR_KEY"

In [2]:
dataset = load_dataset("eriktks/conll2003")

Let's look at the information about the dataset. We have three splits: 'train', 'validation', 'test'.

`dataset.keys()` - to see what splits we have.


In [3]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})


Let's look at the first example in our dataset:

In [39]:
pprint(dataset["test"][0])


{'chunk_tags': [11, 0, 11, 21, 11, 12, 0, 11, 13, 11, 12, 0],
 'id': '0',
 'ner_tags': [0, 0, 5, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 'pos_tags': [21, 8, 22, 37, 22, 22, 6, 22, 15, 12, 21, 7],
 'tokens': ['SOCCER',
            '-',
            'JAPAN',
            'GET',
            'LUCKY',
            'WIN',
            ',',
            'CHINA',
            'IN',
            'SURPRISE',
            'DEFEAT',
            '.']}


In [77]:
def get_sentence(dataset_name: str, idx: int) -> str: 
    return ' '.join(dataset.data[dataset_name]["tokens"][idx].values.tolist())

def generate_corps(size: int, dataset_name: str):
    data = dataset[dataset_name]
    data_size = data.shape[0]
    return (get_sentence(dataset_name, idx) 
            for idx in random.choices(range(data_size), k=size))

In [82]:
print(*generate_corps(10, "train"), sep='\n')


delivered to consumer
shares outstanding
3 - Wayne Ferreira ( South Africa ) beat Jiri Novak ( Czech
LECIVA PRAHA 2470.00 2470.00 1360 3359.200
BOSTON AT CALIFORNIA
-- Helsinki Newsroom +358 - 0 - 680 50 245
More than 1,000 people have been executed in drug-related cases since the law took effect in 1989 .
In another scene , a young girl performed oral sex with an unidentified adult man .
Essex 532-8
ACC sold 9.4 million tonnes in 1995/96 , retaining its top position in the Indian cement industry , Palkhivala said .


Evaluate the number of examples:

In [9]:
for split in dataset.keys():
    dataset_split = dataset[split]
    split_len = len(dataset_split)
    print(f"Split '{split}':  {split_len} examples.")


Split 'train':  14041 examples.
Split 'validation':  3250 examples.
Split 'test':  3453 examples.


Dataset features in the training split:


In [10]:
print(dataset["train"].features)


Dataset features in the training split:
{'id': Value(dtype='string', id=None), 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'pos_tags': Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None), 'chunk_tags': Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None), 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)}


Let's structure output data as json to make structured requests.

In [83]:
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}


In [None]:
payload = {
    "model": "llama3.3:latest",
    "messages": [
        {
            "role": "user", 
            "content": f"Classify all named entities in a sentence and categorize their semantic meaning: '{sentence}'"
        } 
        for sentence in generate_corps(2, "train")
    ],
    "format": "json"

}

In [107]:
response = requests.post(API_URL, headers=headers, json=payload)


In [108]:
pprint(payload)

{'format': 'json',
 'messages': [{'content': 'Classify all named entities in a sentence and '
                          "categorize their semantic meaning: 'The won rose "
                          'against the dollar on Friday as banks unwound '
                          'dollar positions on the belief that the won would '
                          "continue to strengthen , dealers said .'",
               'role': 'user'},
              {'content': 'Classify all named entities in a sentence and '
                          "categorize their semantic meaning: 'Add Men 's "
                          "singles , second round'",
               'role': 'user'}],
 'model': 'llama3.3:latest'}


In [121]:
response_json = response.json()
pprint(response.text)

text = response_json["choices"][0]["message"]["content"] 
texts = [text["message"]["content"] for text in response_json["choices"]]
# pprint(text)

('{"id":"llama3.3:latest-28035d26-0682-4392-870e-3d7c7a5c6c7e","created":1740772237,"model":"llama3.3:latest","choices":[{"index":0,"logprobs":null,"finish_reason":"stop","message":{"content":"{ '
 '\\n  \\"entities\\": [\\n    {\\n      \\"entity\\": \\"Men\'s '
 'singles\\",\\n      \\"type\\": \\"Event\\",\\n      \\"semantic meaning\\": '
 '\\"A tennis tournament category\\"\\n    },\\n    {\\n      \\"entity\\": '
 '\\"second round\\",\\n      \\"type\\": \\"Phase\\",\\n      \\"semantic '
 'meaning\\": \\"Stage of a competition or process\\"\\n    }\\n  ]\\n}\\n\\n '
 '\\n  \\n\\n\\n\\n  \\n \\n\\n  \\n\\n\\n\\n\\n\\n  \\n \\n '
 '\\n\\n\\n\\n\\n\\n \\n\\n\\n \\n\\n \\n   \\n\\n\\n '
 '\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n \\n \\n\\n\\n\\n \\n\\n\\n\\n\\n\\n \\n '
 '\\n\\n \\n \\n\\n\\n  \\n\\n \\n\\n\\n\\n\\n\\n\\n '
 '\\n","role":"assistant"}}],"object":"chat.completion","usage":{"response_token/s":"N/A","prompt_token/s":"N/A","total_duration":0,"load_duration":0,"prompt_eval_count":0

Let's transform into dictionary:

In [117]:
for text in texts:
    pprint(json.loads(text))

{'entities': [{'entity': "Men's singles",
               'semantic meaning': 'A tennis tournament category',
               'type': 'Event'},
              {'entity': 'second round',
               'semantic meaning': 'Stage of a competition or process',
               'type': 'Phase'}]}
