#  NERC task with CoNLL 2003

This is the first part of the task dedicated to reserching *Named Entity Recognition and Classification (NERC)* task applying to the *CoNLL 2003 dataset*. In this step we need find the best way to generate markup for the dataset in the subsequent steps. Also, we will experiment with prompt engineering.


## Dataset analysis

First of all, we need to import necessery libraries and download our dataset:

In [228]:
# !pip install datasets

In [229]:
from datasets import load_dataset
import pandas as pd
from collections import Counter
import requests
from pprint import pprint
import json
import random
import multiprocessing
import itertools

In [None]:
API_URL = "https://llm.ispras.ru/api/chat/completions"
API_MODEL_URL = "https://llm.ispras.ru/api/models"
API_KEY = "YOUR_TOKEN"

In [231]:
dataset = load_dataset("eriktks/conll2003")

Let's look at the information about the dataset. We have three splits: 'train', 'validation', 'test'.

`dataset.keys()` - to see what splits we have.


In [232]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})


Let's look at the first example in our dataset:

In [233]:
pprint(dataset["test"][0])


{'chunk_tags': [11, 0, 11, 21, 11, 12, 0, 11, 13, 11, 12, 0],
 'id': '0',
 'ner_tags': [0, 0, 5, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 'pos_tags': [21, 8, 22, 37, 22, 22, 6, 22, 15, 12, 21, 7],
 'tokens': ['SOCCER',
            '-',
            'JAPAN',
            'GET',
            'LUCKY',
            'WIN',
            ',',
            'CHINA',
            'IN',
            'SURPRISE',
            'DEFEAT',
            '.']}


Evaluate the number of examples:

In [234]:
for split in dataset.keys():
    dataset_split = dataset[split]
    split_len = len(dataset_split)
    print(f"Split '{split}':  {split_len} examples.")


Split 'train':  14041 examples.
Split 'validation':  3250 examples.
Split 'test':  3453 examples.


### Entity Tags

The original dataset uses named entity recognition (NER) tags in the IOB2 format.

Each token is annotated with three types of tags:
1. POS Tags: Indicate the token's grammatical role (e.g., 'NN', 'VB', etc.).
2. Chunk Tags: Specify the syntactic chunk the token belongs to (e.g., 'B-NP', 'I-NP').
3. NER Tags: Identify named entities using the IOB2 scheme:
   - 'O'      : Token is not part of any entity.
   - 'B-PER'  : Beginning of a person entity.
   - 'I-PER'  : Inside a person entity.
   - 'B-ORG'  : Beginning of an organization entity.
   - 'I-ORG'  : Inside an organization entity.
   - 'B-LOC'  : Beginning of a location entity.
   - 'I-LOC'  : Inside a location entity.
   - 'B-MISC' : Beginning of a miscellaneous entity.
   - 'I-MISC' : Inside a miscellaneous entity.

In [235]:
pos_tags = {'"': 0, "''": 1, '#': 2, '$': 3, '(': 4, ')': 5, ',': 6, '.': 7, ':': 8, '``': 9, 'CC': 10, 'CD': 11, 'DT': 12,
 'EX': 13, 'FW': 14, 'IN': 15, 'JJ': 16, 'JJR': 17, 'JJS': 18, 'LS': 19, 'MD': 20, 'NN': 21, 'NNP': 22, 'NNPS': 23,
 'NNS': 24, 'NN|SYM': 25, 'PDT': 26, 'POS': 27, 'PRP': 28, 'PRP$': 29, 'RB': 30, 'RBR': 31, 'RBS': 32, 'RP': 33,
 'SYM': 34, 'TO': 35, 'UH': 36, 'VB': 37, 'VBD': 38, 'VBG': 39, 'VBN': 40, 'VBP': 41, 'VBZ': 42, 'WDT': 43,
 'WP': 44, 'WP$': 45, 'WRB': 46}

chunk_tags = {'O': 0, 'B-ADJP': 1, 'I-ADJP': 2, 'B-ADVP': 3, 'I-ADVP': 4, 'B-CONJP': 5, 'I-CONJP': 6, 'B-INTJ': 7, 'I-INTJ': 8,
 'B-LST': 9, 'I-LST': 10, 'B-NP': 11, 'I-NP': 12, 'B-PP': 13, 'I-PP': 14, 'B-PRT': 15, 'I-PRT': 16, 'B-SBAR': 17,
 'I-SBAR': 18, 'B-UCP': 19, 'I-UCP': 20, 'B-VP': 21, 'I-VP': 22}

ner_tags = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}


Each sentence is split into tokens. 

In [236]:
dataset.data["train"]["tokens"][0]

<pyarrow.ListScalar: ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']>

To use sentences in model we need to group tokens to lists for each sentences with function `get_sentence`.

After generating we will have the batch with random sentences to send them to the model.

The function `generate_corps` takes dataset_


In [237]:
def get_sentence(dataset_name: str, idx: int) -> str: 
    return ' '.join(dataset.data[dataset_name]["tokens"][idx].values.tolist())

def generate_corps(size: int, dataset_name: str):
    data = dataset[dataset_name]
    data_size = data.shape[0]
    return (get_sentence(dataset_name, idx) 
            for idx in random.choices(range(data_size), k=size))

For example for batch of 10 random sentences:

```python
generate_corps(10, "train"), sep='\n'
```

we will have the batch of unannotated sentances.

```bash
delivered to consumer
shares outstanding
3 - Wayne Ferreira ( South Africa ) beat Jiri Novak ( Czech
LECIVA PRAHA 2470.00 2470.00 1360 3359.200
BOSTON AT CALIFORNIA
-- Helsinki Newsroom +358 - 0 - 680 50 245
More than 1,000 people have been executed in drug-related cases since the law took effect in 1989 .
In another scene , a young girl performed oral sex with an unidentified adult man .
Essex 532-8
ACC sold 9.4 million tonnes in 1995/96 , retaining its top position in the Indian cement industry , Palkhivala said .
```

In [238]:
print(*generate_corps(2, "train"), sep='\n')


Ninety-day bank bill rates shed five points to 9.93 percent and September bank bill futures rose four to 90.18 .
The facility has a tenor of six months .


## Model requests

To make the request to the models let's make the request's head and body. And see the available models list to use them further.

In [239]:
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

Retrieve all models names:



In [None]:
def get_all_model_names():
    headers = {"Authorization": f"Bearer {API_KEY}"}
    response = requests.get(API_MODEL_URL, headers=headers)
    if response.status_code // 100 == 2:
        data = response.json()
        models = data.get("data", [])
        model_names = [model["name"] for model in models]
        return model_names
    return None

In [None]:
def get_all_model_names():
    headers = {"Authorization": f"Bearer {API_KEY}"}
    response = requests.get(API_MODEL_URL, headers=headers)
    if response.status_code == 200:
        data = response.json()
        models = data.get("data", [])
        model_names = [model["name"] for model in models]
        return model_names
    return None

In [241]:
model_names = get_all_model_names()

pprint(model_names)

['llama3.3:latest',
 'llama3.1:70b',
 'llama3.1:405b',
 'gemma2:27b',
 'mistral-large:123b',
 'command-r-plus:104b',
 'llama3.1:8b',
 'krith/qwen2.5-coder-32b-instruct:IQ3_M',
 'deepseek-coder-v2:236b',
 'llama3.2:latest',
 'mistral:7b',
 'RuadaptQwen2.5-32B-Pro-Beta:Q8',
 'deepseek-r1:14b',
 'deepseek-r1:70b',
 'deepseek-r1:7b',
 'deepseek-r1:8b',
 'qwen2.5-coder:1.5b',
 'qwen2.5-coder:32b-instruct-q8_0']


To make the body of request we construct several prompts with different description of the task to the model.

In [None]:
prompt_1 = lambda sentence: f"Classify all named entities in a sentence and categorize their semantic meaning: '{sentence}'"

prompt_2 = lambda sentence, tags: f"Classify all named entities in a sentence: '{sentence}', based on following tags: {tags}"

prompt_3 = lambda sentence, pos_tags, chunk_tags, ner_tags: f"Classify all named entities in a sentence: '{sentence}', \
based on following parts of speech tags: {pos_tags}, \
based on following chunk tags: {chunk_tags}, \
based on following named entity recognition tags: {ner_tags}"

prompt_4 = lambda sentence, pos_tags, chunk_tags, ner_tags: f"Determine each entity that can be classified and assign them a pos_tag, a chunk_tag and a ner_tag, using following lists: \
parts of speech tags: [{pos_tags}], \
chunk tags: [{chunk_tags}], \
named entity recognition tags: [{ner_tags}] from the following: '{sentence}'." 

prompt_5 = lambda sentences, pos_tags, chunk_tags, ner_tags: f"""Determine each entity that can be classified and assign them a pos_tag, a chunk_tag and a ner_tag, using following lists: \
parts of speech tags: [{pos_tags}], \
chunk tags: [{chunk_tags}], \
named entity recognition tags: [{ner_tags}] from the following sentences: {', '.join(f"'{sentence}'" for sentence in sentences)}."""

prompt_4_1 = lambda sentence, pos_tags, chunk_tags, ner_tags: f"Determine each entity that can be classified, split it into tokens and assign them a pos_tag, a chunk_tag and a ner_tag, using following lists: \
parts of speech tags: [{pos_tags}], \
chunk tags: [{chunk_tags}], \
named entity recognition tags: [{ner_tags}] from the following: '{sentence}'." 


Function to generate the request's payload, accepting prompt as an argument:

In [243]:
def gen_payload(prompt):
    return {
        "model": "llama3.3:latest",
        "messages": [
            {
                "role": "user", 
                "content": prompt
            } 
        ],
        "format": "json"
    }

def perform_request(prompt: str):
    return (
        requests.post(API_URL, headers=headers, json=gen_payload(prompt))
        .json()
    )   

The example response we get using perform_request:

In [None]:
single_sentence = tuple(generate_corps(1, "train"))
single_prompt = prompt_4(single_sentence, pos_tags, chunk_tags, ner_tags)

response = perform_request(single_prompt)
# pprint(response)
# pprint((json.loads(response["choices"][0]["message"]["content"]), single_prompt))

({'entities': [{'chunk_tag': 'B-NP',
                'entity': '<generator',
                'ner_tag': 'O',
                'pos_tag': 'NN'},
               {'chunk_tag': 'I-NP',
                'entity': 'object',
                'ner_tag': 'O',
                'pos_tag': 'NN'},
               {'chunk_tag': 'I-NP',
                'entity': 'generate_corps',
                'ner_tag': 'O',
                'pos_tag': 'NN'},
               {'chunk_tag': 'I-NP',
                'entity': '<locals>',
                'ner_tag': 'O',
                'pos_tag': 'NN'},
               {'chunk_tag': 'I-NP',
                'entity': '<genexpr>',
                'ner_tag': 'O',
                'pos_tag': 'NN'},
               {'chunk_tag': 'B-PP',
                'entity': 'at',
                'ner_tag': 'O',
                'pos_tag': 'IN'},
               {'chunk_tag': 'I-NP',
                'entity': '0x7fb140311f50',
                'ner_tag': 'O',
                'pos_tag': 'CD'}]},
 'De

To make multiprocessing request to make the responces from model faster.

In [None]:
corps = tuple(tuple(generate_corps(5, "train")) for _ in range(5))
corps

In [None]:
prompts = tuple(prompt_5(corp, pos_tags, chunk_tags, ner_tags) for corp in corps )
with multiprocessing.Pool(10) as p: 
    responses = p.map(perform_request, prompts)

Let's transform into dictionary:

In [251]:
pprint([(json.loads(response["choices"][0]["message"]["content"] ), corp) for response, corp in zip(responses, corps)])

[({'entities': [{'chunk_tag': 'B-NP',
                 'entity': 'The',
                 'ner_tag': 'O',
                 'pos_tag': 'DT'},
                {'chunk_tag': 'I-NP',
                 'entity': 'American Stock Exchange',
                 'ner_tag': 'B-ORG',
                 'pos_tag': 'NNP'},
                {'chunk_tag': 'B-VP',
                 'entity': 'said',
                 'ner_tag': 'O',
                 'pos_tag': 'VBD'},
                {'chunk_tag': 'B-NP',
                 'entity': 'there',
                 'ner_tag': 'O',
                 'pos_tag': 'EX'},
                {'chunk_tag': 'I-VP',
                 'entity': 'was',
                 'ner_tag': 'O',
                 'pos_tag': 'VBD'},
                {'chunk_tag': 'B-NP',
                 'entity': 'no after-hours activity',
                 'ner_tag': 'O',
                 'pos_tag': 'NN'}]},
  'Peace talks between the two sides were last held in February .'),
 ({'entities': [['Mauritius', 'B-LOC', 