# Generating a Dataset for NER Task using LLMs


In this project, we’re building a small dataset for a **Named Entity Recognition (NER)** task set in a fantasy-themed world.

We'll follow these steps:

1. Define entity types (e.g., People, Locations, Creatures, Artifacts).
2. Create a list of example entities for each type.
3. Write base sentences using placeholders like [PER], [LOC], etc.
4. Randomly replace those placeholders with actual entity names from the lists.
5. Format the final output in Prodigy or CoNLL 2003 format to train or test NER models.



⚠️ Disclaimer: Names from Harry Potter and The Lord of the Rings are used solely for educational, non-commercial purposes in an open-source NLP demo. All rights belong to their respective owners.

🧙‍♂️ Prompt 1: For People, Locations, Creatures, Artifacts
“Give me 10 fantasy-style names for each of the following categories: characters (like wizards or elves), magical locations (castles, forests, cities), mythical creatures (beasts, spirits), and powerful artifacts (wands, swords, relics). The names should sound original and fit in a magical world like Harry Potter or Lord of the Rings.”

Reducing **Bias** by using phrases like:
- each inspired by a different region or mythology
- an equal mix of masculine, feminine, and gender-neutral names
- wide range of linguistic roots
- not based in medieval Europe


In [6]:
import random
import re

# Sample entity lists
people = [
    "Harry Potter", "Hermione Granger", "Ron Weasley", "Albus Dumbledore", "Severus Snape",
    "Frodo Baggins", "Gandalf", "Aragorn", "Legolas", "Galadriel"
    ]
locations = [
    "Hogwarts", "Hogsmeade", "The Burrow", "Ministry of Magic", "Diagon Alley",
    "Rivendell", "Shire", "Mordor", "Gondor", "Isengard"
]
creatures = [
    "Hippogriff", "Basilisk", "Dementor", "Thestral", "Acromantula",
    "Balrog", "Nazgûl", "Warg", "Orc", "Shelob"
]
artifacts = [
    "Elder Wand", "Invisibility Cloak", "Resurrection Stone", "Sword of Gryffindor", "Time-Turner",
    "One Ring", "Palantír", "Phial of Galadriel", "Andúril", "Sting"
]

🧙‍♂️ Prompt 2: "Create 10 original fantasy-themed sentences that include placeholders for named entities. Use [PER] for people, [LOC] for locations, [CRE] for creatures, and [ART] for magical artifacts. Each sentence should feel like it belongs in a fantasy novel or adventure log. The placeholders should be naturally embedded into the sentence context. Avoid using real-world names or locations."

Example Output (in-context few-shot learning):
- [PER] uncovered the [ART] deep beneath the ruins of [LOC].
- Only the [CRE] of [LOC] could sense the power hidden within the [ART].

In [7]:
# Base sentences with placeholders
base_sentences = [
    "[PER] stood at the gates of [LOC], clutching the [ART] with trembling hands.",
    "Legends say the [CRE] once ravaged [LOC] until [PER] rose to stop it.",
    "Deep in the vaults beneath [LOC], the [ART] lies guarded by a sleeping [CRE].",
    "When [PER] disappeared, only the [ART] remained, humming with cursed energy.",
    "[PER] summoned the [CRE] using the forbidden rites hidden within the [ART].",
    "The path to [LOC] is perilous, especially with the [CRE] lurking in the shadows.",
    "No one has entered [LOC] since [PER] unleashed the power of the [ART].",
    "It was foretold that [PER] would ride the [CRE] across the skies of [LOC].",
    "[ART] was never meant to be wielded by mortals—yet [PER] defied fate in [LOC].",
    "A single drop of blood on the [ART] awakened the wrath of the [CRE] near [LOC]."
]


In [8]:
# Mapping placeholder to entity type and list
entity_map = {
    "PER": ("PER", people),
    "LOC": ("LOC", locations),
    "CRE": ("CRE", creatures),
    "ART": ("ART", artifacts),
}

In [13]:
def replace_entities(sentence_template, entity_map):
    """Replace placeholders with actual entities."""
    sentence = sentence_template
    for placeholder, (_, entity_list) in entity_map.items():
        while f"[{placeholder}]" in sentence:
            entity = random.choice(entity_list)
            sentence = sentence.replace(f"[{placeholder}]", entity, 1)
    return sentence

def find_entity_spans(sentence, entity_map):
    """Find character spans of entities in the sentence."""
    spans = []
    for label, entity_list in [(label, entity_map[label][1]) for label in entity_map]:
        for entity in entity_list:
            for match in re.finditer(re.escape(entity), sentence):
                spans.append((match.start(), match.end(), label))
    spans.sort()
    return spans

def tokenize_and_tag(sentence, spans):
    """Tokenize and assign BIO tags."""
    tokens = []
    pos = 0
    for word in re.findall(r"\w+|[^\w\s]", sentence):
        start = sentence.find(word, pos)
        end = start + len(word)
        pos = end
        tag = "O"
        for span_start, span_end, label in spans:
            if span_start <= start < span_end:
                tag = f"B-{label}" if start == span_start else f"I-{label}"
                break
        tokens.append((word, tag))
    return tokens

def generate_conll_data(base_sentences, entity_map, num_samples=5):
    """Create a list of samples in CoNLL format."""
    dataset = []
    for _ in range(num_samples):
        template = random.choice(base_sentences)
        sentence = replace_entities(template, entity_map)
        spans = find_entity_spans(sentence, entity_map)
        tagged_tokens = tokenize_and_tag(sentence, spans)
        dataset.append((sentence, tagged_tokens))
    return dataset

def print_conll_format(dataset):
    """Print the data in CoNLL 2003 format."""
    for sentence, tokens in dataset:
        print(f"# Sentence: {sentence}")
        for token, tag in tokens:
            print(f"{token} {tag}")
        print()

In [15]:
dataset = generate_conll_data(base_sentences, entity_map, num_samples=3)
print_conll_format(dataset)

# Sentence: Galadriel summoned the Warg using the forbidden rites hidden within the Andúril.
Galadriel B-PER
summoned O
the O
Warg B-CRE
using O
the O
forbidden O
rites O
hidden O
within O
the O
Andúril B-ART
. O

# Sentence: It was foretold that Gandalf would ride the Shelob across the skies of Gondor.
It O
was O
foretold O
that O
Gandalf B-PER
would O
ride O
the O
Shelob B-CRE
across O
the O
skies O
of O
Gondor B-LOC
. O

# Sentence: Galadriel summoned the Thestral using the forbidden rites hidden within the Sword of Gryffindor.
Galadriel B-PER
summoned O
the O
Thestral B-CRE
using O
the O
forbidden O
rites O
hidden O
within O
the O
Sword B-ART
of I-ART
Gryffindor I-ART
. O



In [16]:
def generate_prodigy_data(base_sentences, entity_map, n_examples=10):
    data = []

    for _ in range(n_examples):
        sentence_template = random.choice(base_sentences)
        filled_sentence = sentence_template
        replacements = {}

        # Replace placeholders with random values and track them
        for placeholder, (label, entity_list) in entity_map.items():
            while f"[{placeholder}]" in filled_sentence:
                replacement = random.choice(entity_list)
                filled_sentence = filled_sentence.replace(f"[{placeholder}]", replacement, 1)

                # Track start and end positions
                start_idx = filled_sentence.find(replacement)
                end_idx = start_idx + len(replacement)
                # If duplicate entity appears, make sure span is correct
                while any(start_idx == span[0] for span in replacements.get(label, [])):
                    start_idx = filled_sentence.find(replacement, end_idx)
                    end_idx = start_idx + len(replacement)

                replacements.setdefault(label, []).append((start_idx, end_idx))

        # Build Prodigy JSON format
        spans = []
        for label, spans_list in replacements.items():
            for start, end in spans_list:
                spans.append({
                    "start": start,
                    "end": end,
                    "label": label
                })

        data.append({
            "text": filled_sentence,
            "spans": spans
        })

    return data


In [17]:
examples = generate_prodigy_data(base_sentences, entity_map, n_examples=10)

for ex in examples:
    print(ex)

{'text': 'A single drop of blood on the Resurrection Stone awakened the wrath of the Warg near Rivendell.', 'spans': [{'start': 73, 'end': 82, 'label': 'LOC'}, {'start': 62, 'end': 66, 'label': 'CRE'}, {'start': 30, 'end': 48, 'label': 'ART'}]}
{'text': 'A single drop of blood on the Sting awakened the wrath of the Thestral near Isengard.', 'spans': [{'start': 73, 'end': 81, 'label': 'LOC'}, {'start': 62, 'end': 70, 'label': 'CRE'}, {'start': 30, 'end': 35, 'label': 'ART'}]}
{'text': 'A single drop of blood on the Sting awakened the wrath of the Acromantula near Shire.', 'spans': [{'start': 73, 'end': 78, 'label': 'LOC'}, {'start': 62, 'end': 73, 'label': 'CRE'}, {'start': 30, 'end': 35, 'label': 'ART'}]}
{'text': 'Invisibility Cloak was never meant to be wielded by mortals—yet Albus Dumbledore defied fate in Diagon Alley.', 'spans': [{'start': 51, 'end': 67, 'label': 'PER'}, {'start': 83, 'end': 95, 'label': 'LOC'}, {'start': 0, 'end': 18, 'label': 'ART'}]}
{'text': 'Hermione Granger 