# Generating a Dataset for NER Task using LLMs


In this project, we’re building a small dataset for a **Named Entity Recognition (NER)** task set in a fantasy-themed world.

We'll follow these steps:

1. Define entity types (e.g., People, Locations, Creatures, Artifacts).
2. Create a list of example entities for each type.
3. Write base sentences using placeholders like [PER], [LOC], etc.
4. Randomly replace those placeholders with actual entity names from the lists.
5. Format the final output in Prodigy or CoNLL 2003 format to train or test NER models.



⚠️ Disclaimer: Names from Harry Potter and The Lord of the Rings are used solely for educational, non-commercial purposes in an open-source NLP demo. All rights belong to their respective owners.

🧙‍♂️ Prompt 1: For People, Locations, Creatures, Artifacts
“Give me 10 fantasy-style names for each of the following categories: characters (like wizards or elves), magical locations (castles, forests, cities), mythical creatures (beasts, spirits), and powerful artifacts (wands, swords, relics). The names should sound original and fit in a magical world like Harry Potter or Lord of the Rings.”

Reducing **Bias** by using phrases like:
- each inspired by a different region or mythology
- an equal mix of masculine, feminine, and gender-neutral names
- wide range of linguistic roots
- not based in medieval Europe


In [1]:
import random
import re

# Sample entity lists
PER = [
    "Jordan Peterson", "Andrew Tate", "Emma Watson", "Chimamanda Ngozi Adichie",
    "Ben Shapiro", "Anita Sarkeesian", "Candace Owens", "Gloria Steinem",
    "Brett Kavanaugh", "Ayaan Hirsi Ali", "Valerie Solanas", "Christina Hoff Sommers",
    "Rebecca Solnit", "Milo Yiannopoulos", "Roxane Gay",
    "Malala Yousafzai", "Tarana Burke", "Noor Tagouri", "Margaret Atwood",
    "Leymah Gbowee", "Justin Baldoni", "Adichie Obioma", "Amna Nawaz",
    "Trevor Noah", "Zainab Salbi", "Ngozi Okonjo-Iweala", "Hasan Minhaj",
    "Yara Shahidi", "Bell Hooks", "Laverne Cox"
]

ORG = [
    "UN Women", "National Organization for Women", "A Voice for Men", "The Red Pill subreddit",
    "Women's March", "HeForShe", "The Daily Wire", "Planned Parenthood",
    "Men's Rights Movement", "Feminist Majority Foundation", "Breitbart", "Jezebel",
    "International Women's Health Coalition", "PragerU", "Girl Up",
    "Ms. Foundation for Women", "Men's Health Network", "Equal Rights Advocates",
    "Gender Equality Council", "Women Deliver", "FAIR For All",
    "Men Are Human", "Feminist Frequency", "The Good Men Project",
    "Raising Voices", "Women's Media Center", "Male Survivor",
    "Institute for Gender Equality", "She Should Run", "Center for Masculinities and Social Justice"
]

EMOTION = [
    "anger", "resentment", "frustration", "hate", "outrage", "empowerment", "pride",
    "anxiety", "shame", "fear", "bitterness", "vulnerability", "sadness", "rage", "hope",
    "confusion", "guilt", "loneliness", "disgust", "envy", "helplessness",
    "relief", "despair", "determination", "joy", "numbness"
]

GENDER_ROLE = [
    "housewife", "breadwinner", "strong man", "submissive woman", "provider", "caregiver",
    "alpha male", "boss babe", "traditional wife", "stay-at-home mom", "working dad",
    "protector", "nurturer", "career woman", "homemaker",
    "gentleman", "modern dad", "feminine man", "emotionally available partner",
    "independent woman", "domineering husband", "stay-at-home dad",
    "assertive woman", "passive man", "masculine woman"
]

RELATIONSHIP = [
    "wife", "husband", "girlfriend", "boyfriend", "single mom", "single dad", "ex-wife",
    "baby mama", "fiancé", "partner", "spouse", "co-parent", "ex-husband", "divorcee", "fling",
    "stepfather", "stepmother", "roommate", "child’s father", "child’s mother",
    "dating partner", "former lover", "sugar daddy", "sugar baby", "romantic interest"
]

INSULT_TERMS = [
    "feminazi", "simp", "Karen", "incel", "cuck", "pick-me", "misandrist", "gold digger",
    "blue-pilled", "white knight", "man-hater", "soy boy", "radical feminist", "male tears",
    "toxic feminist",
    "deadbeat dad", "emotionless brute", "chauvinist", "neckbeard", "manchild",
    "macho creep", "womanizer", "broflake", "alpha poser", "nice guy",
    "entitled male", "fragile ego", "wannabe alpha", "mansplainer", "gaslighter"
]


🧙‍♂️ Prompt 2: "Create 10 original fantasy-themed sentences that include placeholders for named entities. Use [PER] for people, [LOC] for locations, [CRE] for creatures, and [ART] for magical artifacts. Each sentence should feel like it belongs in a fantasy novel or adventure log. The placeholders should be naturally embedded into the sentence context. Avoid using real-world names or locations."

Example Output (in-context few-shot learning):
- [PER] uncovered the [ART] deep beneath the ruins of [LOC].
- Only the [CRE] of [LOC] could sense the power hidden within the [ART].

In [2]:
# Base sentences with placeholders
base_sentences = [
    "[PER] accused the [ORG] of promoting [GENDER_ROLE] stereotypes.",
    "With a mix of [EMOTION], [PER] confronted their [RELATIONSHIP] about the comment.",
    "Many people labeled [PER] a [INSULT_TERMS] after the interview.",
    "The [ORG] defended [PER], saying the [INSULT_TERMS] accusations were baseless.",
    "[PER] shared a story that evoked strong [EMOTION] from both sides of the debate.",
    "Critics called the new policy by [ORG] a win for outdated [GENDER_ROLE] norms.",
    "As a proud [RELATIONSHIP], [PER] advocated against harmful [GENDER_ROLE] roles.",
    "[INSULT_TERMS] was trending on social media after [PER]'s statement.",
    "[PER] said their experiences as a [GENDER_ROLE] shaped their views on equality.",
    "The [ORG]'s campaign was described as full of [EMOTION] and lacking in facts.",
    "[PER] responded to the [INSULT_TERMS] slur by sharing their personal [EMOTION].",
    "[RELATIONSHIP] roles are often misunderstood, said [PER] in a speech to [ORG].",
    "Accusations of being a [INSULT_TERMS] didn’t stop [PER] from speaking at [ORG].",
    "The debate between [PER] and [PER] highlighted deep divides over [GENDER_ROLE].",
    "[PER] expressed [EMOTION] when questioned about their stance on [RELATIONSHIP] roles."
]



In [3]:
# Mapping placeholder to entity type and list
entity_map = {
    "PER": ("PER", PER),
    "ORG": ("ORG", ORG),
    "EMOTION": ("EMOTION", EMOTION),
    "GENDER_ROLE": ("GENDER_ROLE", GENDER_ROLE),
    "RELATIONSHIP": ("RELATIONSHIP", RELATIONSHIP),
    "INSULT_TERMS": ("INSULT_TERMS", INSULT_TERMS)
}


In [4]:
def replace_entities(sentence_template, entity_map):
    """Replace placeholders with actual entities."""
    sentence = sentence_template
    for placeholder, (_, entity_list) in entity_map.items():
        while f"[{placeholder}]" in sentence:
            entity = random.choice(entity_list)
            sentence = sentence.replace(f"[{placeholder}]", entity, 1)
    return sentence

def find_entity_spans(sentence, entity_map):
    """Find character spans of entities in the sentence."""
    spans = []
    for label, entity_list in [(label, entity_map[label][1]) for label in entity_map]:
        for entity in entity_list:
            for match in re.finditer(re.escape(entity), sentence):
                spans.append((match.start(), match.end(), label))
    spans.sort()
    return spans

def tokenize_and_tag(sentence, spans):
    """Tokenize and assign BIO tags."""
    tokens = []
    pos = 0
    for word in re.findall(r"\w+|[^\w\s]", sentence):
        start = sentence.find(word, pos)
        end = start + len(word)
        pos = end
        tag = "O"
        for span_start, span_end, label in spans:
            if span_start <= start < span_end:
                tag = f"B-{label}" if start == span_start else f"I-{label}"
                break
        tokens.append((word, tag))
    return tokens

def generate_conll_data(base_sentences, entity_map, num_samples=5):
    """Create a list of samples in CoNLL format."""
    dataset = []
    for _ in range(num_samples):
        template = random.choice(base_sentences)
        sentence = replace_entities(template, entity_map)
        spans = find_entity_spans(sentence, entity_map)
        tagged_tokens = tokenize_and_tag(sentence, spans)
        dataset.append((sentence, tagged_tokens))
    return dataset

def print_conll_format(dataset):
    """Print the data in CoNLL 2003 format."""
    for sentence, tokens in dataset:
        print(f"# Sentence: {sentence}")
        for token, tag in tokens:
            print(f"{token} {tag}")
        print()

In [12]:
dataset = generate_conll_data(base_sentences, entity_map, num_samples=1)
print_conll_format(dataset)

# Sentence: Critics called the new policy by FAIR For All a win for outdated nurturer norms.
Critics O
called O
the O
new O
policy O
by O
FAIR B-ORG
For I-ORG
All I-ORG
a O
win O
for O
outdated O
nurturer B-GENDER_ROLE
norms O
. O



In [10]:
import random
import spacy
import json

nlp = spacy.blank("en")

def generate_prodigy_data(base_sentences, entity_map, n_examples=20):
    examples = []

    for _ in range(n_examples):
        template = random.choice(base_sentences)
        used_entities = []

        # Replace placeholders with random entity examples
        filled_sentence = template
        for label in entity_map:
            tag = f"[{label}]"
            while tag in filled_sentence:
                entity = random.choice(entity_map[label][1])
                filled_sentence = filled_sentence.replace(tag, entity, 1)
                used_entities.append((entity, label))

        # Tokenize with spaCy
        doc = nlp(filled_sentence)
        tokens = [{"text": token.text, "start": token.idx, "end": token.idx + len(token), "id": i} for i, token in enumerate(doc)]

        # Build spans using matched character offsets
        spans = []
        for ent_text, label in used_entities:
            start = filled_sentence.find(ent_text)
            if start == -1:
                continue
            end = start + len(ent_text)
            # Find token indices that match the entity span
            token_start = token_end = None
            for i, token in enumerate(doc):
                if token.idx == start:
                    token_start = i
                if token.idx + len(token) == end:
                    token_end = i
            if token_start is not None and token_end is not None:
                spans.append({
                    "start": start,
                    "end": end,
                    "token_start": token_start,
                    "token_end": token_end,
                    "label": label
                })

        examples.append({
            "text": filled_sentence,
            "tokens": tokens,
            "spans": spans
        })

    return examples


In [11]:
examples = generate_prodigy_data(base_sentences, entity_map, n_examples=20)

with open("feminism_data.jsonl", "w", encoding="utf-8") as f:
    for ex in examples:
        f.write(json.dumps(ex, ensure_ascii=False) + "\n")

for ex in examples:
    print(ex)