# Create Dataset

I'll need to create a dataset of 1000 entries per entity.
We have 12 entities.
500 entries will be bios.
500 entries will be interviews.
The key variables for each entity's 1000 entries are:
- Included attributes
    - Dropout a random third.
- Verbiage
    - Engineer the prompt to encourage diversity of verbiage.


- Might need to do [automatic diversity checks](https://arxiv.org/pdf/2410.15226v2).
    - Compute Distinct-n, MAUVE, or the "LLM-cluster-agent" metric to flag near-duplicates every batch.
- Temperature sweeps and nucleaus sampling.
- Deduplication by embedding rows with Sentence-BERT and dropping pairs < 0.92 cosine distance.
- Use multiple (5?) prompt templates per task.
    - Mixing writing styles for bios (e.g., neutral encyclopedia, informal blog, professional press release) and interviews (e.g., podcast, panel Q&A, magazine). Randomly choose a different reader persona to (e.g., hiring manager, tech journalist, graduate student) to increase diversity and downstream performance



## Diversity Guages

Tips & Variants:
- Memory tight: Process corpus in chronological order; store only 1-bit hash of 'kept' indices plus embeddings.
- More/less aggressive: Raise threshold (e.g., 0.95) to keep fewer duplicates, lower to keep more.
- 

### Distinct-n

In [None]:
from collections import Counter
import re

def distinct_n(texts, n=1):
    """
    texts : list[str] – corpus
    n     : int       – n‑gram size
    returns float     – Distinct‑n score
    """
    total, unique = 0, set()
    for t in texts:
        # basic whitespace tokenizer; swap for nltk/spacy if needed
        tokens = re.findall(r"\w+", t.lower())
        grams  = zip(*[tokens[i:] for i in range(n)])
        grams  = [' '.join(g) for g in grams]
        total += len(grams)
        unique.update(grams)
    return len(unique) / total if total else 0

# example
print("Distinct‑1:", distinct_n(corpus, 1))
print("Distinct‑2:", distinct_n(corpus, 2))

### Cosine-deduplication

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def deduplicate(texts, model_name='all-MiniLM-L6-v2',
                threshold=0.92, batch_size=512):
    """
    Returns list[str] of texts with near‑duplicates removed.
    """
    model  = SentenceTransformer(model_name)
    keep   = []
    vecs   = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        emb   = model.encode(batch, batch_size=len(batch), show_progress_bar=False,
                             normalize_embeddings=True)
        # compare new embeddings with everything we've kept so far
        if vecs:
            sims = cosine_similarity(emb, np.vstack(vecs))
        else:
            sims = np.zeros((len(batch),0))
        for j, t in enumerate(batch):
            # if max similarity with any kept text < threshold, keep it
            if sims.shape[1]==0 or sims[j].max() < threshold:
                keep.append(t)
                vecs.append(emb[j])
    return keep

# usage
deduped_corpus = deduplicate(corpus, threshold=0.92)

In [None]:
raw = generate_synthetic_batch(...)
div_score_before = distinct_n(raw, 1)

clean = deduplicate(raw)          # remove near‑clones
div_score_after  = distinct_n(clean, 1)

assert div_score_after >= div_score_before * 0.9  # sanity guard
store(clean)