# Statistical models for aggregating LLM or human coders

This notebook is a simple example of how to use statistical models to aggregate labels generated by LLMs or human coders. Each coder generates a label for each item, and each item is labeled by multiple coders. The goal is to aggregate the labels to get a single label for each item. 

In [1]:
import numpy as np
import pandas as pd

from sklearn.metrics import accuracy_score

In [2]:
from annoagg import DSModel, MACEModel, BACEModel

## MACE (Multi-Annotator Competence Estimation; Hovy et al., 2013)

$z_i \in \{1, ..., K\}$ is the true label. The base rates of each
category are also unknown and stored in a vector $\alpha \in
\Delta_K$.

I have $J$ coders, and each gives me a coding for each item $i$, $x_{ij}$.
Each coder has competence $\beta_j$. Before a coder generates a label,
nature flips a coin and decides to return a true label with $\beta_j$
probability, and a bad label with complementary probability.


+ $z_i \sim \text{Categorical}(\alpha)$ : True (hidden) labels
+ $\beta_j$ : Coder j's competence parameter
+ $c_{ij} \sim \text{Bernoulli}(\beta_j)$ : Coder j's annotation
+ $g_{ij} \sim \text{Uniform(1, K)}$ : Coder j's guess
+ $x_{ij} = c_{ij} z_i + (1-c_{ij}) g_{ij}$ : Observed labels from coder j for item i





By Bayes Rule, we can solve for the true label distribution given the
observed labels.

\begin{align*}
P(z_i = k | x_{ij} = k) &= \frac{ P(x_{ij} = k | z_i = k) \cdot
P(z_i = k)}{P(x_{ij} = k)} \\
& =
\frac{ P(x_{ij} = k | z_i = k) \cdot
P(z_i = k)}
{
\underbrace{P(x_{ij} = k | z_i = k)}_{[\beta_j + (1-\beta_j)/K]}
\underbrace{P(z_i = k)}_{\alpha_k} +
\underbrace{P(x_{ij} = k | z_i \neq k)}_{(1-\beta_j)/K}
\underbrace{P(z_i \neq k)}_{1-\alpha_k}
} \\
& =
\frac{[
  \beta_j + (1-\beta_j)/K]\alpha_k}{
[\beta_j + (1-\beta_j)/K]\alpha_k +
(1-\beta_j)/K(1-\alpha_k)}
\end{align*}


## BACE (Biased Annotator Competence Estimation; Tyler 2021)

Each coder may have different 'biases', which means that they don't guess uniformly. Furthermore, it may actually be the case that the base rates for each category are not uniform. To accommodate this, we alter the model as follows:

+ $z_i \sim \text{Categorical}(\alpha)$ : True (hidden) labels
+ $\beta_j$ : Coder j's competence parameter
+ $c_{ij} \sim \text{Bernoulli}(\beta_j)$ : Coder j's annotation
+ $g_{ij} \sim \text{Categorical}(\gamma_j)$ : Coder j's guess drawn from a distribution. When $\gamma_j = \frac{1}{K} \mathbf{1}$, this is the same as MACE.
+ $x_{ij} = c_{ij} z_i + (1-c_{ij}) g_{ij}$ : Observed labels from coder j for item i

Under BACE, 

$$
P(x_{ij} = k | z_i = k) = \beta_j + (1 - \beta_j) \gamma_{jk}
$$

Accordingly, the label distribution given observed labels is 

\begin{align*}
P(z_i = k | x_{ij} = k) &= \frac{ P(x_{ij} = k | z_i = k) \cdot
P(z_i = k)}{P(x_{ij} = k)} \\
& =
\frac{ P(x_{ij} = k | z_i = k) \cdot
P(z_i = k)}
{
P(x_{ij} = k | z_i = k) P(z_i = k) +
P(x_{ij} = k | z_i \neq k) P(z_i \neq k)
} \\
& =
\frac{[
  \beta_j + (1-\beta_j)\gamma_{jk}]\alpha_k}{
[\beta_j + (1-\beta_j)\gamma_{jk}]\alpha_k +
(1-\beta_j)\gamma_{jk}(1-\alpha_k)}
\end{align*}


In [3]:
def generate_synthetic_data(
    n_items=100, n_coders=5, n_classes=5, coder_reliabilities=None, seed=42
):
    """
    Generate synthetic data with two groups of items having different coder behaviors.

    Parameters:
    -----------
    n_items : int
        Number of items to generate
    n_coders : int
        Number of coders
    n_classes : int
        Number of possible classes/categories
    seed : int
        Random seed for reproducibility

    Returns:
    --------
    df : pandas DataFrame
        Generated data with columns [ii, jj, yy, group]
    true_labels : numpy array
        Ground truth labels for items
    group_ids : numpy array
        Group assignments for items
    coder_reliabilities : numpy array
        Reliability scores for each coder
    """
    np.random.seed(seed)

    # Generate true labels
    true_labels = np.random.randint(0, n_classes, n_items)

    # Assign items to groups A and B
    group_ids = np.random.choice(["A", "B"], n_items)

    # Create coder reliabilities if not provided
    if coder_reliabilities is None:
        coder_reliabilities = np.random.uniform(0.4, 0.9, n_coders)

    # Generate coder-specific guessing distributions
    middle_category = (n_classes - 1) / 2  # Allow for even number of classes
    group_A_guess_dist = np.zeros((n_coders, n_classes))
    group_B_guess_dist = np.ones((n_coders, n_classes)) / n_classes

    # Create gaussian-like distribution for group A centered at middle category
    x = np.arange(n_classes)
    for j in range(n_coders):
        # Group A: Peak at middle category with gaussian-like distribution
        dist = np.exp(-0.5 * ((x - middle_category) / (n_classes / 5)) ** 2)
        group_A_guess_dist[j] = dist / dist.sum()

    # Generate labels
    data = []
    for i in range(n_items):
        for j in range(n_coders):
            # Determine if coder is correct
            is_correct = np.random.random() < coder_reliabilities[j]

            if is_correct:
                label = true_labels[i]
            else:
                # Choose guessing distribution based on group
                if group_ids[i] == "A":
                    label = np.random.choice(n_classes, p=group_A_guess_dist[j])
                else:
                    label = np.random.choice(n_classes, p=group_B_guess_dist[j])

            data.append(
                {
                    "ii": i + 1,  # 1-based indexing
                    "jj": j + 1,
                    "yy": label + 1,
                    "group": group_ids[i],
                    "true_label": true_labels[i] + 1,
                }
            )

    df = pd.DataFrame(data)

    # add a column for the true labels
    return df, true_labels, group_ids, coder_reliabilities


In [4]:
def evaluate_models(df, true_labels, group_ids, printres=False):
    """
    Fit and evaluate all three models
    """
    n_classes = len(np.unique(df["yy"]))
    n_coders = len(np.unique(df["jj"]))

    # Initialize models
    mace = MACEModel(n_classes, n_coders)
    bace = BACEModel(n_classes, n_coders)
    ds = DSModel(n_classes, n_coders)

    models = {"MACE": mace, "BACE": bace, "DS": ds}
    results = {}

    for name, model in models.items():
        # Fit model
        for _ in range(50):  # EM iterations
            model.map_update(df)

        # Get predictions
        final_results = model.calc_logliks(df)
        predictions = np.argmax(final_results["post_Z"], axis=1)

        # Calculate metrics
        acc = accuracy_score(true_labels, predictions)

        # Calculate accuracy per group
        group_A_mask = group_ids == "A"
        group_B_mask = group_ids == "B"
        acc_A = accuracy_score(true_labels[group_A_mask], predictions[group_A_mask])
        acc_B = accuracy_score(true_labels[group_B_mask], predictions[group_B_mask])

        results[name] = {
            "accuracy": acc,
            "accuracy_A": acc_A,
            "accuracy_B": acc_B,
            "predictions": predictions,
            "model": model,
        }

    if printres:
        maj_voting = pd.concat(
            [
                df.groupby("ii")["yy"].apply(lambda x: pd.Series.mode(x)[0]),
                df.groupby("ii")["true_label"].first(),
            ],
            axis=1,
        )
        maj_voting.columns = ["pred", "tru"]
        print(
            "Majority voting accuracy: ",
            accuracy_score(maj_voting["tru"], maj_voting["pred"]),
        )
        # Print results
        print("\nOverall Accuracy:")
        for name, res in results.items():
            print(f"{name}: {res['accuracy']:.3f}")

        print("\nGroup A Accuracy (Middle-category bias):")
        for name, res in results.items():
            print(f"{name}: {res['accuracy_A']:.3f}")

        print("\nGroup B Accuracy (Random guessing):")
        for name, res in results.items():
            print(f"{name}: {res['accuracy_B']:.3f}")

        # # Print estimated coder reliabilities
        # print("\nEstimated vs True Coder Reliabilities:")
        # print("True:", coder_reliabilities)
        # for name, res in results.items():
        #     if hasattr(res["model"], "beta"):
        #         print(f"{name}:", res["model"].beta)

        # # For BACE, show learned guessing distributions
        # if "BACE" in results:
        #     print("\nBACE Learned Guessing Distributions:")
        #     bace_model = results["BACE"]["model"]
        #     print(bace_model.gamma)
        return None
    return results

## simulations


In [5]:
df, true_labels, group_ids, coder_reliabilities = generate_synthetic_data(
    n_coders=6, n_classes=5
)
df.head(10)

Unnamed: 0,ii,jj,yy,group,true_label
0,1,1,4,B,4
1,1,2,4,B,4
2,1,3,4,B,4
3,1,4,4,B,4
4,1,5,4,B,4
5,1,6,4,B,4
6,2,1,4,A,5
7,2,2,5,A,5
8,2,3,2,A,5
9,2,4,3,A,5


In [6]:
evaluate_models(df, true_labels, group_ids, printres=True)

Majority voting accuracy:  0.88

Overall Accuracy:
MACE: 0.920
BACE: 0.930
DS: 0.900

Group A Accuracy (Middle-category bias):
MACE: 0.929
BACE: 0.946
DS: 0.911

Group B Accuracy (Random guessing):
MACE: 0.909
BACE: 0.909
DS: 0.886


In [7]:
df, true_labels, group_ids, coder_reliabilities = generate_synthetic_data(
    n_coders=3, n_classes=5
)
evaluate_models(df, true_labels, group_ids, printres=True)

Majority voting accuracy:  0.91

Overall Accuracy:
MACE: 0.890
BACE: 0.920
DS: 0.900

Group A Accuracy (Middle-category bias):
MACE: 0.893
BACE: 0.929
DS: 0.893

Group B Accuracy (Random guessing):
MACE: 0.886
BACE: 0.909
DS: 0.909


In [8]:
df, true_labels, group_ids, coder_reliabilities = generate_synthetic_data(
    n_coders=3, n_classes=20
)
evaluate_models(df, true_labels, group_ids, printres=True)

Majority voting accuracy:  0.82

Overall Accuracy:
MACE: 0.840
BACE: 0.860
DS: 0.320

Group A Accuracy (Middle-category bias):
MACE: 0.807
BACE: 0.825
DS: 0.316

Group B Accuracy (Random guessing):
MACE: 0.884
BACE: 0.907
DS: 0.326


In [9]:
df, true_labels, group_ids, coder_reliabilities = generate_synthetic_data(
    n_coders=5, n_classes=10
)
evaluate_models(df, true_labels, group_ids, printres=True)

Majority voting accuracy:  0.81

Overall Accuracy:
MACE: 0.880
BACE: 0.850
DS: 0.840

Group A Accuracy (Middle-category bias):
MACE: 0.895
BACE: 0.860
DS: 0.825

Group B Accuracy (Random guessing):
MACE: 0.860
BACE: 0.837
DS: 0.860


(hypothetically)

## LLM labels

```python
temperatures = [0.1, 0.3, 0.7, 1.0, 1.5]
models = ["gpt-4", "gpt-3.5", "claude", "llama"]

# Generate labels
labels = []
for temp in temperatures:
    for model in models:
        response = get_model_response(prompt, model=model, temperature=temp)
        labels.append({
            "ii": item_id,
            "jj": f"{model}_{temp}",
            "yy": response
        })

# Use BACE to aggregate
bace = BACEModel(n_classes=num_classes, n_coders=len(temperatures)*len(models))
```