# Advanced Label Models
## IMD3011 - Datacentric AI
### [Dr. Elias Jacob de Menezes Neto](https://docente.ufrn.br/elias.jacob)

## Keypoints

- **Programmatic Weak Supervision (PWS):** This notebook explores advanced label models regarding PWS, where labeling functions (LFs) are used to generate noisy labels for training data.

- **Label Aggregation Challenge:** Combining outputs from multiple LFs, which can be noisy and conflicting, is a central challenge in weak supervision. Label models are designed to address this challenge.

- **Majority Label Voter:** A simple baseline label model that predicts labels based on the majority vote of LFs.

- **Snorkel MeTaL Label Model:**  An advanced label model from the Snorkel library, used as a comparative baseline.

- **Hyper Label Model (HyperLM):** A dataset-agnostic label model using Graph Neural Networks (GNNs) that can be applied to new datasets without retraining. It exists in unsupervised and semi-supervised versions.

- **Dawid & Skene Model:** A probabilistic model that uses the Expectation-Maximization (EM) algorithm to estimate the reliability of each LF (acting as annotators) and infer true labels.

- **Snorkel Generative Model:**  The predecessor to Snorkel MeTaL, it estimates LF accuracy and correlations to produce probabilistic labels.

- **FlyingSquid Model:** A computationally efficient model that uses closed-form solutions based on triplet decomposition and a binary Ising model to aggregate labels.

- **Crowdlab Model:** Integrates predictions from LFs with a classifier, estimates LF quality, and can be used to refine LF sets by removing noisy LFs. It leverages confident learning techniques.

- **Importance of End Models:**  Label models are often paired with downstream "end models" (like SGDClassifier) for effective generalization to new, unseen data. Label models refine noisy labels, and end models learn to generalize from these refined labels.

- **Noisy Labeling Functions Removal:** Crowdlab can be used to identify and remove less reliable or noisy LFs based on quality scores, improving overall label quality.


## Learning goals

By the end of this class, you will be able to:

1. **Explain** the challenges of label aggregation in weak supervision scenarios, particularly when using multiple noisy and potentially conflicting labeling functions.

2. **Describe** the fundamental principles, assumptions, and key components of various advanced label models, including Majority Label Voter, Snorkel MeTaL, HyperLM (Unsupervised and Semi-supervised), Dawid & Skene, Snorkel Generative Model, FlyingSquid, and Crowdlab.

3. **Execute** and **apply** different label models using Python libraries such as Snorkel and HyperLM to aggregate outputs from a set of labeling functions for a given dataset.

4. **Evaluate** and **compare** the performance of different label models in terms of label quality using appropriate metrics like Matthews Correlation Coefficient, and analyze their impact on downstream classification tasks.

5. **Utilize** the Crowdlab framework to assess the quality and reliability of individual labeling functions within a weak supervision pipeline.

6. **Design and apply** a strategy to refine a set of labeling functions by identifying and removing noisy or less effective LFs based on quality scores provided by Crowdlab.

7. **Explain** the cooperative relationship between label models and downstream discriminative "end models" in achieving effective generalization in weakly supervised learning.


# Our Dataset

During [Notebook 03](Notebook_03.ipynb), we have created 113 labeling functions to label our sentiment classification dataset. We've used the `snorkel` library to create these labeling functions and we've also used the `LabelModel` to combine these labeling functions into a single model.

In this notebook, we will use these LFs outputs to dive deeper into other label models. We'll also explore the concept of influence functions and how they can be used to improve our models.

Let's load our dataset with the LF outputs and prepare it to be used in the next steps.

In [1]:
import numpy as np

np.bool = np.bool_

import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"

import pandas as pd
from snorkel.labeling.model import LabelModel, MajorityLabelVoter
from snorkel.labeling.model.label_model import LabelModel

In [2]:
pd.set_option("display.max_columns", 200)
pd.set_option("display.max_rows", 200)

In [3]:
df_train = pd.read_parquet("data/b2w/train.parquet")
df_test = pd.read_parquet("data/b2w/test_cleaned_with_labels.parquet")

In [4]:
# We'll use 10% of the test data as development data

from sklearn.model_selection import train_test_split

df_valid, df_test = train_test_split(
    df_test, train_size=2718, random_state=271828, stratify=df_test["label"]
)

df_train.reset_index(drop=True, inplace=True)
df_valid.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)

In [5]:
df_train.shape, df_valid.shape, df_test.shape

((102326, 3), (2718, 4), (23340, 4))

In [6]:
import re

# You can try the list of LFs from Notebook 03. We'll use a smaller set for now.

# List of regexes for positive sentiment
positive_patterns = [
    re.compile(r"(?<!n[ãa]o\s)(gostei|gostou|gostar)", re.IGNORECASE),
    re.compile(r"(?:dentro do|no|antes do) prazo", re.IGNORECASE),
    re.compile(r"(?<!n[ãa]o\s)recomend", re.IGNORECASE),
    re.compile(r"lind[oa]|bonit]oa]|cheiros[oa]|cheirinh|fresc", re.IGNORECASE),
    re.compile(r"\bbo[ma]\b", re.IGNORECASE),
    re.compile(r"[oó]timo|feliz|\bbo[am]\b|resistente|forte", re.IGNORECASE),
    re.compile(
        r"(?<!n[ãa]o\s)(gostei|legal|atencios[oa]|maravilh|\bamei\b|\bamou\b)",
        re.IGNORECASE,
    ),
    re.compile(r"\bfeliz|satisf[ae]"),
    re.compile(r"(?<!n[ãa]o\s)(plenamente|completamente|inteiramente)", re.IGNORECASE),
    re.compile(r"(?<!n[ãa]o\s)(faltou|falta|faltando)", re.IGNORECASE),
    re.compile(r"\bcert[ao]\b|\bcertinh[oa]\b", re.IGNORECASE),
    re.compile(r"\bshow\b|\btop\b|\barrasan|\barrasa\b", re.IGNORECASE),
]

# List of regexes for negative sentiment
negative_patterns = [
    re.compile(r"n[aã]o gost", re.IGNORECASE),
    re.compile(r"(?:fora do|depois do) prazo|atraso|atrasado", re.IGNORECASE),
    re.compile(r"usado|danificado|problema", re.IGNORECASE),
    re.compile(r"n[ãa]o recomend", re.IGNORECASE),
    re.compile(r"nunca mais", re.IGNORECASE),
    re.compile(r"ruim|p[eé]ssim", re.IGNORECASE),
    re.compile(r"n[aã]o funciona|\bquebr[aeo]", re.IGNORECASE),
    re.compile(
        r"ruim|p[eé]ssim[ao]|pior|fr[aá]gil|frac[ao]|horr[ií]vel|horroroso|detest",
        re.IGNORECASE,
    ),
    re.compile(r"n[aã]o gostei|ruim|descaso|\bmal\b", re.IGNORECASE),
    re.compile(r"\btriste|chatead|engan[oa]", re.IGNORECASE),
    re.compile(
        r"proco[nm]|reclam.{1,5}aqui|justi[cç]a|judici[aá]rio|pequenas causas|advogad",
        re.IGNORECASE,
    ),
    re.compile(
        r"(?:parou|deixou) de funcionar|quebrad[oa]|trincad[oa]|rachad[oa]|amassad[oa]|faltou|faltando|estrag|suj[oa]|\bmancha",
        re.IGNORECASE,
    ),
    re.compile(r"nunca mais|n[aã]o recebi", re.IGNORECASE),
    re.compile(r"\bn[ãa]o.{1,8}entreg", re.IGNORECASE),
    re.compile(r"\berrad[ao]", re.IGNORECASE),
    re.compile(r"estorn|devolvi|devolu[cç][aã]o|cancel"),
]

In [7]:
POSITIVE = 1
NEGATIVE = 0
ABSTAIN = -1

In [8]:
# Import necessary functions from the helpers.lf module
from helpers.lf import int_to_alphabetic_string, create_labeling_functions_from_regex

# Initialize an empty list to store the regex patterns along with their corresponding labels and names
regex_patterns = []

# Iterate over the positive patterns and create tuples with pattern, name, positive label, and abstain label
for i, pattern in enumerate(positive_patterns):
    # Convert the index to an alphabetic string for naming the labeling function
    name = f"positive_{int_to_alphabetic_string(i + 1)}"
    # Append the tuple to the regex_patterns list
    regex_patterns.append((pattern, name, POSITIVE, ABSTAIN))

# Iterate over the negative patterns and create tuples with pattern, name, negative label, and abstain label
for i, pattern in enumerate(negative_patterns):
    # Convert the index to an alphabetic string for naming the labeling function
    name = f"negative_{int_to_alphabetic_string(i + 1)}"
    # Append the tuple to the regex_patterns list
    regex_patterns.append((pattern, name, NEGATIVE, ABSTAIN))

# Create labeling functions from the regex patterns using the helper function
lfs = create_labeling_functions_from_regex(regex_patterns)

# Display the list of labeling functions
lfs

[LabelingFunction lf_regex_positive_a, Preprocessors: [],
 LabelingFunction lf_regex_positive_b, Preprocessors: [],
 LabelingFunction lf_regex_positive_c, Preprocessors: [],
 LabelingFunction lf_regex_positive_d, Preprocessors: [],
 LabelingFunction lf_regex_positive_e, Preprocessors: [],
 LabelingFunction lf_regex_positive_f, Preprocessors: [],
 LabelingFunction lf_regex_positive_g, Preprocessors: [],
 LabelingFunction lf_regex_positive_h, Preprocessors: [],
 LabelingFunction lf_regex_positive_i, Preprocessors: [],
 LabelingFunction lf_regex_positive_j, Preprocessors: [],
 LabelingFunction lf_regex_positive_k, Preprocessors: [],
 LabelingFunction lf_regex_positive_l, Preprocessors: [],
 LabelingFunction lf_regex_negative_a, Preprocessors: [],
 LabelingFunction lf_regex_negative_b, Preprocessors: [],
 LabelingFunction lf_regex_negative_c, Preprocessors: [],
 LabelingFunction lf_regex_negative_d, Preprocessors: [],
 LabelingFunction lf_regex_negative_e, Preprocessors: [],
 LabelingFunct

In [9]:
from snorkel.labeling import LFAnalysis, PandasLFApplier

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
L_valid = applier.apply(df=df_valid)
L_test = applier.apply(df=df_test)

  0%|          | 0/102326 [00:00<?, ?it/s]

100%|██████████| 102326/102326 [00:59<00:00, 1708.19it/s]
100%|██████████| 2718/2718 [00:01<00:00, 1680.29it/s]
100%|██████████| 23340/23340 [00:15<00:00, 1542.39it/s]


In [10]:
LFAnalysis(L=L_train, lfs=lfs).label_coverage()

0.8387995230928601

In [11]:
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
lf_regex_positive_a,0,[1],0.134208,0.130065,0.012861
lf_regex_positive_b,1,[1],0.125902,0.102652,0.009401
lf_regex_positive_c,2,[1],0.203086,0.165207,0.010916
lf_regex_positive_d,3,[1],0.04427,0.036472,0.00256
lf_regex_positive_e,4,[1],0.262338,0.262338,0.028292
lf_regex_positive_f,5,[1],0.399107,0.354182,0.042238
lf_regex_positive_g,6,[1],0.184802,0.169859,0.012402
lf_regex_positive_h,7,[1],0.069015,0.055323,0.011571
lf_regex_positive_i,8,[1],0.003469,0.002824,0.001231
lf_regex_positive_j,9,[1],0.021539,0.01805,0.015587


In [12]:
from sentence_transformers import SentenceTransformer  # pip install tf-keras
from pathlib import Path
import joblib

# Define the path to the pre-trained SentenceTransformer model
PATH_LM = "data/bin/assin2_sentence-transformer"
PATH_CACHE = Path("outputs/advanced_ws/")

PATH_CACHE.mkdir(parents=True, exist_ok=True)

if not (PATH_CACHE / "X_train.joblib").exists():

    # Load the SentenceTransformer model from the specified path
    model_st = SentenceTransformer(PATH_LM)

    # Encode the cleaned training text data using the SentenceTransformer model
    # batch_size: number of samples to process at a time
    # show_progress_bar: display a progress bar during encoding
    # convert_to_tensor: return the encoded data as a NumPy array (False) instead of a PyTorch tensor
    X_train = model_st.encode(
        df_train.text.values,
        batch_size=128,
        show_progress_bar=True,
        convert_to_tensor=False,
    )
    X_valid = model_st.encode(
        df_valid.text.values,
        batch_size=128,
        show_progress_bar=True,
        convert_to_tensor=False,
    )
    X_test = model_st.encode(
        df_test.text.values,
        batch_size=128,
        show_progress_bar=True,
        convert_to_tensor=False,
    )

    joblib.dump(X_train, PATH_CACHE / "X_train.joblib")
    joblib.dump(X_valid, PATH_CACHE / "X_valid.joblib")
    joblib.dump(X_test, PATH_CACHE / "X_test.joblib")

else:
    X_train = joblib.load(PATH_CACHE / "X_train.joblib")
    X_valid = joblib.load(PATH_CACHE / "X_valid.joblib")
    X_test = joblib.load(PATH_CACHE / "X_test.joblib")

In [13]:
y_valid = df_valid.label.values
y_test = df_test.label.values

y_valid.shape, y_test.shape

((2718,), (23340,))

In [14]:
# Get the columns of the LFs (Labeling Functions) that abstained on all rows of the training set
# This creates a boolean mask where True indicates the LF abstained on all rows
all_abstained_lfs = np.all(L_train == -1, axis=0)

# Remove the abstained LFs from the list of LFs
# This filters out LFs that abstained on all rows by using the boolean mask
lfs = [lf for lf, abstained in zip(lfs, all_abstained_lfs) if not abstained]

# Remove the abstained LFs from the training and validation label matrices
# This keeps only the columns (LFs) that did not abstain on all rows
L_train = L_train[:, ~all_abstained_lfs]
L_valid = L_valid[:, ~all_abstained_lfs]

# Get the rows where all LFs abstained in the training set
# This creates a boolean mask where True indicates all LFs abstained for that row
all_abstained_rows = np.all(L_train == -1, axis=1)

# Remove the rows where all LFs abstained from the training set and feature matrix
# This filters out rows where all LFs abstained by using the boolean mask
L_train = L_train[~all_abstained_rows]
X_train = X_train[~all_abstained_rows]

# Update the training DataFrame to reflect the removed rows
# This filters out rows in the DataFrame where all LFs abstained and resets the index
df_train = df_train.iloc[~all_abstained_rows]
df_train.reset_index(drop=True, inplace=True)

In [15]:
X_train.shape, L_train.shape, df_train.shape

((85831, 768), (85831, 28), (85831, 3))

In [16]:
X_test.shape, L_test.shape, df_test.shape

((23340, 768), (23340, 28), (23340, 4))

In [17]:
X_valid.shape, L_valid.shape, df_valid.shape

((2718, 768), (2718, 28), (2718, 4))

In [18]:
df_train

Unnamed: 0,source,review_id,text
0,b2w,79d9a98a62d9adff5e2c8e2bed824e4d524695e0a1e235...,nao gostei do produto! - o acabamento e muito ...
1,b2w,5177b7800f360f47ccd69afa43def2180777c1f6a3b26d...,"produto nao funciona - produto nao funcio, vei..."
2,b2w,fc0cc6d9c2e4539762936bcb5b0e855df6b5e08229b00d...,"nao recebi, portanto nao conheco o produto - p..."
3,b2w,b3a8b907623ceece9aef15896c82a6ea3d932be9f8a856...,maravilhoso - parabens pela eficiencia na entr...
4,b2w,2a4088fd001bddc6b3b3cfc1eacd7df77f365a96c1d457...,produto bom. - demora na entrega. produto bom....
...,...,...,...
85826,b2w,60edcbe9ac5d63cad10f859cd7434940a9cb9968d69f56...,"frete - eu sempre compro na internet, porem o ..."
85827,b2w,8d1a158926a1d0ced1f34e5cca097582628f9ca32705c4...,"amei o produto - meu filho ficou encantado, le..."
85828,olist,64408d18914abe811a12225c1da97beb,"chegou antes do prazo, produto de acordo com o..."
85829,b2w,a24a9e978f550958d912ecc6598d3df5092f8bd0c6d21d...,elogio - estou muito satisfeita com minha comp...


In [19]:
df_valid

Unnamed: 0,label,source,review_id,text
0,0,b2w,52059e633d1d3d99db67fe9b0dd9ab02313a60c27e47c0...,Fraca - Produto nao aquece o suficiente. Nao a...
1,1,b2w,99e4bad489afcc355d58f315c7bff6dba6b0547e073447...,Ótima qualidade! - Rodapés prontos para instal...
2,0,b2w,1994c6c30f08b31a3f5024868381595e9f157c3b901638...,Não foi entregue - Faz mais de um mês que comp...
3,1,b2w,326e7e6098b15d792fbf99c4649f0ea6ddca50b0861bda...,excelente compra - fiquei muito satisfeito com...
4,1,b2w,c78621b567856fa2b86f43858c4bfeedb442036b5ac320...,Praticidade na limpeza da casa! - Amei o produ...
...,...,...,...,...
2713,1,b2w,a08a04ac3f59185da24eb3c58237ceb75415e0fd0fc229...,Rápida para preparar as receitas e fácil de li...
2714,1,b2w,5dddbb02cc88d1695e16e19aee0bf650163b5157fc843c...,Ótimo livro - Comprei para auxiliar em meu pro...
2715,0,b2w,fb6e74ccd2b79eafeb2872770dda18eb4626fa4e978eb1...,NÃO CARREGA 1/3 DA BATERIA ORIGINAL - A foto d...
2716,1,b2w,95aa3ebeac14c6e8a4c9410feeb14aece55a006f46743a...,"Muito bom recomendo - Achei que fosse menor, o..."


In [20]:
df_test

Unnamed: 0,label,source,review_id,text
0,0,olist,10467dc456e818c60bd750271b5d183e,"atraso na entrega da mercadoria,não cumpre pra..."
1,1,b2w,9457791296d29bd13b2495f820c64bf55e129447c57af9...,"EXCELENTE - EXCELENTE PRODUTO, RECOMENDO SEMPR..."
2,1,olist,e4879e4146d2a906e573e7dfdccfe163,"Entrega rápida, eficiente, produto de acordo c..."
3,0,b2w,c26604478b8cc31f11328e5146746cc1a66e5902f11199...,Onde está o produto? - Não posso avaliar um pr...
4,0,b2w,e163d49d72091e26cec076edce577dd429cdedd100e7bb...,Descrição site - O produto parece exelente e o...
...,...,...,...,...
23335,1,olist,f2ab136899d66379c081ff7df3a49764,Impecável - Sensacional compra. Tudo perfeito....
23336,1,b2w,86627126f8cd43b0114eebc6b98ccb47031ff935a8c5a1...,bom produto. - cumpre bem seu papel. samsung ...
23337,1,b2w,e91cd2435aa32bb37077e20d5c479f27cc2d94f1f4f5fe...,Ótimo custo benefício - O mais potente de todo...
23338,0,olist,2286d30e0344664fb0a24b56a644f9bf,ola pessoal nao entendo porque recebi so os ta...


In [21]:
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
lf_regex_positive_a,0,[1],0.16,0.155061,0.015332
lf_regex_positive_b,1,[1],0.150097,0.12238,0.011208
lf_regex_positive_c,2,[1],0.242115,0.196957,0.013014
lf_regex_positive_d,3,[1],0.052778,0.043481,0.003053
lf_regex_positive_e,4,[1],0.312754,0.312754,0.033729
lf_regex_positive_f,5,[1],0.475807,0.422248,0.050355
lf_regex_positive_g,6,[1],0.220317,0.202503,0.014785
lf_regex_positive_h,7,[1],0.082278,0.065955,0.013795
lf_regex_positive_i,8,[1],0.004136,0.003367,0.001468
lf_regex_positive_j,9,[1],0.025678,0.021519,0.018583


In [22]:
LFAnalysis(L=L_valid, lfs=lfs).lf_summary(Y=df_valid.label.values)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
lf_regex_positive_a,0,[1],0.13613,0.132818,0.013245,349,21,0.943243
lf_regex_positive_b,1,[1],0.131714,0.107064,0.01067,335,23,0.935754
lf_regex_positive_c,2,[1],0.201619,0.163723,0.009566,538,10,0.981752
lf_regex_positive_d,3,[1],0.051508,0.041943,0.003679,135,5,0.964286
lf_regex_positive_e,4,[1],0.258278,0.258278,0.029065,648,54,0.923077
lf_regex_positive_f,5,[1],0.395143,0.354673,0.043782,1000,74,0.931099
lf_regex_positive_g,6,[1],0.185798,0.168506,0.012141,497,8,0.984158
lf_regex_positive_h,7,[1],0.054452,0.044518,0.009566,118,30,0.797297
lf_regex_positive_i,8,[1],0.001104,0.001104,0.000368,1,2,0.333333
lf_regex_positive_j,9,[1],0.01766,0.014717,0.013245,8,40,0.166667


In [23]:
LFAnalysis(L=L_train, lfs=lfs).label_coverage()

1.0

In [24]:
LFAnalysis(L=L_valid, lfs=lfs).label_coverage()

0.8421633554083885

In [25]:
class_balance = (
    pd.Series(df_valid.label).value_counts(normalize=True).sort_index().values
)
class_balance

array([0.30206034, 0.69793966])

In [26]:
# A trick to approximate class balance without using a validation set is to use the MajorityLabelVoter model on the LFs

majority_model = MajorityLabelVoter(cardinality=2)
preds_train = majority_model.predict(
    L=L_train, tie_break_policy="random"
)  # Random tie-breaking

# Calculate class balance
approximated_class_balance = np.unique(preds_train, return_counts=True)[1] / len(
    preds_train
)
approximated_class_balance

array([0.24668243, 0.75331757])

In [27]:
from typing import Any
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import matthews_corrcoef
import helpers.classification


def train_and_evaluate_model(
    X_train: Any,
    y_train: Any,
    X_test: Any,
    y_test: Any,
    label_model_name: str,
    verbose: bool = True,
    type_label: str = "LM + EM",
) -> dict[str, Any]:
    """
    Train a SGDClassifier model and evaluate its performance on the test set.

    Args:
        X_train (Any): Training data features.
        y_train (Any): Training data labels.
        X_test (Any): Test data features.
        y_test (Any): Test data labels.
        label_model_name (str): Name of the label model.
        verbose (bool, optional): Whether to print detailed metrics. Defaults to True.
        type_label (str, optional): Type of label model. Defaults to 'LM + EM'.

    Returns:
        dict[str, Any]: A dictionary containing the label model name and Matthews correlation coefficient.

    Example:
        result = train_and_evaluate_model(X_train, y_train, X_test, y_test, 'SGDClassifier')
    """
    # Initialize the SGDClassifier classifier with specified parameters
    model = SGDClassifier(
        loss="log_loss",
        max_iter=10000,
        n_jobs=-1,
        random_state=271828,
        class_weight="balanced",
    )

    # Train the model on the training data
    model.fit(X_train, y_train)

    # Predict the labels for the test data
    y_test_pred = model.predict(X_test)

    if verbose:
        # Print the label model name and classification metrics if verbose is True
        print(f"\n\nLabel Model name: {label_model_name}")
        helpers.classification.print_classification_metrics(y_test, y_test_pred)

    # Return the label model name and Matthews correlation coefficient
    return {
        "Label Model": label_model_name + " + End Model",
        "Matthews Correlation": matthews_corrcoef(y_test, y_test_pred),
        "Type": type_label,
    }


def evaluate_noisy_labels(
    true_labels: Any, predicted_labels: Any, label_model_name: str, verbose: bool = True
) -> dict[str, Any]:
    """
    Evaluate the performance of noisy labels against the true labels.

    Args:
        true_labels (Any): The true labels.
        predicted_labels (Any): The predicted labels.
        label_model_name (str): Name of the label model.
        verbose (bool, optional): Whether to print detailed metrics. Defaults to True.

    Returns:
        dict[str, Any]: A dictionary containing the label model name and Matthews correlation coefficient.

    Example:
        result = evaluate_noisy_labels(y_true, y_pred)
    """
    if verbose:
        # Print the label model name and classification metrics if verbose is True
        print(f"\n\nLabel Model name: {label_model_name}")
        helpers.classification.print_classification_metrics(
            true_labels, predicted_labels
        )

    # Return the label model name and Matthews correlation coefficient
    return {
        "Label Model": label_model_name,
        "Matthews Correlation": matthews_corrcoef(true_labels, predicted_labels),
        "Type": "LM alone",
    }

# Snorkel Label Models

In [28]:
all_results = []

# Let's see again the performance of the MajorityLabelVoter and Snorkel MeTaL LabelModel. We'll keep the End Model the same.

In [29]:
# Initialize the LabelModel from the Snorkel library
# cardinality: number of unique classes in the validation set
metal_model = LabelModel(cardinality=len(np.unique(y_valid)), verbose=True)

# Fit the LabelModel using the training label matrix (L_train)
# n_epochs: number of training epochs
# log_freq: frequency of logging during training
# seed: random seed for reproducibility
# class_balance: prior class distribution
metal_model.fit(
    L_train, n_epochs=1000, log_freq=100, seed=271828, class_balance=class_balance
)

# Predict the labels for the training, validation, and test sets using the trained LabelModel
# This step uses the learned parameters to infer the most likely labels
y_train_metal = metal_model.predict(L_train)
y_valid_metal = metal_model.predict(L_valid)
y_test_metal = metal_model.predict(L_test)

INFO:root:Computing O...
INFO:root:Estimating \mu...
  0%|          | 0/1000 [00:00<?, ?epoch/s]INFO:root:[0 epochs]: TRAIN:[loss=0.207]
  9%|▉         | 93/1000 [00:00<00:04, 188.13epoch/s]INFO:root:[100 epochs]: TRAIN:[loss=0.044]
 19%|█▉        | 188/1000 [00:01<00:04, 172.95epoch/s]INFO:root:[200 epochs]: TRAIN:[loss=0.040]
 29%|██▉       | 294/1000 [00:01<00:04, 163.10epoch/s]INFO:root:[300 epochs]: TRAIN:[loss=0.039]
 40%|███▉      | 398/1000 [00:02<00:03, 151.50epoch/s]INFO:root:[400 epochs]: TRAIN:[loss=0.039]
 50%|████▉     | 496/1000 [00:02<00:03, 150.70epoch/s]INFO:root:[500 epochs]: TRAIN:[loss=0.039]
 58%|█████▊    | 579/1000 [00:03<00:01, 288.27epoch/s]INFO:root:[600 epochs]: TRAIN:[loss=0.039]
 65%|██████▍   | 649/1000 [00:03<00:00, 402.51epoch/s]INFO:root:[700 epochs]: TRAIN:[loss=0.039]
 76%|███████▌  | 759/1000 [00:03<00:00, 459.85epoch/s]INFO:root:[800 epochs]: TRAIN:[loss=0.039]
 84%|████████▎ | 835/1000 [00:03<00:00, 545.31epoch/s]INFO:root:[900 epochs]: TRAIN:[los

In [30]:
# Train and evaluate a model using the training features (X_train) and the labels predicted by the LabelModel (y_train_metal)
# The function 'train_and_evaluate_model' trains a machine learning model and evaluates its performance on the test set
results_metal_em = train_and_evaluate_model(
    X_train, y_train_metal, X_test, y_test, "Snorkel MeTaL"
)



Label Model name: Snorkel MeTaL
Metric                                   Score
Accuracy Score:                        0.95274
Balanced Accuracy Score:               0.94776
F1 Score (weighted):                   0.95292
Cohen Kappa Score:                     0.88876
Matthews Correlation Coefficient:      0.88893

Classification Report:

              precision    recall  f1-score   support

           0       0.91      0.94      0.92      7050
           1       0.97      0.96      0.97     16290

    accuracy                           0.95     23340
   macro avg       0.94      0.95      0.94     23340
weighted avg       0.95      0.95      0.95     23340


Confusion Matrix:

Class 0 has 457 false negatives and 646 false positives.
Class 1 has 646 false negatives and 457 false positives.
The total number of errors is 1103 out of 23340 samples (error rate: 0.0473).
Predicted      0       1     All
True                            
0          6,593     457   7,050
1            646  15,

In [31]:
# Evaluate the noisy labels predicted by the LabelModel (y_test_metal) against the true test labels (y_test)
# The function 'evaluate_noisy_labels' compares the predicted labels to the true labels and returns evaluation metrics
results_metal_lm = evaluate_noisy_labels(y_test, y_test_metal, "Snorkel MeTaL")



Label Model name: Snorkel MeTaL
Metric                                   Score
Accuracy Score:                        0.87121
Balanced Accuracy Score:               0.80314
F1 Score (weighted):                   0.86340
Cohen Kappa Score:                     0.66497
Matthews Correlation Coefficient:      0.68579

Classification Report:

              precision    recall  f1-score   support

           0       0.92      0.63      0.75      7050
           1       0.86      0.98      0.91     16290

    accuracy                           0.87     23340
   macro avg       0.89      0.80      0.83     23340
weighted avg       0.88      0.87      0.86     23340


Confusion Matrix:

Class 0 has 2600 false negatives and 406 false positives.
Class 1 has 406 false negatives and 2600 false positives.
The total number of errors is 3006 out of 23340 samples (error rate: 0.1288).
Predicted      0       1     All
True                            
0          4,450   2,600   7,050
1            406  1

In [32]:
# Initialize the MajorityLabelVoter model
# cardinality: number of unique classes in the validation set
majority_model = MajorityLabelVoter(cardinality=len(np.unique(y_valid)))

# Predict the labels for the training, validation, and test sets using the MajorityLabelVoter model
# tie_break_policy="random": randomly break ties when multiple labels have the same majority vote
y_train_majority = majority_model.predict(L_train, tie_break_policy="random")
y_valid_majority = majority_model.predict(L_valid, tie_break_policy="random")
y_test_majority = majority_model.predict(L_test, tie_break_policy="random")

In [33]:
# Train and evaluate a model using the training features (X_train) and the labels predicted by the MajorityLabelVoter (y_train_majority)
# The function 'train_and_evaluate_model' trains a machine learning model and evaluates its performance on the test set
results_majority_em = train_and_evaluate_model(
    X_train, y_train_majority, X_test, y_test, "Majority Label Voter"
)



Label Model name: Majority Label Voter
Metric                                   Score
Accuracy Score:                        0.94833
Balanced Accuracy Score:               0.92895
F1 Score (weighted):                   0.94779
Cohen Kappa Score:                     0.87501
Matthews Correlation Coefficient:      0.87612

Classification Report:

              precision    recall  f1-score   support

           0       0.95      0.88      0.91      7050
           1       0.95      0.98      0.96     16290

    accuracy                           0.95     23340
   macro avg       0.95      0.93      0.94     23340
weighted avg       0.95      0.95      0.95     23340


Confusion Matrix:

Class 0 has 846 false negatives and 360 false positives.
Class 1 has 360 false negatives and 846 false positives.
The total number of errors is 1206 out of 23340 samples (error rate: 0.0517).
Predicted      0       1     All
True                            
0          6,204     846   7,050
1            3

In [34]:
# Evaluate the noisy labels predicted by the MajorityLabelVoter (y_test_majority) against the true test labels (y_test)
# The function 'evaluate_noisy_labels' compares the predicted labels to the true labels and returns evaluation metrics
results_majority_lm = evaluate_noisy_labels(
    y_test, y_test_majority, "Majority Label Voter"
)



Label Model name: Majority Label Voter
Metric                                   Score
Accuracy Score:                        0.86272
Balanced Accuracy Score:               0.83198
F1 Score (weighted):                   0.86198
Cohen Kappa Score:                     0.67098
Matthews Correlation Coefficient:      0.67122

Classification Report:

              precision    recall  f1-score   support

           0       0.78      0.75      0.77      7050
           1       0.90      0.91      0.90     16290

    accuracy                           0.86     23340
   macro avg       0.84      0.83      0.84     23340
weighted avg       0.86      0.86      0.86     23340


Confusion Matrix:

Class 0 has 1732 false negatives and 1472 false positives.
Class 1 has 1472 false negatives and 1732 false positives.
The total number of errors is 3204 out of 23340 samples (error rate: 0.1373).
Predicted      0       1     All
True                            
0          5,318   1,732   7,050
1         

In [35]:
# Append the evaluation results of the LabelModel's noisy labels to the all_results list
# results_metal_lm contains the evaluation metrics for the noisy labels predicted by the LabelModel
all_results.append(results_metal_lm)

# Append the evaluation results of the MajorityLabelVoter's noisy labels to the all_results list
# results_majority_lm contains the evaluation metrics for the noisy labels predicted by the MajorityLabelVoter
all_results.append(results_majority_lm)

# Append the evaluation results of the model trained with the LabelModel's labels to the all_results list
all_results.append(results_metal_em)

# Append the evaluation results of the model trained with the MajorityLabelVoter's labels to the all_results list
all_results.append(results_majority_em)

In [36]:
df_baseline = pd.DataFrame(all_results)
df_baseline.sort_values(by="Matthews Correlation", ascending=False)

Unnamed: 0,Label Model,Matthews Correlation,Type
2,Snorkel MeTaL + End Model,0.888925,LM + EM
3,Majority Label Voter + End Model,0.87612,LM + EM
0,Snorkel MeTaL,0.685793,LM alone
1,Majority Label Voter,0.671221,LM alone


In [37]:
# Let's compare with the performance of a full supervision model trained on the development set (Let's assume we have 1000 labeled samples).

# Train and evaluate a fully supervised model using the first 1000 labeled samples from the development set
results_full_supervision = train_and_evaluate_model(
    X_valid[:1000],
    y_valid[:1000],
    X_test,
    y_test,
    "Full Supervision on Dev Set",
    type_label="Full supervision",
)



Label Model name: Full Supervision on Dev Set
Metric                                   Score
Accuracy Score:                        0.96071
Balanced Accuracy Score:               0.94884
F1 Score (weighted):                   0.96053
Cohen Kappa Score:                     0.90596
Matthews Correlation Coefficient:      0.90621

Classification Report:

              precision    recall  f1-score   support

           0       0.95      0.92      0.93      7050
           1       0.97      0.98      0.97     16290

    accuracy                           0.96     23340
   macro avg       0.96      0.95      0.95     23340
weighted avg       0.96      0.96      0.96     23340


Confusion Matrix:

Class 0 has 572 false negatives and 345 false positives.
Class 1 has 345 false negatives and 572 false positives.
The total number of errors is 917 out of 23340 samples (error rate: 0.0393).
Predicted      0       1     All
True                            
0          6,478     572   7,050
1       

In [38]:
# Append the results of the fully supervised model to the all_results list
# This list contains the evaluation metrics for different models and labeling strategies
all_results.append(results_full_supervision)

# Create a DataFrame from the all_results list for easier analysis and comparison
df_baseline = pd.DataFrame(all_results)

# Sort the DataFrame by the 'Matthews Correlation' column in descending order
# This helps in identifying the best performing model based on the Matthews Correlation metric
df_baseline.sort_values(by="Matthews Correlation", ascending=False)

Unnamed: 0,Label Model,Matthews Correlation,Type
4,Full Supervision on Dev Set + End Model,0.906205,Full supervision
2,Snorkel MeTaL + End Model,0.888925,LM + EM
3,Majority Label Voter + End Model,0.87612,LM + EM
0,Snorkel MeTaL,0.685793,LM alone
1,Majority Label Voter,0.671221,LM alone


# Other Label Models

Programmatic Weak Supervision (PWS) uses *labeling functions* (LFs) to generate noisy labels from simple rules or heuristics. Combining these outputs—especially when the signals may conflict—is a critical challenge in obtaining high-quality labels.

In earlier notebooks, we used the Metal Label Model from Snorkel to aggregate LF outputs. In this section, we review alternative label models that support label aggregation.


## The Hyper Label Model (HyperLM) for Programmatic Weak Supervision

HyperLM addresses label aggregation challenges by learning a dataset-agnostic model that can be applied directly to new datasets without additional training or parameter tuning. Unlike traditional label models that require retraining for each new dataset, HyperLM approximates an optimal solution through deep learning.

### Key Components of HyperLM

1. **Label Matrix ($X$)**

   - **Definition:** The label matrix $X$ is an $n \times m$ matrix containing outputs from $m$ LFs applied to $n$ data points.
   - **Element Meaning:** Each entry $X_{i,j}$ represents the label given by the $j$-th LF for the $i$-th data point, using the following encoding:
     - **$+1$:** Positive label
     - **$-1$:** Negative label
     - **$0$:** Abstention (the LF does not provide a label)

2. **Ground Truth Labels ($y$)**

   - **Goal:** Infer the true label vector $y \in \{+1, -1\}^n$.
   - **Challenge:** In weak supervision, the true label vector $y$ is typically not observed.

3. **Better-Than-Random Assumption**

   - **Concept:** Most LFs for each class perform better than random guessing. Mathematically, let $g(X, y, j, c)$ be an indicator that equals 1 if the $j$-th LF is better-than-random for class $c$, and 0 otherwise. The assumption is expressed as:

     $$
     \frac{1}{m} \sum_{j=0}^{m-1} g(X, y, j, +1) > \frac{1}{2} \quad \text{and} \quad \frac{1}{m} \sum_{j=0}^{m-1} g(X, y, j, -1) > \frac{1}{2}
     $$

4. **Valid Label Vectors**

   - A label vector $y$ is deemed *valid* if it satisfies the better-than-random assumption. All valid candidate vectors form the set $U_y(X)$.

5. **Optimal Analytical Solution**

   - In theory, the optimal estimation of $y$ from $X$ is given by the average of all valid candidate vectors:

     $$
     h^*(X) = \frac{1}{|U_y(X)|} \sum_{y \in U_y(X)} y
     $$

   - This averaging minimizes the average error across all possible outcomes.

6. **Approximating the Optimal Solution**

   - Direct computation is intractable for large datasets. HyperLM uses a Graph Neural Network (GNN) to approximate the analytical solution, effectively handling large-scale problems.



### HyperLM Architecture

HyperLM uses a GNN-based architecture to address several real-world requirements:

1. **Handling Arbitrary Input Sizes**

   - **Feature:** The model accepts varying numbers of data points and LFs.
   - **Method:** The label matrix $X$ is represented as graph, enabling the GNN to work with inputs of arbitrary size.

2. **LF Permutation Invariance**

   - **Requirement:** The output should remain unchanged if the order of LFs is shuffled.
   - **Formalization:** For any permutation matrix $P_m \in \mathbb{R}^{m \times m}$, the model must satisfy:

     $$
     h(XP_m) = h(X)
     $$

3. **Data Point Permutation Equivariance**

   - **Requirement:** Shuffling data points should produce a corresponding permutation in output labels.
   - **Formalization:** For any permutation matrix $P_n \in \mathbb{R}^{n \times n}$:

     $$
     h(P_n X) = P_n h(X)
     $$

   - The GNN's architecture inherently maintains this property.



### Dataset-Agnostic Learning

A central benefit of HyperLM is its ability to work with new datasets without retraining. The process includes:

1. **Synthetic Data Generation**

   - **Procedure:** Train HyperLM with a large dataset of synthetically generated label matrices paired with approximated optimal label vectors.
   - This synthetic generation uniformly samples valid label vectors, ensuring the model learns a general aggregation strategy.

2. **GNN-based Processing**

   - **Graph Representation:** Data points and LFs are nodes in the graph. Edges connect data points with respective LF outputs.
   - **Message Passing:** Through message exchanges between nodes, the GNN learns the relationships required for effective label aggregation.

3. **Single Forward Pass Inference**

   - **Efficiency:** Once trained, the model produces label predictions in a single forward pass, requiring no dataset-specific parameter adjustment.



### Advantages of HyperLM

- **Improved Accuracy:** Provides high label accuracy across varied datasets.
- **Computational Efficiency:** Faster processing compared to methods needing dataset-specific training.
- **Interactive LF Development:** Allows quick evaluation of the effects of new or changed LFs.
- **Scalability:** Suitable for large datasets and complex labeling tasks, relevant in industrial applications.



## Training Approaches for HyperLM

HyperLM supports both unsupervised and semi-supervised training modes.

### Unsupervised HyperLM

In the unsupervised setting, HyperLM relies entirely on synthetic data:

1. **Data Generation**

   - **Sampling:** Generate a label matrix $X$ by randomly choosing $n$ and $m$ from defined ranges.
   - **Validity Check:** Pair $X$ with a label vector $y$ that meets the better-than-random assumption.
  
2. **Training Objective**

   - **Loss Function:** Minimize the cross-entropy loss over the synthetic dataset $D$:

     $$
     \mathcal{L}(h; D) = \frac{1}{|D|} \sum_{i=1}^{|D|} \sum_{j=1}^{n} \ell_{\text{CE}}(h(X_i)[j], y_i[j])
     $$

     where $\ell_{\text{CE}}(\cdot)\ is the cross-entropy loss.

3. **Model Architecture**

   - Use a multi-layer GNN to aggregate message passing.
   - Pool embeddings from the final layer and process them with a multilayer perceptron (MLP) to generate outputs.

4. **Inference**

   - For any new dataset, a single forward pass yields the predicted labels.



### Semi-supervised HyperLM

Fine-tuning HyperLM with a small set of ground truth labels is handled as follows:

1. **Pretraining**

   - **Initialization:** Start with weights from the unsupervised HyperLM trained on synthetic data.

2. **Fine-Tuning Objective**

   - **Loss Function:** Minimize the cross-entropy loss over the ground truth labeled set $D_{\text{gt}}$:

     $$
     \mathcal{L}_{\text{fine-tune}}(h; D_{\text{gt}}) = \sum_{i \in I} \ell_{\text{CE}}(h(X)[i], y[i])
     $$

   - Here, $I$ denotes the indices of data points with labels.

3. **Learning Rate Adjustment**

   - Use a smaller learning rate to reduce overfitting during fine-tuning. The number of fine-tuning epochs can be adjusted based on the size of the labeled set.

4. **Inference**

   - After fine-tuning, obtain predicted labels with a single forward pass across the new dataset.

> **Note:** The choice of training mode (unsupervised vs. semi-supervised) depends on the availability of ground truth labels. In practical applications, unsupervised training may be combined with limited ground truth data for quicker adaptation to specific tasks.


In [None]:
from hyperlm import HyperLabelModel
import torch
import numpy as np

with torch.serialization.safe_globals(
    [np.core.multiarray.scalar, np.dtype, np.dtypes.Float64DType]
):  # This is to avoid the error from torch >=2.6

    # Initialize the HyperLabelModel for unsupervised learning
    # device='cpu' specifies that the model should run on the CPU
    hyper_lm_unsupervised_model = HyperLabelModel(device="cpu")

# Infer the labels for the training set using the unsupervised HyperLabelModel
# L_train is the label matrix for the training set
y_train_hyper_lm_unsupervised = hyper_lm_unsupervised_model.infer(L_train)

# Infer the labels for the test set using the unsupervised HyperLabelModel
# L_test is the label matrix for the test set
y_test_hyper_lm_unsupervised = hyper_lm_unsupervised_model.infer(L_test)

# Display the shapes of the inferred labels and the feature matrices
# This helps in verifying that the dimensions of the inferred labels match the feature matrices
y_train_hyper_lm_unsupervised.shape, y_test_hyper_lm_unsupervised.shape, X_train.shape, X_test.shape

((85831,), (23340,), (85831, 768), (23340, 768))

In [48]:
# Train and evaluate a model using the training features (X_train) and the labels inferred by the unsupervised HyperLabelModel (y_train_hyper_lm_unsupervised)
# The function 'train_and_evaluate_model' trains a machine learning model and evaluates its performance on the test set
results_hyper_lm_unsupervised_em = train_and_evaluate_model(
    X_train,
    y_train_hyper_lm_unsupervised,
    X_test,
    y_test,
    "HyperLabelModel Unsupervised",
)

# Evaluate the noisy labels inferred by the unsupervised HyperLabelModel (y_test_hyper_lm_unsupervised) against the true test labels (y_test)
# The function 'evaluate_noisy_labels' compares the inferred labels to the true labels and returns evaluation metrics
results_hyper_lm_unsupervised_lm = evaluate_noisy_labels(
    y_test, y_test_hyper_lm_unsupervised, "HyperLabelModel Unsupervised"
)



Label Model name: HyperLabelModel Unsupervised
Metric                                   Score
Accuracy Score:                        0.93089
Balanced Accuracy Score:               0.93492
F1 Score (weighted):                   0.93192
Cohen Kappa Score:                     0.84145
Matthews Correlation Coefficient:      0.84432

Classification Report:

              precision    recall  f1-score   support

           0       0.84      0.95      0.89      7050
           1       0.97      0.92      0.95     16290

    accuracy                           0.93     23340
   macro avg       0.91      0.93      0.92     23340
weighted avg       0.94      0.93      0.93     23340


Confusion Matrix:

Class 0 has 387 false negatives and 1226 false positives.
Class 1 has 1226 false negatives and 387 false positives.
The total number of errors is 1613 out of 23340 samples (error rate: 0.0691).
Predicted      0       1     All
True                            
0          6,663     387   7,050
1   

In [50]:
# Initialize the HyperLabelModel for semi-supervised learning
with torch.serialization.safe_globals(
    [np.core.multiarray.scalar, np.dtype, np.dtypes.Float64DType]
):  # This is to avoid the error from torch >=2.6
    hyper_lm_semi_supervised_model = HyperLabelModel(device="cpu")

# Concatenate the training and validation label matrices
# L_train: label matrix for the training set
# L here will be train + valid (remember this is an Semi-Supervised Learning approach). Let's assume we have 1000 labeled samples.
# L_valid[:1000]: first 1000 samples from the validation set (assumed to be labeled)
L_concat = np.concatenate([L_train, L_valid[:1000]])

# Create index arrays for the training and validation sets within the concatenated label matrix
# idx_train: indices for the training set
# idx_valid: indices for the validation set within the concatenated matrix
idx_train = np.arange(L_train.shape[0])
idx_valid = np.arange(L_train.shape[0], L_concat.shape[0])

# Infer the labels for the training set using the semi-supervised HyperLabelModel
# L_concat: concatenated label matrix
# y_indices=idx_valid: indices of the labeled validation samples
# y_vals=y_valid[:1000]: true labels for the first 1000 validation samples
with torch.serialization.safe_globals(
    [np.core.multiarray.scalar, np.dtype, np.dtypes.Float64DType]
):  # This is to avoid the error from torch >=2.6
    y_train_hyper_lm_semi_supervised = hyper_lm_semi_supervised_model.infer(
        L_concat, y_indices=idx_valid, y_vals=y_valid[:1000]
    )

# Extract the inferred labels for the training set from the concatenated results
y_train_hyper_lm_semi_supervised = y_train_hyper_lm_semi_supervised[idx_train]

In [51]:
# Concatenate the test and validation label matrices
# L_test: label matrix for the test set
# L_valid[:1000]: first 1000 samples from the validation set (assumed to be labeled)
L_concat_2 = np.concatenate([L_test, L_valid[:1000]])

# Create index arrays for the test and validation sets within the concatenated label matrix
# idx_test: indices for the test set
# idx_valid: indices for the validation set within the concatenated matrix
idx_test = np.arange(L_test.shape[0])
idx_valid = np.arange(L_test.shape[0], L_concat_2.shape[0])

# Infer the labels for the test set using the semi-supervised HyperLabelModel
# L_concat_2: concatenated label matrix
# y_indices=idx_valid: indices of the labeled validation samples
# y_vals=y_valid[:1000]: true labels for the first 1000 validation samples
with torch.serialization.safe_globals(
    [np.core.multiarray.scalar, np.dtype, np.dtypes.Float64DType]
):  # This is to avoid the error from torch >=2.6
    y_test_hyper_lm_semi_supervised = hyper_lm_semi_supervised_model.infer(
        L_concat_2, y_indices=idx_valid, y_vals=y_valid[:1000]
    )

# Extract the inferred labels for the test set from the concatenated results
y_test_hyper_lm_semi_supervised = y_test_hyper_lm_semi_supervised[idx_test]

In [52]:
# Train and evaluate a model using the training features (X_train) and the labels inferred by the semi-supervised HyperLabelModel (y_train_hyper_lm_semi_supervised)
# The function 'train_and_evaluate_model' trains a machine learning model and evaluates its performance on the test set
results_hyper_lm_semi_supervised_em = train_and_evaluate_model(
    X_train,
    y_train_hyper_lm_semi_supervised,
    X_test,
    y_test,
    "HyperLabelModel Semi-Supervised",
)



Label Model name: HyperLabelModel Semi-Supervised
Metric                                   Score
Accuracy Score:                        0.93016
Balanced Accuracy Score:               0.91586
F1 Score (weighted):                   0.93007
Cohen Kappa Score:                     0.83393
Matthews Correlation Coefficient:      0.83394

Classification Report:

              precision    recall  f1-score   support

           0       0.89      0.88      0.88      7050
           1       0.95      0.95      0.95     16290

    accuracy                           0.93     23340
   macro avg       0.92      0.92      0.92     23340
weighted avg       0.93      0.93      0.93     23340


Confusion Matrix:

Class 0 has 848 false negatives and 782 false positives.
Class 1 has 782 false negatives and 848 false positives.
The total number of errors is 1630 out of 23340 samples (error rate: 0.0698).
Predicted      0       1     All
True                            
0          6,202     848   7,050
1  

In [53]:
# Evaluate the noisy labels inferred by the semi-supervised HyperLabelModel (y_test_hyper_lm_semi_supervised) against the true test labels (y_test)
# The function 'evaluate_noisy_labels' compares the inferred labels to the true labels and returns evaluation metrics
results_hyper_lm_semi_supervised_lm = evaluate_noisy_labels(
    y_test, y_test_hyper_lm_semi_supervised, "HyperLabelModel Semi-Supervised"
)



Label Model name: HyperLabelModel Semi-Supervised
Metric                                   Score
Accuracy Score:                        0.84773
Balanced Accuracy Score:               0.86915
F1 Score (weighted):                   0.85284
Cohen Kappa Score:                     0.67152
Matthews Correlation Coefficient:      0.68977

Classification Report:

              precision    recall  f1-score   support

           0       0.68      0.92      0.79      7050
           1       0.96      0.82      0.88     16290

    accuracy                           0.85     23340
   macro avg       0.82      0.87      0.83     23340
weighted avg       0.88      0.85      0.85     23340


Confusion Matrix:

Class 0 has 541 false negatives and 3013 false positives.
Class 1 has 3013 false negatives and 541 false positives.
The total number of errors is 3554 out of 23340 samples (error rate: 0.1523).
Predicted      0       1     All
True                            
0          6,509     541   7,050
1

In [54]:
all_results.append(results_hyper_lm_unsupervised_em)
all_results.append(results_hyper_lm_unsupervised_lm)
all_results.append(results_hyper_lm_semi_supervised_em)
all_results.append(results_hyper_lm_semi_supervised_lm)

df_baseline = pd.DataFrame(all_results)

df_baseline.sort_values(by="Matthews Correlation", ascending=False)

Unnamed: 0,Label Model,Matthews Correlation,Type
4,Full Supervision on Dev Set + End Model,0.906205,Full supervision
2,Snorkel MeTaL + End Model,0.888925,LM + EM
3,Majority Label Voter + End Model,0.87612,LM + EM
5,HyperLabelModel Unsupervised + End Model,0.844322,LM + EM
7,HyperLabelModel Semi-Supervised + End Model,0.833944,LM + EM
6,HyperLabelModel Unsupervised,0.704016,LM alone
8,HyperLabelModel Semi-Supervised,0.689767,LM alone
0,Snorkel MeTaL,0.685793,LM alone
1,Majority Label Voter,0.671221,LM alone


## The Dawid & Skene Model for Reliable Label Aggregation

The Dawid & Skene (DS) model, introduced in 1979 ([source](https://www.jstor.org/stable/234680)), is a probabilistic framework for inferring the likely true labels for items when the observed labels come from multiple annotators with unknown reliability. This model is useful in situations where annotations are noisy, such as in crowdsourced settings or when using labeling functions in weak supervision.


### Motivation: Addressing Annotator Variability

When collecting annotations, especially through crowdsourcing, the quality of the labels can vary significantly between annotators. Consider an image classification task with two classes, "cat" and "dog." Some annotators may have difficulty distinguishing between similar breeds or might misinterpret an image, while others deliver high-quality annotations consistently. The DS model handles these discrepancies by simultaneously estimating the true labels and the reliability (error rates) of individual annotators. This approach extends naturally to weak supervision, where labeling functions (LFs) act as annotators that provide possibly noisy labels.


### Model Components

The DS model formulates the problem in a probabilistic manner with the following elements:

1. **True Labels ($Z$)**
   - Each item $i$ has an unobserved true label $Z_i$.
   - For a dataset with $N$ items and $K$ classes, $Z_i \in \{1, 2, \dots, K\}$.

2. **Observed Labels ($X$)**
   - The observed labels come from $M$ annotators. These are represented as matrix $X$ with entries $X_{ij}$, where $X_{ij}$ is the label given by annotator $j$ for item $i$.

3. **Annotator Error Rates ($\pi^{(j)}$)**
   - Each annotator $j$ is characterized by a confusion matrix $\pi^{(j)}$.
   - The entry $\pi^{(j)}_{kl}$ represents the probability that annotator $j$ assigns label $l$ when the true label is $k$. These matrices capture the reliability and error patterns of each annotator.

4. **Class Priors ($\pi_k$)**
   - These are the prior probabilities of each class $k$ in the dataset.
   - They summarize our initial belief about the distribution of the true labels before observing any data.


### Model Assumptions

The DS model is based on a few key assumptions:

- **Conditional Independence:**  
 Given the true label $Z_i$, the labels provided by different annotators are independent. This is expressed as:

  $$
  P(X_{i1}, X_{i2}, \dots, X_{iM} \mid Z_i) = \prod_{j=1}^M P(X_{ij} \mid Z_i)
  $$

- **Stationary Annotator Behavior:**  
 An annotator's error rates, as described by their confusion matrix, are assumed to remain constant across all items. Although this assumption simplifies the model, it may not always hold if annotator performance varies with item characteristics.

- **Fixed Set of Classes:**  
 The set of possible labels is assumed to be known and remains constant throughout the process.



### Estimation with the Expectation-Maximization (EM) Algorithm

The DS model employs the EM algorithm to estimate both the true labels and the parameters (confusion matrices and class priors). The process involves the following steps:

#### 1. Initialization
- **Parameter Setup:**  
 Initiate with estimates for the confusion matrices $\pi^{(j)}$ and the class priors $\pi_k$. Often, these are set uniformly or based on the distribution of observed labels.

#### 2. E-Step (Expectation)
- **Posterior Calculation:**  
 For each item $i$ and each class $k$, calculate the posterior probability that the true label is $k$ given the observed labels:

  $$
  P(Z_i = k \mid X_i, \{\pi^{(j)}\}) = \frac{\pi_k \prod_{j=1}^M \pi^{(j)}_{k, X_{ij}}}{\sum_{k'=1}^{K} \pi_{k'} \prod_{j=1}^M \pi^{(j)}_{k', X_{ij}}}
  $$

  This probability weighs the evidence from all annotators based on their estimated reliability.

#### 3. M-Step (Maximization)
- **Update Confusion Matrices:**  
 Adjust the entries of each annotator's confusion matrix according to:

  $$
  \pi^{(j)}_{kl} = \frac{\sum_{i=1}^N P(Z_i = k \mid X_i, \{\pi^{(j)}\}) \cdot \mathbb{I}(X_{ij} = l)}{\sum_{i=1}^N P(Z_i = k \mid X_i, \{\pi^{(j)}\})}
  $$

  Here, $\mathbb{I}(X_{ij} = l)$ is an indicator function that equals1 if $X_{ij} = l$ and0 otherwise.

- **Update Class Priors:**

  $$
  \pi_k = \frac{1}{N} \sum_{i=1}^N P(Z_i = k \mid X_i, \{\pi^{(j)}\})
  $$

#### 4. Iteration
- **Convergence:**  
 Repeat the E-Step and M-Step iteratively until the parameters stabilize (i.e., changes in the parameters fall below a predefined threshold).



### Instinctive Understanding

> **Note:** The DS model weighs annotator contributions based on their estimated reliability. Annotators with high consensus with inferred true labels have higher influence. In practice, this means that if several annotators frequently agree (and are later estimated as reliable), their votes contribute more strongly to the final label decision. Pay attention because here we are not using human annotators, but labeling functions. If labeling functions are correlated, they will have a higher influence on the final label decision, even if they are noisy.

- **Estimating Reliability:**  
 The confusion matrices serve as measure of each annotator’s accuracy. Annotators whose mistakes rarely deviate from the true labels are considered more reliable.
  
- **Aggregating Noisy Labels:**  
 Rather than treating all annotators equally, the DS model uses the reliability estimates to combine their responses, producing more accurate true label estimates.

### Practical Considerations

- **Multiple Annotations per Item:**  
 The model benefits when each item is labeled by multiple annotators, as more data helps in reliably estimating both the true label and the annotators' error characteristics.

- **Identifiability Issues:**  
 When there are no ground truth labels, several parameter configurations might explain the observed data similarly. Including a small set of known true labels can reduce ambiguity in the estimates.

- **Sensitivity to Initialization:**  
 The EM algorithm may converge to a local minimum depending on the initial parameter values. Running the algorithm multiple times with different initializations or using informed initialization strategies can improve the robustness of the outcomes.

- **Applications Beyond Crowdsourcing:**  
 The DS model also adapts to weak supervision scenarios, where labeling functions, viewed as annotators, produce labels for training models. In such cases, understanding and modeling the error rates of these functions becomes critical.

In [55]:
from helpers.labelmodels import DawidSkene

# Initialize the Dawid-Skene model
# cardinality: number of unique classes in the validation set
ds_model = DawidSkene(cardinality=len(np.unique(y_valid)))

# Fit the Dawid-Skene model using the training label matrix (L_train)
# This step estimates the true labels based on the noisy labels provided by multiple annotators
ds_model.fit(L_train)

# Predict the labels for the training set using the fitted Dawid-Skene model
# This step uses the learned parameters to infer the most likely labels for the training data
y_train_ds = ds_model.predict(L_train)

# Predict the labels for the test set using the fitted Dawid-Skene model
# This step uses the learned parameters to infer the most likely labels for the test data
y_test_ds = ds_model.predict(L_test)

  '''Compute the marginal probabilities of each clique and separator set in the junction tree.
  '''Predict the probabilities of the Y's given the outputs of the LF's.
  '''Predict the value of the Y's that best fits the outputs of the LF's.
  '''Predict the probabilities of the Y's given the outputs of the LF's, marginalizing out all the
  """
  return "\\begin{tabular}{" + tabular_columns_fmt + "}\n\hline"
  _invisible_codes = re.compile("\x1b\[\d*m")  # ANSI color codes
  _invisible_codes_bytes = re.compile(b"\x1b\[\d*m")  # ANSI color codes
  """
  """
  """
  """
  0%|          | 26/10000 [00:02<19:01,  8.74it/s] 


In [56]:
# Train and evaluate a model using the training features (X_train) and the labels inferred by the Dawid-Skene model (y_train_ds)
# The function 'train_and_evaluate_model' trains a machine learning model and evaluates its performance on the test set
results_ds_em = train_and_evaluate_model(
    X_train, y_train_ds, X_test, y_test, "Dawid-Skene"
)



Label Model name: Dawid-Skene
Metric                                   Score
Accuracy Score:                        0.95094
Balanced Accuracy Score:               0.94941
F1 Score (weighted):                   0.95129
Cohen Kappa Score:                     0.88539
Matthews Correlation Coefficient:      0.88602

Classification Report:

              precision    recall  f1-score   support

           0       0.90      0.95      0.92      7050
           1       0.98      0.95      0.96     16290

    accuracy                           0.95     23340
   macro avg       0.94      0.95      0.94     23340
weighted avg       0.95      0.95      0.95     23340


Confusion Matrix:

Class 0 has 384 false negatives and 761 false positives.
Class 1 has 761 false negatives and 384 false positives.
The total number of errors is 1145 out of 23340 samples (error rate: 0.0491).
Predicted      0       1     All
True                            
0          6,666     384   7,050
1            761  15,52

In [57]:
# Evaluate the noisy labels inferred by the Dawid-Skene model (y_test_ds) against the true test labels (y_test)
# The function 'evaluate_noisy_labels' compares the inferred labels to the true labels and returns evaluation metrics
results_ds_lm = evaluate_noisy_labels(y_test, y_test_ds, "Dawid-Skene")



Label Model name: Dawid-Skene
Metric                                   Score
Accuracy Score:                        0.88685
Balanced Accuracy Score:               0.82927
F1 Score (weighted):                   0.88147
Cohen Kappa Score:                     0.71047
Matthews Correlation Coefficient:      0.72500

Classification Report:

              precision    recall  f1-score   support

           0       0.92      0.68      0.78      7050
           1       0.88      0.97      0.92     16290

    accuracy                           0.89     23340
   macro avg       0.90      0.83      0.85     23340
weighted avg       0.89      0.89      0.88     23340


Confusion Matrix:

Class 0 has 2229 false negatives and 412 false positives.
Class 1 has 412 false negatives and 2229 false positives.
The total number of errors is 2641 out of 23340 samples (error rate: 0.1132).
Predicted      0       1     All
True                            
0          4,821   2,229   7,050
1            412  15,

In [58]:
all_results.append(results_ds_em)
all_results.append(results_ds_lm)
df_baseline = pd.DataFrame(all_results)
df_baseline.sort_values(by="Matthews Correlation", ascending=False)

Unnamed: 0,Label Model,Matthews Correlation,Type
4,Full Supervision on Dev Set + End Model,0.906205,Full supervision
2,Snorkel MeTaL + End Model,0.888925,LM + EM
9,Dawid-Skene + End Model,0.886019,LM + EM
3,Majority Label Voter + End Model,0.87612,LM + EM
5,HyperLabelModel Unsupervised + End Model,0.844322,LM + EM
7,HyperLabelModel Semi-Supervised + End Model,0.833944,LM + EM
10,Dawid-Skene,0.724999,LM alone
6,HyperLabelModel Unsupervised,0.704016,LM alone
8,HyperLabelModel Semi-Supervised,0.689767,LM alone
0,Snorkel MeTaL,0.685793,LM alone


## Snorkel Generative Model

The **Snorkel generative model** is the predecessor to the **Snorkel MeTaL** label model, designed to address the problem of combining noisy and potentially conflicting labels generated by weak supervision sources. It was originally introduced in the [Snorkel paper](https://arxiv.org/abs/1711.10160) by Ratner et al. (2017). Although the MeTaL model has since superseded the generative model in the Snorkel framework, understanding the generative model is important for grasping the fundamental principles of weak supervision and label aggregation.

### Objectives of the Generative Model

1. **Estimate LF Accuracy**:  
 Each LF has a different reliability level. The model learns a weight $w_j$ for each LF, representing its accuracy without direct access to the true labels.

2. **Model LF Correlations**:  
 Some LFs may rely on similar cues or patterns and can be correlated. The model adjusts for this to prevent overcounting evidence from highly correlated LFs.

### Operational Overview

The model treats the true label $y_i$ for each data point $x_i$ as latent variable. Each labeling function $\lambda_j(x_i)$ either assigns label or abstains from labeling. These LF outputs are interpreted as noisy votes about the true label.

#### 1. Labeling Functions as Noisy Voters

- **LF Output**:  
 For each data point $x_i$, each labeling function $\lambda_j(x_i)$ outputs a label or $\emptyset$ (abstention).  
- **Noisy Evidence**:  
 Conflicting labels among LFs are combined to produce a probabilistic estimate of the true label $y_i$.

#### 2. Modeling LF Accuracy

- **Weight Assignment**:  
 Each LF $lambda_j $ receives a weight $ w_j $ that reflects its estimated accuracy.  
- **Accuracy Factor**:  
 An accuracy factor, denoted as $\phi^{\text{Acc}}_{i,j}$, quantifies the probability that LF $ j $ correctly assigns the latent true label $ y_i $.

#### 3. Modeling LF Correlations

- **Correlation Factors**:  
 To adjust for dependencies among LFs, the model incorporates correlation factors $\phi^{\text{Corr}}_{i,j,k}$ that capture pairwise relationships between labeling function outputs.
  
- **Avoiding Overcounting**:  
 These factors help ensure that correlated signals from similar LFs do not result in an overconfident prediction.

#### 4. Factor Graph Representation

The generative model is expressed via a factor graph, linking the latent true labels and LF outputs. For each data point $x_i$, the joint probability distribution over true labels $y_i$ and LF outputs $\lambda_j(x_i)$ can be written as:

$$
P(\mathbf{Y}, \mathbf{\Lambda} \mid \mathbf{w}) = \frac{1}{Z} \exp\left(\sum_{i=1}^{m} \mathbf{w}^T \phi_i(\mathbf{\Lambda}, y_i)\right)
$$

where:

- $\mathbf{Y} = (y_1, y_2, \ldots, y_m)$ is the vector of latent true labels.
- $\mathbf{\Lambda}$ is the matrix of LF outputs.
- $\mathbf{w}$ contains the model parameters (LF accuracies and correlations).
- $Z$ is the normalizing constant.
- $\phi_i(\mathbf{\Lambda}, y_i)$ are feature functions encapsulating accuracy and correlation effects.



### Learning the Model

The goal is to estimate the parameters $\mathbf{w}$ by maximizing the marginal likelihood of the observed LF outputs. The training involves the following steps:

#### 1. Initialization

- **Starting Point**:  
 An initial guess is made for the latent true labels $\mathbf{Y}$.

#### 2. Gibbs Sampling

- **Purpose**:  
 Gibbs sampling, a form of Markov Chain Monte Carlo (MCMC), estimates the joint distribution of the latent labels.
  
- **Conditional Sampling**:  
 For each data point $x_i$, sample the true label from the conditional distribution:

  $$
  P(y_i \mid \mathbf{\Lambda}, \mathbf{Y}_{-i}, \mathbf{w}) \propto \exp\left(\mathbf{w}^T \phi_i(\mathbf{\Lambda}, y_i)\right)
  $$
  
 Here, $\mathbf{Y}_{-i}$ denotes the set of latent labels excluding $y_i$.

#### 3. Parameter Update

- **Optimization Step**:  
 Use stochastic gradient descent to update the parameter vector $\mathbf{w}$ based on the current samples of the latent true labels.

#### 4. Convergence

- **Iterative Process**:  
 Repeat the sampling and parameter update steps until the latent label distribution stabilizes, indicating convergence.
  
- **Probabilistic Labels**:  
 Once converged, the model produces probabilistic label estimates:

  $$
  \hat{y}_i = P(y_i \mid \mathbf{\Lambda}, \mathbf{w})
  $$

  where
  - $\hat{y}_i$ is the estimated probability distribution over the true label $y_i$.
  - $\mathbf{\Lambda}$ is the matrix of LF outputs.
  - $\mathbf{w}$ contains the learned parameters.
  

These estimates serve as training labels for a downstream discriminative model.



### The Role of the Generative Model in the Snorkel Pipeline

The generative model acts as a **bridge** between noisy LF outputs and the final discriminative model:
1. **Labeling Functions**: Users create LFs based on domain knowledge, heuristics, or external resources.
2. **Generative Model**: It estimates LF accuracies and correlations, combining their votes into probabilistic labels.
3. **Discriminative Model**: The probabilistic labels train a high-performing discriminative model, improving predictive performance on new data.

By learning LF accuracies and correlations, the Snorkel generative model produces probabilistic labels that significantly reduce the need for costly manual labeling, enabling rapid deployment of machine learning models in real-world scenarios where labeled datasets are scarce.


In [59]:
import helpers.labelmodels
from multiprocessing import cpu_count

# Initialize the GenerativeModel from the helpers.labelmodels module
# cardinality: number of unique classes in the validation set
generative_model = helpers.labelmodels.GenerativeModel(
    cardinality=len(np.unique(y_valid))
)

# Fit the GenerativeModel using the training label matrix (L_train)
# class_balance: prior class distribution
# threads: number of threads to use for parallel processing
# This step estimates the true labels based on the noisy labels provided by multiple labeling functions (LFs)
generative_model.fit(L_train, class_balance=class_balance, threads=cpu_count())

# Note: The fitting process can take a significant amount of time depending on the size of the LF set and the machine's core count
# For a our LF set, it takes around 2 minutes on a 48-core machine

In [60]:
# Predict the labels for the training set using the fitted GenerativeModel
# This step uses the learned parameters to infer the most likely labels for the training data
y_train_generative = generative_model.predict(L_train)

# Predict the labels for the test set using the fitted GenerativeModel
# This step uses the learned parameters to infer the most likely labels for the test data
y_test_generative = generative_model.predict(L_test)

In [61]:
# Train and evaluate a model using the training features (X_train) and the labels inferred by the GenerativeModel (y_train_generative)
# The function 'train_and_evaluate_model' trains a machine learning model and evaluates its performance on the test set
results_generative_em = train_and_evaluate_model(
    X_train, y_train_generative, X_test, y_test, "Generative Model"
)



Label Model name: Generative Model
Metric                                   Score
Accuracy Score:                        0.95244
Balanced Accuracy Score:               0.95088
F1 Score (weighted):                   0.95277
Cohen Kappa Score:                     0.88883
Matthews Correlation Coefficient:      0.88941

Classification Report:

              precision    recall  f1-score   support

           0       0.90      0.95      0.92      7050
           1       0.98      0.95      0.97     16290

    accuracy                           0.95     23340
   macro avg       0.94      0.95      0.94     23340
weighted avg       0.95      0.95      0.95     23340


Confusion Matrix:

Class 0 has 374 false negatives and 736 false positives.
Class 1 has 736 false negatives and 374 false positives.
The total number of errors is 1110 out of 23340 samples (error rate: 0.0476).
Predicted      0       1     All
True                            
0          6,676     374   7,050
1            736  

In [62]:
# Evaluate the noisy labels inferred by the GenerativeModel (y_test_generative) against the true test labels (y_test)
# The function 'evaluate_noisy_labels' compares the inferred labels to the true labels and returns evaluation metrics
results_generative_lm = evaluate_noisy_labels(
    y_test, y_test_generative, "Generative Model"
)



Label Model name: Generative Model
Metric                                   Score
Accuracy Score:                        0.84229
Balanced Accuracy Score:               0.83609
F1 Score (weighted):                   0.84535
Cohen Kappa Score:                     0.64248
Matthews Correlation Coefficient:      0.64652

Classification Report:

              precision    recall  f1-score   support

           0       0.71      0.82      0.76      7050
           1       0.92      0.85      0.88     16290

    accuracy                           0.84     23340
   macro avg       0.81      0.84      0.82     23340
weighted avg       0.85      0.84      0.85     23340


Confusion Matrix:

Class 0 has 1266 false negatives and 2415 false positives.
Class 1 has 2415 false negatives and 1266 false positives.
The total number of errors is 3681 out of 23340 samples (error rate: 0.1577).
Predicted      0       1     All
True                            
0          5,784   1,266   7,050
1          2,4

In [63]:
all_results.append(results_generative_em)
all_results.append(results_generative_lm)
df_baseline = pd.DataFrame(all_results)
df_baseline.sort_values(by="Matthews Correlation", ascending=False)

Unnamed: 0,Label Model,Matthews Correlation,Type
4,Full Supervision on Dev Set + End Model,0.906205,Full supervision
11,Generative Model + End Model,0.88941,LM + EM
2,Snorkel MeTaL + End Model,0.888925,LM + EM
9,Dawid-Skene + End Model,0.886019,LM + EM
3,Majority Label Voter + End Model,0.87612,LM + EM
5,HyperLabelModel Unsupervised + End Model,0.844322,LM + EM
7,HyperLabelModel Semi-Supervised + End Model,0.833944,LM + EM
10,Dawid-Skene,0.724999,LM alone
6,HyperLabelModel Unsupervised,0.704016,LM alone
8,HyperLabelModel Semi-Supervised,0.689767,LM alone


## FlyingSquid Model

The **FlyingSquid** model is a framework specifically designed to tackle the computational demands of weak supervision. Traditional approaches often rely on **latent variable models** and iterative algorithms like **stochastic gradient descent (SGD)** to estimate the accuracies of labeling functions (LFs). However, these methods can be computationally intensive, especially with a large number of LFs or data points.

FlyingSquid distinguishes itself by using **closed-form solutions** to estimate model parameters. This approach dramatically reduces computation time, offering a significant speed advantage over iterative optimization methods.

### Key Components

1. **Weak Supervision Setup**

   - **Data and Labels**: Let $X = \{X_1, X_2, \dots, X_n\}$ be the set of unlabeled data points, and $Y = \{Y_1, Y_2, \dots, Y_n\}$ be the set of unobserved true labels. For binary classification, each $Y_i \in \{-1, +1\}$.
   - **Weak Supervision Sources**: We have $m$ weak supervision sources (labeling functions), denoted as $S_1, S_2, \dots, S_m$. Each source $S_j$ provides a noisy label $\lambda_j \in \{-1, 0, 1\}$ for each data point, where $0$ signifies abstention.

2. **Label Model**

   - **Latent Variable Approach**: FlyingSquid employs a **latent variable model** to infer both the accuracies of and the dependencies between the weak supervision sources.
   - **Binary Ising Model**: Specifically, it uses a **binary Ising model** to aggregate weak signals into probabilistic labels. The Ising model captures the relationships between the LFs and the unobserved true labels, allowing for efficient probabilistic aggregation.

3. **Triplet Decomposition**

   - **Key Innovation**: The central idea in FlyingSquid is the **triplet method**. It simplifies parameter estimation by breaking down the problem into smaller, solvable subproblems involving triplets of label sources.
   - **Pairwise Agreements**: For any three label sources $ \lambda_i, \lambda_j, \lambda_k $, the method estimates their accuracies by analyzing their pairwise agreements.
   - **Conditional Independence**:  Assuming conditional independence between sources $\lambda_i$ and $\lambda_j$ given the true label $Y$, the following relationship holds:

     $$
     E[\lambda_i Y] \cdot E[\lambda_j Y] = E[\lambda_i \lambda_j]
     $$

     Here, $E[\lambda_i Y]$ is intended to capture the “accuracy” of the $i$th labeling function (LF) under the assumption that the LF returns values in $\{-1, 0, +1\}$ and the true label $Y \in \{-1, +1\}$.
     
     Under the conditional independence assumption (given $Y$), if we denote $a_i = E[\lambda_i Y]$ as the accuracy of LF $i$ and $a_j = E[\lambda_j Y]$, the product $a_i\,a_j$ naturally appears as the expected pairwise agreement $E[\lambda_i \lambda_j]$.

     This equation relates the expected product of $\lambda_i$ and $Y$ with the expected product of $\lambda_j$ and $Y$ to the directly observable expected product of $\lambda_i$ and $\lambda_j$. By forming triplets and employing this relationship, FlyingSquid sets up a system of equations to estimate the accuracy parameters.

4. **Closed-Form Solution**

   - **Direct Calculation**:  FlyingSquid derives **closed-form solutions** for estimating parameters, in contrast to iterative optimization used in many latent variable models.
   - **Linear Systems**: This is accomplished by solving linear systems based on the observed agreement rates between pairs of weak supervision sources.
   - **Accuracy Estimation Formula**: The accuracy $a_i$ of a source $\lambda_i$ can be computed directly using the agreement rates with two other sources $\lambda_j$ and $\lambda_k$:

     $$
     a_i = \sqrt{\frac{E[\lambda_i \lambda_j] \cdot E[\lambda_i \lambda_k]}{E[\lambda_j \lambda_k]}}
     $$

     If the accuracies of the labeling functions in a triplet are defined as above, then from $E[\lambda_i \lambda_j] = a_i a_j$ and likewise for other pairs, solving for $a_i$ yields this closed-form expression.
     This formula requires that $E[\lambda_j \lambda_k]$ is non-zero, which is usually assumed when the LFs are sufficiently informative and not completely uncorrelated.

   - **Efficiency**: This method is computationally efficient because it scales linearly with the number of sources. It avoids the computational overhead of iterative procedures, making it suitable for large-scale weak supervision tasks.

5. **Binary Ising Model**

   - **Dependency Modeling**: FlyingSquid models the relationships between label sources and true labels using a **binary Ising model**.
   - **Graph Representation**: The Ising model is visualized as a graph $G$ where each node represents a weak supervision source. Edges between nodes can represent dependencies. Each source $\lambda_i$ is implicitly connected to the true label $Y$.
   - **Joint Distribution**: The Ising model defines the joint probability distribution over the true labels $Y$ and the observed labels $\lambda$:

     $$
     P(Y, \lambda) = \frac{1}{Z} \exp \left( \sum_{i} \theta_i Y \lambda_i + \sum_{i,j} \theta_{ij} \lambda_i \lambda_j \right)
     $$

     Here, $\theta_i$ reflects the accuracy of source $\lambda_i$, and $\theta_{ij}$ captures the correlation between sources $\lambda_i$ and $\lambda_j$. The triplet method allows for efficient estimation of these $\theta$ parameters, making the Ising model parameters tractable.

### Generalization Bound

FlyingSquid provides theoretical guarantees on the **generalization error** of models trained using its generated labels. These bounds consider both sampling error and potential model inaccuracies (misspecification).

- **Error Bound**: The generalization error $E$ of a model trained with FlyingSquid labels is bounded by:

  $$
  E \leq \gamma(n) + \frac{8 \|Y\|}{\lambda_{\min}} \|\hat{\mu} - \mu\|_2^2 + \delta(D, P_\mu)
  $$

  Where:

  - $\gamma(n)$ decreases as the number of samples $n$ increases, representing the reduction in sampling error with more data.
  - $\lambda_{\min}$ is the smallest eigenvalue of the covariance matrix of observed variables, influencing the stability of the estimation.
  - $\|\hat{\mu} - \mu\|_2$ measures the error in parameter estimation, i.e., how accurately the model parameters are estimated.
  - $\delta(D, P_\mu)$ is the Kullback-Leibler (KL) divergence, quantifying the difference between the true data distribution $D$ and the Ising model distribution $P_\mu$. This term accounts for model misspecification – if the Ising model does not perfectly represent the data, this term captures the resulting error.

- **Consequences**: This bound suggests that even if the Ising model is an approximation, the performance of a downstream model trained with FlyingSquid labels remains close to the performance achievable with true labels, especially when the sample size is sufficiently large.

> **Analogy**: Imagine you are trying to judge the fairness of three coin-flipping sources (analogous to LFs) without knowing the true probability of heads for each. You can observe pairs of coin flips from each pair of sources. By comparing how often each pair agrees (both heads or both tails), you can infer the reliability (accuracy) of each source without knowing the true outcome of any flip. FlyingSquid’s triplet method is similar—it uses observed agreements between triplets of LFs to estimate their individual accuracies in a computationally efficient way.

>
> That said, these expressions depend on the modeling assumptions (e.g., conditional independence, non-zero pairwise expectations) and may vary slightly with different formulations or under alternate assumptions. The key takeaway is that FlyingSquid provides a computationally efficient method for estimating LF accuracies and dependencies, enabling adaptable weak supervision applications.



In [64]:
# Initialize the FlyingSquid model from the helpers.labelmodels module
# cardinality: number of unique classes in the validation set
fs_model = helpers.labelmodels.FlyingSquid(cardinality=len(np.unique(y_valid)))

# Fit the FlyingSquid model using the training label matrix (L_train)
# class_balance: prior class distribution
# verbose=True: enables detailed logging during the fitting process
# This step estimates the true labels based on the noisy labels provided by multiple labeling functions (LFs)
fs_model.fit(L_train, class_balance=class_balance, verbose=True)

Marginals written down
R vector written down
Expectations to estimate written down
Triplets constructed
Y marginals computed
Y equals one computed
lambda marginals, moments, conditions computed
Unobserved probabilities computed
R values computed


In [65]:
# Predict the labels for the training set using the fitted FlyingSquid model
# This step uses the learned parameters to infer the most likely labels for the training data
y_train_fs = fs_model.predict(L_train)

# Predict the labels for the test set using the fitted FlyingSquid model
# This step uses the learned parameters to infer the most likely labels for the test data
y_test_fs = fs_model.predict(L_test)

In [66]:
# Train and evaluate a model using the training features (X_train) and the labels inferred by the FlyingSquid model (y_train_fs)
# The function 'train_and_evaluate_model' trains a machine learning model and evaluates its performance on the test set
results_fs_em = train_and_evaluate_model(
    X_train, y_train_fs, X_test, y_test, "FlyingSquid"
)



Label Model name: FlyingSquid
Metric                                   Score
Accuracy Score:                        0.88338
Balanced Accuracy Score:               0.82678
F1 Score (weighted):                   0.87814
Cohen Kappa Score:                     0.70264
Matthews Correlation Coefficient:      0.71562

Classification Report:

              precision    recall  f1-score   support

           0       0.91      0.68      0.78      7050
           1       0.88      0.97      0.92     16290

    accuracy                           0.88     23340
   macro avg       0.89      0.83      0.85     23340
weighted avg       0.89      0.88      0.88     23340


Confusion Matrix:

Class 0 has 2229 false negatives and 493 false positives.
Class 1 has 493 false negatives and 2229 false positives.
The total number of errors is 2722 out of 23340 samples (error rate: 0.1166).
Predicted      0       1     All
True                            
0          4,821   2,229   7,050
1            493  15,

In [67]:
# Evaluate the noisy labels inferred by the FlyingSquid model (y_test_fs) against the true test labels (y_test)
# The function 'evaluate_noisy_labels' compares the inferred labels to the true labels and returns evaluation metrics
results_fs_lm = evaluate_noisy_labels(y_test, y_test_fs, "FlyingSquid")



Label Model name: FlyingSquid
Metric                                   Score
Accuracy Score:                        0.83483
Balanced Accuracy Score:               0.74357
F1 Score (weighted):                   0.81940
Cohen Kappa Score:                     0.55428
Matthews Correlation Coefficient:      0.59127

Classification Report:

              precision    recall  f1-score   support

           0       0.90      0.51      0.65      7050
           1       0.82      0.97      0.89     16290

    accuracy                           0.83     23340
   macro avg       0.86      0.74      0.77     23340
weighted avg       0.84      0.83      0.82     23340


Confusion Matrix:

Class 0 has 3433 false negatives and 422 false positives.
Class 1 has 422 false negatives and 3433 false positives.
The total number of errors is 3855 out of 23340 samples (error rate: 0.1652).
Predicted      0       1     All
True                            
0          3,617   3,433   7,050
1            422  15,

In [68]:
all_results.append(results_fs_em)
all_results.append(results_fs_lm)

df_baseline = pd.DataFrame(all_results)
df_baseline.sort_values(by="Matthews Correlation", ascending=False)

Unnamed: 0,Label Model,Matthews Correlation,Type
4,Full Supervision on Dev Set + End Model,0.906205,Full supervision
11,Generative Model + End Model,0.88941,LM + EM
2,Snorkel MeTaL + End Model,0.888925,LM + EM
9,Dawid-Skene + End Model,0.886019,LM + EM
3,Majority Label Voter + End Model,0.87612,LM + EM
5,HyperLabelModel Unsupervised + End Model,0.844322,LM + EM
7,HyperLabelModel Semi-Supervised + End Model,0.833944,LM + EM
10,Dawid-Skene,0.724999,LM alone
13,FlyingSquid + End Model,0.715622,LM + EM
6,HyperLabelModel Unsupervised,0.704016,LM alone


## Crowdlab Model for Weak Supervision Using Label Functions

The Crowdlab method was originally developed for aggregating annotations from multiple human annotators, but we use it to aggregate outputs from weak supervision sources It kinda merges majority voting with confident learning, aiming to derive high-quality consensus labels and to evaluate the quality of the LFs themselves. Crowdlab achieves this by integrating predictions from LFs with a classifier trained on weakly labeled data, thereby using the complementary strengths and mitigating the weaknesses of both approaches.

### Method Overview

The Crowdlab method is designed for a dataset $D = \{X_i\}_{i=1}^{n}$, where each instance $X_i$ belongs to one of $K$ possible classes. The true label $Y_i$ for each instance is unknown. We have access to $m$ label functions, $LF_j$, each providing a weak label $Y_{ij} \in \{1, \dots, K\}$, or abstaining if $LF_j$ is not applicable to $X_i$. Crowdlab is structured to achieve three primary goals:

1. **Consensus Label Inference**: To combine predictions from LFs and a trained classifier to estimate the most probable true label for each instance.
2. **Confidence Estimation**: To quantify the reliability of these consensus labels, based on the level of agreement between LFs and the classifier.
3. **Label Function Quality Rating**: To evaluate the performance of each LF, assessing its accuracy and consistency with the derived consensus labels.

Through these integrated functionalities, Crowdlab manages the innate noise and potential conflicts present in weak supervision signals.

### Consensus Label Estimation

Crowdlab uses a **weighted ensemble** approach to calculate consensus labels. This method combines the probabilistic outputs of the LFs and a classifier that has been trained using the weakly labeled data. To prevent overfitting of the classifier on the weakly supervised training data, Crowdlab uses **cross-validation**. This ensures that the classifier's predictions used in the ensemble are based on data it has not been directly trained on. The combined probability prediction for each instance $X_i$ belonging to class $k$ is given by:

$$
\hat{p}(Y_i = k \mid X_i, \{Y_{ij}\}_{j=1}^{m}) = \frac{w_M \hat{p}(Y_i = k \mid X_i) + \sum_{j \in J_i} w_j p_j(Y_i = k \mid X_i)}{w_M + \sum_{j \in J_i} w_j}
$$

Let's break down each component of this formula:

- $\hat{p}(Y_i = k \mid X_i)$: This represents the probability that instance $X_i$ belongs to class $k$, as predicted by the classifier. This classifier is typically trained on data weakly labeled by the LFs, and cross-validation is used to get out-of-sample predictions to avoid overestimation of performance in the ensemble.
- $p_j(Y_i = k \mid X_i)$: This is the probabilistic estimate from label function $LF_j$ for instance $X_i$ belonging to class $k$. If $LF_j$ directly outputs a class label, this can be represented as a probability distribution concentrated on that class (e.g., 1 for the predicted class, 0 for others). For LFs that provide scores or confidence levels, these can be converted into probability estimates over classes.
- $w_M$: This is the weight assigned to the classifier, reflecting its estimated overall accuracy and contribution to the consensus. A higher $w_M$ indicates greater reliance on the classifier's predictions in the ensemble.
- $w_j$: This is the weight assigned to label function $LF_j$. It represents the quality and reliability of $LF_j$. Higher weights are given to LFs that are deemed more accurate and consistent.
- $J_i$: This is the set of indices of label functions that provided a non-abstaining label for instance $X_i$.  Only LFs that actually provide a label for $X_i$ contribute to the weighted sum for that instance.

The weights $w_M$ and $w_j$ are determined based on the agreement observed between the classifier's predictions and the labels provided by the LFs. These weights are adjusted dynamically: if the classifier shows higher reliability across the dataset, $w_M$ will be larger, placing more emphasis on its predictions. Conversely, if certain LFs are found to be more consistent and accurate, their corresponding $w_j$ values will increase, giving them more influence in the consensus label.

This adaptive weighting mechanism allows Crowdlab to balance the contributions of both the data-driven classifier and the knowledge-driven LFs, leading to more sturdy consensus labels than relying on either source alone.

### Scoring Label Function Quality

A significant feature of Crowdlab is its capability to assess the quality of individual LFs. This is particularly valuable in weak supervision because LFs inherently vary in accuracy and applicability. Crowdlab assigns a **quality score** $a_j$ to each label function $LF_j$, which serves as an indicator of its reliability and usefulness.

The quality score $a_j$ is computed based on two key factors:

1. **Agreement with Consensus Labels**: This measures the extent to which the labels generated by $LF_j$ align with the final consensus labels derived by Crowdlab. Higher agreement suggests that the LF is effectively capturing the fundamental patterns in the data. The agreement score for $LF_j$ is calculated as:

   $$
   \text{Agreement}_j = \frac{1}{|I_j|} \sum_{i \in I_j} \mathbb{1}(Y_{ij} = \hat{Y_i})
   $$

   - $I_j$: The set of indices of instances for which label function $LF_j$ provided a label.
   - $\hat{Y_i}$: The consensus label for instance $X_i$, as estimated by Crowdlab.
   - $\mathbb{1}(Y_{ij} = \hat{Y_i})$: An indicator function that equals 1 if the label $Y_{ij}$ from $LF_j$ for instance $X_i$ matches the consensus label $\hat{Y_i}$, and 0 otherwise.
   - $|I_j|$: The total number of instances labeled by $LF_j$.
   - $\text{Agreement}_j$: The fraction of times $LF_j$'s label agrees with the consensus label, averaged over all instances it labeled.

2. **Prediction Confidence Alignment**: This factor assesses how well the predictions of $LF_j$ correlate with the confidence of the classifier's predictions. LFs that tend to produce labels in alignment with what the classifier predicts with high confidence are considered more reliable.  A higher confidence alignment suggests that the LF and the classifier are reinforcing each other, potentially pointing to more trustworthy labels.  *(Note: The original text mentions "prediction confidence" but does not give a formula. A possible approach to quantify this "Confidence" could involve measuring how often LF labels align with classes for which the classifier outputs high probabilities, but the provided text does not specify the exact calculation for  $\text{Confidence}_j$. It's often context-dependent.)* For simplicity and based on the provided context focusing on agreement, we can proceed by primarily focusing on the "Agreement" metric as directly described for scoring LF quality.

The final quality score $a_j$ for label function $LF_j$ is a weighted average of these two components:

$$
a_j = \alpha \times \text{Agreement}_j + (1 - \alpha) \times \text{Confidence}_j
$$

- $\alpha$: A hyperparameter that controls the relative importance of agreement versus confidence. It allows for tuning the quality score to emphasize either the direct match with consensus labels or the alignment with classifier confidence, depending on the specific application and characteristics of the LFs and classifier.

This quality score $a_j$ allows for ranking LFs and identifying those that are most reliable and beneficial for the weak supervision task. It can guide efforts in refining LFs, discarding less effective ones, or focusing on improving the most promising labeling strategies.

### Advantages of Crowdlab in Weak Supervision

Applying Crowdlab to weak supervision offers several significant advantages:

- **Integrated Use of Multiple Information Sources**: Crowdlab excels at synergistically combining programmatic LFs with machine learning models. By dynamically weighting each source, it optimally integrates their information. This is especially beneficial in noisy weak supervision environments, leading to more reliable and accurate consensus labels.

- **Effective Handling of Label Function Diversity**:  Recognizing that LFs can vary significantly in noise levels and applicability across different data subsets, Crowdlab's capacity to assign individual weights to each LF is crucial. This tailored approach effectively manages diverse and noisy labeling strategies, enhancing the overall label aggregation process.

- **Classifier Agnostic Design**: Crowdlab's design is independent of the specific classifier model used. This classifier-agnostic nature means it can be integrated with any machine learning classifier. This flexibility ensures that Crowdlab can benefit from ongoing advancements in classifier technology, continually improving the quality of generated labels as better classifiers emerge.

- **Assessment of Label Function Trustworthiness**: By providing a quality score for each LF, Crowdlab offers valuable insights into the contributions of different labeling strategies. This assessment helps identify LFs that provide useful signals and those that introduce noise. Such insights are invaluable for iteratively refining weak supervision pipelines and for selecting the most reliable sources of weak labels, enhancing the overall process of weak supervision.

In conclusion, Crowdlab is another nice framework for weak supervision that not only generates improved consensus labels by combining LFs and classifier predictions but also provides mechanisms to understand and evaluate the reliability of different labeling approaches. This makes Crowdlab a valuable asset in scenarios where weak supervision is a primary method for generating training data, offering a structured and adaptive approach to handling the complexities and uncertainties innate in weakly labeled datasets.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_predict
from cleanlab.multiannotator import (
    get_label_quality_multiannotator,
    get_majority_vote_label,
)

In [70]:
# Crowdlab expects abstains to be represented as np.nan (Not a Number)
# Create a copy of the training label matrix (L_train) and convert it to float type
L_train_crowdlab = L_train.copy().astype(float)

# Replace all instances of -1 (abstains) in the copied label matrix with np.nan
L_train_crowdlab[L_train_crowdlab == -1] = np.nan

# Before training a machine learning model, we need to obtain initial consensus labels from the data annotations.
# These consensus labels represent a crude guess of the best label for each example.
# The most straightforward way to obtain an initial set of consensus labels is via simple majority vote.
# Use the get_majority_vote_label function to obtain the majority vote labels for the training set
crowdlab_majority_vote_train = get_majority_vote_label(L_train_crowdlab)

# Display the first 5 majority vote labels for the training set
crowdlab_majority_vote_train[:5]

array([0, 0, 0, 1, 1])

In [71]:
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression

# Initialize a Logistic Regression model with specific parameters
# random_state: ensures reproducibility
# n_jobs=-1: uses all available processors
# class_weight='balanced': adjusts weights inversely proportional to class frequencies
# max_iter=1000: sets the maximum number of iterations for the solver
model_lr_crowdlab_train = LogisticRegression(
    random_state=271828, n_jobs=-1, class_weight="balanced", max_iter=1000
)

# Perform cross-validated predictions on the training data
# estimator: the logistic regression model
# X: feature matrix for the training set
# y: majority vote labels for the training set
# cv: StratifiedKFold cross-validator with 10 splits, shuffling, and a fixed random state for reproducibility
# method='predict_proba': returns the predicted probabilities for each class
pred_probs_train = cross_val_predict(
    estimator=model_lr_crowdlab_train,
    X=X_train,
    y=crowdlab_majority_vote_train,
    cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=271828),
    method="predict_proba",
)

# Calculate the quality of the labels provided by multiple annotators
# L_train_crowdlab: label matrix with np.nan for abstains
# pred_probs_train: predicted probabilities from the logistic regression model
# verbose=False: disables verbose output
labels_quality_train = get_label_quality_multiannotator(
    L_train_crowdlab, pred_probs_train, verbose=False
)

In [72]:
labels_quality_train.keys()

dict_keys(['label_quality', 'detailed_label_quality', 'annotator_stats'])

In [73]:
# Access the 'label_quality' column from the labels_quality_train DataFrame
# This column contains the improved consensus labels using information from each of the annotators and the model
# The DataFrame also includes information about the number of annotations, annotator agreement, and consensus quality score for each example
improved_consensus_labels = labels_quality_train["label_quality"]

# Display the improved consensus labels
improved_consensus_labels

Unnamed: 0,consensus_label,consensus_quality_score,annotator_agreement,num_annotations
0,1,0.524353,0.5,4
1,0,0.986632,1.0,3
2,0,0.990948,1.0,2
3,1,0.957038,1.0,1
4,1,0.929227,1.0,2
...,...,...,...,...
85826,1,0.933930,1.0,1
85827,1,0.947261,1.0,1
85828,1,0.976285,1.0,1
85829,1,0.964287,1.0,3


In [74]:
# Access the 'detailed_label_quality' column from the labels_quality_train DataFrame
# This column contains the label quality score for each label given by every annotator
# The label quality score provides insights into the reliability of each annotator's labels
detailed_label_quality_scores = labels_quality_train["detailed_label_quality"]

# Display the detailed label quality scores
detailed_label_quality_scores

Unnamed: 0,quality_annotator_0,quality_annotator_1,quality_annotator_2,quality_annotator_3,quality_annotator_4,quality_annotator_5,quality_annotator_6,quality_annotator_7,quality_annotator_8,quality_annotator_9,quality_annotator_10,quality_annotator_11,quality_annotator_12,quality_annotator_13,quality_annotator_14,quality_annotator_15,quality_annotator_16,quality_annotator_17,quality_annotator_18,quality_annotator_19,quality_annotator_20,quality_annotator_21,quality_annotator_22,quality_annotator_23,quality_annotator_24,quality_annotator_25,quality_annotator_26,quality_annotator_27
0,,,,,0.524353,0.524353,,,,,,,0.475647,,,,,,,,0.475647,,,,,,,
1,,,,,,,,,,,,,,,0.986632,0.986632,,,0.986632,,,,,,,,,
2,,,,,,,,,,,,,,,0.990948,,,,,,,,,,0.990948,,,
3,,,,,,,0.957038,,,,,,,,,,,,,,,,,,,,,
4,,,,,0.929227,0.929227,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85826,,,,,,,,0.933930,,,,,,,,,,,,,,,,,,,,
85827,,,,,,,0.947261,,,,,,,,,,,,,,,,,,,,,
85828,,0.976285,,,,,,,,,,,,,,,,,,,,,,,,,,
85829,,,,0.964287,,,0.964287,0.964287,,,,,,,,,,,,,,,,,,,,


In [75]:
# Access the 'annotator_stats' column from the labels_quality_train DataFrame
# This column provides the annotator quality score for each annotator, alongside other information such as:
# - The number of examples each annotator labeled
# - Their agreement with the consensus labels
# - The class they perform the worst at
# The annotator_stats DataFrame is sorted by increasing annotator_quality, showing the worst annotators first
annotator_stats = labels_quality_train["annotator_stats"]

# Display the annotator statistics
annotator_stats

Unnamed: 0,annotator_quality,agreement_with_consensus,worst_class,num_examples_labeled
9,0.30943,0.281307,1,2204
14,0.561334,0.575787,0,4130
8,0.590159,0.588732,1,355
13,0.703598,0.729555,0,1235
23,0.738175,0.757474,0,2977
18,0.810604,0.84132,0,2546
12,0.82488,0.850331,0,2265
7,0.830256,0.834608,1,7062
21,0.841196,0.881911,0,1465
26,0.850247,0.894737,0,1178


In [76]:
# Extract the consensus labels from the 'label_quality' column of the labels_quality_train DataFrame
# These labels represent the improved consensus labels using information from each of the annotators and the model
y_train_crowdlab = labels_quality_train["label_quality"]["consensus_label"].values

# Train and evaluate a model using the training features (X_train) and the consensus labels (y_train_crowdlab)
# The function 'train_and_evaluate_model' trains a machine learning model and evaluates its performance on the test set
results_crowdlab_em = train_and_evaluate_model(
    X_train, y_train_crowdlab, X_test, y_test, "CrowdLab Model"
)



Label Model name: CrowdLab Model
Metric                                   Score
Accuracy Score:                        0.95197
Balanced Accuracy Score:               0.94242
F1 Score (weighted):                   0.95194
Cohen Kappa Score:                     0.88595
Matthews Correlation Coefficient:      0.88595

Classification Report:

              precision    recall  f1-score   support

           0       0.92      0.92      0.92      7050
           1       0.96      0.97      0.97     16290

    accuracy                           0.95     23340
   macro avg       0.94      0.94      0.94     23340
weighted avg       0.95      0.95      0.95     23340


Confusion Matrix:

Class 0 has 576 false negatives and 545 false positives.
Class 1 has 545 false negatives and 576 false positives.
The total number of errors is 1121 out of 23340 samples (error rate: 0.0480).
Predicted      0       1     All
True                            
0          6,474     576   7,050
1            545  15

In [77]:
all_results.append(results_crowdlab_em)
df_baseline = pd.DataFrame(all_results)
df_baseline.sort_values(by="Matthews Correlation", ascending=False)

Unnamed: 0,Label Model,Matthews Correlation,Type
4,Full Supervision on Dev Set + End Model,0.906205,Full supervision
11,Generative Model + End Model,0.88941,LM + EM
2,Snorkel MeTaL + End Model,0.888925,LM + EM
9,Dawid-Skene + End Model,0.886019,LM + EM
15,CrowdLab Model + End Model,0.885951,LM + EM
3,Majority Label Voter + End Model,0.87612,LM + EM
5,HyperLabelModel Unsupervised + End Model,0.844322,LM + EM
7,HyperLabelModel Semi-Supervised + End Model,0.833944,LM + EM
10,Dawid-Skene,0.724999,LM alone
13,FlyingSquid + End Model,0.715622,LM + EM


### Using CrowdLab to Refine Labeling Function Sets by Removing Noisy LFs

As observed, models like Dawid & Skene and Majority Voting perform pretty well in many label aggregation tasks. However, advanced models such as HyperLM, FlyingSquid, and Crowdlab, offer significant advantages for complex and large-scale weak supervision scenarios due to their computational efficiency, scalability, and adaptability to diverse labeling strategies. Crowdlab especially excels in combining the strengths of label functions and classifier predictions, along with its capacity to estimate LF quality and generate high-quality consensus labels. This section will detail how Crowdlab can be specifically utilized to identify and mitigate the impact of noisy LFs by employing its quality estimation capabilities.

CrowdLab enables a systematic approach to refine sets of LFs by identifying and potentially removing those that are deemed noisy or less reliable. This refinement process is based on evaluating each LF's performance through quality scores, which are derived from the LF's agreement with the consensus labels and its prediction confidence.  We can, then, strategically remove less effective LFs, potentially enhancing the overall quality of the aggregated labels.

We can break down the process of noisy LF removal using Crowdlab into four key steps:

1.  **Quality Score Calculation for Each Labeling Function**:

    -   **Process**: The first step involves computing a complete quality score, denoted as $a_j$, for each labeling function $LF_j$. This score serves as a metric for evaluating the trustworthiness and effectiveness of each LF.
    -   **Basis of Score**: The quality score $a_j$ is typically a composite measure, incorporating at least two key aspects of an LF's performance:
        -   **Agreement with Consensus Labels**: This component quantifies how frequently the labels provided by $LF_j$ match the consensus labels generated by CrowdLab. High agreement indicates that the LF is generally aligned with the aggregated wisdom of all sources.
        -   **Prediction Confidence (or Alignment with Classifier)**: This component assesses how well the predictions of $LF_j$ align with the predictions made by the classifier component within CrowdLab. LFs that frequently agree with high-confidence predictions from the classifier are generally considered more reliable. *Note*: As mentioned previously, the exact method for quantifying "prediction confidence" may vary, but it generally involves assessing the extent to which an LF's outputs are in sync with the classifier's probabilistic predictions or similar confidence measures.
    -   **Weighted Combination**: The final quality score $a_j$ is typically calculated as a weighted combination of these measures, allowing for tunable emphasis on either agreement or prediction confidence, as per the formula:

        $$
        a_j = \alpha \times \text{Agreement}_j + (1 - \alpha) \times \text{Confidence}_j
        $$

        Where $\alpha$ is a weighting parameter (typically between 0 and 1) that determines the balance between these two aspects of LF quality.

    -   **Outcome**: This step results in a quality score $a_j$ for each $LF_j$, providing a quantifiable measure of each LF's performance within the CrowdLab framework.

2.  **Ranking Labeling Functions Based on Quality Scores**:

    -   **Ordering**: Once quality scores are calculated for all LFs, the next step is to rank them in descending order based on these scores. This ranking helps to systematically identify which LFs are considered most reliable (higher scores) and which are potentially less reliable or noisy (lower scores).
    -   **Identification of Noisy LFs**: LFs that appear at the lower end of this ranking, i.e., those with significantly lower quality scores, are flagged as potentially noisy or less effective. These are the primary candidates for removal or further scrutiny.
    -   **Purpose of Ranking**: Ranking provides a clear, ordered view of LF performance, enabling informed decisions about which LFs to retain and which to consider removing to improve overall label quality.

3.  **Thresholding Quality Scores to Filter LFs**:

    -   **Threshold Definition**: To automate or semi-automate the process of noisy LF removal, a quality score threshold can be established. This threshold acts as a cutoff point below which LFs are considered insufficiently reliable.
    -   **Filtering Process**: LFs with quality scores that fall below this predefined threshold are flagged as noisy. These LFs are then considered for exclusion from the subsequent label aggregation process.
    -   **Setting the Threshold**: The selection of an appropriate threshold may involve experimentation and validation. It might be determined based on:
        -   **Empirical Analysis**: Examining the distribution of quality scores and identifying natural breakpoints or lower-performing clusters.
        -   **Performance Benchmarking**: Testing different threshold values and evaluating the impact on the quality of the consensus labels on a validation set or through cross-validation.
        -   **Domain Knowledge**: Incorporating expert insights into what constitutes an acceptable level of LF reliability in the specific application context.

4.  **Validation of Noisy LF Removal**:

    -   **Re-evaluation of Consensus Labels**: After identifying and excluding LFs that fall below the quality threshold, it's crucial to re-evaluate the consensus labels. This involves re-running the CrowdLab aggregation process, but this time, *without* the excluded LFs.
    -   **Quality Metric Measurement**: To objectively assess the impact of removing the noisy LFs, appropriate quality metrics should be measured. This could include:
        -   **Agreement with a Gold Standard**: If a small, high-quality validation set is available, measure the agreement of the new consensus labels with these gold standard labels.
        -   **Internal Consistency Metrics**: Evaluate metrics that reflect the internal consistency and confidence of the aggregated labels, such as entropy or variance in probabilistic label distributions.
        -   **Downstream Task Performance**: Assess the performance of a model trained on the refined consensus labels in a downstream task (if applicable). Improved performance in the downstream task is a strong indicator of enhanced label quality.
    -   **Threshold Adjustment**: Based on the validation results, it might be necessary to adjust the quality score threshold. If removing LFs based on the initial threshold does not yield the desired improvement (or even degrades performance), the threshold could be refined (e.g., made more or less strict) and the validation process repeated. This iterative refinement helps to find an optimal set of LFs that maximizes the quality of the consensus labels.

CrowdLab offers a principled and effective method for identifying and mitigating the impact of noisy labeling functions, ultimately improving the quality of weakly supervised learning pipelines. This process allows for a more strong and reliable label aggregation, enhancing the potential of weak supervision in scenarios where labeled data is limited or expensive to obtain.

In [78]:
# Generate descriptive statistics for the 'annotator_quality' column in the 'annotator_stats' DataFrame

labels_quality_train["annotator_stats"]["annotator_quality"].describe(
    percentiles=[0.05, 0.10, 0.15, 0.20, 0.25, 0.3, 0.4, 0.5, 0.6, 0.7, 0.75, 0.9]
)

count    28.000000
mean      0.840280
std       0.145182
min       0.309430
5%        0.571423
10%       0.669567
15%       0.741797
20%       0.816315
25%       0.828912
30%       0.842101
40%       0.867522
50%       0.879746
60%       0.911009
70%       0.923738
75%       0.927615
90%       0.941487
max       0.957423
Name: annotator_quality, dtype: float64

In [79]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict, StratifiedKFold


def remove_worst_annotators_by_quantile(
    quantile: float,
    labels_quality: dict,
    X_train: np.ndarray,
    L_train: np.ndarray,
    labeling_functions: list,
) -> tuple[list, np.ndarray, np.ndarray]:
    """
    Remove the worst annotators based on a given quantile of annotator quality.

    Args:
        quantile (float): The quantile threshold for removing bad annotators.
        labels_quality (dict): Dictionary containing annotator statistics.
        X_train (np.ndarray): Training feature matrix.
        L_train (np.ndarray): Training label matrix.
        labeling_functions (list): list of labeling functions.

    Returns:
        tuple[list, np.ndarray, np.ndarray]: A tuple containing the filtered list of labeling functions,
                                             the filtered training label matrix, and the filtered training feature matrix.
    """
    # Calculate the quality threshold for the given quantile
    quality_threshold = labels_quality["annotator_stats"]["annotator_quality"].quantile(
        quantile
    )

    # Identify annotators whose quality is below or equal to the threshold
    bad_annotators = labels_quality["annotator_stats"][
        labels_quality["annotator_stats"]["annotator_quality"] <= quality_threshold
    ].index.values

    print(
        f'Annotators to remove: {bad_annotators} - ({len(bad_annotators)} out of {len(labels_quality["annotator_stats"])})'
    )

    # Filter out the bad annotators from the list of labeling functions
    good_labeling_functions = [
        lf
        for lf, idx in zip(labeling_functions, range(len(labeling_functions)))
        if idx not in bad_annotators
    ]

    # Filter out the bad annotators from the label matrix
    L_train_filtered = L_train[
        :, [idx for idx in range(len(labeling_functions)) if idx not in bad_annotators]
    ].copy()

    # Convert the filtered label matrix to float type
    L_train_filtered = L_train_filtered.astype(float)

    # Identify rows where all labeling functions abstained (-1 indicates abstain)
    all_abstained_rows = np.all(L_train_filtered == -1, axis=1)

    # Remove rows where all labeling functions abstained
    L_train_filtered = L_train_filtered[~all_abstained_rows]
    X_train_filtered = X_train[~all_abstained_rows]

    # Replace abstain values (-1) with NaN
    L_train_filtered[L_train_filtered == -1] = np.nan

    return good_labeling_functions, L_train_filtered, X_train_filtered


def get_crowdlab_results(
    L_train_crowdlab: np.ndarray,
    X_train_crowdlab: np.ndarray,
    X_test: np.ndarray,
    y_test: np.ndarray,
    quantile: float,
    initial_consensus_labels: np.ndarray | None = None,
) -> dict:
    """
    Train and evaluate a logistic regression model using filtered training data and consensus labels.

    Args:
        L_train_crowdlab (np.ndarray): Filtered training label matrix.
        X_train_crowdlab (np.ndarray): Filtered training feature matrix.
        X_test (np.ndarray): Test feature matrix.
        y_test (np.ndarray): Test labels.
        quantile (float): The quantile threshold for removing bad annotators.
        initial_consensus_labels (Optional[np.ndarray]): Initial consensus labels. If None, majority vote is used.

    Returns:
        dict: Results of the model evaluation.
    """
    if initial_consensus_labels is None:
        # Obtain initial consensus labels using majority vote from the label matrix
        initial_consensus_labels = get_majority_vote_label(L_train_crowdlab)

    # Initialize a logistic regression model with specific parameters
    logistic_model = LogisticRegression(
        random_state=271828, n_jobs=-1, class_weight="balanced", max_iter=1000
    )

    # Perform cross-validated prediction to obtain predicted probabilities
    predicted_probabilities = cross_val_predict(
        estimator=logistic_model,
        X=X_train_crowdlab,
        y=initial_consensus_labels,
        cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=271828),
        method="predict_proba",
    )

    # Calculate the quality of labels using the predicted probabilities
    labels_quality = get_label_quality_multiannotator(
        L_train_crowdlab, predicted_probabilities, verbose=False
    )

    # Extract the consensus labels from the label quality results
    consensus_labels = labels_quality["label_quality"]["consensus_label"].values

    # Train and evaluate the model using the processed training data and consensus labels
    evaluation_results = train_and_evaluate_model(
        X_train_crowdlab,
        consensus_labels,
        X_test,
        y_test,
        f"CrowdLab without {int(quantile * 100)}% worst LFs",
    )

    return evaluation_results

In [80]:
good_labeling_functions_dict = {}

for alpha in [0.05, 0.10, 0.15, 0.20, 0.25]:

    # Remove the worst annotators based on the specified quantile from the training data
    # This function filters out the worst-performing annotators by their quality scores
    good_labeling_functions, L_train_filtered, X_train_filtered = (
        remove_worst_annotators_by_quantile(
            alpha, labels_quality_train, X_train, L_train, lfs
        )
    )

    # Get the results of the CrowdLab model after removing the worst annotators
    # This function trains and evaluates a model using the filtered training data and the test set
    good_labeling_functions_dict[alpha] = get_crowdlab_results(
        L_train_filtered, X_train_filtered, X_test, y_test, alpha
    )

Annotators to remove: [ 9 14] - (2 out of 28)


Label Model name: CrowdLab without 5% worst LFs
Metric                                   Score
Accuracy Score:                        0.95917
Balanced Accuracy Score:               0.94609
F1 Score (weighted):                   0.95894
Cohen Kappa Score:                     0.90208
Matthews Correlation Coefficient:      0.90244

Classification Report:

              precision    recall  f1-score   support

           0       0.95      0.91      0.93      7050
           1       0.96      0.98      0.97     16290

    accuracy                           0.96     23340
   macro avg       0.96      0.95      0.95     23340
weighted avg       0.96      0.96      0.96     23340


Confusion Matrix:

Class 0 has 613 false negatives and 340 false positives.
Class 1 has 340 false negatives and 613 false positives.
The total number of errors is 953 out of 23340 samples (error rate: 0.0408).
Predicted      0       1     All
True                      

In [81]:
for alpha, result in good_labeling_functions_dict.items():
    all_results.append(result)


df_baseline = pd.DataFrame(all_results)
df_baseline.sort_values(by="Matthews Correlation", ascending=False)

Unnamed: 0,Label Model,Matthews Correlation,Type
19,CrowdLab without 20% worst LFs + End Model,0.90697,LM + EM
4,Full Supervision on Dev Set + End Model,0.906205,Full supervision
16,CrowdLab without 5% worst LFs + End Model,0.90244,LM + EM
18,CrowdLab without 15% worst LFs + End Model,0.892357,LM + EM
11,Generative Model + End Model,0.88941,LM + EM
2,Snorkel MeTaL + End Model,0.888925,LM + EM
20,CrowdLab without 25% worst LFs + End Model,0.888636,LM + EM
17,CrowdLab without 10% worst LFs + End Model,0.887548,LM + EM
9,Dawid-Skene + End Model,0.886019,LM + EM
15,CrowdLab Model + End Model,0.885951,LM + EM


As previously noted, traditional label aggregation methods such as Dawid & Skene and Majority Voting can perform surprisingly well. However, contemporary models like HyperLM, FlyingSquid, and Crowdlab offer distinct advantages, especially concerning computational efficiency, adaptability, and scalability for sophisticated, large-scale weak supervision tasks. Crowdlab is particularly noteworthy due to its capacity to integrate the strengths of both label functions and classifier-based predictions, alongside its built-in ability to assess label function quality and generate high-quality consensus labels.

The preceding table underscores a crucial insight: the necessity of pairing a label model with a downstream "end model." While label models excel at capturing and deciphering the often complex relationships between labeling functions and the latent true labels, they may not be optimally designed for generalization to entirely new, unseen data instances.  Conversely, end models, often discriminative models like neural networks, are explicitly trained to generalize. By learning from the probabilistic or consensus labels generated by the label model, the end model can effectively generalize beyond the specific heuristics encoded in the labeling functions and perform robustly on novel, unseen examples.

This strategic combination leverages the distinct strengths of each model type:

-   **Label Model Strength**: Expertise in distilling and aggregating noisy signals from multiple weak sources (Labeling Functions), effectively capturing fundamental data patterns and LF inter-dependencies. It is adept at creating a refined, probabilistic label set from initially noisy inputs.
-   **End Model Strength**: Capacity for strong generalization. Trained on the refined labels from the label model, the end model learns to make accurate predictions on new, unseen data, going beyond the limitations of individual LFs.

This collaboration is cooperative; the label model refines the noisy input, and the end model ensures effective real-world applicability.  This pairing consistently leads to enhanced performance on downstream tasks compared to relying solely on either type of model in isolation.

Besides, as previously discussed, the CrowdLab model offers an important additional benefit: the ability to identify and aid the removal of noisy Labeling Functions within weak supervision frameworks. CrowdLab employs confident-learning techniques to evaluate and score each LF's quality. This evaluation is based on the consistency of an LF's outputs with the overall consensus labels and the confidence of its predictions (or alignment with the classifier component). By providing a quantifiable measure of each LF's reliability, CrowdLab allows for the strategic filtering of less reliable label sources. Removing noisy LFs leads to a more refined and accurate label aggregation process, ultimately resulting in higher-quality training labels. Machine learning models trained on these improved weakly supervised datasets subsequently exhibit enhanced performance and robustness.



### Addressing Noisy Examples Beyond Noisy Labeling Functions

Building upon the strategy for managing noisy LFs, a natural extension arises: how to identify and mitigate the impact of not just noisy *labeling functions*, but also noisy or problematic *examples* within a weakly supervised dataset? Addressing noisy examples is crucial as individual data points can be inherently flawed, mislabeled even by consensus, or outliers that negatively impact model training and generalization.

A powerful technique to tackle this challenge is the application of **Influence Functions**.  Moving beyond the issue of noisy LFs, Influence Functions provide a methodology to analyze the impact of individual *training examples* on a trained model's parameters and predictions.

**Influence Functions: A Tool for Identifying Noisy Examples**

Influence Functions, fundamentally, quantify how much a model's parameters and predictions would change if a particular training example were upweighted or removed from the training dataset. with respect to weak supervision and potentially noisy labels, Influence Functions become invaluable for:

-   **Detecting Anomalous Influence**: Identifying training examples that exert disproportionately large influence on the model's behavior. Examples with unusually high influence are candidates for being noisy, incorrect, or outliers.
-   **Pinpointing Detrimental Examples**: Determining if the influence of a specific example is *detrimental* to the model's performance, particularly on a validation set.  Negative influence on validation performance is a strong indicator of a noisy or problematic example.
-   **Guiding Data Refinement**: By ranking training examples based on their influence scores, we can systematically identify and investigate the most influential examples. These examples become prime candidates for manual review, correction, or removal from the training set, leading to a more refined and higher-quality dataset.

We will explore the steps involved in applying Influence Functions to address noisy examples in weak supervision in our next class.

## Takeaways
- **Weak Supervision for Data Scarcity:**  Advanced label models are crucial for effectively employing weak supervision techniques to train models when labeled data is limited.

- **Label Model Diversity:**  Different label models offer various approaches to label aggregation, each with its own strengths, assumptions, and computational properties. The choice of model depends on the specific needs of the task.

- **Beyond Simple Aggregation:**  Moving beyond simple majority voting, advanced label models like HyperLM, Dawid & Skene, FlyingSquid, and Crowdlab offer more sophisticated methods to handle noisy and conflicting labels, leading to improved label quality.

- **Quality over Quantity of LFs:** The quality and reliability of labeling functions are as important as the number of LFs. Identifying and removing noisy or less effective LFs (as demonstrated with Crowdlab) can improve the overall performance of the weak supervision pipeline.

- **Interaction of Label and End Models:** Combining label models for noise reduction with discriminative end models for generalization is a powerful example in weak supervision. This two-stage approach leverages the strengths of both types of models for stable performance.

- **Addressing Noisy Data Broadly:**  Noisy data can manifest in both labeling functions and individual data examples. Techniques like Crowdlab and Influence Functions are essential tools for identifying and mitigating different sources of noise in weakly supervised learning.


# Questions

1. What is Programmatic Weak Supervision (PWS) and why is label aggregation critical challenge in this context?

2. How does the Majority Label Voter model operate within weak supervision pipelines?

3. What are the key components of the Hyper Label Model (HyperLM) and how do its unsupervised and semi-supervised training approaches differ?

4. In what way does the Dawid & Skene model utilize the Expectation-Maximization (EM) algorithm to infer true labels from multiple labeling functions?

5. How does the Snorkel Generative Model produce probabilistic labels from noisy labeling function outputs, and how does it contrast with the later MeTaL model?

6. What is the central concept behind the FlyingSquid model and how is triplet decomposition used to efficiently estimate labeling function accuracies?

7. How does Crowdlab combine labeling function outputs with classifier predictions to generate high-quality consensus labels?

8. What preprocessing steps are applied to the label matrix (L_train) to handle abstentions and improve the overall quality of labeling function outputs?

9. Why is it important to pair a label model with a downstream "end model" in weak supervision applications?

10. How does Crowdlab enable the refinement of labeling function sets?

`Answers are commented inside this cell.`


<!-- 
1. Programmatic Weak Supervision (PWS) uses multiple labeling functions (LFs) to automatically generate noisy labels for training data. Label aggregation is critical because the outputs from various LFs are inherently noisy and may conflict, so combining them effectively is essential to obtain reliable labels.

2. The Majority Label Voter model aggregates the outputs from multiple LFs by simply taking the majority vote for each data point. In cases of ties, a random tie-break policy is typically applied, making it a straightforward baseline for label aggregation.

3. HyperLM utilizes a Graph Neural Network (GNN) to approximate the optimal aggregation of LF outputs. In its unsupervised mode, it is trained on synthetic data to learn how to aggregate LF signals without ground truth labels, while in the semi-supervised mode, it is fine-tuned with a small set of known labels to improve the inferred label quality.

4. The Dawid & Skene model treats each LF as an annotator characterized by a confusion matrix that reflects its reliability. Using the EM algorithm, it alternates between estimating the posterior probabilities of the true labels (E-step) and updating the confusion matrices and class priors (M-step) to better model the noise and derive accurate label estimates.

5. The Snorkel Generative Model estimates LF accuracies and correlations from noisy outputs by constructing a probabilistic factor graph that models the relationship between the true labels and LF outputs. Unlike the MeTaL model that introduces more advanced mechanisms and joint optimization strategies, the generative model primarily focuses on combining LF votes based on estimated accuracies.

6. Facing the computational challenges of weak supervision, the FlyingSquid model uses a closed-form solution based on triplet decomposition, where agreements among triplets of LFs are analyzed to directly compute the accuracies of each LF. This approach avoids iterative optimization and significantly improves computational efficiency.

7. Crowdlab integrates LF outputs with the predictions of a classifier through a weighted ensemble method. It assigns weights to each LF based on its quality (using measures like agreement with the consensus and prediction confidence) and fuses these with cross-validated classifier probabilities to produce reliable consensus labels.

8. The notebook first removes LFs that abstain on all rows and then filters out rows where every LF abstains. Besides, it converts abstain signals (originally represented as -1) into NaN values to assist subsequent quality analyses and consensus label computation.

9. Pairing a label model with a downstream end model is vital because label models excel at aggregating noisy signals from multiple LFs but are often not designed for generalizing to unseen data. In contrast, end models (e.g., discriminative classifiers) trained on refined labels can learn to generalize to new, real-world instances, using the strengths of both approaches.

10. Crowdlab enables LF refinement by computing quality scores for each labeling function—based on their agreement with consensus labels and prediction confidence—and then ranking them. Label functions with scores below a selected quantile threshold are identified as noisy and can be removed, which in turn improves the overall quality of the aggregated labels. -->
