# Annotation Error Detection
## Datacentric AI - Learning with Limited Label using Weak Supervision and Uncertainty-Aware Training
### [Dr. Elias Jacob de Menezes Neto](https://docente.ufrn.br/elias.jacob)


> Parts of this notebook were adapted from the amazing [MIT Introduction to Data-Centric AI](https://dcai.csail.mit.edu) course taugh by the amazing folks from [Cleanlab](https://cleanlab.ai/)

## Summary

### Keypoints
- Annotation Error Detection (AED) is crucial for identifying and correcting mislabeled instances in datasets, improving the quality of training data for machine learning models.

- AED methods can be categorized into two types: Flaggers (binary classification of errors) and Scorers (assigning error likelihood scores).

- Label noise can be classified into three main types: Uniform/Symmetric, Systematic/Asymmetric, and Instance-Dependent, each with different repercussions for model training.

- The concept of a label noise transition matrix is fundamental to Confident Learning, capturing the probabilities of observing incorrect labels given the true labels.

- Retagging is a technique used in both Weak Supervision Learning (WSL) and traditional supervised learning to refine datasets by leveraging model predictions to identify potential label errors.

- Confident Learning uses out-of-sample predictions and thresholded confidences to estimate the label noise transition matrix and identify potentially mislabeled instances.

- Real-world datasets, even those considered clean, often contain significant levels of label noise, which can substantially impact model performance and evaluation metrics.

### Takeaways
- Data quality is paramount in machine learning; addressing label noise through techniques like AED and Confident Learning can lead to more robust and reliable models.

- The presence of label noise in both training and test sets necessitates careful consideration in model evaluation and comparison, as traditional metrics may be misleading.

- Integrating AED into the data preparation pipeline, especially in weak supervision contexts, can significantly enhance the quality of training data and subsequent model performance.

- Understanding the nature and extent of label noise in datasets is crucial for developing effective strategies to mitigate its impact on model training and evaluation.

- Techniques like retagging and Confident Learning provide data-centric approaches to improving machine learning outcomes, emphasizing the importance of high-quality data over increasingly complex models.

## Notation

Understanding the notations used in discussions about noisy labels and their true counterparts is crucial for grasping the basic concepts. Below, I provide a detailed explanation of each notation and its significance.

### Key Notations and Definitions

- **$\tilde{y}$**: This symbol represents the observed label, which may be noisy. In practical scenarios, the labels we get from data collection processes are often prone to errors, hence termed "noisy."

- **$y^*$**: This denotes the unobserved, latent, or correct label. It is the true label that we aim to infer or predict, which remains hidden due to noise in the observed data.

### Sets and Counts

- **$X_{\tilde{y}=i, y^*=j}$**: This set includes all examples where the noisy observed label is $i$, but the actual true label is $j$. By examining these sets, we can identify instances where the observed label deviates from the true label.

- **$C_{\tilde{y}=i, y^*=j}$**: This notation represents the count of examples in the set $X_{\tilde{y}=i, y^*=j}$. In other words, it is the number of instances where the observed label is $i$, but the true label is $j$. These counts help in estimating the noise distribution.

### Probabilities and Distributions

- **$p(\tilde{y}=i, y^*=j)$**: This is the joint probability distribution of noisy labels and true labels. It provides the probability that an instance has an observed label $i$ and a true label $j$. We can estimate this joint distribution by normalizing the counts $C_{\tilde{y}=i, y^*=j}$.

- **$p(\tilde{y}=i \mid y^*=j)$**: Known as the transition probability, this conditional probability gives the likelihood that a true label $j$ is flipped to a noisy observed label $i$. It is crucial for understanding how noise affects the labeling process and for developing methods to correct or account for this noise.

### Detailed Explanation

To further clarify these concepts, let's consider a practical example:

Suppose we have a dataset of images labeled with animal categories (e.g., cats, dogs, horses). Due to some mislabeling during data collection, some images of cats might be incorrectly labeled as dogs.

- **Noisy Label ($\tilde{y}$)**: The label assigned during data collection, which may be incorrect.
- **True Label ($y^*$)**: The actual category the image belongs to.

For instance, if we have an image of a cat that was incorrectly labeled as a dog:
- **$X_{\tilde{y}=dog, y^*=cat}$**: This set would include this misclassified image.
- **$C_{\tilde{y}=dog, y^*=cat}$**: This count would increase by one for this misclassified image.

By examining these counts across the dataset, we can estimate the joint distribution $p(\tilde{y}=i, y^*=j)$ and the transition probabilities $p(\tilde{y}=i \mid y^*=j)$, which help us understand and potentially correct the noise in our labels.

> **Note**: The transition probability $p(\tilde{y}=i \mid y^*=j)$ is essential for designing algorithms that can handle noisy labels. By estimating these probabilities, we can adjust our models to better reflect the true labels, improving the overall performance and reliability of our predictions.

### FAQ

- **Why is it important to distinguish between $\tilde{y}$ and $y^*$?**
    - Distinguishing between observed and true labels is crucial because the presence of noise can significantly impact the performance of machine learning models. By understanding the noise structure, we can develop methods to mitigate its effects.

- **How do we estimate $p(\tilde{y}=i, y^*=j)$?**
    - We estimate this joint distribution by normalizing the counts $C_{\tilde{y}=i, y^*=j}$, which gives us the proportion of instances with each combination of observed and true labels.


## Sources of Noisy Labels

Noisy labels can stem from multiple factors:

1. **Human Error**
    - Accidental misclassification (e.g., clicking the wrong button)
    - Mistakes due to fatigue or lack of attention
    - Incompetence or insufficient domain knowledge

2. **Measurement Issues**
    - Imprecise or faulty data collection tools
    - Inconsistent measurement methodologies

3. **Algorithmic Errors**
    - Propagation of errors from other ML models used in data labeling (e.g.: [a weak supervision pipeline](Notebook_03.ipynb))
    - Automated labeling systems with innate biases or flaws

4. **Data Corruption**
    - Technical glitches during data storage or transfer
    - Malicious tampering with datasets

5. **Subjective Interpretations**
    - Ambiguous cases where multiple labels could apply
    - Cultural or contextual differences in label interpretation

## The Concept of Label Flipping

Label flipping occurs when an instance is assigned an incorrect label, effectively "flipping" it from its true category to another. This phenomenon is central to understanding noisy labels.

**Examples of Label Flipping:**

- Visual Misclassification: An image of a bus labeled as a car
- Sentiment Analysis Error: A positive review ("Muito bom, gostei bastante!") incorrectly tagged as negative.
- NER Mislabeling: A named entity "Elias Jacob" mistakenly left untagged in some instances.

## Quantifying Label Noise

To understand the extent of label noise, we could represent $C_{\tilde{y}, y^*}$ like a confusion matrix that compares true labels ($y^*$) with observed, potentially noisy labels ($\tilde{y}$).

<div align="center">

| $C_{\tilde{y}, y^*}$ | $y^*=\text{bus}$ | $y^*=\text{car}$ | $y^*=\text{bike}$ |
|------------------------|--------------------|--------------------|--------------------|
| $\tilde{y}=\text{bus}$ | 90 | 35 | 25 |
| $\tilde{y}=\text{car}$ | 50 | 70 | 5 |
| $\tilde{y}=\text{bike}$ | 30 | 15 | 75 |

</div>

**Interpreting the Matrix:**
- Diagonal elements represent correctly labeled instances
- Off-diagonal elements indicate label flips
- For example, 50 instances of true "bus" were mislabeled as "car"

## Consequences of Noisy Labels

1. **Model Performance:** Noisy labels can lead to decreased model accuracy and generalization.
2. **Training Difficulties:** Models may struggle to learn true patterns, instead fitting to noise.
3. **Evaluation Challenges:** Noisy test sets can provide inaccurate assessments of model performance.
4. **Bias Introduction:** Systematic label noise can introduce or enhance biases in the model.
5. **Resource Waste:** Training on noisy data can waste computational resources and time.

## Understanding Uncertainty
In machine learning, **uncertainty** reflects a model's lack of confidence in its predictions. It signifies the possibility of errors and provides crucial information about the reliability of a model's output. Understanding and managing uncertainty is essential for developing robust machine learning models.

Two main types of uncertainty impact a model's predictions:

### 1. Aleatoric Uncertainty

**Aleatoric uncertainty** stems from innate randomness or noise in the data itself. This type of uncertainty is irreducible, even with a perfect model. It represents the innate stochasticity of the process being modeled.

- **Example:** In predicting house prices, factors like unpredictable market fluctuations or slight variations in similar houses introduce aleatoric uncertainty. Even with thorough data and a perfect model, predicting the exact price with absolute certainty is impossible due to these innate variations.

### 2. Epistemic Uncertainty

**Epistemic uncertainty** arises from limitations in the model's knowledge or representation of the fundamental relationship between input features and target variables. This uncertainty can be reduced by improving the model, such as using a more complex architecture or providing more training data.

- **Example:** A model trained to classify images of cats and dogs might struggle with images of kittens or puppies if the training data lacked these examples. This uncertainty stems from the model's limited knowledge and can be reduced by incorporating more diverse training data.

### The Role of Label Noise in Disambiguating Uncertainty

In real-world scenarios, distinguishing between aleatoric and epistemic uncertainty can be challenging. Observed errors might originate from noisy data (aleatoric) or an imperfect model (epistemic). **Label noise**, where training data labels are incorrect, further complicates this distinction.

To disentangle these uncertainties, a **label noise process assumption** is often employed. This assumption posits a specific mechanism for how label noise corrupts the true labels.

#### Class-Conditional Label Noise Model

A common assumption is the **class-conditional label noise** model, which assumes that label noise depends only on the true class label and not on the input features:

$$
p(\tilde{y} \mid y^*; x) = p(\tilde{y} \mid y^*)
$$

Here:

- $ \tilde{y} $ represents the observed (potentially noisy) label.
- $ y^* $ denotes the true basic label.
- $ x $ represents the input features.

This assumption simplifies the problem by suggesting that label noise introduces a consistent pattern of errors across all data points belonging to the same class, regardless of their specific features. It allows us to model the noise using a transition matrix $ p(\tilde{y} \mid y^*) $, capturing the probabilities of labels being flipped between different classes.

By incorporating a label noise process assumption, we can separate the impact of noisy labels (aleatoric uncertainty) from the model's built-in limitations (epistemic uncertainty). This separation is crucial for developing techniques to mitigate the effects of noise and improve the reliability of machine learning models.

## Types of Label Noise in Machine Learning

Label noise can significantly affect the performance of machine learning models. Understanding the types of label noise is essential for developing robust models and noise-handling techniques. There are three primary types of label noise:

### 1. Uniform/Symmetric Class-Conditional Label Noise

This type of noise assumes that mislabeling occurs with equal probability across all incorrect classes.

- **Definition:** For any incorrect label $ i $ and true label $ j $, the probability of mislabeling is constant:

$$
p(\tilde{y} = i \mid y^* = j) = \epsilon, \quad \forall i \neq j
$$

$ \epsilon $ is the noise rate.

- **Characteristics:**
    - Simplest form of label noise.
    - Assumes uniform distribution of errors across classes.
    - Often used as a baseline in noise-robust learning studies.

- **Implications:**
    - Easier to model and correct.
    - Does not accurately represent real-world noise patterns in complex datasets.

> **Note:** While widely used in research, this model may oversimplify noise patterns in practical applications.

### 2. Systematic/Asymmetric Class-Conditional Label Noise

This noise model allows for varying probabilities of mislabeling between different class pairs.

- **Definition:** The probability of mislabeling depends on the true label $ j $ and can vary for each incorrect label $ i $:

$$
p(\tilde{y} = i \mid y^* = j)
$$

- **Characteristics:**
    - More realistic representation of real-world label noise.
    - Models common confusion patterns (e.g., confusion between visually similar classes).

- **Implications:**
    - Requires more sophisticated modeling techniques.
    - Captures nuanced error patterns in data labeling processes.
    - More challenging to correct but potentially more effective in real-world scenarios.

### 3. Instance-Dependent Label Noise

This advanced noise model considers both the true class and the instance features when determining mislabeling probabilities.

- **Definition:** Mislabeling probability depends on both the true class and the instance features $ x $:

$$
p(\tilde{y} = i \mid y^* = j, x)
$$

- **Characteristics:**
    - Most complex and potentially most realistic noise model.
    - Accounts for instances more likely to be mislabeled based on specific attributes.

- **Implications:**
    - Requires significant assumptions about data distribution.
    - Challenging to model and correct.
    - Often impractical due to complexity.

> **Important:** While instance-dependent noise may accurately represent real-world label noise, its complexity often makes it impractical for many applications.

### Comparative Example

To illustrate these noise types, consider an image classification task for animals:

- **Uniform Noise:** A cat is equally likely to be mislabeled as a dog, bird, or any other animal.
- **Asymmetric Noise:** A cat is more likely to be mislabeled as a dog (a visually similar class) than as a bird.
- **Instance-Dependent Noise:** A blurry image of a cat is more likely to be mislabeled than a clear, well-lit image due to its specific attributes.


In real-world datasets like ImageNet, we often observe asymmetric label noise. For example, many images of **wild boars** are mislabeled as **pigs**, and vice versa. This is a clear example of systematic noise, where certain classes are more likely to be confused due to their visual similarities.

<p align="center">
<img src="images/label_errors_pig.png" width="100%" height="100%" alt="Wild Boar and Pig Label Errors"/>
</p>

If label noise was uniform, we would expect the mislabeling to be evenly distributed across all classes. However, the systematic nature of the noise indicates that certain classes are more likely to be confused with specific others, leading to asymmetric noise patterns.

> **Note**: Despite the prevalence of asymmetric label noise in real-world datasets, many noise-robust learning studies still rely on the uniform noise assumption due to its simplicity and ease of modeling. This usually results in myths like "neural networks are robust to label noise" because they are often tested on datasets with uniform noise, which doesn't accurately reflect real-world scenarios. This discrepancy between research assumptions and real-world noise patterns highlights the importance of developing noise-handling techniques that can address more complex noise structures.

## Where Does Label Noise Come From?

### Noise in Data

Machine learning models often encounter various types of noise in input data. While these are important to understand, this lecture will focus on a different type of noise. For context, some common data noise types include:

- **Visual Noise**: Blurry or distorted images. For example, an image of a sidewalk that is so blurry that it's hard to distinguish the details.
- **Adversarial Examples**: Intentionally manipulated inputs designed to fool models, such as a slightly altered image of a car that a model misclassifies as a bicycle.
- **Textual Noise**: Typos, misspellings, or grammatical errors in text data, like a sentence with multiple spelling errors making it harder for a language model to understand.
- **Audio Noise**: Background sounds or distortions in audio recordings, such as a conversation recorded with loud traffic noise in the background.

### Annotator Label Noise

This lecture focuses on annotator label noise, which occurs during the data labeling process. Unlike data noise, label noise affects the target variable or class assignment.

Consider a scenario where an image of a toy car is labeled by different annotators. One annotator labels it as a "Sports Car," while two others label it as a "Toy Car." This inconsistency in labels introduces noise into the dataset.

> **Key Point**: In the context of Confident Learning (CL), we assume:
> 1. Labels are noisy, not the data itself.
> 2. Each example has only one annotation.

This simplification allows us to focus on addressing label noise specifically, which is a common challenge in real-world machine learning applications.

## Approaches to Learning with Noisy Labels

### Model-Centric Methods: "Change the Loss"

These methods focus on modifying the model's learning process to account for noisy labels:

1. **Using Loss from Another Network**
    - Techniques like Co-Teaching exploit additional networks to guide the learning process.
    - These methods often involve training multiple models simultaneously, with each model helping to identify and mitigate the impact of noisy labels on the other.

2. **Direct Loss Modification**
    - Approaches like SCE-loss alter the loss function itself.
    - These modifications aim to make the loss more robust to label noise, often by changing how the model penalizes misclassifications.

3. **Importance Reweighting**
    - This strategy involves assigning different weights to training examples based on their likelihood of having correct labels.
    - The goal is to reduce the influence of potentially mislabeled data points during training.

### Data-Centric Methods: "Change the Data"

These approaches focus on improving the quality of the training data itself:

1. **Identifying Label Errors**
    - Techniques are developed to detect and flag potential mislabeled examples in datasets.
    - This can involve statistical methods, model predictions, or even manual review processes.

2. **Learning with Cleaned Data**
    - Once label errors are identified, the training process can be adjusted in several ways:
        - Removing mislabeled data points
        - Correcting labels where possible
        - Assigning lower weights to potentially noisy examples

> **Lecture Focus**: We will primarily explore data-centric methods for handling noisy labels, emphasizing techniques to identify and address label errors in datasets.

Having established the importance of data-centric methods for handling noisy labels—particularly through the identification and correction of label errors—we now turn our attention to the specialized techniques that make this possible. Detecting mislabeled instances is a critical step in improving data quality, which directly impacts the performance and reliability of machine learning models.

This brings us to Annotation Error Detection (AED), a collection of methodologies designed to systematically identify potential errors in data annotations. AED plays a critical role in the data curation process by ensuring that models are trained on accurate and trustworthy data. By focusing on the annotations themselves, AED methods complement the data-centric approach by providing tools to clean and refine datasets effectively.

## Introduction to Annotation Error Detection (AED)

In Machine Learning, high-quality annotated datasets are the cornerstone of model training and evaluation. However, even meticulously curated datasets can harbor errors and inconsistencies in their annotations, potentially jeopardizing model performance and leading to inaccurate conclusions about a model's capabilities. This is where Annotation Error Detection (AED) comes into play.

AED focuses on automatically identifying potential errors or inconsistencies within labeled datasets. Its primary objective is to assist human annotators and dataset creators in refining data quality. By pinpointing instances that require manual review and potential correction, AED streamlines the annotation process and enhances the reliability of the dataset.

### Types of Annotation Errors

<p align="center">
<img src="images/label_errors1.png" width="100%" height="100%" />
</p>
<p align="center">
<a href="https://labelerrors.com">Source</a>
</p>


AED targets various types of annotation errors, including:

1. **Incorrect Labels:** Instances where the assigned label is fundamentally wrong. For example, in sentiment analysis, a positive review might be mistakenly labeled as negative.

2. **Inconsistencies:** Situations where similar items are labeled differently across the dataset. This inconsistency can arise from subjective interpretations or evolving annotation guidelines. For example, in named entity recognition, the same person's name might be tagged as a person in some instances and left untagged in others.

3. **Ambiguities:** Cases where multiple valid interpretations exist, but the annotation scheme only allows for one label. This ambiguity can stem from built-in complexities in the data or limitations in the annotation guidelines. For example, a sentence could be interpreted as both sarcastic and genuine, leading to ambiguity in sentiment analysis.


### Categories of AED Methods

AED methods can be broadly classified into two categories:

1. **Flaggers:** These methods make binary decisions, classifying annotations as either correct or incorrect. They essentially flag potentially erroneous instances for further review.

2. **Scorers:** These approaches assign a score to each instance, reflecting the likelihood of an annotation error. This score helps prioritize instances for manual inspection, focusing on those with higher error probabilities.

### Techniques for AED

Various techniques have been developed for AED, including:

- **Model-based methods:** These methods capitalize on machine learning models to identify potential errors. For example, a classifier can be trained on a subset of the data to predict the likelihood of an annotation error.

- **Variation-based methods:** These techniques exploit similarities in surface forms to detect inconsistencies. For example, if similar sentences have different labels, it might indicate an annotation error.

- **Ensemble methods:** These approaches combine multiple AED methods to improve detection accuracy. By leveraging the strengths of different methods, ensemble techniques can provide a more robust error detection mechanism.

- **Vector space proximity methods:** These methods apply dense embeddings to identify anomalies. For example, instances that are far away from other instances with the same label in the embedding space might indicate potential errors.

### Applications of AED

AED has proven valuable across various ML tasks, including:

- **Document Classification:** Identifying misclassified instances.
- **Named Entity Recognition:** Detecting inconsistencies or errors in entity tagging.
- **Image Classification:** Flagging images with incorrect labels.
- **Pixel-wise Segmentation:** Highlighting regions with annotation discrepancies.
- **Regression Tasks:** Identifying outliers or incorrectly labeled data points.

AED has become an indispensable tool in the data curation pipeline, ensuring the reliability and quality of training data for NLP models.

### Formal Definition

Given an annotated dataset $D$ consisting of instances $(x_i, \tilde{y_i})$, where $x_i$ represents an input (e.g., a sentence or token) and $\tilde{y_i}$ its corresponding **observed** label, AED aims to identify a subset of instances $E \subseteq D$ where the observed label $\tilde{y_i}$ is likely to differ from the true latent label $y_i^*$.

### Key Components of the AED Task

1. **Input:** An annotated dataset, typically without access to the true latent labels ($y_i^*$) or any additional clean data with the same annotation scheme. This constraint highlights the challenge of AED, as it often operates in a weakly supervised setting where only the observed labels are available for error detection.

2. **Output:** AED methods produce different outputs depending on their category:
    - **Flaggers:** A set $E$ containing instances $(x_i, \tilde{y_i})$ flagged as erroneous: $E = \{(x_i, \tilde{y_i}) \mid \text{is\_error}(x_i, \tilde{y_i}) = \text{True}\}$.
    - **Scorers:** A ranked list $L$ of instances $(x_i, \tilde{y_i})$ with corresponding error likelihood scores $s_i$: $L = [(x_i, \tilde{y_i}, s_i)]$. Higher scores indicate a higher likelihood of error.

3. **Granularity:** AED can be applied at various levels of granularity:
    - **Document or Sentence Level:** Suitable for tasks like text classification.
    - **Token Level:** Applicable to tasks such as part-of-speech tagging.
    - **Span Level:** Relevant for tasks like named entity recognition.

4. **Error Types:** AED aims to detect a range of annotation issues, including incorrect labels, inconsistencies in labeling similar instances, and ambiguous cases with multiple valid interpretations, all with respect to the unknown true latent label $y_i^*$.

5. **Evaluation:** Evaluation metrics for AED depend on the method category:
    - **Flaggers:** Precision, Recall, and F1 score, commonly used in binary classification tasks.
    - **Scorers:** Average Precision, Precision@k, and Recall@k, which are relevant for ranking tasks.

#### Challenges in AED

1. **Lack of Ground Truth:** In real-world scenarios, the true latent labels ($y_i^*$) for potentially erroneous instances are often unknown, making it challenging to definitively assess the accuracy of AED methods.

2. **Class Imbalance:** Correctly labeled instances typically far outnumber incorrect ones, leading to a class imbalance problem. This imbalance can bias AED methods towards favoring the majority class (correctly labeled instances) and result in lower recall for the minority class (erroneous instances).

3. **Task and Domain Specificity:** Different NLP tasks and domains may require tailored AED approaches. For example, an AED method designed for sentiment analysis might not be directly applicable to named entity recognition.

4. **Distinguishing Errors from Valid Edge Cases or Ambiguities:** AED methods need to differentiate between genuine annotation errors and valid edge cases or instances with built-in ambiguity. Misclassifying these edge cases as errors can lead to unnecessary manual review and hinder the efficiency of the annotation process.

#### Distinction from Other Data Quality Tasks

While related to other data quality tasks like noise-robust learning and data cleaning, AED serves a distinct purpose. Noise-robust learning focuses on developing models that are resilient to noise in the training data, while data cleaning aims to identify and correct errors in the data itself. In contrast, AED specifically targets potential discrepancies between the observed labels ($\tilde{y_i}$) and the true latent labels ($y_i^*$), which are human-generated labels assigned to the data. AED acts as a crucial step in the data curation pipeline, preceding model training or evaluation. By identifying and assisting the correction of annotation errors, AED contributes significantly to improving the reliability and accuracy of NLP models trained on these datasets.

#### Mathematical Formulation

For a given instance $(x_i, \tilde{y_i}) \in D$, an AED method $f$ aims to:

1. **Flaggers:** Determine whether an instance is likely erroneous, meaning $\tilde{y_i}$ likely differs from $y_i^*$:
$f(x_i, \tilde{y_i}) = \begin{cases}
1 & \text{if } (x_i, \tilde{y_i}) \text{ is likely erroneous} \\
0 & \text{otherwise}
\end{cases}$

2. **Scorers:** Assign an error likelihood score:
$f(x_i, \tilde{y_i}) = s_i \in [0, 1]$, where higher values indicate a higher likelihood of error, meaning $\tilde{y_i}$ is more likely to differ from $y_i^*$.

The overarching goal of AED is to maximize the detection of true errors while minimizing false positives:
$$\text{argmax}_f \, F1(\{(x_i, \tilde{y_i}) \mid f(x_i, \tilde{y_i}) = 1\}, E_{true})$$

where $E_{true}$ represents the set of truly erroneous instances, which is typically unknown in practice. This objective highlights the trade-off between identifying as many true errors as possible (high recall) while minimizing the number of correctly labeled instances misclassified as errors (high precision).

### Refining Weak Supervision with Annotation Error Detection

Applying Error Detection (AED) as a final step in a weak supervision pipeline significantly refines the quality of generated labels, ultimately leading to a more robust and reliable training dataset for downstream machine learning models.

#### How AED Enhances Weak Supervision

Instead of directly using the potentially noisy labels from the weak supervision's label model, AED treats these labels as probabilistic suggestions. This approach acknowledges the innate uncertainty in weak supervision and leverages AED techniques to identify and address potential inconsistencies.

1. **Weak Labels as Input:** The label model in a weak supervision framework assigns a weak label to each label, reflecting its confidence in the classification. AED uses these weak labels, denoted as ($\tilde{y_i}$), as the foundation for its analysis. This is in contrast to traditional supervised learning, where labels are treated as absolute truths.

2. **Detecting Potential Errors:** We can pinpoint potential errors within the weal labels. These techniques focus on identifying instances where the weak label is likely to deviate from the true latent label ($y_i^*$). By flagging these instances for further review, AED helps refine the dataset and improve the quality of the training data.

3. **Dataset Refinement Strategies:** AED's error detection insights drive targeted actions to refine the dataset:

    - **Manual Review and Correction:** Instances flagged as potentially erroneous are reviewed by human experts who can correct mislabeled instances or provide additional insights. This is particularly valuable for complex cases where automated methods might struggle.
    - **Selective Data Exclusion:** Instances with a high likelihood of error, based on the AED analysis, can be temporarily or permanently excluded from the training dataset. This prevents the propagation of noisy labels during model training and leads to a cleaner dataset.

#### Benefits of Incorporating AED to Weak Supervision

Integrating AED into a weak supervision pipeline offers several key advantages:

- **Enhanced Data Quality:** By proactively identifying and addressing potential labeling errors, AED significantly improves the overall quality and reliability of the labeled dataset.
- **Increased Model Trustworthiness:** Training downstream machine learning models on a dataset refined by AED results in more accurate and reliable predictions. This is because the model learns from a cleaner, more consistent dataset.
- **Targeted Weak Supervision Improvement:** AED provides valuable feedback on the performance of the weak supervision pipeline itself. By highlighting areas of weakness, such as poorly performing labeling functions or systematic biases in the label model, AED guides targeted improvements for more effective weak supervision.

## Our Dataset

We'll load the dataset from our WSL Pipeline Notebook and perform Annotation Error Detection on it. We'll the weakly annotated dataset and apply AED techniques to identify potential errors or inconsistencies in the labels. This process will help refine the dataset and enhance the quality of the training data for downstream tasks.

We'll compare the performance of different AED methods and evaluate their effectiveness producing a dataset that is more capable of training high-performing machine learning models.

In [1]:
import pandas as pd
df_test = pd.read_parquet('data/b2w/test_cleaned_with_labels.parquet')
df_train = pd.read_parquet('outputs/ws-pipeline/df_train_weakly_labeled.parquet')
df_dev = pd.read_parquet('data/b2w/dev.parquet')

In [2]:
df_train.head()

Unnamed: 0,source,review_id,text,label_snorkel,label_majority_vote
0,b2w,79d9a98a62d9adff5e2c8e2bed824e4d524695e0a1e235...,nao gostei do produto! - o acabamento e muito ...,1,1
1,b2w,5177b7800f360f47ccd69afa43def2180777c1f6a3b26d...,"produto nao funciona - produto nao funcio, vei...",0,0
2,b2w,fc0cc6d9c2e4539762936bcb5b0e855df6b5e08229b00d...,"nao recebi, portanto nao conheco o produto - p...",0,0
3,b2w,b3a8b907623ceece9aef15896c82a6ea3d932be9f8a856...,maravilhoso - parabens pela eficiencia na entr...,1,1
4,b2w,5e1611aae145617b04b247314421e2d65ef9d7800b2e95...,decepcionado - relogio com a mesma qualidade d...,1,0


Let's estabilish a reference point for the dataset if it was trained with the weak labels and then evaluate the model performance after the AED process.

In [3]:
# Import TfidfVectorizer from sklearn for converting text data into TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer


In [4]:
from helpers.classification import train_and_evaluate_classification_models, print_classification_metrics

tfidf = TfidfVectorizer(
    ngram_range=(1, 2), 
    strip_accents='unicode', 
    lowercase=True, 
    max_features=1000, 
    min_df=3
)

X_train = tfidf.fit_transform(df_train['text'])
y_train = df_train['label_majority_vote']

X_dev = tfidf.transform(df_dev['text'])
y_dev = df_dev['label']

X_test = tfidf.transform(df_test['text'])
y_test = df_test['label']

2024-10-05 14:35:57.969359: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-05 14:35:57.996621: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-05 14:35:58.003828: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-05 14:35:58.022905: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
df_results, classification_reports = train_and_evaluate_classification_models(X_train, y_train)

Model: Calibrated-LSVC - F1: 0.9761 - Balanced Accuracy: 0.9682 - Accuracy: 0.9761 - Matthews Correlation Coefficient: 0.9366 - Elapsed time: 11.31s
              precision    recall  f1-score   support

           0       0.95      0.95      0.95     24385
           1       0.98      0.98      0.98     72396

    accuracy                           0.98     96781
   macro avg       0.97      0.97      0.97     96781
weighted avg       0.98      0.98      0.98     96781

[[23217  1168]
 [ 1143 71253]]
******************** 

Model: Logistic Regression - F1: 0.9674 - Balanced Accuracy: 0.9707 - Accuracy: 0.9674 - Matthews Correlation Coefficient: 0.9173 - Elapsed time: 1.34s
              precision    recall  f1-score   support

           0       0.90      0.98      0.94     24385
           1       0.99      0.96      0.98     72396

    accuracy                           0.97     96781
   macro avg       0.95      0.97      0.96     96781
weighted avg       0.97      0.97      0.97   



Model: Random Forest - F1: 0.9671 - Balanced Accuracy: 0.9533 - Accuracy: 0.9671 - Matthews Correlation Coefficient: 0.9122 - Elapsed time: 28.22s
              precision    recall  f1-score   support

           0       0.94      0.93      0.93     24385
           1       0.98      0.98      0.98     72396

    accuracy                           0.97     96781
   macro avg       0.96      0.95      0.96     96781
weighted avg       0.97      0.97      0.97     96781

[[22569  1816]
 [ 1370 71026]]
******************** 

Model: XGBoost - F1: 0.9748 - Balanced Accuracy: 0.9648 - Accuracy: 0.9748 - Matthews Correlation Coefficient: 0.9330 - Elapsed time: 191.64s
              precision    recall  f1-score   support

           0       0.95      0.94      0.95     24385
           1       0.98      0.98      0.98     72396

    accuracy                           0.97     96781
   macro avg       0.97      0.96      0.97     96781
weighted avg       0.97      0.97      0.97     96781

[[2



Model: K-Nearest Neighbors - F1: 0.8517 - Balanced Accuracy: 0.7183 - Accuracy: 0.8517 - Matthews Correlation Coefficient: 0.5773 - Elapsed time: 97.33s
              precision    recall  f1-score   support

           0       0.92      0.45      0.60     24385
           1       0.84      0.99      0.91     72396

    accuracy                           0.85     96781
   macro avg       0.88      0.72      0.76     96781
weighted avg       0.86      0.85      0.83     96781

[[10964 13421]
 [  936 71460]]
******************** 

Model: Decision Tree - F1: 0.9441 - Balanced Accuracy: 0.9284 - Accuracy: 0.9441 - Matthews Correlation Coefficient: 0.8525 - Elapsed time: 180.14s
              precision    recall  f1-score   support

           0       0.88      0.90      0.89     24385
           1       0.97      0.96      0.96     72396

    accuracy                           0.94     96781
   macro avg       0.92      0.93      0.93     96781
weighted avg       0.94      0.94      0.94   

In [6]:
df_results.sort_values(by='Matthews Correlation Coefficient', ascending=False)

Unnamed: 0,Model,F1,Balanced Accuracy,Accuracy,Matthews Correlation Coefficient,Elapsed Time,Confusion Matrix,Classification Report
0,Calibrated-LSVC,0.976121,0.968157,0.976121,0.936632,11.312945,[[23217 1168]\n [ 1143 71253]],precision recall f1-score ...
3,XGBoost,0.974819,0.964798,0.974819,0.932983,191.644645,[[23034 1351]\n [ 1086 71310]],precision recall f1-score ...
1,Logistic Regression,0.967411,0.970684,0.967411,0.917253,1.342237,[[23831 554]\n [ 2600 69796]],precision recall f1-score ...
8,Extra Trees,0.968827,0.956876,0.968827,0.917031,37.225035,[[22746 1639]\n [ 1378 71018]],precision recall f1-score ...
2,Random Forest,0.96708,0.953302,0.96708,0.912205,28.221164,[[22569 1816]\n [ 1370 71026]],precision recall f1-score ...
4,SGD,0.962555,0.967927,0.962555,0.90617,1.489239,[[23867 518]\n [ 3106 69290]],precision recall f1-score ...
5,Naive Bayes,0.948027,0.936555,0.948027,0.863829,0.544594,[[22274 2111]\n [ 2919 69477]],precision recall f1-score ...
7,Decision Tree,0.94409,0.92839,0.94409,0.852477,180.144789,[[21867 2518]\n [ 2893 69503]],precision recall f1-score ...
6,K-Nearest Neighbors,0.851655,0.718346,0.851655,0.577318,97.330122,[[10964 13421]\n [ 936 71460]],precision recall f1-score ...


In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, matthews_corrcoef

# Initialize a Logistic Regression model with balanced class weights
# 'random_state' ensures reproducibility, 'n_jobs=-1' uses all available processors
model_lr = LogisticRegression(random_state=271828, n_jobs=-1, class_weight='balanced')

# Fit the Logistic Regression model on the training data
model_lr.fit(X_train, y_train)

# Predict the labels for the test set
y_test_pred = model_lr.predict(X_test)

print_classification_metrics(y_test, y_test_pred)

Metric                                   Score
Accuracy Score:                        0.95579
Balanced Accuracy Score:               0.94520
F1 Score (weighted):                   0.95568
Cohen Kappa Score:                     0.89465
Matthews Correlation Coefficient:      0.89471

Classification Report:

              precision    recall  f1-score   support

           0       0.93      0.92      0.93      7871
           1       0.96      0.97      0.97     18187

    accuracy                           0.96     26058
   macro avg       0.95      0.95      0.95     26058
weighted avg       0.96      0.96      0.96     26058


Confusion Matrix:

Predicted      0       1     All
True                            
0          7,229     642   7,871
1            510  17,677  18,187
All        7,739  18,319  26,058


## Retagging with End-Model Predictions

In a **Weak Supervision Learning (WSL)** pipeline, the ultimate goal is to train an **end-model** that can accurately classify new, unseen data. Typically, this model is trained on a dataset where the labels might be noisy or weak due to the nature of weak supervision. However, we can employ a technique known as **retagging** to enhance the quality of our dataset before finalizing our end-model.

**Retagging** involves using the predictions of an interim model—often the end-model trained on a subset of the data—to identify and correct potential annotation errors within the dataset itself. By comparing the model's predictions to the existing labels, we can flag discrepancies that may indicate mislabeling or inconsistencies.

### Why Retagging Matters

Weakly labeled data is common in machine learning tasks where obtaining high-quality annotations is costly or impractical. These weak labels can introduce noise into the training process, potentially degrading model performance. Retagging serves as a practical method to improve label quality by leveraging the model's own predictive capabilities to refine the dataset.

> **Analogy**: Consider retagging similar to using a spellchecker to proofread a text document. The initial document (your dataset) may contain typos (annotation errors). The spellchecker (your model) flags words it suspects are misspelled. By reviewing these suggestions, you can correct the typos, resulting in a more accurate and polished document (dataset).

By updating the labels based on the model's predictions, we create an enhanced dataset that more accurately reflects the true fundamental patterns. This can lead to significant improvements in the performance of downstream models trained on this data.

> **Note**: Retagging is particularly useful when dealing with large datasets where manual re-annotation is not feasible.

### Recommended Reading

- For insights into how retagging compares with other annotation error detection techniques, consider reading [this study](https://arxiv.org/abs/2206.02280). It provides valuable comparisons of different methods for improving weakly labeled data.
- For a detailed description of the retagging approach and its implementation, refer to the original [Retag Paper](https://aclanthology.org/W00-1907/).

### How to Implement Retagging

To perform retagging, follow these steps:

1. **Train the Model on a Subset**: Split your dataset and train your model on a subset to ensure that the predictions are out-of-sample for the remaining data.
2. **Obtain Out-of-Sample Predictions**: Use the trained model to generate predictions on the portion of the data not used in training.
3. **Compare Predictions with Existing Labels**: Identify instances where the model's predictions differ from the existing labels.
4. **Flag and Review Discrepancies**: Treat these discrepancies as potential annotation errors and consider updating the labels accordingly.

By implementing this process, you can iteratively improve the quality of your dataset.

> **Important Note**: To avoid overfitting and ensure that the predictions are reliable, it is crucial to use out-of-sample predictions. This is commonly achieved through cross-validation techniques. The `cross_val_predict` function from scikit-learn is a practical tool for obtaining out-of-sample predictions across your entire dataset.

### Addressing Potential Misconceptions

- **Does Retagging Replace Manual Annotation?**
    - Retagging does not entirely replace the need for manual annotation but serves as a complementary process to improve label quality where manual efforts are insufficient or impractical.

- **Can Retagging Introduce New Errors?**
    - While retagging can correct some annotation errors, it may also introduce new errors if the model's predictions are incorrect. It's essential to balance model confidence and potentially incorporate human review for critical cases.

- **Is Retagging Effective for All Models?**
    - The effectiveness of retagging depends on the initial performance of your model. If the model is not sufficiently accurate, its predictions may not be reliable indicators of annotation errors.

### Practical Considerations

- **Model Confidence**: When deciding whether to update a label based on the model's prediction, consider the confidence level of the prediction. High-confidence predictions are more likely to indicate true errors in the original labels.

- **Iterative Process**: Retagging can be an iterative process. After updating labels, retrain the model and repeat the steps to further refine the dataset.

- **Threshold Setting**: Establish thresholds for discrepancies that warrant label changes. For instance, only retag instances where the model's prediction probability exceeds a certain level.

- **Domain Expertise**: Incorporate insights from domain experts when reviewing flagged discrepancies to ensure that label corrections are valid.

In [8]:
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV

# Initialize stratified k-fold cross-validation with 20 splits
# StratifiedKFold ensures that each fold has the same proportion of classes as the original dataset
cross_validation = StratifiedKFold(n_splits=20, shuffle=True, random_state=271828)

# Initialize a Logistic Regression model with balanced class weights
# 'random_state' ensures reproducibility, 'n_jobs=-1' uses all available processors
model_retag = LogisticRegression(random_state=271828, n_jobs=-1, class_weight='balanced')

# Perform cross-validated predictions on the training data
# 'method="predict"' returns the predicted class labels for each fold
# 'n_jobs=2' uses 2 processors for parallel computation
y_train_retag = cross_val_predict(estimator=model_retag, X=X_train, y=y_train, cv=cross_validation, method="predict", n_jobs=2)

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, matthews_corrcoef

# Initialize a Logistic Regression model with balanced class weights
# 'random_state' ensures reproducibility, 'n_jobs=-1' uses all available processors
model_lr = LogisticRegression(random_state=271828, n_jobs=-1, class_weight='balanced')

# Fit the Logistic Regression model on the training data
# y_train_retag is the re-tagged training labels obtained from cross-validation
model_lr.fit(X_train, y_train_retag)

# Predict the labels for the test set
y_test_pred = model_lr.predict(X_test)

print_classification_metrics(y_test, y_test_pred)

Metric                                   Score
Accuracy Score:                        0.95713
Balanced Accuracy Score:               0.95203
F1 Score (weighted):                   0.95725
Cohen Kappa Score:                     0.89891
Matthews Correlation Coefficient:      0.89901

Classification Report:

              precision    recall  f1-score   support

           0       0.92      0.94      0.93      7871
           1       0.97      0.96      0.97     18187

    accuracy                           0.96     26058
   macro avg       0.95      0.95      0.95     26058
weighted avg       0.96      0.96      0.96     26058


Confusion Matrix:

Predicted      0       1     All
True                            
0          7,392     479   7,871
1            638  17,549  18,187
All        8,030  18,028  26,058


In [10]:
# Dropping indexes where y_train_retag is different from y_train
# This step ensures that only the examples where the re-tagged labels match the original labels are kept
keep_indexes = y_train_retag == y_train
X_train_retag = X_train[keep_indexes]
y_train_retag = y_train[keep_indexes]

# Initialize a Logistic Regression model with balanced class weights
# 'random_state' ensures reproducibility, 'n_jobs=-1' uses all available processors
model_lr = LogisticRegression(random_state=271828, n_jobs=-1, class_weight='balanced')

# Fit the Logistic Regression model on the filtered training data
model_lr.fit(X_train_retag, y_train_retag)

# Predict the labels for the test set
y_test_pred = model_lr.predict(X_test)

print_classification_metrics(y_test, y_test_pred)

Metric                                   Score
Accuracy Score:                        0.95775
Balanced Accuracy Score:               0.94894
F1 Score (weighted):                   0.95771
Cohen Kappa Score:                     0.89960
Matthews Correlation Coefficient:      0.89961

Classification Report:

              precision    recall  f1-score   support

           0       0.93      0.93      0.93      7871
           1       0.97      0.97      0.97     18187

    accuracy                           0.96     26058
   macro avg       0.95      0.95      0.95     26058
weighted avg       0.96      0.96      0.96     26058


Confusion Matrix:

Predicted      0       1     All
True                            
0          7,294     577   7,871
1            524  17,663  18,187
All        7,818  18,240  26,058


In [11]:
from tqdm import tqdm, trange
from sklearn.model_selection import StratifiedKFold, cross_val_predict

# Loop over 50 iterations to retag the training data
for i in range(50):
    if i == 0:
        # On the first iteration, create copies of the original training data
        y_train_retag = y_train.copy()
        X_train_retag = X_train.copy()

    # Keep track of the dataset size at the start of each iteration
    dataset_size = X_train_retag.shape[0]

    # Initialize stratified k-fold cross-validation with 10 splits
    cross_validation = StratifiedKFold(n_splits=10, shuffle=True, random_state=i)
    
    # Initialize a logistic regression model with balanced class weights
    model_retag = LogisticRegression(random_state=i, n_jobs=-1, class_weight='balanced')
    
    # Perform cross-validated predictions on the training data
    y_train_retag_new = cross_val_predict(
        estimator=model_retag, 
        X=X_train_retag, 
        y=y_train_retag, 
        cv=cross_validation, 
        method="predict", 
        n_jobs=2
    )
    
    # Identify indexes where the new predictions match the old labels
    keep_indexes = y_train_retag_new == y_train_retag
    
    # Filter the training data to keep only the consistent samples
    X_train_retag_new = X_train_retag[keep_indexes]
    y_train_retag_new = y_train_retag[keep_indexes]

    # Update the training data for the next iteration
    y_train_retag = y_train_retag_new.copy()
    X_train_retag = X_train_retag_new.copy()

    # Keep track of the dataset size at the start of each iteration
    new_dataset_size = X_train_retag.shape[0]
    print(f'Iteration {i}: {dataset_size} -> {new_dataset_size} (-{(dataset_size - new_dataset_size) / dataset_size:.2%})')


# After 50 iterations, train a final logistic regression model on the retagged data
model_lr = LogisticRegression(random_state=271828, n_jobs=-1, class_weight='balanced')
model_lr.fit(X_train_retag, y_train_retag)

# Predict the labels for the test set
y_test_pred = model_lr.predict(X_test)

print_classification_metrics(y_test, y_test_pred)

Iteration 0: 96781 -> 93625 (-3.26%)
Iteration 1: 93625 -> 92873 (-0.80%)
Iteration 2: 92873 -> 92560 (-0.34%)
Iteration 3: 92560 -> 92417 (-0.15%)
Iteration 4: 92417 -> 92344 (-0.08%)
Iteration 5: 92344 -> 92289 (-0.06%)
Iteration 6: 92289 -> 92258 (-0.03%)
Iteration 7: 92258 -> 92230 (-0.03%)
Iteration 8: 92230 -> 92205 (-0.03%)
Iteration 9: 92205 -> 92192 (-0.01%)
Iteration 10: 92192 -> 92185 (-0.01%)
Iteration 11: 92185 -> 92175 (-0.01%)
Iteration 12: 92175 -> 92163 (-0.01%)
Iteration 13: 92163 -> 92153 (-0.01%)
Iteration 14: 92153 -> 92149 (-0.00%)
Iteration 15: 92149 -> 92144 (-0.01%)
Iteration 16: 92144 -> 92137 (-0.01%)
Iteration 17: 92137 -> 92128 (-0.01%)
Iteration 18: 92128 -> 92119 (-0.01%)
Iteration 19: 92119 -> 92113 (-0.01%)
Iteration 20: 92113 -> 92108 (-0.01%)
Iteration 21: 92108 -> 92107 (-0.00%)
Iteration 22: 92107 -> 92105 (-0.00%)
Iteration 23: 92105 -> 92101 (-0.00%)
Iteration 24: 92101 -> 92099 (-0.00%)
Iteration 25: 92099 -> 92096 (-0.00%)
Iteration 26: 92096 ->

The table below shows the difference in the number of errors detected between the original weakly labeled dataset and the dataset after retagging with the end-model predictions:

| Dataset | Errors | Improvement |
|---------|--------|-------------|
| WS| 1152 | |
| WS + Retag| 1016 | 11.8% |

This is quite a significant improvement in the quality of the dataset, especially considering the simplicity of the retagging process. By leveraging the model's predictive power to identify and correct annotation errors, we can enhance the dataset's quality and, consequently, the performance of the end-model.

### Retagging in Traditional Supervised Learning

While commonly associated with weak supervision, **retagging** can also be a valuable technique in traditional supervised learning. In this context, it involves leveraging the predictions of an intermediate model to refine the original training labels, ultimately improving the quality of the dataset used to train the final model.

This process can be particularly helpful in addressing several common issues present in real-world datasets:

* **Annotation Errors:** Human annotators can make mistakes. An intermediate model can help identify and potentially correct these errors, leading to a more accurate gold standard.
* **Label Inconsistencies:** Large datasets annotated by multiple individuals can suffer from inconsistencies in label application. Retagging can help standardize labels and improve consistency.
* **Ambiguities:** Some data points may be inherently ambiguous or fall into gray areas within the labeling schema. Retagging can capitalize on the intermediate model's learned representations to potentially resolve these ambiguities more consistently.

Essentially, retagging in traditional supervised learning uses an iterative approach to improve the training data's quality. By using an initial model to highlight potential issues and refine labels, we aim to train a final model on a cleaner, more accurate dataset, leading to improved performance.

**Think of it this way:** Imagine you're training someone to identify different tree species. Providing them with a field guide with some errors or inconsistencies would make their learning process harder. Retagging is like having an experienced botanist review and correct the field guide based on their initial understanding, resulting in a more reliable resource for the learner.

Let's implement retagging in our dataset, but now we'll use the original labels, not the ones coming from the weak supervision pipeline.

In [12]:
df_train_full_supervision = pd.read_parquet('data/b2w/data_with_labels.parquet')
df_train_full_supervision.head()

Unnamed: 0,label,source,review_id,text
0,1,b2w,283982d6d87c11e9eca43ab0499ea49e804944aea3ab12...,"Bom pedido - Chegou dentro do prazo, e com bom..."
1,1,olist,742de54b6da4e794db56a8e0857ecb49,"Produto entregue antes do prazo, e em perfeita..."
2,0,b2w,a3837b7f7d18fb9b977fe4dec9653737a0f72221eb4140...,"pessimo - não recebi o produto certo, em vez d..."
3,1,b2w,dc76bdd02c963842a7803d92818b89b79c55c756021f83...,"Compraria de novo! - Adquiri o produto, que me..."
4,1,b2w,0c0cdf342d676c350202f455cec14d52824d2f6c75dfde...,"Ótimo produto, superou expectativas - A máquin..."


In [13]:
# Transform the text data in the training set using TF-IDF vectorization
X_train_full_supervision = tfidf.transform(df_train_full_supervision['text'])

# Extract the labels from the training set
y_train_full_supervision = df_train_full_supervision['label']

# Initialize a Logistic Regression model with balanced class weights
# 'random_state' ensures reproducibility, 'n_jobs=-1' uses all available processors
model_lr = LogisticRegression(random_state=271828, n_jobs=-1, class_weight='balanced')

# Fit the Logistic Regression model on the transformed training data
model_lr.fit(X_train_full_supervision, y_train_full_supervision)

# Predict the labels for the test set
y_test_pred = model_lr.predict(X_test)

print_classification_metrics(y_test, y_test_pred)

Metric                                   Score
Accuracy Score:                        0.97034
Balanced Accuracy Score:               0.97197
F1 Score (weighted):                   0.97054
Cohen Kappa Score:                     0.93064
Matthews Correlation Coefficient:      0.93123

Classification Report:

              precision    recall  f1-score   support

           0       0.93      0.98      0.95      7871
           1       0.99      0.97      0.98     18187

    accuracy                           0.97     26058
   macro avg       0.96      0.97      0.97     26058
weighted avg       0.97      0.97      0.97     26058


Confusion Matrix:

Predicted      0       1     All
True                            
0          7,683     188   7,871
1            585  17,602  18,187
All        8,268  17,790  26,058


In [14]:
# Create copies of the original training data
y_train_retag = y_train_full_supervision.copy()
X_train_retag = X_train_full_supervision.copy()

# Keep track of the dataset size at the start of each iteration
dataset_size = X_train_retag.shape[0]

# Initialize stratified k-fold cross-validation with 10 splits
cross_validation = StratifiedKFold(n_splits=10, shuffle=True, random_state=271828)

# Initialize a logistic regression model with balanced class weights
model_retag = LogisticRegression(random_state=271828, n_jobs=-1, class_weight='balanced')

# Perform cross-validated predictions on the training data
y_train_retag_new = cross_val_predict(
    estimator=model_retag, 
    X=X_train_retag, 
    y=y_train_retag, 
    cv=cross_validation, 
    method="predict", 
    n_jobs=2
)

# Identify indexes where the new predictions match the old labels
keep_indexes = y_train_retag_new == y_train_retag

# Filter the training data to keep only the consistent samples
X_train_retag_new = X_train_retag[keep_indexes]
y_train_retag_new = y_train_retag[keep_indexes]

# Update the training data for the next iteration
y_train_retag = y_train_retag_new.copy()
X_train_retag = X_train_retag_new.copy()

# Keep track of the dataset size at the start of each iteration
new_dataset_size = X_train_retag.shape[0]
print(f'{dataset_size} -> {new_dataset_size} (-{(dataset_size - new_dataset_size) / dataset_size:.2%})')

# After 50 iterations, train a final logistic regression model on the retagged data
model_lr = LogisticRegression(random_state=271828, n_jobs=-1, class_weight='balanced')
model_lr.fit(X_train_retag, y_train_retag)

# Predict the labels for the test set
y_test_pred = model_lr.predict(X_test)

print_classification_metrics(y_test, y_test_pred)

107521 -> 102106 (-5.04%)
Metric                                   Score
Accuracy Score:                        0.96723
Balanced Accuracy Score:               0.97007
F1 Score (weighted):                   0.96750
Cohen Kappa Score:                     0.92364
Matthews Correlation Coefficient:      0.92455

Classification Report:

              precision    recall  f1-score   support

           0       0.92      0.98      0.95      7871
           1       0.99      0.96      0.98     18187

    accuracy                           0.97     26058
   macro avg       0.95      0.97      0.96     26058
weighted avg       0.97      0.97      0.97     26058


Confusion Matrix:

Predicted      0       1     All
True                            
0          7,692     179   7,871
1            675  17,512  18,187
All        8,367  17,691  26,058


| Dataset | Errors | Improvement |
|---------|--------|-------------|
| Full Supervision| 773 | |
| WS + Retag| 854 | -10.5% |

Ok, this time the retagging process didn't improve the dataset quality. This can happen and it's important to evaluate the impact of retagging in each specific case.

## Theoretical Introduction to Confident Learning

In the realm of machine learning, dealing with noisy labels is a prevalent challenge. **Confident Learning** presents a robust framework specifically designed to train models effectively in the presence of such noisy labels. This approach hinges on the idea of a **label noise transition matrix**, which allows us to model the probabilities of encountering incorrect labels and subsequently mitigate their impact during training.

### Key Concepts in Confident Learning

1. **Label Noise Transition Matrix:** This matrix forms the bedrock of Confident Learning. It captures the probabilities of observing a particular label given the true basic label. For instance, imagine a binary classification problem where the true label is "positive." The transition matrix would quantify the probabilities of observing a "positive" label (correct) and a "negative" label (incorrect) in our dataset. By estimating these probabilities, we gain valuable insights into the nature and extent of label noise.

2. **Confidence Thresholding:** Not all predictions are created equal. Confident Learning leverages this by employing confidence thresholding during training. This technique involves setting a predefined threshold and only considering predictions that exceed this threshold. By focusing on high-confidence predictions, we minimize the influence of potentially mislabeled data points, leading to a more robust learning process.

3. **Model Agnostic:** Confident Learning is model-agnostic, meaning it can be applied to a wide range of machine learning models. This flexibility allows practitioners to exploit the benefits of Confident Learning across various domains and applications.

4. **Datacentric Approach:** Confident Learning emphasizes the importance of data quality in model training. By addressing label noise directly at the data level, this approach ensures that models learn from clean, reliable data, leading to improved performance and generalization.

For a deeper approach, check [this paper](https://arxiv.org/pdf/1911.00068).

### Understanding the Label Noise Transition Matrix

The **label noise transition matrix** is a fundamental concept in Confident Learning. It provides a structured way to model and address label noise in machine learning datasets. This matrix captures the probabilities of observing different labels given the true basic labels, enabling us to quantify and mitigate the impact of noisy labels during model training.

#### Observed vs. True Labels

To understand the label noise transition matrix, let's first look at an example of the counts of observed labels ($\tilde{y}$) compared to the true basic labels ($y^*$) in a confusion matrix format:

<div align="center">

| $C_{\tilde{y}, y^*}$ | $y^*=\text{bus}$ | $y^*=\text{car}$ | $y^*=\text{bike}$ |
|----------------------|------------------|------------------|------------------|
| $\tilde{y}=\text{bus}$ | 90 | 35 | 25 |
| $\tilde{y}=\text{car}$ | 50 | 70 | 5 |
| $\tilde{y}=\text{bike}$ | 30 | 15 | 75 |

</div>

#### Transition Matrix

When we normalize these counts to obtain probabilities, we derive the label noise transition matrix. This matrix shows the probability of observing each label given the true label:

<div align="center">

| $\hat{Q}(\tilde{y} , y^*)$ | $y^*=\text{bus}$ | $y^*=\text{car}$ | $y^*=\text{bike}$ |
|----------------------|------------------|------------------|------------------|
| $\tilde{y}=\text{bus}$ | 0.22 | 0.09 | 0.06 |
| $\tilde{y}=\text{car}$ | 0.13 | 0.18 | 0.01 |
| $\tilde{y}=\text{bike}$ | 0.08 | 0.04 | 0.19 |

</div>

#### Key Insights

- **Diagonal Elements**: These represent the probabilities of the observed label being the same as the true label. For example, $\hat{Q}(\tilde{y}=\text{bus} | y^*=\text{bus}) = 0.22$.
- **Off-Diagonal Elements**: These represent the probabilities of label flips, where the observed label differs from the true label. For instance, the probability of observing a "car" label when the true label is "bus" is $\hat{Q}(\tilde{y}=\text{car} | y^*=\text{bus}) = 0.13$.

#### Applications

Understanding the label noise transition matrix allows us to:

1. **Identify Label Errors**: By analyzing the off-diagonal elements, we can detect inconsistencies and potential errors in the dataset.
2. **Learn with Noisy Labels**: This matrix helps in designing algorithms that can handle noisy labels more effectively by providing insights into the nature and extent of the noise.
3. **Detect Ontological Issues**: It can reveal fundamental issues in how datasets are labeled, such as ambiguous class definitions or overlapping categories.

> **Example**: Consider the entry $\hat{Q}(\tilde{y}=\text{car} | y^*=\text{bus}) = 0.13$. This means that there is a 13% probability that a true "bus" label has been incorrectly observed as "car". Such insights can guide us in refining our data collection and labeling processes to improve overall data quality.

#### Joint Distribution

From the joint distribution of observed and true labels, we can derive marginal and conditional probabilities. This deeper statistical understanding is crucial for:

- **Improving Model Training**: By using clean data or appropriately weighting the noisy labels during training.
- **Related Work**: Many approaches in machine learning depend on knowing the prior distributions and transition matrices to adjust for label noise.


> **Note**: Confident Learning's key contribution is solving for the joint distribution of true and observed labels, providing essential statistics for learning with noisy labels. This approach allows for more accurate modeling and correction of label noise, improving the robustness and reliability of machine learning models.

### How to Estimate $P(\tilde{y} , y^*)$

To estimate the label noise transition matrix $P(\tilde{y} , y^*)$, we'll need two key components:

- **Noisy labels $\tilde{y}$**: The labels observed in the dataset.
- **Predicted probabilities $\hat{p}(\tilde{y}; x, \theta)$**: The model's predicted probabilities for each class given the input $x$ and model parameters $\theta$.

#### Obtain Out-of-Sample Predictions

To estimate the transition matrix, we'll first need to obtain out-of-sample predictions from our model. Since this is scale-invariant, we can use anything from a simple logistic regression to a complex neural network as our model, as long as it provides predicted probabilities for each class.

To obtain out-of-sample predictions, we'll need to use cross-validation to ensure that the predictions are not influenced by the data used for training. This step is crucial to ensure that our estimates are reliable and not overfit to the training data.

In [15]:
# Initialize stratified k-fold cross-validation with 20 splits
# StratifiedKFold ensures that each fold has the same proportion of classes as the original dataset
cross_validation = StratifiedKFold(n_splits=20, shuffle=True, random_state=271828)

# Initialize a Logistic Regression model with balanced class weights
# 'random_state' ensures reproducibility, 'n_jobs=-1' uses all available processors
model_lr = LogisticRegression(random_state=271828, n_jobs=-1, class_weight='balanced')

# Perform cross-validated predictions on the training data
# 'method="predict_proba"' returns the predicted probabilities for each class
# 'n_jobs=2' uses 2 processors for parallel computation
y_train_preds = cross_val_predict(estimator=model_lr, X=X_train, y=y_train, cv=cross_validation, method="predict_proba", n_jobs=2)

# Display the first 5 rows of the predicted probabilities
y_train_preds[:5]

array([[4.66928707e-01, 5.33071293e-01],
       [9.99857512e-01, 1.42488429e-04],
       [9.99725257e-01, 2.74742632e-04],
       [1.49207705e-02, 9.85079229e-01],
       [6.64148555e-01, 3.35851445e-01]])


#### Step 2: Compute the Thresholded Confidences

**Key idea:** Find thresholds as a proxy for the machine’s self-confidence, on average, for each task/class $ j $.

$$
t_j = \frac{1}{|X_{\tilde{y}=j}|} \sum_{x \in X_{\tilde{y}=j}} \hat{p}(\tilde{y} = j; x, \theta)
$$

This formula is used to compute the threshold $ t_j $ for a particular class $ j $ in the context of confident learning. Here's what each part represents:

- $ t_j $: The threshold for class $ j $. This threshold serves as a proxy for the machine's self-confidence in its predictions for class $ j $.

- $ |X_{\tilde{y}=j}| $: The number of data points in the set $ X $ that are predicted to belong to class $ j $. Here, $ \tilde{y} $ represents the predicted labels.

- $ X_{\tilde{y}=j} $: The set of data points that are predicted to belong to class $ j $.

- $ \sum_{x \in X_{\tilde{y}=j}} $: This summation indicates that we are summing over all data points $ x $ in the set $ X_{\tilde{y}=j} $.

- $ \hat{p}(\tilde{y} = j; x, \theta) $: The predicted probability that data point $ x $ belongs to class $ j $, given the model parameters $ \theta $.


**Explanation**:

- **Class-specific Threshold Calculation:** The formula calculates an average confidence score for each class $ j $. This average confidence score serves as a threshold $ t_j $.
- **Summation of Confidence Scores:** For each data point $ x $ predicted to belong to class $ j $, we take the predicted probability $ \hat{p}(\tilde{y} = j; x, \theta) $ that the model assigns to class $ j $.
- **Averaging:** We sum these predicted probabilities for all data points in $ X_{\tilde{y}=j} $ and then divide by the number of data points $ |X_{\tilde{y}=j}| $ to get the average confidence score.
- **Purpose:** This average confidence score $ t_j $ acts as a threshold to determine how confident the model is, on average, about its predictions for class $ j $. If the model's confidence for a particular prediction exceeds this threshold, it can be considered a confident prediction.

In [16]:
# Calculate the threshold for the negative class based on the mean predicted probability
t_negative = y_train_preds.mean(axis=0)[0]

# Calculate the threshold for the positive class based on the mean predicted probability
t_positive = y_train_preds.mean(axis=0)[1]

# Print the calculated threshold for the negative class
print(f'Threshold for negative class: {t_negative:.4f}')

# Print the calculated threshold for the positive class
print(f'Threshold for positive class: {t_positive:.4f}')

Threshold for negative class: 0.2798
Threshold for positive class: 0.7202


#### Step 3: Compute the Error Sets

Next, we identify the data points that are confidently misclassified. This is done using the error sets:

$$
\hat{X}_{\tilde{y}=i, y^*=j} = \{ x \in X_{\tilde{y}=i} : \hat{p}(\tilde{y} = j; x, \theta) \geq t_j \}
$$

**Explanation**:

- **$ \hat{X}_{\tilde{y}=i, y^*=j} $**: The set of data points $x$ that are predicted to belong to class $i$ but have a high probability of actually belonging to class $j$.
- **$ X_{\tilde{y}=i} $**: The set of data points predicted to belong to class $i$.
- **$ \hat{p}(\tilde{y} = j; x, \theta) \geq t_j $**: The condition that the predicted probability of class $j$ for data point $x$ is greater than or equal to the threshold $t_j$.

**Purpose**:

- Identify data points that are likely mislabeled by comparing the model's predictions against the computed thresholds.
- These error sets help in estimating the true label noise transition matrix by providing insights into where the model's predictions diverge from the noisy labels.

Through these steps, we can effectively estimate the label noise transition matrix $P(\tilde{y}, y^*)$.

In [17]:
# Create boolean masks for negative and positive examples in the training data
negative_mask = y_train == 0  # Mask for negative examples (class 0)
positive_mask = y_train == 1  # Mask for positive examples (class 1)

# Get the indices of negative and positive examples using the masks
negative_indices = y_train[negative_mask].index  # Indices of negative examples
positive_indices = y_train[positive_mask].index  # Indices of positive examples

# Create boolean masks for negative and positive examples below their respective thresholds
# y_train_preds[negative_indices][:, 0] extracts the predicted probabilities for class 0 for negative examples
negative_below_threshold_mask = y_train_preds[negative_indices][:, 0] < t_negative  # Mask for negative examples below threshold
# y_train_preds[positive_indices][:, 1] extracts the predicted probabilities for class 1 for positive examples
positive_below_threshold_mask = y_train_preds[positive_indices][:, 1] < t_positive  # Mask for positive examples below threshold

# Get the indices of negative and positive examples that are below the threshold
negative_below_threshold_indices = negative_indices[negative_below_threshold_mask]  # Indices of negative examples below threshold
positive_below_threshold_indices = positive_indices[positive_below_threshold_mask]  # Indices of positive examples below threshold

# Get the indices of examples that are above the threshold for both classes
# This includes all indices that are not in negative_below_threshold_indices or positive_below_threshold_indices
above_threshold_indices = [
    i for i in range(len(y_train)) 
    if i not in negative_below_threshold_indices and i not in positive_below_threshold_indices
]

# Display the first 5 indices of negative examples below the threshold, positive examples below the threshold, and examples above the threshold
negative_below_threshold_indices[:5], positive_below_threshold_indices[:5], above_threshold_indices[:10]

(Index([779, 965, 1430, 1431, 1721], dtype='int64'),
 Index([0, 9, 25, 37, 55], dtype='int64'),
 [1, 2, 3, 4, 5, 6, 7, 8, 10, 11])

In [18]:
import numpy as np

# Define mappings between integer labels and their string representations
int_to_str = {0: 'Negative', 1: 'Positive'}
str_to_int = {v: k for k, v in int_to_str.items()}  # Reverse mapping from string to integer

# Calculate the threshold for each class based on the mean predicted probabilities
# This threshold will be used to determine the predicted class
thresholds_for_each_class = {i: y_train_preds.mean(axis=0)[i] for i in range(y_train_preds.shape[1])}

# Combine indices of examples below the threshold and above the threshold
# Select the first 2 examples from each category for demonstration
example_indices = list(negative_below_threshold_indices[:2]) + list(positive_below_threshold_indices[:2]) + list(above_threshold_indices[:10])

# Initialize a counts matrix to keep track of true vs predicted class counts
counts_matrix = np.zeros((len(int_to_str), len(int_to_str)))

# Print the threshold for each class
for class_idx, class_name in int_to_str.items():
    print(f'Threshold for class {class_name}: {thresholds_for_each_class[class_idx]:.4f}')

# Iterate over the selected example indices
for idx in example_indices:
    # Get the true label and predicted confidence scores for the current example
    observed_label = y_train[idx]
    confidence_scores = y_train_preds[idx]
    
    # Print the index and text of the current example
    print(f'Index: {idx}')
    print(f'Text: \"{df_train.loc[idx, "text"]}\"')
    print(f'True Label: {int_to_str[observed_label]}')
    
    predicted_class = None

    # Print confidence scores for each class and determine the predicted class
    for i, confidence_score in enumerate(confidence_scores):
        output_str = f'Confidence score for class {int_to_str[i]}: {confidence_score:.4f}'
        if confidence_score >= thresholds_for_each_class[i]:
            predicted_class = i
            output_str += ' (Threshold met)'
        else:
            output_str += ' (Threshold not met)'
        print(output_str)
        
    # Update the counts matrix based on the true and predicted classes
    counts_matrix[observed_label, predicted_class] += 1

    # Print the predicted class
    print(f'Predicted class: {int_to_str[predicted_class]}\n')
    print('-' * 50)

# Print the counts matrix to show the distribution of true vs predicted classes
print(counts_matrix)

Threshold for class Negative: 0.2798
Threshold for class Positive: 0.7202
Index: 779
Text: "rapidez - so faltou o outro que foi comprado junto"
True Label: Negative
Confidence score for class Negative: 0.2380 (Threshold not met)
Confidence score for class Positive: 0.7620 (Threshold met)
Predicted class: Positive

--------------------------------------------------
Index: 965
Text: "engano - me sinto enganada qualidade nota 0 fiquei muito triste quando abri o produto"
True Label: Negative
Confidence score for class Negative: 0.2740 (Threshold not met)
Confidence score for class Positive: 0.7260 (Threshold met)
Predicted class: Positive

--------------------------------------------------
Index: 0
Text: "nao gostei do produto! - o acabamento e muito bom. mas a casa nao fica montada! encaixes frouxos! ja tentei colar com tudo! nao consegui! pelo preco esperava muito mais!"
True Label: Positive
Confidence score for class Negative: 0.4669 (Threshold met)
Confidence score for class Positive: 

In [19]:
import numpy as np
import pandas as pd

def counts_matrix_to_dataframe(confusion_matrix, labels, name='C_{ŷ,y*}'):
    # Convert the confusion matrix (a numpy array) into a pandas DataFrame
    df = pd.DataFrame(confusion_matrix, index=labels, columns=labels)
    # Set the name of the index to the provided name (default is 'C_{ŷ,y*}')
    df.index.name = name
    return df

# Define the labels for the confusion matrix, corresponding to the classes
labels = ["negative", "positive"]

# Convert the counts matrix (confusion matrix) to a pandas DataFrame for better readability
df_counts_matrix = counts_matrix_to_dataframe(counts_matrix, labels)

# Display the DataFrame to visualize the confusion matrix
df_counts_matrix

Unnamed: 0_level_0,negative,positive
"C_{ŷ,y*}",Unnamed: 1_level_1,Unnamed: 2_level_1
negative,4.0,2.0
positive,2.0,6.0


In [20]:
counts_matrix_to_dataframe(counts_matrix / counts_matrix.sum(), labels, name='Q̂_{ŷ,y*}')

Unnamed: 0_level_0,negative,positive
"Q̂_{ŷ,y*}",Unnamed: 1_level_1,Unnamed: 2_level_1
negative,0.285714,0.142857
positive,0.142857,0.428571


## The Pervasiveness and Impact of Label Noise in Real-World Datasets

Label noise is a widespread issue in real-world datasets, affecting both training and test sets. [Northcutt et al., 2021](http://arxiv.org/abs/2103.14749) used **Confident Learning** to estimate the prevalence of label noise in popular datasets and assess its impact on model performance.

### Prevalence of Label Noise in Popular Datasets

The study analyzed several widely used datasets, revealing varying levels of label noise:

<div align="center">

| Dataset | CL Guessed | MTurk Checked | Validated Errors | Estimated Errors | % Error |
|-------------|------------|---------------|------------------|------------------|---------|
| MNIST | 100 | 100 (100%) | 15 | — | 0.15% |
| CIFAR-10 | 275 | 275 (100%) | 54 | — | 0.54% |
| CIFAR-100 | 2,235 | 2,235 (100%) | 585 | — | 5.85% |
| Caltech-256 | 4,643 | 400 (8.6%) | 65 | 754 | 2.46% |
| ImageNet* | 5,440 | 5,440 (100%) | 2,916 | — | 5.83% |
| QuickDraw | 6,825,383 | 2,500 (0.04%) | 1,870 | 5,105,386 | 10.12% |
| 20news | 93 | 93 (100%) | 82 | — | 1.11% |
| IMDB | 1,310 | 1,310 (100%) | 725 | — | 2.90% |
| Amazon | 533,249 | 1,000 (0.2%) | 732 | 390,338 | 3.90% |
| AudioSet | 307 | 307 (100%) | 275 | — | 1.35% |

</div>

*Note: "CL Guessed" refers to the number of samples suspected to be mislabeled by Confident Learning. "MTurk Checked" indicates the number of samples verified by human annotators on Amazon Mechanical Turk.*

**Key Observations:**

- **Widespread Label Noise:** Even datasets considered clean, like MNIST, contain mislabeled samples.
- **Varied Noise Levels:** Label noise ranges from as low as 0.15% in MNIST to over 10% in QuickDraw.
- **Impact on Large Datasets:** Large-scale datasets like QuickDraw and Amazon Reviews have significant absolute numbers of label errors due to their size.

### Impact on Model Performance

Noisy labels adversely affect machine learning models in several ways:

- **Learning Incorrect Associations:** Models may learn to associate incorrect features with labels, leading to poor predictive performance.
- **Reduced Accuracy:** The presence of mislabeled data can lower the model's accuracy on both training and unseen data.
- **Overfitting to Noise:** Models might overfit to noisy labels, capturing noise rather than fundamental patterns.

*Example:* If images of cats are mislabeled as dogs, a model might learn erroneous features that distinguish dogs, negatively impacting its ability to correctly classify new images.

### Repercussions of Label Errors in Test Data

Label noise in test sets poses significant challenges:

- **Inaccurate Performance Metrics:** Evaluation metrics become unreliable when based on incorrect labels, misleading stakeholders about the model's true capabilities.
- **Misguided Model Comparisons:** Comparing models using noisy test data may favor models that perform well on mislabeled samples rather than genuinely robust models.
- **False Confidence:** Overestimating a model's performance can lead to deploying inadequate models in critical applications.

*Analogy:* Testing a student's knowledge with an exam containing wrong answers in the answer key would not accurately assess their understanding.

### Addressing Potential Misconceptions

- **"Minor Noise is Insignificant":** Even small percentages of label noise can substantially impact model performance, especially in sensitive domains like healthcare.
- **"Models Naturally Handle Noise":** While some algorithms are robust to random noise, relying on this without addressing the root cause is risky and can lead to suboptimal models. Also, as shown before, label noise can be systematic and not random.
- **"Data Cleaning is Too Costly":** Investing in data quality upfront saves resources in the long run by reducing the need for complex models to compensate for noisy data.

### Key Insights

- **Improved Model Selection:** Relying solely on uncorrected accuracy may lead to choosing inferior models. Using corrected accuracy provides a more reliable basis for model evaluation.

- **Data Quality Matters:** High-quality, accurately labeled data is crucial for both training and evaluating models. Techniques like Confident Learning help enhance data quality, leading to better-performing models.

*Example:* In a sentiment analysis task, mislabeling positive reviews as negative can confuse the model, leading to poor performance in distinguishing sentiments.

# Questions

1. What are the two main categories of Annotation Error Detection (AED) methods, and how do they differ in their output?

2. Explain the concept of "label flipping" and provide two examples of how it might occur in different machine learning tasks.

3. What is the difference between aleatoric and epistemic uncertainty in machine learning?

4. Describe the three primary types of label noise and provide a real-world example of how asymmetric label noise might occur in an image classification task.

5. How does the concept of a "label noise process assumption" help in disentangling aleatoric and epistemic uncertainty?

6. What is the purpose of "retagging" in the context of a Weak Supervision Learning (WSL) pipeline, and how does it improve the quality of the training dataset?

7. Explain how Confident Learning leverages the concept of a "label noise transition matrix" to address the issue of noisy labels in datasets.

8. Why is it crucial to obtain out-of-sample predictions when estimating the label noise transition matrix using Confident Learning?

9. What are the potential consequences of having label errors present in the test set used to evaluate a machine learning model?

10. How does the incorporation of Annotation Error Detection (AED) enhance the process of Weak Supervision?

`Answers are commented inside this cell.`

<!-- 1. The two main categories of AED methods are **Flaggers** and **Scorers**. Flaggers make binary classifications, labeling instances as either correct or incorrect. Scorers, on the other hand, assign a score to each instance, reflecting the likelihood of an annotation error.

2. **Label flipping** occurs when an instance is assigned an incorrect label, effectively switching it from its true category to another. For example:
- In **image classification**, an image of a bus might be incorrectly labeled as a car.
- In **sentiment analysis**, a positive review could be mistakenly tagged as negative.

3. **Aleatoric uncertainty** stems from innate randomness or noise in the data itself, while **epistemic uncertainty** arises from limitations in the model's knowledge or representation of the data.

4. The three primary types of label noise are:
- **Uniform/Symmetric:** Mislabeling occurs with equal probability across all incorrect classes.
- **Systematic/Asymmetric:** Mislabeling probabilities vary between class pairs.
- **Instance-Dependent:** Mislabeling depends on both the true class and the instance features.
**Example of Asymmetric Noise:** In image classification, a cat might be more likely to be mislabeled as a dog (visually similar) than a bird.

5. The **label noise process assumption** helps separate the effects of noisy labels (aleatoric uncertainty) from the model's built-in limitations (epistemic uncertainty). By assuming a specific mechanism for how noise corrupts labels, we can better isolate and address its impact.

6. **Retagging** in WSL uses the predictions of an interim model to identify and correct potential annotation errors in the dataset. By comparing model predictions to existing labels, discrepancies are flagged, potentially indicating mislabeling and enabling dataset refinement.

7. The **label noise transition matrix** quantifies the probabilities of observing a particular label given the true basic label. Confident Learning uses this matrix to model the noise process and adjust its learning accordingly, reducing the influence of noisy labels.

8. Out-of-sample predictions are essential to avoid overfitting and ensure that the estimated transition matrix generalizes well to unseen data. If predictions were made on the training data, the matrix might capture noise specific to that data, leading to inaccurate noise modeling.

9. Label errors in the test set can lead to inaccurate performance metrics, misleading model comparisons, and false confidence in a model's capabilities. This can have significant consequences, especially when deploying models in real-world applications.

10. AED enhances Weak Supervision by identifying potential errors in the labels generated by the weak supervision process. This identification allows for targeted actions like manual review and correction or data exclusion, resulting in a more refined and reliable training dataset for downstream machine learning models. -->