# Theoretical Introduction to Data-Centric AI and Weakly Supervised Learning
## Learning with Limited Labels: Weak Supervision and Uncertainty-Aware Training
### [Dr. Elias Jacob de Menezes Neto](https://docente.ufrn.br/elias.jacob)

## Course Logistics: A Few Things You Should Know

### Ask Questions
Please let me know how things are progressing for you throughout the course.

### Basics
**Office hours**: If you need to meet, please [email me](elias.jacob@ufrn.br).


### Requirements for This Course
To succeed in this course, you should have:

- Access to the dataset and the code provided.
- A foundational understanding of machine learning.
- Basic knowledge of natural language processing (NLP).
- Proficiency in Python.
- Familiarity with Jupyter Notebooks.
- Basic knowledge of statistics.
- An understanding of data science and machine learning concepts and stacks.

### Teaching Approach
#### Top-Down Method
I will be employing a *top-down* teaching method, which contrasts the traditional *bottom-up* approach. In a *bottom-up* approach, you typically learn all the individual components first and then gradually combine them into more complex structures. This approach often leads to students losing motivation, lacking a sense of the "big picture," and not knowing what they will need in practice.

According to Harvard Professor David Perkins in his book [Making Learning Whole](https://www.amazon.com/Making-Learning-Whole-Principles-Transform/dp/0470633719), effective learning can be likened to learning a sport like baseball. Children are not required to memorize all the rules and understand every technical detail before they start playing. Instead, they begin by playing with a general understanding and gradually learn more rules and details over time.

#### Focus on Functionality
At the beginning of this course, prioritize understanding what things **do**, not necessarily what they **are**. You will encounter some "black boxes"—concepts or tools that we use without fully explaining them upfront. Later, we will get into the lower-level details.

#### Learning by Doing and Explaining
Research indicates that people learn best by:
1. **Doing**: Engaging in coding and building projects.
2. **Explaining**: Articulating what they've learned, either by writing or helping others.

I will guide you through building projects and encourage you to explain these concepts to your peers.

#### Learning as a Team Sport
Studies show that teamwork significantly enhances learning. Therefore, I encourage you to:
- Ask questions.
- Answer questions from fellow students.
- Collaborate on building projects.

If you receive a request for help, consider it an opportunity to solidify your understanding by teaching others.

### Course Materials
All course materials can be accessed at the course's GitHub repository.


### Final Grade
Your final grade will be based on a project that ideally employs data from a real-life project you are working on. The project will be evaluated on the following criteria:

- **Technical Quality**: The effectiveness and accuracy of your implementation.
- **Creativity**: The innovation and uniqueness of your approach.
- **Usefulness**: The practical application and benefit of your project.
- **Presentation**: The clarity and professionalism of your project presentation.

#### FAQ
> - The project must be completed individually.
> - Submit a link to a shared folder containing your code, data, and report. Use virtual environments and a `requirements.txt` file to assist running your code.
> - The project is due 15 days after the course ends.
> - Submit your project through SIGAA.

## Summary

### Keypoints

- Data-Centric AI is a example shift in AI development, focusing on the critical role of data quality and quantity rather than solely on model architecture improvements.

- The three key principles of Data-Centric AI are:
1. It centers around data as the primary driver of AI success.
2. It needs to be programmatic to overcome the limitations of manual data labeling.
3. It requires the inclusion of subject matter expertise in the development process.

- Weak supervision is a powerful technique for addressing the challenge of limited labeled data, leveraging domain knowledge and heuristics to create large-scale labeled datasets efficiently.

- There are three primary types of weak supervision:
1. Incomplete supervision: Only a subset of the training data is labeled.
2. Inexact supervision: Labels are less precise or fine-grained than ideal.
3. Inaccurate supervision: Labels contain errors or noise.

- Weak supervision can achieve comparable or even superior performance to traditional supervised learning in various tasks, often requiring about twice as much weakly labeled data to match the performance of manually annotated datasets.

- Aggregating multiple, potentially noisy labeling sources is a critical aspect of weak supervision, with techniques ranging from simple majority voting to advanced methods like graph neural networks and matrix completion.

- Even widely used benchmark datasets contain significant labeling errors, underscoring the need for techniques to automatically detect and correct annotation errors in both human-annotated and weakly supervised datasets.

### Takeaways

- The focus in AI development is shifting from model architecture to data quality and management, emphasizing the importance of data curation, augmentation, and governance in achieving high-performance AI systems.

- Weak supervision offers a scalable and cost-effective alternative to traditional manual data labeling, particularly valuable in domains where expert knowledge is required or when dealing with large-scale datasets.

- The effectiveness of weak supervision challenges the notion that only carefully hand-labeled datasets can produce high-quality machine learning models, opening up new possibilities for AI development in data-scarce or expertise-intensive domains.

- Integrating subject matter experts into the AI development process is crucial for creating effective labeling functions and ensuring the relevance and accuracy of the resulting models.

- The asymptotic scaling behavior of weak supervision methods matches that of traditional supervised learning, suggesting that effective use of unlabeled data can yield strong predictive performance similar to fully supervised methods.

- As AI systems become more prevalent in critical applications, understanding and mitigating the impact of label noise and errors is increasingly important for ensuring the reliability and trustworthiness of these systems.

# Data-Centric AI: Shifting Focus from Models to Data

## The Model Shift in AI Development

In recent years, the field of Artificial Intelligence has undergone a significant transformation. While much attention has been focused on developing sophisticated models and architectures, a new example is emerging: Data-Centric AI. This approach recognizes that **data, not just algorithms, is the key driver of AI success**.



<div style="text-align: center;">
<img src="images/model_vs_datacentric.jpeg" width="70%" height="70%">
</div>

<br><br>
Consider the following two perspectives, assuming that $\text{AI} = \text{Code} + \text{Data} $:

- **Conventional model-centric approach**
- Historically, AI development has been model-centric, with a primary focus on designing and optimizing algorithms.
- Researchers have dedicated substantial effort to developing complex neural network architectures and training procedures.

- **Emerging data-centric approach**
- The data-centric approach emphasizes the critical role of data in AI development.
- Recent advances in AI are often attributed to the availability of large, high-quality datasets rather than architectural innovations.

Consider the evolution of language models, particularly the GPT series developed by OpenAI:

<div style="text-align: center;">
<img src="images/gpt_evolution.png" width="70%" height="70%">
</div>


The [figure](http://arxiv.org/abs/2303.10158) above illustrates the evolution of the GPT series, from GPT-1 to GPT-3. Notably, the primary driver of performance improvements is the **scale of the model and the training data**. While architectural enhancements have been made, the key factor behind the models' success is the vast amount of data used to train them.

- On the **left**, large and high-quality training data are the driving force of recent successes of GPT models, while model architectures remain similar, except for more model weights.
- On the **right**, when the model becomes sufficiently powerful, we only need to engineer prompts (inference data) to accomplish our objectives, with the model being fixed.



## Key Principles of Data-Centric AI

### Principle 1: It Centers Around Data
1. **Data as the Primary Driver**

- The adage "Garbage In, Garbage Out" (GIGO) is particularly relevant in AI, emphasizing the critical role of data quality.
- Recent AI breakthroughs are largely attributed to the quality and quantity of data, rather than model architecture improvements.

2. **The Power of Large Language Models**
- With the advent of powerful models like GPT-3, the focus has shifted from model design to prompt engineering.
- These models demonstrate that with sufficient scale and data, many complex tasks can be solved through clever prompting rather than architectural changes.


3. **The "Solved" Problem of Deep Learning Architectures**
- Many researchers argue that the fundamental challenges in designing deep learning architectures have been largely addressed.
- See the image below: over the last years the [performance on IMDB sentiment analysis](https://paperswithcode.com/sota/sentiment-analysis-on-imdb) has been increasing very modestly . After 5 years and millions of dollars invested, the accuracy has increased from 96.0% to 96.68%

<div style="text-align: center;">
<img src="images/imdb_sentiment_analysis.png" width="70%" height="70%">
</div>

- The emphasis now is on applying and fine-tuning existing architectures rather than creating entirely new ones.
- The "problem" with designing deep learning architectures is "solved". Today, you can load a pre-trained model with few lines of code and get state-of-the-art results. The real challenge is to get the right data to train your model.

> <video width="720" controls>
> <source src="images/febraban1.mp4" type="video/mp4">
> </video>
>


Maybe a little too much? Let's see.

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
model = AutoModelForMaskedLM.from_pretrained("neuralmind/bert-base-portuguese-cased")
````

Maybe he was right. 😅


4. **Data Hunger of Deep Learning Models**
- Modern deep learning models require vast amounts of data to achieve high performance.
- This data dependency has led to a shift in focus towards data acquisition, curation, and management.

### Principle 2: It Needs to be Programmatic

#### The Challenge of Manual Data Labeling

In the current landscape of AI development, a significant bottleneck exists in the form of manual data labeling. This process often requires:

- Large teams of human labelers
- Extensive time and resources
- Specialized knowledge for domain-specific tasks

While this approach has been necessary, it presents several critical limitations:

1. **Scalability Issues**: As datasets grow larger and more complex, manual labeling becomes increasingly impractical.
2. **Cost Inefficiency**: Hiring and maintaining large teams of labelers is expensive and often unsustainable for many organizations.
3. **Time Constraints**: Manual labeling is time-consuming, slowing down the development and iteration of AI models.

#### The Impracticality for Real-World Applications

For many real-world AI applications, especially those dealing with sensitive or specialized data, manual labeling faces additional challenges:

- **Data Privacy Concerns**: In fields like healthcare or law, data often contains sensitive personal information that cannot be easily shared with labelers.
- **Expertise Requirements**: Certain domains require highly specialized knowledge for accurate labeling:
- Legal cases need judges or lawyers
- Medical data requires physicians or specialists
- Technical fields may require subject matter experts

> **Example**: Imagine developing an AI for rare disease diagnosis. Each case would need to be labeled by a specialist in that particular rare condition, making the process extremely time-consuming and expensive. This data may also be sensitive and require strict privacy measures, further complicating the labeling process.

#### The Problem of Evolving Information

One of the most significant issues with manual labeling is its static nature in a dynamic world:

- **Rapid Information Changes**: New laws, medical discoveries, or technological advancements can quickly render existing labels obsolete.
- **Continuous Updates Needed**: Adding new classes or modifying existing ones requires re-labeling large portions of the dataset.

> **Analogy**: Think of manual labeling like printing a paper encyclopedia. By the time it's complete, some entries are already outdated, and adding new information requires reprinting the entire set.

#### The Call for Programmatic Solutions

To address these challenges, the field is moving towards more programmatic approaches to AI development:

1. **Automated Labeling Techniques**: Developing algorithms that can label data with minimal human intervention.
2. **Transfer Learning**: Employing knowledge from pre-trained models to reduce the need for extensive labeled datasets.
3. **Active Learning**: Implementing systems that intelligently select the most informative samples for labeling, reducing overall labeling requirements.
4. **Synthetic Data Generation**: Creating artificial datasets that mimic real-world data characteristics.

These programmatic approaches offer several advantages:

- **Scalability**: Can handle large and growing datasets more efficiently.
- **Adaptability**: Easier to update and modify as new information becomes available.
- **Cost-Effectiveness**: Reduces the need for large teams of human labelers.
- **Speed**: Allows for faster development and iteration of AI models.

To summarize:


| Labeling approach | Speed | Cost | Adaptability |
|---------------|-------|-----------|--------------|
| Manual Labels | Slow | Expensive | Static |
| Programmatic Labels | Fast | Cheap | Dynamic |

### Principle 3: It Needs to Include Subject Matter Expertise in the Loop

#### The Role of Subject Matter Experts (SMEs)

In real-world machine learning projects, the involvement of Subject Matter Experts (SMEs) is not just beneficial—it's essential. Their deep domain knowledge provides critical context and insights that can significantly enhance the quality and relevance of your ML solutions.

- **Project Reflection**: Consider your current or recent ML projects. How much time did you spend communicating with SMEs? This interaction is often more extensive than many anticipate, highlighting the importance of effective collaboration.

#### SMEs as Team Members

It's important to shift your perspective on SMEs:

- **Not outsiders, but central team members**: SMEs should be viewed as indispensable parts of your ML project team.
- **Develop communication skills**: Learning to effectively communicate with SMEs is as important as your technical skills. This involves:
- Translating technical concepts into domain-specific language
- Actively listening to their insights and concerns
- Asking targeted questions to extract relevant information

#### Navigating Domain Complexity

As an ML practitioner, you'll encounter diverse industries and domains:

- **Acknowledge limitations**: It's impossible to be an expert in every field your projects touch.
- **Embrace continuous learning**: Don't let the vastness of domain knowledge discourage you. Each project is an opportunity to expand your understanding.
- **Knowledge encoding**: Your primary task is to effectively translate SME knowledge into your ML models and processes.

> **Key Point**: The goal is not to become an expert in every domain, but to develop the skills to extract, understand, and incorporate domain expertise into your ML solutions.

#### Strategies for Effective SME Combination

1. **Regular check-ins**: Schedule frequent meetings with SMEs throughout the project lifecycle.
2. **Collaborative problem definition**: Involve SMEs in defining project objectives and success criteria.
3. **Data interpretation**: Exploit SME insights for better understanding of data nuances and anomalies.
4. **Feature engineering**: Work with SMEs to identify and create meaningful features that capture domain-specific nuances.
5. **Model evaluation**: Incorporate SME feedback in assessing model performance beyond just metrics.
6. **Iterative refinement**: Use SME input to continuously improve your models and approaches.

#### Overcoming Challenges

- **Bridging knowledge gaps**: Develop strategies to effectively communicate across disciplinary boundaries.
- **Managing expectations**: Clearly articulate the capabilities and limitations of ML to SMEs.
- **Balancing perspectives**: Learn to reconcile potentially conflicting viewpoints between technical feasibility and domain-specific requirements.


## The Data-Centric Approach

> "Data is both the key bottleneck and interface to developing AI today"

This quote encapsulates the essence of Data-Centric AI. It suggests that:

- **Data as a Bottleneck**: The limitation in AI development is often not computational power or model sophistication, but the availability of high-quality, relevant data.
- **Data as an Interface**: Data serves as the primary means through which we interact with and shape AI systems.

## Repercussions and Strategies

1. **Focus on Data Quality**
- Prioritize collecting, cleaning, and curating high-quality datasets.
- Implement robust data validation and verification processes.

2. **Data Augmentation and Synthesis**
- Develop techniques to artificially expand datasets while maintaining relevance and quality.
- Explore methods for generating synthetic data to address scarcity in certain domains.

3. **Efficient Data Labeling**
- Investigate techniques like weak supervision and semi-supervised learning to reduce the need for extensive manual labeling.
- Develop smart labeling tools that capitalize on existing models to assist in the labeling process.

4. **Data Governance and Ethics**
- Establish clear protocols for data collection, usage, and storage.
- Address ethical concerns related to data bias, privacy, and representation.

5. **Continuous Data Improvement**
- Implement systems for ongoing data refinement and updates.
- Develop metrics to assess and improve data quality over time.


## Challenges and Considerations

- **Balancing Quantity and Quality**: While more data is generally beneficial, it's crucial to maintain data quality and relevance.
- **Domain Specificity**: The effectiveness of Data-Centric AI can vary across different domains and applications.
- **Computational Resources**: Managing and processing large datasets requires significant computational resources.
- **Interdisciplinary Approach**: Effective Data-Centric AI often requires collaboration between domain experts, data scientists, and AI researchers.

> By embracing a Data-Centric approach, AI practitioners can potentially achieve significant improvements in model performance and applicability, often surpassing gains from model architecture tweaks alone. This shift represents a fundamental change in how we approach AI development, emphasizing the critical role of data in shaping the future of artificial intelligence.

# I don't have enough data, what can I do?

<div style="text-align: center;">
<img src="images/insufficient_labeled_training_data.png" width="70%" height="70%">
</div>

## From Supervised to Weakly Supervised Learning

### Understanding Fully Supervised Learning

Fully supervised learning (or just traditional supervision) is a fundamental concept in machine learning where we aim to learn a function that maps input features to output labels using a labeled dataset. This approach forms the basis for many practical applications in data science and machine learning.

#### Key Components of Fully Supervised Learning

1. **Function Mapping**: The key goal is to learn a function $f: \mathcal{X} \rightarrow \mathcal{Y}$
- $\mathcal{X}$ represents the input space (features)
- $\mathcal{Y}$ represents the output space (labels)
- $f$ is the function that establishes the relationship between $\mathcal{X}$ and $\mathcal{Y}$

2. **Dataset Structure**: We work with a dataset $\mathcal{D} = \{(x_1, y_1), ..., (x_m, y_m)\}$
- Each pair $(x_i, y_i)$ represents an input-output example
- $x_i \in \mathcal{X}$: An instance from the input space
- $y_i \in \mathcal{Y}$: The corresponding label from the output space

3. **Output Space Variations**: The nature of $\mathcal{Y}$ defines the type of supervised learning task:
- Binary Classification: $\mathcal{Y} = \{\text{Yes}, \text{No}\}$
- Multiclass or Multilabel Classification: $\mathcal{Y} = \{\text{Class}_1, \text{Class}_2, ..., \text{Class}_n\}$
- Regression: $\mathcal{Y} = \mathbb{R}$ (real numbers)

#### The Learning Process

1. **Data Collection**: Gather a representative dataset with input-output pairs.
2. **Feature Selection**: Identify relevant attributes of the input space.
3. **Model Selection**: Choose an appropriate algorithm or model architecture.
4. **Training**: Use the labeled data to optimize the model's parameters.
5. **Validation**: Assess the model's performance on unseen data.
6. **Fine-tuning**: Adjust hyperparameters or model structure as needed.

#### Objective Function

In supervised learning, we typically aim to minimize a loss function $\mathcal{L}$:

$$\min_{\theta} \frac{1}{m} \sum_{i=1}^m \mathcal{L}(f_{\theta}(x_i), y_i)$$

Where:
- $\theta$ represents the model parameters
- $f_{\theta}(x_i)$ is the model's prediction for input $x_i$
- $\mathcal{L}(f_{\theta}(x_i), y_i)$ measures the discrepancy between the prediction and true label

> You could read that as "Find the model settings that, on average, make our predictions as close as possible to the correct answers for all our training examples."

#### Advantages and Challenges

**Advantages:**
- High accuracy when sufficient labeled data is available
- Clear evaluation metrics based on predicted vs. actual labels
- Well-established theoretical foundations and practical techniques

**Challenges:**
- Requires large amounts of labeled data, which can be expensive and time-consuming to obtain
- May struggle with generalization if training data is not representative
- Can be prone to overfitting, especially with complex models and limited data

> **Note**: While fully supervised learning is powerful, it's important to recognize its limitations, particularly the need for large labeled datasets. This realization has led to the development of weakly supervised and unsupervised learning techniques, which we'll explore in subsequent sections.

## Types of Weak Supervision

> *“Weakly supervised learning is an umbrella term covering a variety of studies that attempt to construct predictive models by learning with weak supervision.”*

Weak supervision in machine learning comes in several forms, each presenting unique challenges and methodologies. The three primary types are **incomplete supervision**, **inexact supervision**, and **inaccurate supervision**.


### Incomplete Supervision

<img src="images/incomplete_supervision.png" width="70%" height="70%" style="display: block; margin-left: auto; margin-right: auto;">

Incomplete supervision occurs when only a subset of the training data is labeled. This is a common scenario in real-world applications where labeling data can be expensive, time-consuming, or impractical.

- **Characteristics**:
- Only a subset of data points have labels.
- A large portion of the data remains unlabeled.
- Formally, we have $\mathcal{l}$ labeled data points and $\mathcal{u}$ unlabeled data points, with $\mathcal{l} \ll \mathcal{u}$, totaling $\mathcal{m} = \mathcal{l} + \mathcal{u}$ data points.

$$ \mathcal{D} = \{(x_1, y_1), ..., (x_l, y_l), (x_{l+1}, \emptyset), ..., (x_m, \emptyset)\} $$

- **Example**: In a dataset of medical images, only a small fraction might be annotated by experts due to the high cost and time required for labeling.



### Inexact Supervision

<img src="images/inexact_supervision.png" width="70%" height="70%" style="display: block; margin-left: auto; margin-right: auto;">

Inexact supervision refers to situations where the labels provided are less precise or fine-grained than ideal. This often involves higher-level or aggregate information rather than specific, instance-level labels.

- **Characteristics**:
- Labels are present but lack precision or detail.
- Often involves group or set-level annotations rather than individual instance labels.
- May provide partial information about the target variable.

- **Common Forms**:
1. **Multiple Instance Learning (MIL)**:
- Bags of instances are labeled, not individual instances.
- Formally, for a bag $B_i = \{x_{i1}, x_{i2}, ..., x_{in}\}$, we have:
$$ y_i = \begin{cases}
1 & \text{if } \exists x_{ij} \in B_i : y_{ij} = 1 \\
0 & \text{otherwise}
\end{cases} $$

2. **Label Proportions**:
- Only the proportion of positive instances in a group is known.
- For a group $G_k$ with $n_k$ instances, we have:
$$ p_k = \frac{1}{n_k} \sum_{x_i \in G_k} y_i $$
where $p_k$ is the known proportion of positive instances in group $k$.

3. **Coarse-Grained Labels**:
- Labels are provided at a higher level of granularity than desired.
- If $Y$ is the set of fine-grained labels and $Z$ is the set of coarse-grained labels:
$$ z = f(y), \text{ where } y \in Y, z \in Z, \text{ and } |Z| < |Y| $$


- **Examples**:
- **Image Classification**: An image is labeled as containing a cat, but the exact location of the cat is unknown.
- **Text Analysis**: A document is labeled as positive or negative, but the specific sentences contributing to this sentiment are not indicated.
- **Epidemiology**: Data on disease prevalence in a population without individual diagnoses.


> **Note**: Inexact supervision is often used as part of a more complex pipeline rather than in isolation.


### Inaccurate Supervision

<img src="images/inaccurate_supervision.png" width="70%" height="70%" style="display: block; margin-left: auto; margin-right: auto;">

Inaccurate supervision occurs when the labels provided in the training data contain errors or noise. This means some of the "ground truth" labels used to train the model are incorrect.

Formally, it is identical to the formulation of fully supervised learning, but you can't trust the $y_i$ labels.
We can model the noise in the labels as a transition matrix $T$ that describes the probability of flipping a label from one class to another, hence, we can estimate the real label as $\tilde{y} = T y$.

- **Sources of Inaccuracy**:
- Human error in manual labeling.
- Faulty automated labeling processes.
- Built-In ambiguity in the task.
- Deliberate noise injection.

- **Types of Label Noise**:
1. **Random Noise**:
- Labels are randomly flipped or misassigned.
- For binary classification, the noise transition matrix can be represented as:
$$ T = \begin{bmatrix}
1-\rho_1 & \rho_2 \\
\rho_1 & 1-\rho_2
\end{bmatrix} $$
where $\rho_1$ and $\rho_2$ are the probabilities of flipping labels from 0 to 1 and 1 to 0, respectively.

2. **Systematic Noise**:
- Errors follow a pattern, often due to consistent misunderstandings or biases.
- Can be modeled as a function of the true label:
$$ P(\tilde{y} | y) = g(y) $$
where $\tilde{y}$ is the observed (noisy) label and $y$ is the true label.

3. **Instance-Dependent Noise**:
- The likelihood of a label being incorrect depends on the features of the instance.
- Can be modeled as:
$$ P(\tilde{y} | y, x) = h(y, x) $$
where $x$ represents the features of the instance.

- **Challenges**:
- Risk of the model learning and amplifying errors in the training data.
- Difficulty in distinguishing between true patterns and noise.
- Potential for overfitting to noisy labels.
- Reduced model performance and generalization ability.

- **Importance in Real-World Applications**:
- Most real-world datasets contain some degree of label noise.
- Understanding and addressing inaccurate supervision is crucial for developing reliable machine learning models.
- Particularly important in fields like medical diagnosis or autonomous driving where errors can have serious consequences.

- **Research Directions**:
- Developing better methods to estimate the noise transition matrix $T$.
- Creating more robust loss functions that account for label noise, e.g., by incorporating the estimated noise rates: $ \mathcal{L}_{robust}(\theta) = \mathbb{E}_{(x,\tilde{y})}[\ell(f_\theta(x), \tilde{y}) | T] $
- Exploring theoretical bounds on learning under different noise models.
- Confident learning: learning with noisy labels by leveraging the confidence of the model on the training data.


> **Focus**: While we'll focus primarily on incomplete and inaccurate supervision, the concepts are similar for inexact supervision.

## Overcoming the Unavailability of Labels: Weak Supervision

Weak supervision is a powerful technique for addressing the challenge of limited labeled data in machine learning. This approach leverages domain knowledge and heuristics to create large-scale labeled datasets efficiently.

### Key Concepts of Weak Supervision

1. **Programmatic Labeling**:
- Labels are generated through heuristics that encapsulate domain expertise.
- These heuristics act as labeling functions, automatically assigning labels to data points.

2. **Weak Labels to Strong Classifiers**:
- Classifiers trained on weak labels can generalize beyond the specific patterns in the heuristics.
- This generalization ability is crucial for capturing complex relationships in the data.

3. **Quantity vs. Quality Trade-off**:
- Typically, twice as much weakly labeled data is needed to match the performance of manually annotated datasets.
- However, the cost-efficiency of weak labeling often outweighs this requirement.

4. **Scalability**:
- A significant advantage of weak supervision is its scalability.
- The cost difference between generating 1 or 1 million labels is minimal, making it highly efficient for large datasets.

### Effectiveness of Weak Supervision

Empirical evidence supports the efficacy of weak supervision:

<div style="text-align: center;">
<img src="images/ws1.png" width="70%" height="70%">
</div>

<div style="text-align: center;">
<img src="images/ws2.png" width="70%" height="70%">
</div>

<div style="text-align: center;">
<img src="images/ws3.png" width="70%" height="70%">
</div>

These images demonstrate that weak supervision can achieve comparable or even superior performance to traditional supervised learning in various tasks.

### Aggregating Noisy Labels

A critical aspect of weak supervision is combining multiple, potentially noisy labeling sources:

1. **Simple Methods**:
- Majority voting is a straightforward approach but may not capture complex relationships between labeling functions.

2. **Advanced Techniques**:
- Graph neural networks and matrix completion methods offer more sophisticated aggregation.
- These techniques can model elaborate dependencies and correlations among labeling functions.

3. **Unsupervised Accuracy Estimation**:
- It's possible to estimate the accuracy and correlations of heuristics without ground truth labels.
- This capability is crucial for assessing the reliability of weak labels.

4. **Practical Implementation**:
- Typically, a small set (400-600) of manually labeled observations is used to calibrate and validate the weak labeling process.
- This calibration set helps in refining heuristics and assessing their performance.

5. **Probabilistic Labeling**:
- The aggregation process results in a single probabilistic label for each observation.
- This probabilistic approach captures the uncertainty innate in weak labeling.

### Reliability of Weakly Labeled Datasets

An important question arises: Can we trust weakly labeled datasets? This inquiry leads to a more fundamental question about the reliability of human-annotated datasets.

> **Key Insight**: Even widely used benchmark datasets contain significant labeling errors.

Examples of label error rates in popular datasets ([Northcutt et al, 2021](http://arxiv.org/abs/2103.14749)):
- CIFAR-100: 5.85%
- ImageNet: 5.83%
- Google QuickDraw: 10.12%
- IMDB Reviews: 2.9%
- Amazon Reviews: 3.9%

These findings underscore the need for techniques to automatically detect and correct annotation errors, regardless of whether the labels come from human annotators or weak supervision pipelines.

### Theoretical Foundation

The effectiveness of weak supervision and error detection is grounded in the non-random nature of label noise. As demonstrated by [Angluin and Laird (1988)](https://dl.acm.org/doi/10.1023/A%3A1022873112823):

$$ P(\tilde{\mathcal{y}}_\text{ tree photo} \mid \mathcal{y^*}_\text{flower photo}) > P(\tilde{\mathcal{y}}_\text{ tree photo} \mid \mathcal{y^*}_\text{computer photo}) $$

This inequality illustrates that mislabeling is more likely between similar classes (e.g., tree and flower) than dissimilar ones (e.g., tree and computer).

### Real-World Application

I've had a practical example of weak supervision's effectiveness comes from a project for the Brazilian National Council of Justice and the United Nations Development Programme:

- **Scope**: Legal data classification for environmental crimes
- **Scale**: 135,668 data points labeled across 20 classification tasks
- **Efficiency**: Completed within days using a single legal expert to design labeling functions
- **Performance**: Our labels where correct within the 93-100% range, which is better than most benchmark datasets from the [Northcutt et al, 2021](http://arxiv.org/abs/2103.14749) study above.

This case study demonstrates the power of weak supervision in rapidly generating large-scale labeled datasets in specialized domains with minimal expert involvement.

### Asymptotic Scaling

- [Ratner et al](http://arxiv.org/abs/1810.02840) show that, as the number of unlabeled data points $ \mathcal{n} $ increases, the generalization error decreases at a rate proportional to $ \mathcal{n}^{-\frac{1}{2}} $.
- This scaling behavior is significant because it matches the rate found in traditional supervised learning, where the generalization error also typically scales as $ \mathcal{n}^{-\frac{1}{2}} $ with respect to the number of labeled data points.
- The key result is that even though you are primarily using unlabeled data (which has been labeled through the label model $ \hat{\mu} $), you achieve the same asymptotic error scaling as if you were using labeled data directly.
- This finding is valuable because it suggests that effective use of unlabeled data, when combined with a robust label model, can yield a strong predictive performance similar to traditional supervised learning methods.

# Questions

1. What are the three key principles of Data-Centric AI?

2. How does the Data-Centric approach differ from the conventional model-centric approach in AI development?

3. Why is programmatic labeling important in Data-Centric AI?

4. What role do Subject Matter Experts (SMEs) play in Data-Centric AI projects?

5. What are the three primary types of weak supervision?

6. How does incomplete supervision differ from inaccurate supervision?

7. What is the typical trade-off between quantity and quality in weak supervision?

8. How does the asymptotic scaling of weak supervision compare to traditional supervised learning?

9. What challenge does manual data labeling present in real-world AI applications?

10. How effective is weak supervision compared to traditional supervised learning?

`Answers are commented inside this cell.`

<!-- 1. The three key principles of Data-Centric AI are: 1) It centers around data as the primary driver of AI success, 2) It needs to be programmatic to overcome limitations of manual labeling, and 3) It requires the inclusion of subject matter expertise in the development process.

2. The Data-Centric approach focuses on data quality and quantity as the key drivers of AI success, whereas the conventional model-centric approach primarily focuses on designing and optimizing algorithms and model architectures.

3. Programmatic labeling is important because it allows for scalable, efficient, and adaptable data labeling, overcoming the limitations of manual labeling such as cost, time, and the ability to update labels as information evolves.

4. Subject Matter Experts (SMEs) are crucial in Data-Centric AI projects as they provide deep domain knowledge, help in defining project objectives, assist in data interpretation, and contribute to feature engineering and model evaluation.

5. The three primary types of weak supervision are incomplete supervision, inexact supervision, and inaccurate supervision.

6. Incomplete supervision occurs when only a subset of the training data is labeled, while inaccurate supervision refers to situations where the provided labels contain errors or noise.

7. In weak supervision, typically twice as much weakly labeled data is needed to match the performance of manually annotated datasets. However, the cost-efficiency of weak labeling often outweighs this requirement.

8. The asymptotic scaling of weak supervision matches that of traditional supervised learning, with generalization error decreasing at a rate proportional to n^(-1/2) as the number of unlabeled data points increases.

9. Manual data labeling presents challenges in scalability, cost efficiency, time constraints, and the need for specialized knowledge in domain-specific tasks. It's often impractical for large-scale or sensitive data applications.

10. Weak supervision can achieve comparable or even superior performance to traditional supervised learning in various tasks, as demonstrated by empirical evidence and real-world applications. -->