# Before we start:

## Course Logistics: A Few Things You Should Know

### Ask Questions
Please let me know how things are progressing for you throughout the course.

### Basics
**Office hours**: If you need to meet, please [email me](elias.jacob@ufrn.br).


### Requirements for This Course
To succeed in this course, you should have:

- Access to the dataset and the code provided.
- A foundational understanding of machine learning.
- Basic knowledge of natural language processing (NLP).
- Proficiency in Python.
- Familiarity with Jupyter Notebooks.
- Basic knowledge of statistics.
- An understanding of data science and machine learning concepts and stacks.

### Teaching Approach
#### Top-Down Method
I will be employing a *top-down* teaching method, which contrasts the traditional *bottom-up* approach. In a *bottom-up* approach, you typically learn all the individual components first and then gradually combine them into more complex structures. This approach often leads to students losing motivation, lacking a sense of the "big picture," and not knowing what they will need in practice.

According to Harvard Professor David Perkins in his book [Making Learning Whole](https://www.amazon.com/Making-Learning-Whole-Principles-Transform/dp/0470633719), effective learning can be likened to learning a sport like baseball. Children are not required to memorize all the rules and understand every technical detail before they start playing. Instead, they begin by playing with a general understanding and gradually learn more rules and details over time.

#### Focus on Functionality
At the beginning of this course, prioritize understanding what things **do**, not necessarily what they **are**. You will encounter some "black boxes", that is, concepts or tools that we use without fully explaining them upfront. Later, we will get into the lower-level details.

#### Learning by Doing and Explaining
Research indicates that people learn best by:
1. **Doing**: Engaging in coding and building projects.
2. **Explaining**: Articulating what they've learned, either by writing or helping others.

I will guide you through building projects and encourage you to explain these concepts to your peers.

#### Learning as a Team Sport
Studies show that teamwork significantly enhances learning. Therefore, I encourage you to:
- Ask questions.
- Answer questions from fellow students.
- Collaborate on building projects.

If you receive a request for help, consider it an opportunity to solidify your understanding by teaching others.

### Course Materials
All course materials can be accessed at the course's GitHub repository.


### Final Project

Your final project will be evaluated based on:

- **Technical Quality:** Robustness and efficacy of your implementation.
- **Creativity:** Originality of your approach.
- **Usefulness:** Practical applicability.
- **Presentation:** Effectiveness of your project showcase.
- **Documentation:** Clarity and thoroughness of your report.

### Project Guidelines

- **Individual Work:** The project must be completed individually.
- **Submission:** Provide a link to a GitHub repository or shared folder with your code, data, and report. Use virtual environments along with a `requirements.txt` file to ensure reproducibility.
- **Deadline:** Refer to the syllabus.
- **Presentation:** Prepare a 10-minute presentation to demonstrate your project.
- **Submission Platform:** Use the designated platform (e.g., SIGAA).


----

# Theoretical Introduction to Data-Centric AI and Weakly Supervised Learning
## IMD3011 - Datacentric AI
### [Dr. Elias Jacob de Menezes Neto](https://docente.ufrn.br/elias.jacob)


## Keypoints
- **Data-Centric Shift:**  
  Modern AI development increasingly emphasizes the importance of data quality and scale over constant modifications of model architectures. High-quality, well-prepared datasets are foundational to achieving reliable AI performance.

- **Programmatic Labeling:**  
  Moving away from purely manual annotation, programmatic methods, including automated and heuristic labeling, provide expandable, cost-efficient, and adaptable solutions. These methods address real-world challenges like rapidly evolving data and limited expert availability.

- **Subject Matter Expertise (SME):**  
  Integrating SMEs in the AI development process is essential. Their domain knowledge refines data interpretation, feature engineering, and model evaluation, ensuring that model outputs are both accurate and practically relevant.

- **Weak Supervision Types:**  
  Weak supervision can be categorized into:
  - **Incomplete Supervision:** Only a subset of data is labeled.
  - **Inexact Supervision:** Labels are imprecise or aggregated (e.g., multiple instance learning).
  - **Inaccurate Supervision:** Labeled data contains noise and errors.
  
- **Label Aggregation and Trade-offs:**  
  Although weak supervision may require a larger dataset (often about twice the number of weakly labeled samples compared to fully supervised ones), it can achieve comparable performance by employing aggregation techniques and adjusting for uncertainty through probabilistic labels.

- **Asymptotic Scaling and Model Robustness:**  
  The generalization error in weak supervision decreases at a rate proportional to n^(-1/2), mirroring traditional supervised learning. Efficient use of unlabeled data and sophisticated aggregation methods further enhance robustness.

## Learning Goals

By the end of this section, students will be able to:

1) Define and differentiate between model-centric and data-centric approaches to artificial intelligence development.

2) Explain the rationale behind the shift towards data-centric AI, emphasizing the importance of data quality and scale for achieving performance improvements.

3) Describe how the evolution of the GPT series exemplifies the principles of data-centric AI and the impact of increasing data quantity and quality.

4) Articulate the role of prompt engineering in data-centric AI and explain its significance when working with large, pre-trained models.

5) Discuss the consequences of adopting a data-centric approach on various aspects of AI development, such as data collection, model training strategies, and deployment workflows.


## Data-Centric AI: Shifting Focus from Models to Data

This section discusses the recent change in artificial intelligence research, where the emphasis is increasingly on the quality and scale of data rather than just on the design of model architectures.

### Model-Centric vs. Data-Centric Approaches

Consider the formula:

$$
\text{AI} = \text{Code} + \text{Data}
$$

This expression highlights that both the algorithms (code) and the training data have important roles in achieving success with AI.

#### Conventional Model-Centric Approach

- **Focus on Algorithms:**  
  Traditionally, efforts in AI research have prioritized the development and improvement of model architectures. Researchers have invested significant time in creating complex neural networks and refining training techniques.

- **Limited Attention to Data Quality:**  
  While data is always a component, its role has often been viewed as secondary to the design of advanced models.

#### Emerging Data-Centric Approach

- **Emphasis on High-Quality Data:**  
  The data-centric perspective argues that substantial improvements in AI performance mainly come from the availability of large and high-quality datasets. As a result, even modest changes in how data is handled can lead to noticeable enhancements in AI outcomes.

- **Simplicity in Model Architechture After a Certain Point:**  
  With powerful models that can learn effectively from data, the task of adapting solutions to various problems may only require altering the input data (often called prompt engineering) rather than changing the network design. In other words, once the model reaches a certain level of capacity, the data fed during both training and inference becomes the primary tool for achieving specific tasks.

### The GPT Series as an Example

The evolution of the GPT series, developed by OpenAI, illustrates these ideas:

<div style="text-align: center;">
<img src="images/gpt_evolution.png" width="70%" height="70%">
</div>

- **Left Side of the Figure:**  
  This side of the figure shows that improvements in GPT models' performance are largely due to the increase in the amount and quality of the training data. While the architecture has been refined, it has remained mostly similar in design, with the main change being the increase in the number of parameters.

- **Right Side of the Figure:**  
  When the model is sufficiently large and well-trained, it requires only minor adjustments to the input prompts (inference data) to perform a diverse range of tasks. This means that after the initial heavy lifting during training, the model remains fixed while the data used in inference drives its functionality.

### Important Points to Consider

> **Key Takeaways:**
> - **Data plays a central role** in the success of modern AI systems.
> - **High-quality and large-scale data** have contributed more to performance gains than radical changes in model architecture alone.
> - **Prompt engineering** becomes a useful tool when working with models that are already powerful.

### Addressing Potential Questions

1. **Why shift focus from models to data?**  
   The shift is driven by the realization that even with similar model architectures, significant performance improvements are achieved by using better and more abundant training data. This approach also simplifies the work required during deployment by relying on prompt engineering rather than constantly redesigning the models.

2. **What are the consequences for AI development?**  
   Emphasizing data quality may lead to increased investment in data collection, cleaning, and augmentation. Researchers and engineers might allocate resources more toward ensuring that training sets are diverse, representative, and of high quality.

3. **How does this change affect model training?**  
   Large-scale and high-quality datasets allow models to learn more from the data itself. Once a model reaches a certain level of capability, minor modifications to input data can tailor its performance across various tasks without altering the basic architecture.

### Visualizing the Concept

The image below reinforces the discussion by comparing the traditional focus on model improvement with the modern emphasis on data quality:

<div style="text-align: center;">
<img src="images/model_vs_datacentric.jpeg" width="70%" height="70%">
</div>

The accompanying figure from [this paper](http://arxiv.org/abs/2303.10158) demonstrates that even as model complexity increases, the primary factor behind performance improvements is the scale and quality of training data.

## Key Principles of Data-Centric AI

### Principle 1: It Centers Around Data
1. **Data as the Primary Driver**

    - The adage "Garbage In, Garbage Out" (GIGO) is particularly relevant in AI, emphasizing the critical role of data quality.
    - Recent AI breakthroughs are largely attributed to the quality and quantity of data, rather than model architecture improvements.
2. **The Power of Large Language Models**
    - With the advent of powerful models like GPT-3, the focus has shifted from model design to prompt engineering.
    - These models demonstrate that with sufficient scale and data, many complex tasks can be solved through clever prompting rather than architectural changes.

3. **Data Hunger of Deep Learning Models**
    - Modern deep learning models require vast amounts of data to achieve high performance.
    - This data dependency has led to a shift in focus towards data acquisition, curation, and management.

4. **The "Solved" Problem of Deep Learning Architectures**
    - Many researchers argue that the fundamental challenges in designing deep learning architectures have been largely addressed.
    - See the image below: over the last years the [performance on IMDB sentiment analysis](https://paperswithcode.com/sota/sentiment-analysis-on-imdb) has been increasing very modestly . After 5 years and millions of dollars invested, the accuracy has increased from 96.0% to 96.68%
    - The emphasis now is on applying and fine-tuning existing architectures rather than creating entirely new ones.
    - The "problem" with designing deep learning architectures is "solved". Today, you can load a pre-trained model with few lines of code and get latest results. The real challenge is to get the right data to train your model.

<div style="text-align: center;">
<img src="images/imdb_sentiment_analysis.png" width="70%" height="70%">
</div>

> <div style="text-align: center;">
> <video width="720" controls>
> <source src="images/febraban1.mp4" type="video/mp4">
> </video>
>


Maybe a little too much? Let's see.

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
model = AutoModelForMaskedLM.from_pretrained("neuralmind/bert-base-portuguese-cased")
````

Maybe he was right, after all. 😅

## Principle 2: It Needs to be Programmatic

### Challenges with Manual Data Labeling

Manual data labeling, while traditional, has several difficulties that affect the development of AI systems:

- **Scalability:**  
  As dataset sizes increase, the time required for manual labeling grows approximately as  
  $$
  T = N \times t_{\text{label}}
  $$
  where $ N $ is the number of samples and $ t_{\text{label}} $ is the time spent per sample. With large $ N $, manual labeling becomes impractical.

- **Cost Inefficiency:**  
  Hiring and managing sizeable teams of human labelers can demand extensive resources. This cost can limit projects, especially when frequent dataset updates are necessary.

- **Time Constraints:**  
  The entire manual process demands significant time investment. This delay slows the overall cycle of AI model development and updates.

### Specific Real-World Limitations

Manual labeling presents added challenges in certain domains:

- **Data Privacy:**  
  In areas like healthcare or law, the data may include sensitive personal information that cannot be easily shared with human labelers. This restriction complicates the process and may require additional safeguards.

- **Specialized Expertise:**  
  Some projects require labelers with specific domain knowledge. For instance:
  
  - Legal document classification might require insights from lawyers.
  - Medical data annotation often needs input from qualified physicians.
  - Technical content might demand experts in the field.

> **Example:**  
> Developing an AI system to diagnose rare diseases would necessitate input from medical specialists for each case. This makes manual labeling not only slow and expensive but also subject to strict data privacy protocols.

### The Issue of Evolving Information

Manual labeling can also struggle with keeping pace with changes in information:

- **Outdated Labels:**  
  Fields such as law, medicine, or technology may experience rapid changes. As new laws, discoveries, or technologies emerge, previously labeled data may become less accurate or entirely irrelevant.

- **Need for Continuous Updates:**  
  When new categories must be added or existing ones updated, large parts of the dataset may need re-labeling.  
  > **Analogy:**  
  > This is like printing an encyclopedia: by the time it is complete, some information is already outdated, requiring a complete revision to stay current.

### Moving Toward Programmatic Solutions

To overcome these challenges, the AI community is increasingly using programmatic methods for data labeling:

1. **Automated Labeling:**  
   Developing algorithms that can assign labels with minimal human oversight. This method reduces both the cost and time required.

2. **Transfer Learning:**  
   Using models pre-trained on large datasets to reduce the volume of new labeled data needed for specific tasks.

3. **Active Learning:**  
   Carrying Out systems that identify and select the most informative samples for human review. This approach minimizes the overall labeling effort while maximizing learning efficiency.

4. **Synthetic Data Generation:**  
   Creating artificial data that mimics real-world properties can help increase actual datasets, addressing limitations in data quantity and diversity.

### Benefits of Programmatic Approaches

These automated methods offer several clear advantages:

- **Speed:**  
  Labeling can be performed much faster than with manual methods.
  
- **Cost:**  
  Reduced dependency on large labeling teams leads to lower overall expenses.

- **Adaptability:**  
  Automated systems can be updated quickly as new information becomes available, ensuring the labeling process remains current.

- **Scalability:**  
  Programmatic labeling scales efficiently with data volume, accommodating rapid growth in dataset size.

<br><br>
| Labeling Approach      | Speed | Cost      | Adaptability |
|------------------------|-------|-----------|--------------|
| Manual Labeling        | Slow  | Expensive | Static       |
| Programmatic Labeling  | Fast  | Affordable| Dynamic      |



## Principle 3: It Needs to Include Subject Matter Expertise in the Loop

### The Importance of Subject Matter Experts (SMEs)

In many machine learning projects, the input of Subject Matter Experts (SMEs) is essential. Their specific knowledge and insights bring an understanding of the domain that can greatly improve the accuracy and applicability of AI systems.

- **Critical Contributions:**  
  SMEs offer context that extends beyond data and algorithms. They can clarify ambiguous data points, provide guidance on potential pitfalls, and help align model outputs with real-world requirements.

- **Team Incorporation:**  
  Rather than being seen as external consultants, SMEs should be regarded as key members of the project team. Their continuous participation shapes decisions on model design, data interpretation, and feature engineering.

### Effective Collaboration with SMEs

Working effectively with SMEs involves clear communication and mutual respect:

- **Clear Communication:**  
  - Translate technical terms into language that is easily understood by experts in the domain.  
  - Listen carefully to their feedback and use it to refine model objectives.  
  - Ask specific questions to uncover hidden assumptions in the data or problem setup.

- **Regular Interaction:**  
  Frequent discussions throughout the project help ensure that the insights from domain experts remain aligned with the model's development. For example, scheduling regular check-ins helps both technical and domain teams stay informed about ongoing adjustments.

### Navigating Domain Complexity

Machine learning practitioners often work across a variety of fields, which may include healthcare, finance, legal, or technology, among others. Consider the following points:

- **Recognize Limits:**  
  It is unrealistic to be an expert in every field. Acknowledging this fact encourages a learning mindset and emphasizes the value of expert input.

- **Continuous Learning:**  
  Every project serves as an opportunity to learn more about a new domain. Over time, the process of incorporating SME insights into technical models enhances your capability to adapt to various subjects.

- **Encoding Knowledge:**  
  The challenge is to incorporate domain knowledge into your models effectively. This may involve using established domain theories to guide data preprocessing, feature selection, and model evaluation. In some cases, the SME contributions can be represented mathematically. For example, if a feature $ x_i $ is informed by domain insights, one can express the importance of that feature as a weight $ w_i $ in the following linear model:
  $$
  y = \sum_{i=1}^{n} w_i x_i + b
  $$
  where the weights $ w_i $ are tuned not only through standard optimization methods but also adjusted based on SME feedback.

### Strategies for Integrating SME Insights

To maximize the benefits of SME participation, consider the following approaches:

1. **Frequent Check-Ins:**  
   Schedule regular meetings with SMEs throughout the project lifecycle. This ongoing interaction helps address issues as they arise.

2. **Collaborative Problem Definition:**  
   Involve SMEs in determining project aims and defining success criteria. Their input ensures that the project goals are both realistic and relevant to the domain.

3. **Data Interpretation:**  
   Use SME insights to help interpret data, especially when dealing with subtle or ambiguous statistical patterns. Their understanding can clarify why certain anomalies arise.

4. **Feature Engineering:**  
   Work together to select or create features that capture domain-specific nuances. This might involve identifying key indicators or combining variables in a meaningful way.

5. **Model Evaluation:**  
   Go beyond common performance metrics by incorporating SME observations into the evaluation process. Their feedback can reveal performance issues that numbers alone might not show.

6. **Iterative Refinement:**  
   As the project evolves, continue to refine the model based on ongoing SME input. This iterative process helps align technical outcomes with practical expectations.

### Addressing Common Challenges

Despite the benefits, integrating SME feedback can come with challenges:

- **Bridging Knowledge Gaps:**  
  Form strategies that ease better communication between technical teams and domain experts. Clear, jargon-free dialogue is key.

- **Managing Expectations:**  
  Clearly state what machine learning can and cannot achieve. This helps align SME expectations with the model’s capabilities.

- **Balancing Technical and Domain Perspectives:**  
  Find a middle ground where technical feasibility and domain-specific requirements meet. Align both perspectives to achieve solutions that work well statistically and practically.


## The Data-Centric Approach

> "Data is both the key bottleneck and interface to developing AI today"

This statement highlights two important ideas:

- **Data as a Bottleneck**: Often, the main limitation in advancing AI is not the computational power or the model's design but rather the availability of high-quality, relevant data.
- **Data as an Interface**: Data acts as the key element through which we shape and interact with AI systems.



## Key Strategies for Data-Centric AI

To enhance AI development by focusing on data, consider the following strategies:

1. **Focus on Data Quality**
   - Place emphasis on collecting and curating data that is both accurate and relevant.
   - Establish and follow clear procedures for data validation and verification.
   - **Example:** Imagine training a predictive model. The performance of the model depends on how well the training data represents real-world conditions, which is similar to how clear and precise measurements in an experiment lead to more reliable results.

2. **Data Augmentation and Synthesis**
   - Use techniques to expand existing datasets without compromising quality.
   - Explore methods to generate synthetic data, especially when data is limited in particular areas.
   - **Mathematical View:** Let $ D $ represent the available data. Augmentation aims to create an enhanced dataset $ D' $ such that:
     $$
     D' = D \cup A(D)
     $$
     where $ A(D) $ represents the augmented data generated from the original dataset.

3. **Efficient Data Labeling**
   - Consider methods like weak supervision and semi-supervised learning to minimize the need for extensive manual labeling.
   - Develop or use tools that assist in the labeling process by incorporating predictions from existing models.
   - **Analogy:** Think of labeling as tagging books in a library; efficient systems reduce the overall workload while still ensuring that every book is properly categorized.

4. **Data Governance and Ethics**
   - Create clear guidelines for how data is collected, used, and stored.
   - Address concerns such as data bias, ensuring that data collection covers a fair and diverse sample, and safeguard privacy.
   - **Note:** Integrating ethics into data management helps prevent unintended consequences in AI applications.

5. **Continuous Data Improvement**
   - Establish processes for regular updates and improvements to the dataset.
   - Develop metrics and techniques for ongoing assessment of data quality.
   - **Example:** Similar to how students improve their learning materials with feedback over a term, the data used in AI can be continually refined to improve model performance.



## Challenges and Considerations

Several challenges may arise when adopting a data-centric approach:

- **Balancing Quantity and Quality**
  - While adding more data can be beneficial, it is essential to ensure that new data maintains or improves the overall quality.
  - **Question to Consider:** How do we ensure that increased data volume does not compromise accuracy?

- **Domain Specificity**
  - The benefits of a data-focused strategy may vary depending on the application area. Some domains might require specialized data collection and curation methods.
  - **Clarification:** Different fields may have unique data requirements. For instance, medical data demands strict privacy and high accuracy, while social media data might prioritize volume and trend extraction.

- **Computational Resources**
  - Managing and processing large datasets requires considerable computational power.
  - **Consideration:** Ensure that the available infrastructure can support the data operations needed.

- **Interdisciplinary Collaboration**
  - Effective data management in AI often calls for the combined efforts of data scientists, domain experts, and AI researchers.
  - **Example:** In environmental modeling, collaboration between meteorologists and data experts can lead to better data collection practices, ensuring that models more accurately reflect climate patterns.

> **Important:** Shifting to a data-centric approach in AI emphasizes the importance of data quality and relevance. Often, improvements in data can lead to more significant gains in model performance than changes in model architecture. This change in focus can lead to more stable and reliable AI systems across various applications.

## I Don't Have Enough Data! What Can I Do?

<div style="text-align: center;">
  <img src="images/insufficient_labeled_training_data.png" width="70%" height="70%">
</div>


### 1. Expert Hand-Labeling

**Overview:**  
- Involves subject matter experts (SMEs) manually annotating data.
- **Traditional Supervision:** Randomly label data points without a specific strategy.
- **Active Learning:** Select the most informative or uncertain data points first to reduce labeling efforts.

**Considerations:**
- **Accuracy:** High-quality labels benefiting from deep domain knowledge.
- **Cost and Time:** Manual processes can be slow and expensive.
- **Scalability:** Not well-suited for extremely large datasets.

**Key Points:**
- When using expert labeling, imagine your labeled data as a set of indices:  
  $$
  D_{\text{labeled}} = \{(x_i, y_i) \}_{i=1}^{n}
  $$
  Here, $n$ is often small due to resource constraints.
- Active learning strategies seek to maximize the utility of each labeled data point.

---

### 2. Weak Supervision

**Overview:**  
Weak supervision refers to techniques that generate labels with lower precision but at significantly reduced cost and effort.

**Methods:**
- **Programmatic Supervision:**  
  - Develop rules, patterns, or heuristics to label data automatically.
  - Rules can be defined as functions $ f: X \to Y $ that map features to approximate labels.
- **Crowdsourcing:**  
  - Distribute labeling tasks among many contributors.
  - Often involves aggregating responses from multiple non-expert labelers to approximate expert quality.
- **Heuristic Supervision:**  
  - Use existing metadata or domain-specific rules to assign labels without manual review of every sample.

**Considerations:**
- **Trade-Off:** Lower per-instance accuracy may be offset by scale and speed.
- **Aggregate Reliability:** Techniques such as majority voting can help improve overall label quality.

---

### 3. Semi-Supervised Learning

**Overview:**  
Semi-supervised learning is effective when you have a small set of labeled data combined with a much larger pool of unlabeled data.

**Techniques:**
- **Utilizing Unlabeled Data:**  
  - Incorporate unlabeled data, $D_{\text{unlabeled}} = \{ x_j \}_{j=1}^{m}$, to learn structure or patterns in the data.
  - Combine with the labeled set to improve overall model performance.
- **Loss Function Incorporation:**  
  - Training often involves a combined loss function that includes both supervised and unsupervised components:
    $$
    \mathcal{L}_{\text{total}} = \mathcal{L}_{\text{supervised}} + \lambda \, \mathcal{L}_{\text{unsupervised}}
    $$
    where $\lambda$ controls the influence of the unlabeled data.

**Considerations:**
- **Generalization:** Using unlabeled data affords the model additional context, potentially reducing overfitting on the small labeled subset.
- **Implementation:** Requires algorithms designed to extract useful features from unlabeled examples, such as consistency regularization or pseudo-labeling.

---

### 4. Transfer Learning

**Overview:**  
Transfer learning involves taking a model trained on one task and adapting it to another task with limited labeled data.

**Methods:**
- **Pre-training and Fine-tuning:**  
  - Begin with a model pre-trained on a related dataset.
  - Adapt the model to your target domain by fine-tuning on your specific (and typically small) labeled dataset.
- **Multi-Task Learning:**  
  - Train models on multiple related tasks simultaneously.
  - Share learned representations across tasks, which can improve performance when target data is scarce.
- **Using Models Trained on Similar Domains:**  
  - If available, use models that have been developed on analogous problems.
  - Even if the data quality varies, these models can provide a useful starting point.

**Considerations:**
- **Domain Similarity:**  
  - The success of transfer learning depends on how similar the source and target domains are.
- **Data Efficiency:**  
  - Transfer learning can be especially powerful when labeled data is very limited, effectively reducing the requirement for large amounts of task-specific example data.

---

### Strategic Considerations

When deciding on a strategy for addressing insufficient labeled training data, consider the following factors:

- **Resources Available:**  
  - Time, labor, expertise, and computing resources.
- **Domain Complexity:**  
  - The intricacy of the subject matter may dictate a need for expert labels.
- **Precision Requirements:**  
  - Critical applications might demand higher quality labels.
- **Computational Capabilities:**  
  - More sophisticated methods like semi-supervised or transfer learning might require additional computing power.

**Recommendation:**
- Often, combining multiple strategies will yield the best results. For example, expert hand-labeling on a key subset of the data can be combined with weak supervision to label the remaining data, and semi-supervised learning can further capitalize on any unlabeled examples.

> **Note:**  
> There is no one-size-fits-all approach to data scarcity. The effectiveness of each method will vary depending on the context. It is important to assess your data quality and continuously refine your approach in light of model performance and emerging insights.

## From Supervised to Weakly Supervised Learning

### Fully Supervised Learning

Fully supervised learning is a basic method in machine learning that trains a model using a dataset where every example has an associated label. The goal is to learn a mapping from inputs to outputs, which can be applied to new, unseen data.

#### Key Concepts

1. **Function Mapping**

   - **Definition:** Learn a function $ f: \mathcal{X} \rightarrow \mathcal{Y} $.
     - $ \mathcal{X} $ is the input space (e.g., images, text, or sensor readings).
     - $ \mathcal{Y} $ is the output space (e.g., categories, binary values, or continuous numbers).
   - **Example:** In image classification, $ \mathcal{X} $ represents images while $ \mathcal{Y} $ represents a set of labels such as "cat" or "dog".

2. **Dataset Structure**

   - **Training Data:** A labeled dataset denoted by
     $$
     \mathcal{D} = \{(x_1, y_1), (x_2, y_2), \dots, (x_m, y_m)\}
     $$
     Here, each $ (x_i, y_i) $ is an individual example.
     - $ x_i \in \mathcal{X} $: The input feature.
     - $ y_i \in \mathcal{Y} $: The corresponding label.
   - **Role of Data:** The dataset must be representative of the problem space to ensure the model generalizes well to unseen cases.

3. **Variants of Supervised Learning Tasks**

   - **Binary Classification:** $ \mathcal{Y} = \{\text{Yes}, \text{No}\} $.
   - **Multiclass Classification:** $ \mathcal{Y} = \{\text{Class}_1, \text{Class}_2, \dots, \text{Class}_n\} $.
   - **Regression:** $ \mathcal{Y} = \mathbb{R} $ (real numbers).

#### The Learning Process

1. **Data Collection:**  
   Gather an extensive and representative dataset that includes both input features and corresponding labels.

2. **Feature Selection:**  
   Identify and choose relevant features from the input data that contribute to predicting the output accurately.

3. **Model Selection:**  
   Decide on an appropriate algorithm or model structure capable of capturing the relationship between inputs and outputs.

4. **Training:**  
   Improve the model’s parameters using the training data to reduce the difference between the model’s predictions and the actual labels.

5. **Validation:**  
   Evaluate the model on a separate set of data that was not used during training. This step helps to check how well the model generalizes.

6. **Fine-Tuning:**  
   Adjust parameters or modify the model structure based on performance feedback until a satisfactory level of accuracy is achieved.

#### Objective Function

To train the model, we minimize a loss function that measures the error between the model's predictions and the true labels. This is usually expressed as:

$$
\min_{\theta} \frac{1}{m} \sum_{i=1}^m \mathcal{L}(f_{\theta}(x_i), y_i)
$$

- $ \theta $ represents the model parameters.
- $ f_{\theta}(x_i) $ is the prediction for the input $ x_i $.
- $ \mathcal{L}(f_{\theta}(x_i), y_i) $ is the loss associated with the prediction.

> **Note:** This equation means that we are trying to find the best model settings that, on average, make our predictions as close as possible to the actual labels.

#### Advantages and Challenges

- **Advantages:**
  - **High Accuracy:** With a sufficient amount of accurately labeled data, supervised models can reach high levels of accuracy.
  - **Clear Evaluation Metrics:** Since true labels are known, performance metrics are straightforward to compute (e.g., accuracy, precision, recall).
  - **Theoretical and Practical Support:** There is a strong foundation in both theory and applied methods for fully supervised learning.

- **Challenges:**
  - **Data Requirements:** A large amount of labeled data is needed, which can be costly and time-consuming to collect.
  - **Generalization Risk:** If the training data does not fully represent the problem space, the model might not perform well on new data.
  - **Overfitting:** Complex models may learn the training data too well, including its noise, which can reduce performance on unseen data.

> **Reminder:** Fully supervised learning is effective when there is plenty of labeled data. In situations where obtaining such data is difficult, alternative methods like weakly supervised or unsupervised learning may provide viable solutions.


## Types of Weak Supervision

> *“Weakly supervised learning is an umbrella term covering a variety of studies that attempt to construct predictive models by learning with weak supervision.”*

Weak supervision in machine learning refers to scenarios where the available training labels are not as complete, precise, or error-free as in fully supervised learning. This section categorizes weak supervision into three primary types: **Incomplete Supervision**, **Inexact Supervision**, and **Inaccurate Supervision**.



### Incomplete Supervision

<img src="images/incomplete_supervision.png" width="70%" height="70%" style="display: block; margin-left: auto; margin-right: auto;">

**Definition:**  
Incomplete supervision occurs when only a subset of the available training data is labeled. In many practical applications, obtaining labels for every data point can be very expensive or time-consuming. As a result, a large portion of the data remains unlabeled.

**Characteristics:**
- Only a fraction of the data points have labels.
- The unlabeled data may be much larger in proportion compared to the labeled set.
  
**Formal Representation:**  
If we denote the labeled data by $ l $ and the unlabeled data by $ u $, with the total data being $ m = l + u $, the dataset can be written as:

$$
\mathcal{D} = \{(x_1, y_1), \dots, (x_l, y_l), (x_{l+1}, \emptyset), \dots, (x_m, \emptyset)\}
$$

**Example:**  
In a collection of medical images, only a small number may be annotated by experts due to the high cost and expertise required for labeling.



### Inexact Supervision

<img src="images/inexact_supervision.png" width="70%" height="70%" style="display: block; margin-left: auto; margin-right: auto;">

**Definition:**  
Inexact supervision refers to situations in which the labels available do not provide the ideally detailed or instance-specific information. Instead, the labels often supply higher-level summaries or aggregated data.

**Characteristics:**
- Labels are less precise, often providing only partial or aggregate information.
- Annotations may be available at a group level rather than for each individual instance.

#### Common Forms of Inexact Supervision

1. **Multiple Instance Learning (MIL):**  
   - **Concept:** Data is grouped into "bags" of instances and the entire bag is labeled rather than each instance.
   - **Mathematical Formulation:**  
     For a bag $ B_i = \{x_{i1}, x_{i2}, \dots, x_{in}\} $, the bag label $ y_i $ is defined as:
     $$
     y_i =
     \begin{cases}
     1 & \text{if there exists } x_{ij} \in B_i \text{ such that } y_{ij} = 1, \\
     0 & \text{otherwise.}
     \end{cases}
     $$

2. **Label Proportions:**  
   - **Concept:** Instead of individual instance labels, the proportion of positive instances in each group is known.
   - **Mathematical Formulation:**  
     For a group $ G_k $ with $ n_k $ instances, the proportion is given as:
     $$
     p_k = \frac{1}{n_k} \sum_{x_i \in G_k} y_i
     $$
     where $ p_k $ represents the known proportion of positive instances.

3. **Coarse-Grained Labels:**  
   - **Concept:** Labels are provided at a more general level rather than being specific to each fine-grained category.
   - **Mathematical Formulation:**  
     Let $ Y $ be the set of fine-grained labels, and $ Z $ be the set of coarse-grained labels such that:
     $$
     z = f(y), \quad \text{with } y \in Y,\, z \in Z, \quad \text{and } |Z| < |Y|.
     $$

**Examples:**
- **Image Classification:** An image might be labeled as "cat," but the precise location of the cat within the image is not provided.
- **Text Analysis:** A document might be classified as having a positive sentiment without indicating which specific parts contribute to this sentiment.
- **Epidemiology:** Reports may provide the prevalence of a disease in a region without recording individual-level diagnoses.

> **Note:** Inexact supervision is often integrated with other methods in a larger pipeline rather than used on its own.



### Inaccurate Supervision

<img src="images/inaccurate_supervision.png" width="70%" height="70%" style="display: block; margin-left: auto; margin-right: auto;">

**Definition:**  
Inaccurate supervision arises when the training labels contain errors or noise. Although the dataset is formally structured as in fully supervised learning, the labels $ y_i $ cannot be fully trusted due to their built-in inaccuracy.

**Sources of Inaccuracy:**
- **Human Error:** Mistakes made during manual labeling.
- **Automated Labeling Errors:** Faulty processes in automated systems.
- **Built-In Ambiguity:** Certain tasks are ambiguous by nature.
- **Deliberate Noise Injection:** Sometimes noise is intentionally introduced during training.

**Modeling Label Noise:**  
A common approach is to model the noise using a transition matrix $ T $, which represents the probability of a true label being flipped to an incorrect label. The effective label can be expressed as:
$$
\tilde{y} = T y
$$

#### Types of Label Noise

1. **Random Noise:**
   - **Description:** Labels are flipped or misassigned at random.
   - **Binary Classification Example:** The noise transition matrix for binary labels might be:
     $$
     T = \begin{bmatrix}
     1-\rho_1 & \rho_2 \\
     \rho_1 & 1-\rho_2
     \end{bmatrix}
     $$
     Here, $ \rho_1 $ is the probability of flipping a label from 0 to 1 and $ \rho_2 $ is the probability of flipping from 1 to 0.

2. **Systematic Noise:**
   - **Description:** Errors occur in a patterned or biased manner, often due to consistent misunderstandings.
   - **Modeling Approach:**
     $$
     P(\tilde{y} | y) = g(y)
     $$
     where $ g(y) $ represents a function that captures the error pattern based on the true label $ y $.

3. **Instance-Dependent Noise:**
   - **Description:** The mislabeling probability varies with the features of each instance.
   - **Modeling Approach:**
     $$
     P(\tilde{y} | y, x) = h(y, x)
     $$
     Here, $ x $ denotes the features of the instance, indicating that the likelihood of a label being noisy depends on the instance itself.

**Challenges Introduced by Inaccurate Supervision:**
- The risk of the model learning erroneous patterns from noisy labels.
- Difficulties in distinguishing reliable patterns from noise.
- The potential for overfitting to incorrect labels.
- Reduced performance and generalization ability.

**Research Directions:**
- Improvement in methods to estimate the noise transition matrix $ T $.
- Development of strong loss functions that account for label noise, for example:
  $$
  \mathcal{L}_{\text{reliable}}(\theta) = \mathbb{E}_{(x,\tilde{y})}\left[\ell(f_\theta(x), \tilde{y}) \mid T \right]
  $$
- Investigations in theoretical bounds when learning with noisy data.
- Methods in confident learning, which incorporate the model's confidence to better handle noisy labels.


## Overcoming the Unavailability of Labels: Weak Supervision

Weak supervision offers a useful method to overcome the problem of limited labeled data in machine learning. It uses domain expertise and a set of rules or heuristics to generate labels for large datasets quickly and efficiently.

### Key Concepts of Weak Supervision

1. **Programmatic Labeling**
   - **Heuristic Labeling Functions:** Instead of manual annotations, labels are produced by applying rules that represent domain knowledge. These rules assign labels automatically to data points based on specific criteria.
   - **Efficiency:** This strategy allows practitioners to generate vast numbers of labels rapidly, which is especially valuable when expert human labeling is costly or time-consuming.

2. **Generalization from Weak Labels**
   - **Learning Beyond Simple Patterns:** Though the initial labels come from heuristics that tie closely to the domain-based rules, the classifiers trained on these labels can learn to capture more complex relationships within the data.
   - **Transition from Weak to Strong:** Over time, the model may generalize well beyond the limitations of the original labeling functions.

3. **Quantity vs. Quality Trade-off**
   - **Data Volume:** Empirical results indicate that about twice as many weakly labeled samples may be needed to achieve the performance seen with manually labeled data.
   - **Cost Benefits:** Despite the requirement for more data, the significant reduction in labeling cost justifies the approach.

4. **Scalability**
   - **Large-Scale Applications:** The method is exceptionally flexible since there is only a minimal extra cost in generating a million labels compared to a single label.
   - **Cost Efficiency:** This efficiency makes weak supervision accessible for projects involving large datasets.

### Effectiveness of Weak Supervision

Experimental evidence supports the practical use of weak supervision:

<div style="text-align: center;">
<img src="images/ws1.png" width="70%" height="70%">
</div>

<div style="text-align: center;">
<img src="images/ws2.png" width="70%" height="70%">
</div>

<div style="text-align: center;">
<img src="images/ws3.png" width="70%" height="70%">
</div>

These images show that weak supervision can achieve performance similar to, or even better than, traditional supervised learning methods in various tasks.

### Aggregating Noisy Labels

A key challenge of weak supervision is how to combine labels from various heuristic sources, which might be noisy or partially conflicting:

1. **Simple Aggregation Methods**
   - **Majority Voting:** A basic method where the label selected by most heuristics is chosen. However, this method might miss the fundamental relationships between the different rules.

2. **Advanced Aggregation Techniques**
   - **Graph Neural Networks and Matrix Completion:** These advanced strategies better capture the dependencies among the different labeling functions, leading to more reliable aggregated labels.

3. **Unsupervised Accuracy Estimation**
   - **Estimating Performance Without Ground Truth:** It is possible to estimate the accuracy and interdependencies of heuristics even when no true labels are available. This estimation helps to flag which rules are more reliable.

4. **Calibration with a Small Labeled Set**
   - **Validation:** Typically, a small set of manually labeled observations (around 400-600) is used to calibrate the labeling functions and verify their performance.
   - **Refinement:** This set is critical to adjust the heuristics and ensure that the resulting labels are trustworthy.

5. **Probabilistic Labeling**
   - **Confidence Estimates:** The outcome of the aggregation is a probabilistic label for each observation, which reflects the uncertainty innate in using weak labels.


### Asymptotic Scaling Behavior

The scaling behavior of weak supervision has a mathematical foundation:

- **Generalization Error:** Studies, such as those by [Ratner et al](http://arxiv.org/abs/1810.02840), demonstrate that as the number of unlabeled data points $ \mathcal{n} $ increases, the generalization error decreases following an order of $ \mathcal{n}^{-\frac{1}{2}} $. This rate is similar to that seen in traditional supervised learning with manually labeled data.
- **Label Model Impact:** When unlabeled data is used in combination with a well-calibrated label model $ \hat{\mu} $, the performance converges to that of models trained with fully labeled datasets.

This mathematical insight reassures us that practical use of weak supervision can yield reliable predictive performance when sufficient data is available.


### Reliability of Weakly Labeled Datasets

A common concern is the trustworthiness of weakly labeled datasets. It is important to recognize that human-annotated datasets also contain errors. For example, studies report the following label error rates for widely used datasets ([Northcutt et al, 2021](http://arxiv.org/abs/2103.14749)):

- **CIFAR-100:** 5.85%
- **ImageNet:** 5.83%
- **Google QuickDraw:** 10.12%
- **IMDB Reviews:** 2.9%
- **Amazon Reviews:** 3.9%

> **Note:** Even benchmark datasets widely accepted in the research community are not free from labeling errors, emphasizing the necessity for methods that can identify and correct such mistakes in both human and weak supervision pipelines.

### Theoretical Foundation

The method's reliability is based on the observation that labeling errors are not random. Mislabeling is more common between similar classes. For example, consider the following inequality from [Angluin and Laird (1988)](https://dl.acm.org/doi/10.1023/A%3A1022873112823):

$$
P(\tilde{y}_{\text{tree photo}} \mid y^*_{\text{flower photo}}) > P(\tilde{y}_{\text{tree photo}} \mid y^*_{\text{computer photo}})
$$

This inequality implies that mislabeling between classes such as trees and flowers (which share similarities) is more likely compared to mislabeling between trees and computers (which are very different).

### Real-World Application

An illustrative case comes from a project I led with the Brazilian National Council of Justice and the United Nations Development Programme:

- **Application:** Classification of legal data related to environmental crimes.
- **Data Volume:** 135,668 data points across 20 classification tasks.
- **Efficiency:** A single legal expert created the labeling functions, completing the process in just a few days.
- **Performance:** The generated labels achieved a correctness rate between 93% and 100%, exceeding the quality found in benchmark datasets referenced in [Northcutt et al, 2021](http://arxiv.org/abs/2103.14749).

This example shows how weak supervision can be effectively applied even in specialized fields with minimal expert input.


## Takeaways
- **Data Quality Over Model Complexity:**  
  The performance of AI systems greatly benefits from investments in data curation, augmentation, and governance rather than continuous innovations in model design. Simple improvements in the input data can drive significant performance gains.

- **Expandable and Cost-Efficient Approaches:**  
  Embracing programmatic labeling and weak supervision techniques allows for the creation of large, effective datasets without the heavy costs and time commitments of manual labeling. This makes AI development more practical for large-scale and sensitive applications.

- **Collaborative and Iterative Process:**  
  Combining automated methods with continuous SME input leads to more nuanced, accurate models. Regular feedback and teamwork not only simplify the labeling process but also ensure that AI systems remain aligned with real-world requirements.

- **Adaptability in a Dynamic Environment:**  
  The data-centric approach supports continuous updates and improvements. This adaptability is vital in domains where information evolves rapidly, ensuring that AI systems remain relevant and accurate over time.

- **Broad Applicability of Weak Supervision:**  
  Empirical evidence and theoretical insights demonstrate that weak supervision does not only serve as a stopgap when labeled data is scarce, it can even outperform traditional methods in various contexts by effectively utilizing large volumes of weakly labeled data.

# Questions

1. What is the main shift in focus in modern AI development, moving from model-centric to data-centric approaches, and why is this shift occurring?

2.  Contrast the conventional model-centric approach to AI development with the emerging data-centric approach, highlighting their primary focuses and assumptions.

3.  Explain why programmatic labeling is becoming a necessary principle in data-centric AI, outlining the limitations of manual data labeling that it addresses.

4.  Describe the essential role of Subject Matter Experts (SMEs) in data-centric AI development and how their involvement enhances the quality and relevance of AI systems.

5.  According to the principles of data-centric AI, what are the key strategies for enhancing AI development by focusing on data, beyond just collecting more data?

6.  Define weak supervision regarding machine learning and explain its purpose in overcoming the limitations of fully supervised learning.

7.  Categorize and briefly describe the three main types of weak supervision as defined in the notebook content.

8.  Explain the concept of programmatic labeling functions in weak supervision and how they contribute to efficient data labeling.

9.  What is the key trade-off between data quantity and quality in weak supervision, and how is comparable performance achieved despite potentially noisy labels?

10.  Based on the notebook content, what are the key benefits of adopting a data-centric approach to AI development, and how does it impact the overall AI development lifecycle?

`Answers are commented inside this cell.`

<!-- 1. The main shift is from focusing primarily on model architecture improvements to emphasizing data quality and scale. This shift is occurring because significant performance improvements are now being realized through better and more abundant training data, even with relatively simpler models.

2. The model-centric approach prioritizes developing and improving model architectures, with data being secondary. The data-centric approach, conversely, emphasizes that substantial AI performance improvements come from high-quality, large datasets, making data handling of primary importance.

3. Programmatic labeling is necessary due to the challenges of manual labeling, including limitations in scalability, cost inefficiency, time constraints, and difficulties in handling data privacy, specialized expertise needs, and evolving information.

4. Subject Matter Experts (SMEs) are essential because their domain knowledge refines data interpretation, feature engineering, and model evaluation, ensuring that model outputs are accurate and practically relevant. They provide context beyond algorithms and data alone.

5. Key strategies include focusing on data quality, data augmentation and synthesis, efficient data labeling methods like weak supervision, data governance and ethics, and continuous data improvement through regular updates and assessments.

6. Weak supervision refers to training models with labels that are incomplete, inexact, or inaccurate compared to fully supervised learning. It aims to reduce the cost and effort of obtaining large, high-quality labeled datasets.

7. The three types are: Incomplete Supervision, where only a subset of data is labeled; Inexact Supervision, where labels are imprecise or aggregated (e.g., multiple instance learning); and Inaccurate Supervision, where labeled data contains noise and errors.

8. Programmatic labeling functions involve using rules, heuristics, or automated methods based on domain knowledge to assign labels to data points. This allows for rapid generation of labels, making the labeling process more efficient and adaptable compared to manual labeling.

9. The trade-off is that weak supervision may require a larger dataset compared to fully supervised learning. However, by utilizing aggregation techniques and probabilistic labels to adjust for uncertainty, comparable performance can be achieved with significantly reduced labeling costs and time.

10. Key benefits include improved AI system performance through better data, expandable and cost-efficient data creation, enhanced model robustness and reliability, adaptability to dynamic environments, and a shift towards a more practical and efficient AI development lifecycle centered around data quality and iterative improvement. -->
