## About last time

- Research is **competitive, self-referential and commercialized**

- A lot of research is produced everyday, mostly in the form of **article-shaped units**

- This focus on articles leaves behind other equivalently important aspects like **data, code, and other kind of artefacts**

- Moreover, competitiveness and the lack of adequate fundings leads to several issues, such as the **evaluation** of produced research (*publish or perish*), research **quality**, and **lack** of incentives.

- **Peer reviewing** should address some of these problems (e.g., research quality) but this is often not the case

- This situation calls for movements focused on **quality and reproducibility** of research, often embracing **different definitions** of these terms

## About this lecture

Reproducibility can target **different aspects** of the research pipeline.

From the conceptualization of ideas to collecting resources and doing experiments.

In this lecture, we are going to cover several aspects of data collection since it often represents the **backbone** of most procuded machine learning research.

### What about data

In particular, we will cover the following aspects, which (hopefully) might be new to many of you.

- Annotation paradigms
- Requirements for collecting and annotating data
- Issues and risks when collecting data
- Evaluating annotation quality

### Why should you care about data?

Even if you never going to annotate and collect novel resources, you will **highly probably** deal with some sort of data.

$\rightarrow$ This means that you will still need to **inspect** and understand the limits of the data you are using in relation to the hypothesis you want to corroborate in your experiments!

### Data-centric AI

A recent trend **is emerging** centred on **data quality** rather than model design: [Data-centric AI](https://datacentricai.org/)

$\rightarrow$ a suite covering different aspects of data design and collection: from labeling to augmentation

#### Why?

Because collecting high-quality data is the **biggest slice of the effort cake** when solving a particular task! (*the cake is a lie!*)

Models **always find a shortcut** to solve a task: biases, correlations, cues 

$\rightarrow$ More in general: **Internal** and **external** validity of experiments [[Liao et al., 2021](https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/757b505cfd34c64c85ca5b5690ee5293-Paper-round2.pdf)]

Reproducibility of findings should **avoid such leakages** in experimenting

## Let's first check your experience!

It is time for another [Google Form](https://forms.gle/uQBhrgoDM52VbKA99)! (**5 mins**)

<center>
<div>
<img src="../Images/Lecture-3/qsn_data-collection.png" width="600" alt='Data collection'/>
</div>
</center>

## Annotation Paradigms

Data annotation covers several aspects, such as annotators eligibility, agreement resolution, human evaluation, recommendations for mitigating harmful biases, and data annotation methodology [[Ruggeri et al., 2024]](https://arxiv.org/abs/2406.14099v2)

Annotation paradigms are a way to frame data annotation methodology where the focus is on the **intended purpose** of collected data.

In brief, an annotation paradigm can be thought of some sort of recommendations on **how to** structure the annotation process.

### Types of Annotation Paradigms

So far, we identify three main annotation paradigms.

- Prescriptive [[Rottger et al., 2022]](https://aclanthology.org/2022.naacl-main.13/)
- Descriptive  [[Rottger et al., 2022]](https://aclanthology.org/2022.naacl-main.13/)
- Perspectivist [[Cabitza et al., 2023]](https://ojs.aaai.org/index.php/AAAI/article/view/25840)

### An important disclaimer

Bare in mind that **no paradigm is inherently superior**!

$\rightarrow$ Explicitly aiming for one specific annotation paradigm beneficial because it makes clear what an annotated dataset can and should be used for.

Annotation paradigms are applicable to **all** data annotation.

They can be used to compare existing datasets, and to make and communicate decisions about **how** new datasets are annotated as well as **how** annotator disagreement can be interpreted.

## Descriptive Annotation Paradigm

[[Rottger et al., 2022]](https://aclanthology.org/2022.naacl-main.13/) argues that one particular aspect of collecting data is how many beliefs (concerning a certain task) shoud be modeled.

The descriptive annotation paradigm **encourages** annotator subjectivity to create datasets as granular surveys of
individual beliefs.

Descriptive data annotation thus allows for the capturing and modelling of **different beliefs**.

### Does it make sense?

Descriptive data annotation captures a multiplicity of beliefs in data labels, much like a very granular survey would.

#### Example [[Jiang et al., 2021]](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0256762)

Perceptions about the harmfulness of sexually explicit language vary strongly across the eight countries in their sample.

In contrast, support for mass murder or human trafficking is seen as very harmful across all countries.

$\rightarrow$ Descriptive data annotation can help to identify which entries are more **subjective**

### The good thing(s)

Annotator-level labels from descriptive data annotation have been shown to be a rich source of information for model training

$\rightarrow$ they can be used to **separately** model annotators’ beliefs.

[[Davani et al., 2021]](https://aclanthology.org/2022.tacl-1.6/) show that for tasks like abuse detection multi-annotator model architectures
outperform standard single-label approaches on single label prediction

Annotators can be studied by **grouping them** into clusters based on sociodemographic attributes [[Al Kuwatly et al., 2020]](https://aclanthology.org/2020.alw-1.21/) or polarisation
measures [[Akhtar et al., 2020]](https://ojs.aaai.org/index.php/HCOMP/article/view/7473) derived from annotator labels.

Models can be trained directly on **soft labels** (i.e., distributions of labels given by annotators), rather than hard one-hot ground truth vectors.

Descriptive data annotation facilitates **model evaluation** that accounts for different beliefs about how a model should behave.

$\rightarrow$ comparing a model prediction to a descriptive label distribution, the crowd truth can help estimate how acceptable the prediction would be to users.

### The bad thing(s)

Dataset creators must decide **who their data aims to represent**, by establishing a clear population of interest.

For instance, ask women journalists to annotate harassment targeted at them [[Arora et al., 2020]](https://aclanthology.org/2020.alw-1.2/)

Dataset creators must consider whether representativeness can practically **be achieved**.

To capture a representative distribution of beliefs for each entry requires dozens, if not hundreds of annotators recruited from the population of interest.

$\rightarrow$ To mitigate this, we might introduce information sharing across groups of annotators, where annotator behaviour updates group-specific priors rather than being considered in isolation

The absence of a (specified) ground truth label **complicates the interpretation** of any observed annotator disagreement.

It may be due to:

- (**desirable**) a genuine difference in beliefs

- (**undesirable**) annotator error

$\rightarrow$ affects agreement measures in the sense they can **only measure subjectiveness** and not task difficulty, annotator performance, or data quality.

Descriptive annotation is fundamentally **misaligned** with standard NLP methods that rely on single gold standard labels

When datasets are constructed to be granular surveys of beliefs, reducing those beliefs to a single label, through
majority voting or otherwise, goes directly **against that purpose**

$\rightarrow$ Aggregating labels conceals informative disagreements and risks discarding minority beliefs

### Hands on session!

It is time to test your skills (and patience) as an annotator.

You will be asked to annotate a few examples concerning hate speech.

Please, click on this [Google Form](https://forms.gle/PEhH5xH2dcAEuMTe9) to annotate (**10 mins**)

<center>
<div>
<img src="../Images/Lecture-3/qsn_descriptive.png" width="600" alt='Descriptive hands on'/>
</div>
</center>

## Prescriptive Annotation Paradigm

The prescriptive annotation paradigm **discourages** annotator subjectivity and instead tasks annotators with encoding one specific belief.

This shared belief is formulated in the **annotation guidelines**.

### Does it make sense?

Encoding one specific belief in a dataset through data annotation is **difficult**.

$\rightarrow$ still advantageous for many practical applications

#### Example

Social media companies moderate content on their platforms according to specific and extensive content policies

$\rightarrow$ This illustrates that even for highly subjective tasks, where different model behaviours are plausible and valid, one specific behaviour may be practically desirable.

### The good thing(s)

In the prescriptive paradigm, annotator disagreements are a **call to action** when

-  the annotation guidelines are not correctly applied by annotators

-  the guidelines themselves are inadequate

$\rightarrow$ **Quality assurance** is a laborious but structured process, with inter-annotator agreement as a
useful, albeit noisy, measure of dataset quality.

More importantly, the one belief that annotators are tasked with applying is **made visible and explicit** in the
annotation guidelines.

Prescriptive annotation guidelines can provide detailed insights into **how datasets were created**, which can then
inform their downstream use.

### The bad thing(s)

Creating guidelines for prescriptive data annotation is **difficult** because it requires topical knowledge and
familiarity with the data that is to be annotated.

Guidelines would ideally provide a clear judgment on **every possible entry**, but in practice, such perfectly comprehensive guidelines can only be approximated.

Moreover, creating guidelines for prescriptive data annotation requires deciding **which one belief to encode** in the dataset.

This can be a complex process that risks disregarding non-majority beliefs if marginalised people are
not included in it.

Annotators need to **be familiar with annotation guidelines** to apply them correctly, which may require additional training, especially if guidelines are long and complex.

During annotation, annotators will need to **refer back to the guidelines**, which requires giving them sufficient time per entry and providing a well-designed annotation interface.

Annotator subjectivity can be discouraged, **but not eliminated**. 

Inevitable gaps in guidelines leave annotators no choice but to apply their personal judgement for some entries, and even when there is explicit guidance, implicit biases may persist.

To address this issue, dataset creators should work with groups of annotators that are diverse in terms of sociodemographic characteristics and personal experiences, even when annotator subjectivity is discouraged.

### Hands on session!

It is time to test your skills (and patience) as an annotator.

You will be asked to annotate a few examples concerning hate speech.

Please, click on this [Google Form](https://forms.gle/tvVmFVrLySxNgYQC7) to annotate (**15 mins**)

<center>
<div>
<img src="../Images/Lecture-3/qsn_prescriptive.png" width="600" alt='Prescriptive hands on'/>
</div>
</center>

## A simple remark on the prescriptive annotation paradigm [[Ruggeri et al. 2024]](https://arxiv.org/abs/2406.14099v2)

All the existing annotation methodologies following these paradigms are centered on annotating
data samples with **class labels**.

When considering a paradigm that relies on annotation guidelines like the prescriptive one, there are four main limitations

- an **information loss** during data annotation, where the specific components(s) of the guidelines used by annotators are not explicitly recorded.

-  the **lack of transparency** in evaluating annotator adherence to the guidelines, making it difficult to assess whether annotation decisions were based on the provided instructions or subjective interpretations.

- the task-specific annotations, which tie annotations to a task-specific class set, **limit the reuse of annotated data** for other tasks unless undergoing additional human annotation efforts.

- **model adherence to guidelines**, where machine learning (ML) models are trained with class supervision only despite using data annotated according to specific guidelines, thus making unclear to what extent model learning aligns with the task description conveyed by guidelines.

## Guideline-centered annotation paradigm (GCAM)

<center>
<div>
<img src="../Images/Lecture-3/GCAM.png" width="600" alt='GCAM'/>
</div>
</center>

### The good thing(s)

There are three major properties of GCAM addressing the issues of SAM.

-  Annotators are **unaware** of the class set $C$ and the class grounding function $r$.

$\rightarrow$ This enforces adherence by definition to the guidelines during the annotation process, as the annotators do not have access to any other information on the task

- GCAM enables using the annotated dataset for **different task annotations** that share common guidelines.

<center>
<div>
<img src="../Images/Lecture-3/GCAM-task-formulations.png" width="600" alt='GCAM-task-formulations'/>
</div>
</center>

- Datasets created following GCAM **report explicitly the knowledge** encoded in the guidelines $G$.

$\rightarrow$ This allows ML models to be effectively trained to leverage this information

$\rightarrow$ prevents the model from learning spurious correlations between $x$ and the class subset $C_x$, as
the model has no access to $C_x$ during training

$\rightarrow$ decouples $G$ and $C$ allowing for using the model for different task formulations, as its predictions can be mapped to different class sets relying on the corresponding class grounding functions

$\rightarrow$ we can evaluate the model’s performance by assessing its alignment with $G$ by analyzing disparities between predicted and ground-truth $G_x$, fostering a deeper comprehension of the model.

### The bad thing(s)

While GCAM can reach comparable IAA to SAM, it requires **more than 2x of the time** and effort for annotators.

$\rightarrow$ can significantly limit the size of collected datasets at the cost of fine-grained control over annotations.

## The perspectivist annotation paradigm

The annotation process is often performed in terms of a **majority vote**, however this has been
proved to be often problematic

$\rightarrow$ This is done to get rid of **disagreement**

$\rightarrow$ The main idea behind disagreement removal grounds on the ideal of truth for which a “higher-quality ground truth is one in which multiple humans provide the same annotation for the same examples”

Perspectivism **counters** the removal of disagreement and, consequently, the assumption of correctness of traditionally aggregated gold-standard datasets

Proposes the adoption of methods that preserve divergence of opinions and **integrate multiple perspectives**
in the ground truthing process of ML development

Thus, the perspectivist paradigm is centred on how annotations should be handled for an intended purpose, independently of how they were collected.

### Disagreement always occurs

Real-world settings show that disagreement is **unavoidable and essentially irreducible**, especially when the
objects to classify are so complex that most of the raters can actually get them wrong, and the real experts are a minority [[Basile, 2021]](https://link.springer.com/chapter/10.1007/978-3-030-77091-4_26) [[Cabitza et al., 2019]](https://journals.sagepub.com/doi/10.1177/1460458218824705).

### Two types of perspectivism

We distinguish two types of perspectivism depending on how annotations are handled.

#### Weak perspectivism

Collect multiple annotations (i.e., labels) per sample to account for (i) annotators' errors and (ii) task variability and then **aggregate them** into a single label for training models.

#### Strong perspectivism

Collect multiple annotations (i.e., labels) per sample to account for (i) annotators' errors and (ii) task variability and then **keep them all** in the subsequent phases of training or benchmarking of the classifcation models.

### The good thing(s)

- Provide a tool to recognize phenomena (i.e., tasks) which exhibit a **natural ambiguity**

- Take into account **different beliefs** which are often attributed as noise

- To develop models that can leverage **label uncertainty**

- Report about **annotators' confidence** in annotations $\rightarrow$ useful for maority voting and data quality assessment

### The bad things(s)

- Involve **several annotators** to cover different perspectives with statistical significance

- Involve annotators from **heterogeneous backgrounds** (e.g., culture)

-  **Incompatibility** with standard ML approaches, which are usually not designed to take into account multiple perspectives or annotation, and need to design ad-hoc ML methods

- More complex **validation**, due to the absence of a uniquely defned ground truth.

## Data selection

The data on which you carry out your experiments is **extremely important**.

- What does it contain?

- Where was the data collected?

- Are there any harmful limitations/bias/spurious correlations? $\rightarrow$ **Pre-processing on Aharoni et al., 2014** [💬].

- **Popularity != quality** $\rightarrow$ several popular datasets have shown to have significant limitations [[Paullada et al., 2020](https://www.sciencedirect.com/science/article/pii/S2666389921001847); [Koch et al., 2021](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/3b8a614226a953a8cd9526fca6fe9ba5-Paper-round2.pdf)]

- Can you rely on some domain experts?

### Data size

- Are there any **requirements** on data size?


- **Low amount** of data? What does 'low' mean?

- Does dataset limit your intended empirical analysis?

- Class imbalance [[Haixian et al., 2017](https://www.sciencedirect.com/science/article/pii/S0957417416307175?via%3Dihub)]

- Duplicates $\rightarrow$ ([💬] CIFAR 14% relative performance drop [[Barz et al., 2020](https://www.mdpi.com/2313-433X/6/6/41)])

## Dataset collection

According to [[Paullada et al., 2021](https://www.sciencedirect.com/science/article/pii/S2666389921001847)], a **wide majority** of datasets and related data collection processes have **major flaws**.


- **Trend** on collecting large-scale datasets (quantity over quality) $\rightarrow$ *''If it is available to us, we ingest it"*

- No **clear understanding** what these datasets/benchmarks **measure**.

- Affected by **spurious correlations** that deep learning models use as **shortcuts** and **overfit** on them [[Geirhos et al., 2020](https://www.nature.com/articles/s42256-020-00257-z); [Storks et al. 2020](https://arxiv.org/pdf/1904.01172.pdf); [Schlegel et al., 2020](https://arxiv.org/pdf/2005.14709.pdf)] $\rightarrow$ exacerbate a dataset's utility (the hypothesis of reflecting human reasoning capabilities)

- **Improper** task formulation (I/O)$\rightarrow$ **Personal traits prediction from images** [💬]

- Annotators' **biases**, usually due to **lack of** guidelines or dataset documentation  $\rightarrow$ **re-building ImageNet** [💬]

- Hard to scrutinize $\rightarrow$ **abbatoir-effect** [[Raji et al., 2020](https://slideslive.com/38947414/harms-from-ai-research)] exacerbating data quality and hiding ethical issues

- Entirely focused on **benchmarking metrics** $\rightarrow$ consumption, cost, fairness, model size, limitations, multiple gold standards, etc. [[Tedeschi et al., 2023]](https://aclanthology.org/2023.acl-long.697/)

- Simple heuristics $\rightarrow$ ([💬] VQA imbalance [[Goyal et al., 2017](https://openaccess.thecvf.com/content_cvpr_2017/papers/Goyal_Making_the_v_CVPR_2017_paper.pdf)])

## Are we modeling the annotator?

[[Geva et al., 2019](https://aclanthology.org/D19-1107.pdf)] show that models overfit **on annotators' specific cues**, especially when free natural language text is allowed

<center>
<div>
<img src="../Images/Lecture-3/modeling_annotators.png" width="400" alt='modeling_annotators'/>
</div>
</center>

They advocate for annotation processes were **different annotators** are used for training and test splits, respectively

$\rightarrow$ this is **particularly important** when considering small/medium-sized datasets with a restricted number of annotators

As the number of annotators **increases**, the model is less prone to exploit annotators' specific cues

## Important aspects of Annotation

There are several aspects that concern the annotation process, which are often not discussed enough.

### Iterative guideline refinement

In real-world scenarios, guidelines are subject to **continuous refinement** with human-in-the-loop helping in addressing edge cases as data is collected

Consequently, data annotation is required **at each** guideline refinement iteration, leading to substantial efforts and costs.

Repeated until convergence or time budget reached.

### Edge case discovery and resolution

Preliminary annotation studies and annotation iterations are mainly devised to evaluate the coverage and effectiveness of guidelines.

Measured based on the number of difficult examples detected (i.e., topic of discussion for which high disagreement is observed)

$\rightarrow$ these examples are often denoted as **edge cases**, which must be addressed by updating guidelines and repeating the annotation phase.

## Discussion phases

Annotators undergo **eligibility tests** to assess their adequacy to participate in an annotation study

Likewise, annotators engage in discussions to compare their annotations and further assess their understanding of the task (and, if any, adherence to the guidelines)

In many cases, discussions are not sufficient to resolve disagreements (if resolution if the purpose of the discussion)

$\rightarrow$ Involve additional annotators to resolve the conflict

## How to measure success?

Inter-annotator agreement to the rescue!

- Cohen's kappa for pairs of annotators
- Krippendorff's alpha, Fleiss's kappa for multiple annotators



Avoid mistakes during annotation!

**Annotation leakage** between annotators during annotation (*sounds trivial, right?*) $\rightarrow$ ([💬] Annotating in the legal domain)

## Concluding Remarks

- Data quality is paramount in machine learning $\rightarrow$ don't take quality for granted!

- Data annotation should always reflect an intended use of data $\rightarrow$ annotation paradigms 

- Data collection is inherently biased $\rightarrow$ when biases are harmful?

- There are several aspects concerning data annotation that should not be neglected

- There are several ways to measure annotator agreement $\rightarrow$ make sure you know what this agreement indicates!

# Any questions?

<center>
<div>
<img src="../Images/Lecture-2/jojo-arrivederci.gif" width="1000" alt='JOJO_arrivederci'/>
</div>
</center>