## About last time

- Research is competitive, self-referential and commercialized

- A lot of research is produced everyday, mostly in the form of article-shaped units

- This focus on articles leaves behind other equivalently important aspects like data, code, and other kind of artefacts

- Moreover, competitiveness and the lack of adequate fundings leads to several issues, such as the evaluation of produced research (*publish or perish*), research quality, and lack of incentives.

- Peer reviewing should address some of these problems (e.g., research quality) but this is often not the case

- This situation calls for movements focused on quality and reproducibility of research, often embracing different definitions of these terms

## About this lecture

Reproducibility can target different aspects of the research pipeline.

From the conceptualization of ideas to collecting resources and doing experiments.

In this lecture, we are going to cover several aspects of data collection since it often represents the backbone of most procuded machine learning research.

### What about data

In particular, we will cover the following aspects, which (hopefully) might be new to many of you.

- Annotation paradigms
- Requirements for collecting and annotating data
- Issues and risks when collecting data
- Evaluating annotation quality

### Why should you care about data?

Even if you never going to annotate and collect novel resources, you will highly probably deal with some sort of data.

$\rightarrow$ This means that you will still need to inspect and understand the limits of the data you are using in relation to the hypothesis you want to corroborate in your experiments!

## Let's first check your experience!

It is time for another [Google Form](https://forms.gle/uQBhrgoDM52VbKA99)! (**5 mins**)

<center>
<div>
<img src="../Images/Lecture-3/qsn_data-collection.png" width="600" alt='Data collection'/>
</div>
</center>

## Annotation Paradigms

Data annotation covers several aspects, such as annotators eligibility, agreement resolution, human evaluation, recommendations for mitigating harmful biases, and data annotation methodology [[Ruggeri et al., 2024]](https://arxiv.org/abs/2406.14099v2)

Annotation paradigms are a way to frame data annotation methodology where the focus is on the intended purpose of collected data.

In brief, an annotation paradigm can be thought of some sort of recommendations on how to structured the annotation process.

### Types of Annotation Paradigms

So far, we identify three main annotation paradigms.

- Prescriptive [[Rottger et al., 2022]](https://aclanthology.org/2022.naacl-main.13/)
- Descriptive  [[Rottger et al., 2022]](https://aclanthology.org/2022.naacl-main.13/)
- Perspectivist [[Cabitza et al., 2023]](https://ojs.aaai.org/index.php/AAAI/article/view/25840)

### An important disclaimer

Bare in mind that no paradigm is inherently superior!

$\rightarrow$ Explicitly aiming for one specific annotation paradigm beneficial because it makes clear what an annotated dataset can and should be used for.

## Descriptive Annotation Paradigm

### Hands on session!

It is time to test your skills (and patience) as an annotator.

You will be asked to annotate a few examples concerning hate speech.

Please, click on this [Google Form]() to annotate (**15 mins**)

<center>
<div>
<img src="../Images/Lecture-3/qsn_descriptive.png" width="600" alt='Descriptive hands on'/>
</div>
</center>

## Prescriptive Annotation Paradigm

[[Rottger et al., 2022]](https://aclanthology.org/2022.naacl-main.13/) argues that one particular aspect of collecting data is how many beliefs (concerning a certain task) shoud be modeled.

The prescriptive annotation paradigm discourages annotator subjectivity and instead tasks annotators with encoding one specific belief.

This shared belief is formulated in the annotation guidelines.

### Does it make sense?

Encoding one specific belief in a dataset through data annotation is difficult.

$\rightarrow$ still advantageous for many practical applications

#### Example

Social media companies moderate content on their platforms according to specific and extensive content policies

$\rightarrow$ This illustrates that even for highly subjective tasks, where different model behaviours are plausible and valid, one specific behaviour may be practically desirable.

### The good thing(s)

In the prescriptive paradigm, annotator disagreements are a call to action

-  the annotation guidelines were not correctly applied by annotators

-  the guidelines themselves were inadequate

$\rightarrow$ Quality assurance is a laborious but structured process, with inter-annotator agreement as a
useful, albeit noisy, measure of dataset quality.

More importantly, the one belief that annotators are tasked with applying is made visible and explicit in the
annotation guidelines.

Prescriptive annotation guidelines can provide detailed insights into how datasets were created, which can then
inform their downstream use.

### The bad thing(s)

Creating guidelines for prescriptive data annotation is difficult because it requires topical knowledge and
familiarity with the data that is to be annotated.

Guidelines would ideally provide a clear judgment on every possible entry, but in practice, such perfectly comprehensive guidelines can only be approximated.

Moreover, creating guidelines for prescriptive data annotation requires deciding which one belief to encode in the dataset.

This can be a complex process that risks disregarding non-majority beliefs if marginalised people are
not included in it.

Annotators need to be familiar with annotation guidelines to apply them correctly, which may require additional training, especially if guidelines are long and complex.

During annotation, annotators will need to refer back to the guidelines, which requires giving them sufficient time per entry and providing a well-designed annotation interface.

Annotator subjectivity can be discouraged, but not eliminated. 

Inevitable gaps in guidelines leave annotators no choice but to apply their personal judgement for some entries, and even when there is explicit guidance, implicit biases may persist.

To address this issue, dataset creators should work with groups of annotators that are diverse in terms of sociodemographic characteristics and personal experiences, even when annotator subjectivity is discouraged.

### Hands on session!

It is time to test your skills (and patience) as an annotator.

You will be asked to annotate a few examples concerning hate speech.

Please, click on this [Google Form]() to annotate (**15 mins**)

<center>
<div>
<img src="../Images/Lecture-3/qsn_prescriptive.png" width="600" alt='Prescriptive hands on'/>
</div>
</center>

### Guidelines

<center>
<div>
<img src="../Images/Lecture-3/prescriptive-guidelines.png" width="600" alt='Prescriptive guidelines'/>
</div>
</center>

## A simple remark on the prescriptive annotation paradigm

## The perspectivist annotation paradigm

## Step 1 - The data

- Data selection
- Understand your data
- Data size and distribution

### Data selection

The data on which you carry out your experiments is **extremely important**.

- What does it contain?

- Where was the data collected?

- Are there any limitations/bias/spurious correlations? $\rightarrow$ **The Tank problem** [💬]; **Pre-processing on Aharoni et al., 2014** [💬].

- **Popularity != quality** $\rightarrow$ several popular datasets have shown to have significant limitations [[Paullada et al., 2020](https://www.sciencedirect.com/science/article/pii/S2666389921001847); [Koch et al., 2021](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/3b8a614226a953a8cd9526fca6fe9ba5-Paper-round2.pdf)]

- Is the data directly available or built via code?

- Are you going to collect any data?

### Understand your data

**Data is your friend**: you may spot relevant patterns that can guide your modeling. 

$\rightarrow$ However, **only look at training data!**

Otherwise, you might **impose some biases** that limit the generalization capabilities of your approach: **data leakage** [❗]

**Domain experts** are a valuable resource for understanding your data!

### Data size

- Are there any **requirements** on data size?


- **Low amount** of data? What does 'low' mean?

- Do you need specific solutions concerning data size? $\rightarrow$ cross-validation, data-augmentation [[Wong et al., 2016](https://ieeexplore.ieee.org/document/7797091); [Shorten and Khoshgoftaar, 2019](https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0)], transfer learning, simple models.

- Class imbalance [[Haixian et al., 2017](https://www.sciencedirect.com/science/article/pii/S0957417416307175?via%3Dihub)]

## Dataset collection

According to [[Paullada et al., 2021](https://www.sciencedirect.com/science/article/pii/S2666389921001847)], a **wide majority** of datasets and related data collection processes have **major flaws**.


- **Trend** on collecting large-scale datasets (quantity over quality) $\rightarrow$ *''If it is available to us, we ingest it"*

- No **clear understanding** what these datasets/benchmarks **measure**.

- Affected by **spurious correlations** that deep learning models use as **shortcuts** and **overfit** on them [[Geirhos et al., 2020](https://www.nature.com/articles/s42256-020-00257-z); [Storks et al. 2020](https://arxiv.org/pdf/1904.01172.pdf); [Schlegel et al., 2020](https://arxiv.org/pdf/2005.14709.pdf)] $\rightarrow$ exacerbate a dataset's utility (the hypothesis of reflecting human reasoning capabilities)

- **Improper** task formulation (I/O)$\rightarrow$ **Personal traits prediction from images** [💬]

- Annotators' **biases**, usually due to **lack of** guidelines or dataset documentation  $\rightarrow$ **re-building ImageNet** [💬]

- Hard to scrutinize $\rightarrow$ **abbatoir-effect** [[Raji et al., 2020](https://slideslive.com/38947414/harms-from-ai-research)] exacerbating data quality and hiding ethical issues

- Entirely focused on **benchmarking metrics** $\rightarrow$ consumption, cost, fairness, model size, limitations, multiple gold standards, etc..

- (Scientific) progress is made by **trusted research**, where that trust is layed on research being reproducible.

## Dataset creation

What might **go wrong**? How to follow a **correct** approach?

- Initial motivation: do we really need to create a new dataset?

- What data to collect?

- For what purpose(s)?

- **How to collect data**? annotators' selection, training, guidelines definition, pilot studies

- How to state if a data collection was **'successful'**? Size, quality, coverage, biases, unbalance

- Can we estimate human performance?

- Ethical considerations

### Data-centric AI

A recent trend **is emerging** centred on **data quality** rather than model design: [Data-centric AI](https://datacentricai.org/)

$\rightarrow$ a suite covering different aspects of data design and collection: from labeling to augmentation

#### Why?

Because collecting high-quality data is the **biggest slice of the effort cake** when solving a particular task! (*the cake is a lie!*)

Models **always find a shortcut** to solve a task: biases, correlations, cues 

$\rightarrow$ More in general: **Internal** and **external** validity of experiments [[Liao et al., 2021](https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/757b505cfd34c64c85ca5b5690ee5293-Paper-round2.pdf)]

Reproducibility of findings should **avoid such leakages** in experimenting

### Data collection

Data collection is a **time-consuming** process whose *success* also depends on the **complexity** of the task

It is **crucial** to bridge the gap between the 'theoretical aspects' of a problem and its 'implementation' (a dataset) $\rightarrow$ biases are around the corner!

Some bad quality indicators (**non-exhaustive**):

- Low annotators agreement
- Duplicates $\rightarrow$ ([💬] CIFAR 14% relative performance drop [[Barz et al., 2020](https://www.mdpi.com/2313-433X/6/6/41)])
- Labeling errors
- Annotator's bias
- Incorrect feature selection
- Simple heuristics $\rightarrow$ ([💬] VQA imbalance [[Goyal et al., 2017](https://openaccess.thecvf.com/content_cvpr_2017/papers/Goyal_Making_the_v_CVPR_2017_paper.pdf)])

<center>
<div>
<img src="Images/Lecture-2/jojo-yare-yare-daze.gif" width="1200" alt='JOJ_YAREYARE'/>
</div>
</center>

#### How to assign a label to a sample? [❓]

We consider **multiple annotators** and assign them samples to annotate

#### How to do that? [❓]

We need a set of **instructions**!

### Example: Subjectivity Detection

Consider the task of annotating a sentence as subjective or objective (w.r.t. to its writer)

I (the author) assign you (the annotator) the following guideline:

*''A sentence is subjective if it conveys an opinion''*

#### Do you think this is detailed enough? Would you understand it? [❓]

We need to build appropriate annotation guidelines to allow:
   - High-quality annotation $\rightarrow$ annotators **fully understand** your task definition
   - **Reproducibility** (recall ImageNet reconstruction issues)

<center>
<div>
<img src="Images/Lecture-2/subjectivity.png" width="850" alt='subjectivity'/>
</div>
</center>

How do we achieve that [❓]

- Follow a specific annotation paradigm (e.g., prescriptive vs descriptive [[Röttger et al., 2022](https://aclanthology.org/2022.naacl-main.13.pdf)])
- Multiple pilot studies $\rightarrow$ for **edge case discovery** and **resolution**
- Guidelines refinement
- Discussion phases

**Note**: it takes a lot of time!

**Note**: biases are **always there** $\rightarrow$ limited sources, limited data, limited time span, arbitrary guidelines definition, etc...

### How to measure success?

Inter-annotator agreement to the rescue!

- Cohen's kappa for pairs of annotators
- Krippendorff's alpha, Fleiss's kappa for multiple annotators



Avoid mistakes during annotation!

**Annotation leakage** between annotators during annotation (*sounds trivial, right?*) $\rightarrow$ ([💬] Annotating in the legal domain)

### How to ease data maintenance?

We have already seen datasheets $\rightarrow$ that's the way to do it

In addition:

- Proper documentation $\rightarrow$ ([💬] IBM2015 corpus [[Rinott et al., 2015](https://aclanthology.org/D15-1050.pdf)])
- Guidelines availability
- Reporting individual annotator's labels $\rightarrow$ quality assurance [[Röttger et al., 2022](https://aclanthology.org/2022.naacl-main.13.pdf), [Teruel et al., 2018](https://aclanthology.org/L18-1640.pdf)]

## Modeling uncertainty in subjective tasks

[[Davani et al., 2022](https://aclanthology.org/2022.tacl-1.6.pdf)] propose to address annotation disagreement by modeling multiple annotations as a multi-task approach $\rightarrow$ majority-voting is non-representative of annotation

<center>
<div>
<img src="Images/Lecture-2/modeling_disagreement.png" width="1600" alt='modeling_disagreement'/>
</div>
</center>

<center>
<div>
<img src="Images/Lecture-2/modeling_uncerntainty_example.png" width="1800" alt='modeling_uncerntainty_example'/>
</div>
</center>

The multi-task approach **better correlates** with uncertainty estimation of annotators' labels

## Are we modeling the annotator?

[[Geva et al., 2019](https://aclanthology.org/D19-1107.pdf)] show that models overfit **on annotators' specific cues**, especially when free natural language text is allowed

<center>
<div>
<img src="Images/Lecture-2/modeling_annotators.png" width="900" alt='modeling_annotators'/>
</div>
</center>

They advocate for annotation processes were **different annotators** are used for training and test splits, respectively

$\rightarrow$ this is **particularly important** when considering small/medium-sized datasets with a restricted number of annotators

As the number of annotators **increases**, the model is less prone to exploit annotators' specific cues

# Any questions?

<center>
<div>
<img src="../Images/Lecture-2/jojo-arrivederci.gif" width="1000" alt='JOJO_arrivederci'/>
</div>
</center>