## About last time

- Data is a crucial part of machine learning $\rightarrow$ always **check the data** you use!

- Annotation paradigms encode recommendations concerning the **intended purpose** of collected data

- Prescriptive paradigm encourages **one shared belief**; Descriptive and perspectivist encourage **different distinct beliefs**

- Data collection is difficult, time-consuming and **always** biased $\rightarrow$ in which cases biases are harmful?

- Trends on using as much data as possible and focus on a limited set of metrics $\rightarrow$ high risk of **limited analysis**

## About this lecture

We talked a lot about data and, certainly, we did not talk enough about data

However, there are other aspects of a machine learning pipeline where talking about reproducibility is essential: modeling, experimenting, and evaluation.

### What we are going to talk about

- Data partitioning
- Data leakage
- Seeding
- Performance comparison
- Metrics

## Data Partitioning

Data partitioning is a **crucial step** since model performance is computed on built **dev** and **test** splits.

Suppose you have some available data (e.g., you created it)

[❓] **How** should we split data?

[❓] For **what purpose** are we splitting data?

It is common practice to collect a dataset and build train, dev and test splits

These splits are usually built by following a specific criterion

- Order of collection
- Time order
- Domain-specific information (e.g., different speakers, topics, sources)
- Random !

These splits are formally known as **standard splits**

This process is perfectly fine: we have the data, we follow the splits and train/evaluate our models of interest

### Are standard splits good?

[❗] [[Gorman & Bedrick, 2019](https://aclanthology.org/P19-1267.pdf)] showed that there's a **relevant discrepancy** between standard and random splits

In particular, they argue in favor of **random splits** to have better estimate of model performance on new data samples (i.e., an hypothetical real-world scenario)

[❗] However, [[Søgaard et al., 2021](https://aclanthology.org/2021.eacl-main.156v2.pdf)] showed that **random splits** (and other data partitioning approaches) yet provide an **over-estimate** of model performance

<center>
<div>
<img src="../Images/Lecture-4/talk_about_random_splits.png" width="2500" alt='talk_about_random_splits'/>
</div>
</center>

- Random and standard splits **under-estimate** error on new samples
- Heuristic and adversarial splits sometime **under-estimate** error on new samples

### What should we do?

[[Søgaard et al., 2021](https://aclanthology.org/2021.eacl-main.156v2.pdf)] propose to:

1. Consider/build **multiple diverse test splits** with different biases $\rightarrow$ better evaluation
2. If (1) is not possible, it is better to consider **biased splits** to better approximate real-world performance

### Is train-dev-test partitioning correct?

[[Rob van der Goot, 2021](https://aclanthology.org/2021.emnlp-main.368.pdf)] notices

- A **clear mismatch** between classical and neural models
- **Over-estimations** on dev splits limit evaluations on test splits only $\rightarrow$ overfitting and fast expiration of test splits

**Overfitting**: design decisions (e.g., hyper-parameters) $\rightarrow$ *bias from research design* [[Hovy & Prabhumoye, 2021](https://compass.onlinelibrary.wiley.com/doi/epdf/10.1111/lnc3.12432)]

### Model comparison

If we want to compare model A to model B, we calibrate A and B on the dev split and then evaluate them on the test split. $\rightarrow$ model architecture search

In particular, we pick the best-epoch for each model based on the dev split $\rightarrow$ **early stopping**

### Dev split over-estimation

If we calibrate on the dev split $\rightarrow$ **overly optimistic** performance on the dev split



Thus, regarding **model picking**

- If on the dev split: **over-estimation** since model picking and calibration are **both done** on the dev split
- If on the test split: **overfitting of design decisions**, leading to faster obsolescence of the test split

#### What can be done?

[[Rob van der Goot, 2021](https://aclanthology.org/2021.emnlp-main.368.pdf)] proposes a *tune split*

1. Early stop on tune split
2. Hyper-parameters calibration on dev split
3. Validate results on test split

This also allows a **fair comparison** with classical models since they don't use a dev split during training

<table><tr>
<td> <img src="../Images/Lecture-4/tune_split_test_performance.png" width="1100" alt='tune_split_test_performance'/> </td>
<td> <img src="../Images/Lecture-4/tune_split_dev_test_performance.png" width="1100" alt='tune_split_dev_test_performance'/> </td>
</tr></table>

### Downsides [❓]




- **More** data required!


- It is possible to **add** tune split to train data for final evaluation on the test split (early stop on dev)
    - Similar to **shared tasks**: use train + dev
    - Similar to **cross-lingual/domain** setups: source dataset dev split for model picking $\rightarrow$ **don't use all** target languages [[Artetxe et al., 2020](https://aclanthology.org/2020.acl-main.658.pdf)]

### The importance of biases

<table><tr>
<td> <img src="../Images/Lecture-4/different_time_samples.png" width="1100" alt='different_time_samples'/> </td>
<td> <img src="../Images/Lecture-4/temporal_drift.png" width="1100" alt='temporal_drift'/> </td>
</tr></table>

## Data Leakage

''*Data leakage is a spurious relationship between the independent variables and the target variable that arises as an artifact of the data collection, sampling, or pre-processing strategy." [[Kapoor & Narayanan, 2022](https://arxiv.org/pdf/2207.07048.pdf)]

#### Examples

- Perform variable scaling/normalization using the whole data set.
- Feature selection before data partitioning.
- Dimensionality reduction before data partitioning.
- Calibrate and test your model on the same dev/test set.
- Data augmentation before data partitioning
- Random splits with time series data or look ahead bias 
- Biased data partitioning

<center>
<div>
<img src="../Images/Lecture-4/data_leakage.png" width="2000" alt='data_leakage'/>
</div>
</center>

<center>
<div>
<img src="../Images/Lecture-4/kapoor_leakage.png" width="2000" alt='kapoor_leakage'/>
</div>
</center>

### [L2] Illegitimate features

The model has access to features that should **not be legitimately available** for use.

The judgement of whether the use of a given feature is legitimate for a modeling task requires **domain knowledge** and can be highly **problem specific**!

Use **domain expertise** to decide which features are suitable.

### [L3] Test set not drawn from the distribution of scientific interest

The distribution of data on which the performance of an ML model is evaluated **differs from** the distribution of data about which the scientific claims are made.

The performance of the model on the test set **does not correspond** to its performance on data drawn from the distribution of scientific interest.

### [L3.1] Temporal Leakage

The test set should **not contain** any data from a date before the training set.

If the test set contains data from before the training set, the model is built using data *''from the future"* that it should not have access to during training.

$\rightarrow$ Avoid random splitting and check for potential **look-ahead bias** [[Cerqueira et al., 2020](https://link.springer.com/article/10.1007/s10994-020-05910-7); [Wang & Ruf, 2022](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3836631)]

### [L3.2] Non-independence between train and test samples

Nonindependence between train and test samples constitutes leakage, unless the scientific claim is about a distribution that has the same dependence structure.

It is **quite common** that train and test samples come from the same people or unit (histopathology case [💬] [[Oner et al., 2020](https://www.medrxiv.org/content/10.1101/2020.04.23.20076406v1)])

There are solutions like 'block cross-validation' [[Roberts et al., 2017](https://onlinelibrary.wiley.com/doi/10.1111/ecog.02881); [Valavi et al., 2021](https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13107)]

$\rightarrow$ Non-independence is a **hard problem** since we might not know the **underlying dependency structure** of the task.

### [L3.3] Sampling bias in test distribution

- **Spatial bias**: choosing data from a geographic location but making claims about model performance in other locations as well

- **Selection bias**: choosing a non-representative subset of the dataset for evaluation (autism case [💬], pneumonia prediction [💬])

### There are a lot of other biases (*a small memo*)

- **Self-selection bias**: People with specific characteristics are more likely to agree to take part in a study than others.

- **Non-response bias**: People who refuse to participate or drop out from a study systematically differ from those who take part.

- **Undercoverage bias**: Some members of a population are inadequately represented in the sample.

- **Survivorship bias**: Successful observations, people and objects are more likely to be represented in the sample than unsuccessful ones.

<center>
<div>
<img src="../Images/Lecture-4/Survivorship-bias.svg" width="850" alt='survivorship-bias'/>
</div>
</center>

## Seeding

Pseudo random number generation is a **critical aspect** in developing experiments.

Choosing a random seed fixes the pseudo random number generation process.

[❓] What processes are affected by a random seed? Have you ever fixed seeds?



- Neural network initialization
- Optimization
- Neural network prediction
- Data pipeline (e.g., data partitioning)

[❓] How should we use random seeds?

1. Model selection $\rightarrow$ hyper-parameters calibration
2. Ensemble creation
3. Sensitivity analysis
4. Single fixed seed $\rightarrow$ through all model pipeline
5. Performance comparison

According to [[Bethard, 2022](https://arxiv.org/pdf/2210.13393.pdf)], based on an analysis of **85** ACL papers, points [1-3] are **safe** approaches, while points [4-5] **hide some critical risks**.

In particular, Bethard found that 48 out of 85 papers (**~56%**) follow risky approaches. 

This trend is **unaffected by time**! $\rightarrow$ recent papers are **still** making the same errors

### Model selection

A random seed is just **another hyper-parameter** of the model.

It is perfectly fine to find for the best random seeds like for other hyper-parameters.


*''...Training multiple models with randomized initializations and use as the final model the one which achieved the best performance on the validation set''* [[Björne & Salakoski, 2018](https://aclanthology.org/W18-2311.pdf)]

*''The test results are derived from the 1-best random seed on the validation set''* [[Kuncoro et al., 2020](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00345/96469/Syntactic-Structure-Distillation-Pretraining-for)]

### Ensemble creation

Ensembling is a standard strategy to **achieve increased performance** by combining multiple models

One can create an ensemble by training the same model with **different random seeds** and obtain predictions via voting

*''...We ensemble five distinct models each initialized with a different random seed''* [[Nicolai et al., 2017](https://aclanthology.org/K17-2008.pdf)]

*''Our model is composed of the ensemble of 8 single models. The hyperparameters and the training procedure used in each single model are the same except the random seed''* [[Yang & Wang, 2019](https://aclanthology.org/W19-4421.pdf)]

###  Sensitivity Analysis

Since neural networks may be **sensitive to random initialization**, one could study this aspect by considering multiple seed runs.

*''...examine the expected variance in attention-procuded weights by initializing multiple training sequences with different random seeds''* [[Wiegreffe & Pinter, 2019](https://aclanthology.org/D19-1002.pdf)]

*''Our model shows a lower standard deviation on each task, which means our model is less sensitive to random seeds than other models''* [[Hua et al., 2021](https://aclanthology.org/2021.naacl-main.258.pdf)]

### Single fixed seed

Pick a single random seed to fix the whole model pipeline **to ensure reproducibility**.

[❓] Why is this risky?

- Doesn't **necessarily guarantee reproducibility** $\rightarrow$ some libraries (e.g., Tensorflow) may **not support** seed fixing for all operations (it also depends on the **library version**)

- Seed fixing **implies no calibration** for such a hyper-parameter $\rightarrow$ performance **under-estimation**

What should be done instead

- Calibrate the random seed like any other hyper-parameter [[Dodge et al., 2020](https://arxiv.org/pdf/2002.06305.pdf)]
- Low-resource scenario? Random search [[Bergstra & Bengio, 2012](https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf)]

### Performance comparison

For model comparison, one should prefer **considering distributions of model performance** and not single point estimates to draw more **reliable conclusions**.

[❗] However, the trend is to just consider **random seeds variations** to obtain such distributions

*''We re-ran both implementations multiple times, each time only changing the seed value of the random number generator''* [[Reimers & Gurevych, 2017](https://aclanthology.org/D17-1035.pdf)]

*''Indeed, the best approach is to stop reporting single-value results, and instead report the distribution of results from a range of seeds. Doing so allows for a fairer comparison across models''* [[Crane, 2018](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00018/43441/Questionable-Answers-in-Question-Answering)]

[❓] Why is this risky?

1. **Sub-optimal models** if the goal is to compare the best possible model A from one architecture to the best possible model B from another architecture (e.g., leaderboards)

2. **Biased slice of family** if the goal is to compare the family of models A to the family of models B $\rightarrow$ we should consider **all** hyper-parameters

What should be done instead

- For (1), we should also optimize for the best possible seed $\rightarrow$ we can still consider distributions rather than point estimates by **considering multiple test splits**.
- For (2), we should sample different models from a family by considering **all hyper-parameters** and not just random seeds.

## Model Evaluation

Having **valid results** from which to **draw reliable** conclusions is a fundamental aspect of your experimental setting [[Lones, 2023](https://arxiv.org/pdf/2108.02497.pdf)].

*Current SOTA achieves 94% of accuracy, while our model achieve 95.2%* $\rightarrow$ **new SOTA!** [[Hooker, 1995](https://link.springer.com/article/10.1007/BF02430364)]

[❓] What could have gone differently? 

- Training/evaluation on different **data partitions**
- Different **training dataset**
- Different **training regularization**
- **Mismatch** in hyper-parameter optimization $\rightarrow$ I know you have calibrated your model only!

### What to do?

- **Re-implement** all models or use available code (*good luck!*)

- Perform a **fair hyper-parameter** calibration

- Evaluation on an **appropriate** test set: no overlap with training data, representative of population $\rightarrow$ **Images with same weather conditions** [💬]

- In case of data-augmentation, perform it on **training data only**

- Perform model evaluation **multiple times** [[Blagec et al., 2020](https://arxiv.org/pdf/2008.02577.pdf)]: multiple data partitions (**input sensitivity**); multiple seeds (**random initialization sensitivity**) $\rightarrow$ under/over-estimation of model's performance.
    - It becomes increasingly likely that the best model just happens to **over-fit the test set**, and doesn't necessarily **generalize** any better than the other models.  **The Persuasive Essays dataset** [💬]

- Cross-validation (**stratified** in case of class imbalance) $\rightarrow$ keep a separate test set for final evaluation

- Pick **correct** evaluation metrics: different metrics give different perspectives [Wagstaff, 2012](https://icml.cc/2012/papers/298.pdf)

- Statistical tests

- Check your results $\rightarrow$ In some domains it is very important to understand which errors does your model make

- Consider strong simple baselines $\rightarrow$ is there any advantage?

# Reporting

To effectively contribute to knowledge, you need to provide a **complete picture** of your work, covering both what worked and what **didn't**. 

$\rightarrow$ Trade-offs are common.

### Over-statements

A common mistake is to make **general** statements that are **not supported** by the data used to train and evaluate models.

- If your model does really well on one dataset, it doesn't **necessarily mean** that it will do well on other datasets.
- There's always a limit of what can be **inferred from** an experimental study $\rightarrow$ sampling error/bias, datasets overlap, data quality.

### Statistical significance shotgun

As opposed to most scientific disciplines, statistical analysis **is seldom conducted** in machine learning-based research [[Forde & Paganini, 2019](https://arxiv.org/pdf/1904.10922.pdf); [Henderson et al., 2018](https://aaai.org/papers/11694-deep-reinforcement-learning-that-matters/)]

Statistical tests are **error-prone**: some may **underestimate** significance while some others may **overestimate** it.

$\rightarrow$ A positive test **doesn't always** indicate that something is significant, and a negative test doesn't necessarily mean that something isn't significant.

#### Threshold selection and abuse

Statisticians are increasingly arguing that it is better **not to** use thresholds and just report p-values for **interpretation** [[Amrhein et al., 2019](https://www.nature.com/articles/d41586-019-00857-9)]

To give a better indication of whether something is important, we can measure **effect size** [[Betensky, 2019](https://www.tandfonline.com/doi/epdf/10.1080/00031305.2018.1529624?needAccess=true&role=button); [Benavoli et al., 2017](https://jmlr.org/papers/volume18/16-305/16-305.pdf)].

#### Which one to use?

**[Pairwise comparison]** Student's T test (if normally distributed), Mann-Whitneys's U test (more general) [[Raschka, 2020](https://arxiv.org/pdf/1811.12808.pdf); [Carrasco et al, 2020](https://www.sciencedirect.com/science/article/pii/S2210650219302639)]

**[Multiple comparisons]** Multiple pairwise comparisons can lead to overly-optimistic interpretations of significance (using the test set multiple times) $\rightarrow$ **multiplicity effect** or **data dredging** or **p-hacking** [[Head et al., 2015](https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106)]

## Statistical significance

A standard tool to assess that experimental results are **not coincidental**.

[❓] Which statistical tool should we use? Under which conditions should we use it?

[[Dror et al., 2018](https://aclanthology.org/P18-1128.pdf)] surveyed **~200** papers from ACL'17 and found that statistical significance is often **not reported** or **wrongly used**.

<center>
<div>
<img src="../Images/Lecture-4/dror_survey.png" width="300" alt='dror_survey'/>
</div>
</center>

### Our goal is

Make sure the **difference** between two algorithms on a single comparison **is not coincidental**.

<center>
<div>
<img src="../Images/Lecture-4/statistics_scooby-doo.jpg" width="300" alt='statistics_scooby-doo'/>
</div>
</center>

### Preliminaries

We have two algorithms: $A$ and $B$

We have a dataset $X$

We have an evaluation measure $\mathcal{M}$ $\rightarrow$ $\mathcal{M}(\cdot, X)$ is its value on dataset $X$ for algorithm $\cdot$

We define the **difference** in performance between $A$ and $B$ as:

$$
\begin{equation} \label{eq:delta}
    \delta(X) = \mathcal{M}(A, X) - \mathcal{M}(B, X)
\end{equation}
$$

Then the stastical hypothesis testing problem is:

$$
\begin{align}
    H_0: \delta(X) \le 0 \\
    H_1: \delta(X) > 0
\end{align}
$$

We want to give very low probability for $H_0$ being true in order **to reject it** and accept our desired hypothesis $H_1$.

To do so, we compute the p-value: the probability, under $H_0$, of obtaining a result **equal to or more extreme** than what was actually observed:

$$
\begin{equation}
    P \left ( \delta(Y) \ge \delta_{observed} | H_0 \right ) \quad (\delta_{observed} \, \text{is derived from Eq. \eqref{eq:delta}})
\end{equation}
$$

where $Y$ is a random variable over possible observations

The smaller the p-value, the higher the significance $\rightarrow$ $H_0$ does not hold

To reject $H_0$ one should define a threshold $\alpha$, the **significance level**: reject only if p-value $< \alpha$

#### Error types

- Type I: $H_0$ is rejected when it is actually true
- Type II: $H_0$ is not rejected although it should be

### Parametric vs non-parametric tests

Statistical significance depends on two notions: $\mathcal{M}$ and the distribution of $\delta(X)$

- If the distribution of $\delta(X)$ **is known** (we call the distribution parameters $\theta$): **parametric tests** are the way to ensure low probability of Type II error
- Otherwise, we rely on **non-parametric** tests (less powerful but statistically sound)

#### How to know the distribution of $\delta(X)$?

- Shapiro-Wilk test
- Kolmogorov-Smirnov test
- Anderson-Darling test
- $\dots$

### Parametric tests

Typically assume the normal distribution

#### Paired Student's t-test

Assesses whether the **population means** of two sets of measurements differ from each other

Assumes that both samples come from a normal distribution

##### When is it applied?

On evaluation measures like accuracy, UAS (unlabeled attachment score) and LAS (labeled attachment score) (used in **structured tasks**) $\rightarrow$ compute the mean of correct predictions per example

Usually based on the idea of the Central Limit Theorem $\rightarrow$ when the number of individual predictions (e.g., words in a sentence) **is large enough**

### Non-parametric tests

There are two families

- **Sampling-free**: Does **not** consider the actual values of the evaluation measures
- **Sampling-based**: Does consider the values of the measures



#### Sampling-free

Only considers **the number of cases** in which each of the algorithms performs better than the other

Lower statistical power than sampling-based but **far less** computationally intensive

- Sign test
- McNemar's test / Cochran's Q test
- Wilcoxon signed-rank test

#### Sign test

**Test statistic**: number of examples for which A is better than B but **ignores the extent of the difference**!

**Null hypothesis**: given a new pair of measurements $(a_i, b_i)$ then $a_i$ and $b_i$ are **equally likely** to be larger than the other

**Assumptions**:

- Data samples are i.i.d.
- The differences come from a continuous distribution (e.g., accuracy metric)
- The values are ordered

#### McNemar's test

A **special case** of sign test for **binary classification**

**Null hypothesis**: marginal probability for each outcome is the same for both A and B $\rightarrow$ A and B are expected to make the **same proportions** of correct/incorrect predictions

The Cochran's Q test generalizes for **multi-class classification** setups

#### Wilcoxon's signed-rank test

More powerful than previous methods.

**Null hypothesis**: the differences follow a symmetric distribution around zero

#### Sampling-based

- Permutation/randomization test
- Paired bootstrap

#### Pitman's permutation test

Computes statistical significance under all possible labellings (**permutations**) of the test set

   - Compute original sum of differences between A and B: $d_0$
   - For all permutations $r$
       - Randomly swap $\mathcal{M}(A, x_i)$ with $\mathcal{M}(B, x_i)$ up to $N$ times
       - Compute sum of differences between A and B: $d_r$
   - The p-value is the ratio of times where $d_r \le d_0$  

**Exponentially large** number of possible permutations

In pratice, we use **approximations**: a pre-defined limited number of permutations are drawn without replacement

#### Paired bootstrap test

Differs from Pitman's test by **allowing replacements**: an example from the original test data can appear more than once in a sample

Uses the samples as surrogates populations for the purpose of approximating the sampling distribution of the statistic

Less effective for **small** test sets.

### Which one to use?

<center>
<div>
<img src="../Images/Lecture-4/statistical_significance.png" width="600" alt='statistical_significance'/>
</div>
</center>

### Open problems

#### Dependent observations 

In many cases, samples are **not** i.i.d. (e.g., sentences from the same document)

[❗] **Hard to quantify** the nature of the dependence between (test) samples

#### Cross-validation

[❗] Test splits of different folds are **not independent**

A possible solution is to use $K_{Bonferroni}$ estimator [[Dror et al., 2017](https://aclanthology.org/Q17-1033.pdf)]
  1. Calculate p-value **for each fold** separately
  2. Perform replicability analysis for dependent datasets with $K_{Bonferroni}$
  3. If the analysis rejects the null hypothesis in **all folds** the results should be significant

## Replicability analysis

In many cases we experimentally compare a model A to another model B on several datasets

However, **aggregating individual statistical significance tests** over multiple datasets is error prone $\rightarrow$ the probability of making one or more false claims is **very high**!

[[Dror et al., 2017](https://aclanthology.org/Q17-1033.pdf)] propose a method for

- Counting: for how many datasets does a given algorithm outperform another?
- Identification: what are these datasets?

### Counting

Recall the statistical hypothesis testing problem:

$$
\begin{align}
    H_0: \delta(X) \le 0 \\
    H_1: \delta(X) > 0
\end{align}
$$

If we have $N$ datasets, we have to test for rejecting $H_0$ **$N$ times**$\rightarrow$ it is likely to make some erroneous rejections!

For instance, if significance level is $\alpha = 0.05$, we have 5% chance to make an erroneous rejection. If $N = 100$, we can expect to make around 5 wrong rejections.

The probability of **making at least one** erroneous rejection is $1 - (1 - 0.05)^{100} = 0.994$

#### Partial conjunction test

We consider different $H_0$ and $H_1$

$$
\begin{align}
    H_0^{u/N}: k < u \\
    H_1^{u/N}: k \ge u
\end{align}
$$

where $k$ is the true unknown number of false null hypotheses $\rightarrow$ number of datasets where A is **truly better** than B

This problem translates to *"Are **at least** $u$ out of $N$ null hypotheses false?''*

There are two estimators for $k$ (i.e., for addressing counting)

- Bonferroni: for **dependent** datasets
- Fischer: for **independent** datasets (requires independence but more powerful than Bonferroni)

By computing the estimator, we can estimate our $\hat{k}$, meaning that A is better than B in **at least** $\hat{k}$ datasets with a confidence level $1 - \alpha$.

### Is null-hypothesis significance test the only option?

<center>
<div>
<img src="../Images/Lecture-4/comparison.png" width="550" alt='comparison'/>
</div>
</center>

What is the correct way to interpret empirical observation in terms of the **superiority** of one system over another?

While $S_1$ has higher accuracy than $S_2$ in both cases, the gap is **moderate** and the datasets are of **limited size**

[[Azer et al., 2020](https://aclanthology.org/2020.acl-main.506.pdf)] make at least **four statistically distinct hypotheses**, each supported by a **different statistical evaluation**

#### H1 (null hypothesis)

**Assuming** $S_1$ and $S_2$ have inherently **identical** accuracy, the probability (p-value) of making a hypothetical observation with an accuracy gap at least as large as the empirical observation (here, 3.5%) is at most 5% (making us 95% confident that the above assumption is false).

#### H2 (confidence intervals)

**Assuming** $S_1$ and $S_2$ have inherently **identical** accuracy, the empirical accuracy gap (here, 3.5%) is
larger than the maximum possible gap (confidence interval) that could hypothetically be observed with a
probability of over 5% (making us 95% confident that the above assumption is false).

#### H3 (posterior intervals)

**Assume a prior belief** (a probability distribution) w.r.t. the inherent accuracy of typical systems. Given the
empirically observed accuracies, the **probability** (posterior interval) that the inherent accuracy of S1 exceeds that of S2 **by a margin** of 1% is at least 95%.

#### H4 (Bayes factor)

**Assume a prior belief** (a probability distribution) w.r.t. the inherent accuracies of typical systems. Given the empirically observed accuracies, **the odds** increase by a factor in favor of the hypothesis that the inherent accuracy of S1 exceeds that of S2 **by a margin** of 1%

H1 and H2 can be tested with **p-value-based** methods $\rightarrow$ operate over the probability space of a test statistics ($\delta$) over observations

H3 and H4 are based on **Bayesian inference** $\rightarrow$ operate directly over the probability space of inherent accuracy (rather than observations)

Due to **time reasons**, we don't explore these methods in details. $\rightarrow$ a dedicated course would be needed! (*I suck at statistics*)

I **recommend** further reading [[Azer et al., 2020](https://aclanthology.org/2020.acl-main.506.pdf)] as it clearly describes all possible misconceptions, misuses and statistical tools when comparing two models on some observed data.

<center>
<div>
<img src="../Images/Lecture-4/azer_survey.png" width="1700" alt='azer_survey'/>
</div>
</center>

### Takeaways

1. p-values **do not provide probability** estimates on two systems **being different** (or equal). They can only allow you to conclude that one system **is better than** the other without taking into account **the extent** of the difference between the systems (binary thinking)

2. A common **misconception** is: *"if p-value $< 0.05$, the null hypothesis has only a 5% chance of being true''* $\rightarrow$ this is false since we are defining p-value with the **assumption** of null hypothesis being true

3. Another **misconception** is: *"if p-value $> 0.05$, there is no difference between the two systems''* $\rightarrow$ a large p-value only means that the null hypothesis **is consistent with observations**. It does **not** tell you anything about the **likeliness** of the null hypothesis

4. Another misconception is: *"a statistically significant result (p-value $< 0.05$) indicates a **notable difference** between the two systems''* $\rightarrow$ p-value only indicates **strict superiority** and provides **no information** about the **margin of the effect** 

5. **Posterior intervals** generally provide a useful summary as they capture **probabilistic estimates** of the correctness of the hypothesis. For instance, we can say "with probability 0.95, model A's accuracy is two percent higher than that of B $\rightarrow$ A outperforms B

## Sticking to widely known reporting standards

[[Marie et al., 2021](https://aclanthology.org/2021.acl-long.566.pdf)] analyze **769** papers concerning machine translation and observe that:

- BLEU metric is used as a reference metric in **99%** of the papers, while there are **more than 100 metrics** that better correlate with human judgements than BLEU $\rightarrow$ different metric choice leads to **different model rankings**!

- Significance testing is **only used in 65%** of the papers

- Just considering 2019-2020, **40%** of the papers copied results of previous work $\rightarrow$ it is often unclear whether copied and proper results are **comparable**! (metrics may have **parameters**)
    - For instance, a metric may depend on the tokenizer used and the pre-processing pipeline
    - Previous work may hide **coding tricks** [[Liao et al., 2021](https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/757b505cfd34c64c85ca5b5690ee5293-Paper-round2.pdf)]

- Just considering 2019-2020, **38%** of the papers claimed superiority of a particular method but used different data (e..g, pre-processing pipeline)

More in general, there are several factors that may lead to a method being perceived as superior to another: *benchmark lottery* [[Dehghani et al., 2021](https://arxiv.org/pdf/2107.07002.pdf)]

## Aggregating results via averaging

<center>
<div>
<img src="../Images/Lecture-4/BT_example.png" width="600" alt='BT_example'/>
</div>
</center>

[[Peyrard et al., 2021](https://aclanthology.org/2021.acl-long.179.pdf)] argue that **averaging across samples** on the same test set for comparing two models is not a **robust** approach

They propose the **Bradley-Terry (BT)** model to account for **sample-level pairing**: compares systems for each test instance and estimates the latent strength of systems based on **how frequently** one system scores higher than another

Pairing is particular important when there are **different types of examples** in the data (e.g., harder vs. easier)

Compared to BT:

- Mean aggregation is **not robust** to outliers
- Mean and median aggregation **do not take into account order of scores**
- BT is the paired variant of median (while median is the **outlier-resistant** version of mean)
- BT relates to Fischer's sign test for statistical significance

BT **decision** simply translates to computing the median of metric scores of model A and B: $BT = Median \left ( \mathcal{M}(A) - \mathcal{M}(B) \right )$

If $BT > 0$, then A is **better than** B.

<center>
<div>
<img src="../Images/Lecture-4/BT_discrepancy.png" width="1000" alt='BT_discrepancy'/>
</div>
</center>

<center>
<div>
<img src="../Images/Lecture-4/BT_benchmark.png" width="1600" alt='BT_benchmark'/>
</div>
</center>

## Concluding Remarks

- Data partitioning is a very delicate step that could lead to over-estimations!

- Data leakage can take different forms and is quite common! $\rightarrow$ pay attention to what you do!

- Random seeds should be treated as any other hyper-parameter $\rightarrow$ evaluate on more test splits!

- Statistical significance is often abused and erroneously applied $\rightarrow$ we really need to understand it!

# Any questions?

<center>
<div>
<img src="../Images/Lecture-2/jojo-arrivederci.gif" width="1000" alt='JOJO_arrivederci'/>
</div>
</center>