## About last time

- Reproducibility (computational, of findings) issues affect a wide variety of research papers

- A recent and emerging trend is pushing towards open research communities

- Guidelines, datasheets, model sheets, checklists are proposed to address reproducibility issues

- For several issues like data leakage there exist specific solutions

- Coding plays a crucial part in defining reproducible and robust experiments
    - Reproducible: same data and tools
    - Robust: same data but different tools (e.g., code re-implementation)

## A quick Legend (memo)

[❓] $\rightarrow$ This is a **question for you** (the audience)

[❗] $\rightarrow$ A **very important** information to remember!

[💬] $\rightarrow$ An (funny?) **anecdote** will be told

[[XX et al., YEAR](https://www.youtube.com/watch?v=dQw4w9WgXcQ)] $\rightarrow$ A reference (**click on the reference** to see the paper) 


## About this lecture

We explore reproducibility issues and data leakage more in detail

- Dataset creation

- Data processing

- Model setup

- Reporting

- **Coding**: the next 3 lectures are about this!

We are going to consider some real examples!

## Dataset creation

What might go wrong? How to follow a correct approach?

- Initial motivation: do we really need to create a new dataset?

- What data to collect?

- For what purpose(s)?

- How to collect data? annotators' selection, training, guidelines definition, pilot studies

- How to state if a data collection was 'successful'? Size, quality, coverage, biases, unbalance

- Ethical considerations

## Data Partitioning

Data partitioning is a crucial step since model performance is computed on built dev and test splits.

Suppose you have some available data (e.g., you created it)

[❓] How should we split data?

[❓] For what purpose are we splitting data?

It is common practice to collect a dataset and build train, dev and test splits

These splits are usually built by following a specific criterion

- Order of collection
- Time order
- Metadata (e.g., different speakers)
- Random !

These splits are formally known as **standard splits**

This process is perfectly fine: we have the data, we follow the splits and train/evaluate our models of interest

### Are standard splits good?

[❗] [[Gorman & Bedrick, 2019](https://aclanthology.org/P19-1267.pdf)] showed that there's a relevant discrepancy between standard and random splits

In particular, they argue in favor of random splits to have better estimate of model performance on new data samples (i.e., an hypothetical real-world scenario)

However, [[Søgaard et al., 2021](https://aclanthology.org/2021.eacl-main.156v2.pdf)] showed that random split (and other data partitioning approaches) yet provide an over-estimate of model performance

<center>
<div>
<img src="Images/Lecture-2/talk_about_random_splits.png" width="2000" alt='talk_about_random_splits'/>
</div>
</center>

- Random and standard splits under-estimate error on new samples
- Heuristic and adversarial splits sometime under-estimate error on new samples

### What should we do?

[[Søgaard et al., 2021](https://aclanthology.org/2021.eacl-main.156v2.pdf)] propose to:

1. Consider/build multiple diverse test splits with different biases $\rightarrow$ better evaluation
2. If (1) is not possible, it is better to consider biased splits to better approximate real-world performance

#### The importance of biases

<table><tr>
<td> <img src="Images/Lecture-2/different_time_samples.png" width="1100" alt='different_time_samples'/> </td>
<td> <img src="Images/Lecture-2/temporal_drift.png" width="1100" alt='temporal_drift'/> </td>
</tr></table>

## Seeding

Pseudo random number generation is a critical aspect in developing experiments.

Choosing a random seed fixes the pseudo random number generation process.

[❓] What processes are affected by a random seed? Have you ever fixed seeds?



- Neural network initialization
- Optimization
- Neural network prediction
- Data pipeline (e.g., data partitioning)

[❓] How should we use random seeds?

1. Model selection $\rightarrow$ hyper-parameters calibration
2. Ensemble creation
3. Sensitivity analysis
4. Single fixed seed $\rightarrow$ through all model pipeline
5. Performance comparison

According to [[Bethard, 2022](https://arxiv.org/pdf/2210.13393.pdf)], based on an analysis of 85 ACL papers, points [1-3] are safe approaches, while points [4-5] hide some critical risks.

In particular, Bethard found that 48 out of 85 papers follow risky approaches. 

This trend is unaffected by time!

### Model selection

A random seed is just another hyper-parameter of the model.

It is perfectly fine to find for the best random seeds like for other hyper-parameters.

*"...Training multiple models with randomized initializations and use as the final model the one which achieved the best performance on the validation set''* [[Björne & Salakoski, 2018](https://aclanthology.org/W18-2311.pdf)]

*"The test results are derived from the 1-best random seed on the validation set''* [[Kuncoro et al., 2020](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00345/96469/Syntactic-Structure-Distillation-Pretraining-for)]

### Ensemble creation

Ensembling is a standard strategy to achieve increased performance by combining multiple models

One can create an ensemble by training the same model with different random seeds and obtain predictions via voting

*"...We ensemble five distinct models each initialized with a different random seed''* [[Nicolai et al., 2017](https://aclanthology.org/K17-2008.pdf)]

*" Our model is composed of the ensemble of 8 single models. The hyperparameters and the training procedure used in each single model are the same except the random seed''* [[Yang & Wang, 2019](https://aclanthology.org/W19-4421.pdf)]

###  Sensitivity Analysis

Since neural networks may be sensitive to random initialization, one could study this aspect by considering multiple seed runs

*"...examine the expected variance in attention-procuded weights by initializing multiple training sequences with different random seeds''* [[Wiegreffe & Pinter, 2019](https://aclanthology.org/D19-1002.pdf)]

*"Our model shows a lower standard deviation on each task, which means our model is less sensitive to random seeds than other models''* [[Hua et al., 2021](https://aclanthology.org/2021.naacl-main.258.pdf)]

### Single fixed seed

Pick a single random seed to fix the whole model pipeline to ensure reproducibility.

[❓] Why is this risky?

- Doesn't necessarily guarantee reproducibility $\rightarrow$ some libraries (e.g., Tensorflow) may not support seed fixing for all operations (it also depends on the library version)

- Seed fixing implies no calibration for such a hyper-parameter $\rightarrow$ performance under-estimation

What should be done instead

- Calibrate the random seed like any other hyper-parameter [[Dodge et al., 2020](https://arxiv.org/pdf/2002.06305.pdf)]
- Low-resource scenario? Random search [[Bergstra & Bengio, 2012](https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf)]

### Performance comparison

For model comparison, one should prefer considering distributions of model performance and not single point estimates to draw more reliable conclusions.

[❗] However, the trend is to just consider random seeds variations to obtain such distributions

*"We re-ran both implementations multiple times, each time only changing the seed value of the random number generator''* [[Reimers & Gurevych, 2017](https://aclanthology.org/D17-1035.pdf)]

*"Indeed, the best approach is to stop reporting single-value results, and instead report the distribution of results from a range of seeds. Doing so allows for a fairer comparison across models''* [[Crane, 2018](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00018/43441/Questionable-Answers-in-Question-Answering)]

[❓] Why is this risky?

1. Sub-optimal models if the goal is to compare the best possible model A from one architecture to the best possible model B from another architecture (e.g., leaderboards)

2. Biased slice of family if the goal is to compare the family of models A to the family of models B $\rightarrow$ we should consider all hyper-parameters

What should be done instead

- For (1), we should also optimize for the best possible seed $\rightarrow$ we can still consider distributions rather than point estimates by considering multiple test splits.
- For (2), we should sample different models from a family by considering all hyper-parameters and not just random seeds.