## Why are you here?

According to submitted Google form data, you gave the following reasons


![google_form_true](Images/Lecture-1/google_form_true.png)


## All you liars!

My predictions were **wrong** since I expected 100% respondance rate for ``CFUs (Yay!!!)`` answer!!


![SegmentLocal](Images/Lecture-1/google_form_liar.gif "segment")



## Anyway.. Why are you here?

- Do you often struggle to reproduce a baseline for your work?

- Do you get different results everytime you run your experiments?

- Is your work often criticized for the methodology?

- Are you often confused about what kind of experiment to carry out?

- Are you looking for a sound motivation to explain that your 1-point F1-score improvement is a new SOTA?

- Are you looking for some ways to improve your spaghetti code?

- Are you already tired of this listing?

**$\rightarrow$ This course might shed some light on these topics and help you!**

## A quick Legend

[❓] $\rightarrow$ This is a question for you (the audience)

[❗] $\rightarrow$ A very important information to remember!

[💬] $\rightarrow$ An anecdote will be told

[XX et al., YEAR] $\rightarrow$ A reference (the bibliography is at the end of the presentation) 


## Motivational Intro

Let's consider a general, relatable, researcher. We name this researcher as **FR**.

FR has just started their PhD and is particularly entusiastic: a lot of ideas, stuff to read, things to try.

Intuitively, **FR has no idea of what comes next..**


Suppose now that FR has a little, yet interesting, idea about their favourite research domain.

[❓] What does FR do?

#### Scientific Method [🤓]

- Specify an hypothesis
- Running some experiments
- Analyze the results
- Draw some conclusions

#### Results 

[❗] Results have to be **repeatable** and **reproducible** to be valid and reliable $\rightarrow$ **consistent** results!

## A blind walk in the research forest

FR remembers this stuff from their past studies and considers this methodology fairly standard and easy to follow.

FR develops their idea, starts thinking about the experimental setup and what to code...

However..some wild issues appear!

![SegmentLocal](Images/Lecture-1/oh-my-god-omg.gif "segment")

### w/ Data

- Some interesting datasets for their study are **not available**... (*private only or premium subscription required*)
- The available datasets are not **well-documented or maintained** (*good luck!*)
- FR can't find the **datasheet** of the dataset (*err.. is that an insult? [🐉]*)
- Some of the datasets have significant **flaws/biases** **$\leftarrow$ to be continued...**
- Can't reproduce the same dataset by following the paper

### w/ Competitor models

- The code of some competitor models is not available (*error 404*)
- The code of some competitor models is outdated, horrifying, full of errors/bugs and incomplete (*pick any combination*)
- The code of some competitor models is not exactly what you inferred after reading the paper (*tricksters*)
- Missing hyper-parameter specs, training configuration
- FR finally managed to run/re-implement the code for competitor models but can't reproduce the same results (*bugs?*)

### w/ Experimental setup

- Improper use of statistical analysis
- Over-claiming of results
- Unfair comparison with other models, methods, etc..
- Wrong experiments (not aligned with FR thesis)


## A well-known issue!

According to [Baker et al., 2016]:

"*More than 70% of researchers failed in their attempt to reproduce another researcher's experiments, and over 50% failed to reproduce one of their own experiments*''

We observe a trend where research (in computer science) is a frenetic race to publications, often leading to significant flaws in proposed work.

Moreover...

"*There's a bias in the field towards publishing positive results (rather than negative ones). Indeed, the evidence threshold for publishing a new positive finding is much lower than that for invalidating a previous finding.*''

### Some observations

There's a growing interest in improving reproducibility across scientific disciplines: guidelines, recommendations [McNutt, 2014].

#### The Open Science movement

"*Open Science is transparent and accessible knowledge that is shared and developed through collaborative networks*'' [Vicente-Saez & Martinez-Fuentes, 2018]

$\rightarrow$ code, data, scientific communications, research artifacts should be made publicly available [Sonnenburg et al., 2007]

#### Stodden's Best Practices Wiki

*It's now completely acceptable for a researcher to say - here is what I did, and here are my results. [...] The presentation's core isn't about the struggle to root our error but is instead a sales pitch: an enthusiastic presentation of ideas and a breezy demo of an implementation* [Donoho et al., 2009]

- Make your data, code, supplementary material and other artifacts publicly available
- Provide version control for data and code
- Make sure you cite original authors if using external data, code, etc...


#### New SOTA?

We always want to achieve new SOTA: propose solutions that lead to higher metric scores.

However, it is hard to assert if the aspect of a model claimed to have improved its performance is indeed the factor leading to the higher score

$\rightarrow$ A few studies show new proposed methods are often not better than previous implementations after a more thorough hyper-parameter search [Lucic et al., 2018; Melis et al., 2017] or initialization [Bouthillier et al., 2019; Henderson et al., 2018]

#### Documentation

Raff, 2019 showed that 63.5% of manuscripts out of 255 were successfully reproduced given proper documentation.

Strikingly, this study found that when authors provided assistance, 85% of results were reproduced (compared to 4% when authors didn't respond)

$\rightarrow$ Selection bias

$\rightarrow$ Reporting problem?

#### Statistical Analysis

As opposed to most scientific disciplines, statistical analysis is seldom conducted in machine learning-based research [Forde & Paganini, 2019; Henderson et al., 2018]

## Reproducible Research [❗]

![reproducible_research](Images/Lecture-1/reproducible_research.png)

- <u>**Reproducible**</u>: re-doing an experiment using the **same data** and **same analytical tools**
- **Replicable**: considers **different data** (presumably, from the same distribution)
- <u>**Robust**</u>: assumes the **same data** but **different analysis** (e.g., code re-implementation, different hardware, different architecture, etc..)
- **Generalisable**: leads to the **same conclusions** despite considering **different data** and **different analytical tools**

## What are we going to cover

- Setting up a correct and coherent experimental setup
- Define and provide a reproducible and robust implementation

### What does it mean?

- Contextualize your contribution
- Data understanding
- Model experimenting
- Model evaluation and analysis

## Step 1 - The data

- Data selection
- Understand your data
- Data size and distribution

### Data selection

You think about the fanciest contribution of your life, but it will still count less than the data on which you carry out your experiments.

- What does it contain?
- Where was the data collected?
- Are there any limitations/bias/spurious correlations? $\rightarrow$ **The Tank problem** [💬]; **One my first data pre-processing on Aharoni et al., 2014** [💬].
- **Popularity != quality** $\rightarrow$ several popular datasets have shown to have significant limitations [Paullada et al., 2020]
- Is the data directly available or built via code?

#### Understand your data

Data is your friend: you may spot relevant patterns that can guide your modeling. However, **only look at training data!**

Otherwise, you might impose some biases that limit the generalization capabilities of your approach: **data leakage** [❗]

Domain experts are a valuable resource for understanding your data!

#### Data size

- Low amount of data? $\rightarrow$ cross-validation, data-augmentation [Wong et al., 2016; Shorten and Khoshgoftaar, 2019] (**after data partitioning**), transfer learning, simple models.
- Class imbalance [Haixian et al., 2017]


## Dataset collection

According to [Paullada et al., 2021], a wide majority of datasets and related data collection processes have major flaws.

- Trend on collecting large-scale datasets (quantity over quality) $\rightarrow$ *''If it is available to us, we ingest it"*
- No clear understanding what these datasets/benchmarks measure.
- Affected by spurious correlations that deep learning models use as shortcuts and overfit on them [Geirhos et al., 2020; Storks et al. 2020; Schlegel et al., 2020] $\rightarrow$ exacerbate a dataset's utility (the hypothesis of reflecting human reasoning capabilities) $\rightarrow$ Some solutions concerning human guidelines [Srivastava et al., 2020] and data collection recommendations [Gardner et al., 2019].
- Improper task formulation (I/O)$\rightarrow$ **Personal traits prediction from images** [💬]
- Annotators' biases $\rightarrow$ usually due to lack of guidelines or dataset documentation (**re-building ImageNet** [💬])
- Hard to scrutinize $\rightarrow$ abbatoir-effect [Raji et al., 2020] exacerbating data quality and hiding ethical issues
- Entirely focused on benchmarking metrics $\rightarrow$ consumption, cost, fairness, model size, limitations, multiple gold standards, etc..



## Data Leakage [Kapoor & Narayaan, 2022]

*Data leakage is a spurious relationship between the independent variables and the target variable that arises as an artifact of the data collection, sampling, or pre-processing strategy.

#### Examples

- Perform variable scaling/normalization using the whole data set.
- Feature selection before data partitioning.
- Dimensionality reduction before data partitioning.
- Calibrate and test your model on the same dev/test set.
- Data augmentation before data partitioning
- Random splits with time series data or look ahead bias [Cerqueira et al., 2020; Wang & Ruf, 2022]
- Biased data partitioning

**$\rightarrow$ make sure you define distinct validation and test splits** [❗]

## Leakage Taxonomy



## Model Evaluation

Having valid results from which to draw reliable conclusions is a fundamental aspect of your experimental setting [Lones, 2023].

- Evaluate models on an **appropriate** test set: no overlap with training data, representative of population $\rightarrow$ **Images with same weather conditions** [💬]
- In case of data-augmentation, perform it on training data only [Vandewiele et al., 2021]
- Perform model evaluation multiple times: multiple data partitions (**input sensitivity**); multiple seeds (**random initialization sensitivity**) $\rightarrow$ under/over-estimation of model's performance.
- Cross-validation (stratified in case of class imbalance) $\rightarrow$ keep a separate test set for final evaluation
- Pick correct evaluation metrics: avoid accuracy with imbalanced datasets! [Haixian et al., 2017]

## Model Comparison

*Current SOTA achieves 94% of accuracy, while our model achieve 95.2%* $\rightarrow$ new SOTA! 

What could have gone differently? [❓]

- Training/evaluation on different data partitions
- Different training dataset
- Different training regularization
- Mismatch in hyper-parameter optimization $\rightarrow$ I know you have calibrated your model only!

What to do?

- Re-implement all models or use available code (good luck!)
- Perform a fair hyper-parameter calibration
- Define an exstensive experimental setup: multiple model evaluations
- Statistical tests **$\leftarrow$ to be continued...**

## Statistical Tests

Should we use statistical analysis to compare models? How to do so?

**[Pairwise comparison]** Student's T test (if normally distributed), Mann-Whitneys's U test (more general) [Raschka, 2020; Carrasco et al, 2020]

**[Multiple comparisons]** Multiple pairwise comparisons can lead to overly-optimistic interpretations of significance (using the test set multiple times) $\rightarrow$ **multiplicity effect* $\rightarrow$ **data dredging or p-hacking** [Head et al., 2015]

## Multiple Benchmarks

In some domains, it is a standard approach to use benchmark datasets to evaluate new models

### Major drawback

It becomes increasingly likely that the best model just happens to over-fit the test set, and doesn't necessarily generalize any better than the other models.  **The Persuasive Essays dataset** [💬]

$\rightarrow$ Don't assume that a small increase in performance is a significant and general contribution!

## Reporting results

To effectively contyribute to knowledge, you need to provide a complete picture of your work, covering both what worked and what didn't. $\rightarrow$ Trade-offs are common.

### Transparency

Share your data, code, results in an accessible way. 

- It adds confidence to your work
- It allows easier and fair comparison
- Enforces you to act thoughtfully and carefully: document your steps, write clean code, fill checklists

### Rigorous evaluation

A better and rigorous model evaluation and comparison is done by considering multiple datasets and evaluation metrics [Blagec et al., 2020].

- Gives a more complete picture of model's performance
- Different metrics give different perspectives $\rightarrow$ increased transparency
- In some domains it is very important to which errors does your model make

### Over-statements

A common mistake is to make general statements that are not supported by the data used to train and evaluate models.

- If your model does really well on one dataset, it doesn't necessarily mean that it will do well on other datasets.
- There's always a limit of what can be inferred from an experimental study $\rightarrow$ sampling error/bias, datasets overlap, data quality.

### Statistical significance shotgun

Statistical tests are error-prone: some may underestimate significance while some others may overestimate it.

$\rightarrow$ A positive test doesn't always indicate that something is significant, and a negative test doesn't necessarily mean that something isn't significant.

#### Threshold selection and abuse

Statisticians are increasingly arguing that it is better not to use thresholds and just report p-values for interpretation.

To give a better indication of whether something is important, we can measure **effect size** [Betensky, 2019, Benavoli et al., 2017].

### Error Analysis

Metric reporting is complementary to model inspection as the former doesn't give insight into what the model has actually learnt.

$\rightarrow$ We don't want to just improve metric performance, but rather generate knowledge and understanding to share with the research community. [❗]

Some insights could be gathered through (if your model is not explainable):

- Human evaluation
- Explainable AI tools [Li et al., 2020; Angelov et al., 2021]