NoiseBench

About

NoiseBench serves for benchmarking the impact of real label noise on named entity recognition. It is based on a subset of the English CoNLL-03 dataset and consists of 1 ground-truth label set and 6 variants of noisy labels:

Clean
Expert noise
Crowd noise
Crowd noise (best-case)
Distant supervision noise
Weak supervision noise
LLM noise

We provide the annotation-only files in data/annotations. The annotations follow the IOB2 scheme. We masked the tokens in the included sentences with [TOK] due to the license of the Reuters Corpus that CoNLL-03 is based on. We take the CleanCoNLL annotations as a ground truth.

Instructions

Create the NoiseBench datasets

This script generates the NoiseBench dataset variants in data/noisebench.

Option 1

Run the script:

bash create_noisebench.sh

Option 2

(if the git clone command from Option 1 is not available)

Download the full CleanCoNLL dataset in the data/cleanconll folder according to the instructions in https://github.com/flairNLP/CleanCoNLL.git.
Create noisy datasets

python scripts/generate_data_files.py

Run experiments

Requirements

conda create -n noisebench python=3.10
conda activate noisebench
pip install -r requirements.txt

Run main experiment script

python main.py --config configs/exp1_real_noise.json

Run simulated noise experiments

Run simulated noise generation

python scripts/calculate_data_overviews.py
python scripts/create_simulated_noisy_sets.py

Run main experiment script

python main.py --config configs/exp1_simulated_noise.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

data/annotations

data/annotations

scripts

scripts

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

create_noisebench.sh

create_noisebench.sh

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

NoiseBench

About

Instructions

Create the NoiseBench datasets

Option 1

Option 2

Run experiments

Run simulated noise experiments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
configs		configs
data/annotations		data/annotations
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_noisebench.sh		create_noisebench.sh
main.py		main.py
requirements.txt		requirements.txt

License

elenamer/NoiseBench

Folders and files

Latest commit

History

Repository files navigation

NoiseBench

About

Instructions

Create the NoiseBench datasets

Option 1

Option 2

Run experiments

Run simulated noise experiments

About

Resources

License

Stars

Watchers

Forks

Languages