Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train - test leakage #13

Open
YannDubs opened this issue Jul 18, 2023 · 5 comments
Open

Train - test leakage #13

YannDubs opened this issue Jul 18, 2023 · 5 comments

Comments

@YannDubs
Copy link

Hi, I looked into the OOD results and many examples in the test sets seem to be in the train set. E.g. DengueFilipino has the same train and test set. KirundiNews has 90% overlap...

to reproduce:

from data import *
dataloaders = dict(DengueFilipino=load_filipino, 
                   KirundiNews=load_kirnews, 
                   KinyarwandaNews=load_kinnews, 
                   SwahiliNews=load_swahili)

for data_name, loader in dataloaders.items():
    train, test = loader();
    overlap = 1 - len(set(test) - set(train)) / len(set(test))
    print(data_name, f"train<->test overlap: {overlap * 100:.1f}%")
DengueFilipino train<->test overlap: 100.0%
KirundiNews train<->test overlap: 90.4%
KinyarwandaNews train<->test overlap: 23.8%
SwahiliNews train<->test overlap: 0.5%
@kts
Copy link

kts commented Jul 18, 2023

Wow. It looks like the issue is in the original huggingface datasets. Lots of duplicates too.

Here are the stats using only huggingface datasets.load_dataset():

from datasets import load_dataset
from tabulate import tabulate

# dataset info from data.py:
# [name, args to load_dataset(), keys used on each item]
ds_info = [

    ("filipino",
     ['dengue_filipino'],
     ['text', 'absent', 'dengue', 'health', 'mosquito', 'sick']),

    ("kirnews",
     ["kinnews_kirnews","kirnews_cleaned"],
     ['label','title','content']),
    
    ("kinnews",
     ["kinnews_kirnews", "kinnews_cleaned"],
     ['label','title','content']),

    ("swahili",
     ['swahili_news'],
     ['label','text']),

]

lines = []
for name,args,keys in ds_info:

    ds = load_dataset(*args)

    # convert to list-of-tuples:
    train = [tuple([item[key] for key in keys]) for item in ds['train']]
    test  = [tuple([item[key] for key in keys]) for item in ds['test']]

    lines.append(name)

    n_overlap = len(set(train).intersection(test))
    lines.append(tabulate([
        ("train:",        len(train)),
        ("train unique:", len(set(train))),
        ("test:",         len(test)),
        ("test unique:",  len(set(test))),
        
        ("train/test overlap:", n_overlap,
         "%.1f%%" % (100.0 * n_overlap / len(set(test)))),
    ]))
    lines.append("\n")
    
print("\n".join(lines))
filipino
-------------------  ----  ------
train:               4015
train unique:        3947
test:                4015
test unique:         3947
train/test overlap:  3947  100.0%
-------------------  ----  ------


kirnews
-------------------  ----  -----
train:               3689
train unique:        1791
test:                 923
test unique:          698
train/test overlap:   631  90.4%
-------------------  ----  -----


kinnews
-------------------  -----  -----
train:               17014
train unique:         9199
test:                 4254
test unique:          2702
train/test overlap:    643  23.8%
-------------------  -----  -----


swahili
-------------------  -----  ----
train:               22207
train unique:        22207
test:                 7338
test unique:          7338
train/test overlap:     34  0.5%
-------------------  -----  ----

@ljvmiranda921
Copy link

ljvmiranda921 commented Jul 18, 2023

I suggest the authors to download the DengueFilipino dataset from the original link instead of Hugging Face. I'm also working on some Tagalog pipelines and I noticed the same upload issues (basically the train and test are 1:1 match).

I wrote a parser and some personal notes (file docstring) here. The parser uses some spaCy primitives but feel free to use this as you see fit: https://github.com/ljvmiranda921/calamanCy/blob/master/reports/emnlp2023/benchmark/scripts/process_dengue.py

@bazingagin
Copy link
Owner

Hi @YannDubs, wow thanks for pointing this out!!! I was only aware of the dataset issue of DengueFilipino. Thanks @kts for verifying the huggingface dataset issue. People should be aware of that and use the original link for those datasets. I will redo the experiment on Filipino using the original link @ljvmiranda921 provided.
I will also check if the issue of KirundiNews overlapped happened in their original dataset.
Thanks again!

@bazingagin
Copy link
Owner

Screenshot 2023-07-31 at 10 10 06 PM

Here are results using the original DengueFilipino dataset.
I also checked the original Kirundi dataset, it still has the data contamination issue.

@maoxuxu
Copy link

maoxuxu commented Jul 18, 2024

我建议作者从原始链接下载 DengueFilipino 数据集,而不是Hugging Face。我也在研究一些 Tagalog 管道,我注意到了同样的上传问题(基本上训练和测试是 1:1 匹配的)。

我在这里编写了一个解析器和一些个人笔记(文件文档字符串)。解析器使用了一些 spaCy 原语,但您可以随意使用它:https://github.com/ljvmiranda921/calamanCy/blob/master/reports/emnlp2023/benchmark/scripts/process_dengue.py

Hello, may I ask if you can provide the Filipino dataset? Cannot download from the original link, thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants