add subtsring decontamination #16

RaymondLi0 · 2022-10-26T15:35:21Z

Exact-substring match for decontamination #13

This removes 336 files from the python-permissive dataset and 292 from the python-permissive-dedup dataset. All these removals match HumanEval samples.

loubnabnl

Great thank you for adding this! I just left a comment on a missing file from which you import humaneval reader. It would also be helpful if you can add a readme with what is needed to run the code.
@ChenghaoMou we can probably add instructions to execute your code too in that readme (forgot to mention it in your PR)

loubnabnl · 2022-10-26T15:57:48Z

data_analysis/decontamination/find_substrings.py

+from tqdm import tqdm
+from multiprocessing import Pool
+
+from human_eval.data import read_problems


I think the file human_eval is missing (btw we also have HumanEval on datasets for the future :) )

Oh actually I directly installed https://github.com/openai/human-eval as a package. I could modify the code to use HF datasets instead yes.

loubnabnl · 2022-10-26T15:59:37Z

data_analysis/decontamination/find_substrings.py

+    assert len(data) == 500
+
+    # Checksum / version issues here
+    # dataset = load_dataset("mbpp", split="test")


I also noticed that, you will need to upgrade datasets version (if the error still happens there is the ignore_verifications=True argument)

loubnabnl

thanks for adding datasets

add subtsring decontamination

5c481a4

RaymondLi0 requested a review from loubnabnl October 26, 2022 15:35

remove unused code

7ac9836

loubnabnl reviewed Oct 26, 2022

View reviewed changes

use hf datasets to load humaneval problems

51d9bf5

loubnabnl approved these changes Oct 27, 2022

View reviewed changes

RaymondLi0 merged commit 04e5828 into main Oct 27, 2022

RaymondLi0 deleted the substring-decontamination branch October 27, 2022 17:40

ChenghaoMou mentioned this pull request Nov 1, 2022

Decontamination #13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add subtsring decontamination #16

add subtsring decontamination #16

RaymondLi0 commented Oct 26, 2022

loubnabnl left a comment

loubnabnl Oct 26, 2022

RaymondLi0 Oct 26, 2022

loubnabnl Oct 26, 2022

loubnabnl left a comment

add subtsring decontamination #16

add subtsring decontamination #16

Conversation

RaymondLi0 commented Oct 26, 2022

loubnabnl left a comment

Choose a reason for hiding this comment

loubnabnl Oct 26, 2022

Choose a reason for hiding this comment

RaymondLi0 Oct 26, 2022

Choose a reason for hiding this comment

loubnabnl Oct 26, 2022

Choose a reason for hiding this comment

loubnabnl left a comment

Choose a reason for hiding this comment