Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add subtsring decontamination #16

Merged
merged 3 commits into from
Oct 27, 2022
Merged

add subtsring decontamination #16

merged 3 commits into from
Oct 27, 2022

Conversation

RaymondLi0
Copy link
Contributor

Exact-substring match for decontamination #13

This removes 336 files from the python-permissive dataset and 292 from the python-permissive-dedup dataset. All these removals match HumanEval samples.

Copy link
Collaborator

@loubnabnl loubnabnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thank you for adding this! I just left a comment on a missing file from which you import humaneval reader. It would also be helpful if you can add a readme with what is needed to run the code.
@ChenghaoMou we can probably add instructions to execute your code too in that readme (forgot to mention it in your PR)

from tqdm import tqdm
from multiprocessing import Pool

from human_eval.data import read_problems
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the file human_eval is missing (btw we also have HumanEval on datasets for the future :) )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually I directly installed https://github.com/openai/human-eval as a package. I could modify the code to use HF datasets instead yes.

assert len(data) == 500

# Checksum / version issues here
# dataset = load_dataset("mbpp", split="test")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed that, you will need to upgrade datasets version (if the error still happens there is the ignore_verifications=True argument)

Copy link
Collaborator

@loubnabnl loubnabnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding datasets

@RaymondLi0 RaymondLi0 merged commit 04e5828 into main Oct 27, 2022
@RaymondLi0 RaymondLi0 deleted the substring-decontamination branch October 27, 2022 17:40
@ChenghaoMou ChenghaoMou mentioned this pull request Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants