Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huggingface dataset load error #4

Open
heyongxin233 opened this issue Mar 19, 2024 · 1 comment
Open

huggingface dataset load error #4

heyongxin233 opened this issue Mar 19, 2024 · 1 comment

Comments

@heyongxin233
Copy link

I got an error when loading the data set using huggingface, as follows:

datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 2 new columns ({'split', 'index'})

This happened while the json dataset builder was generating data using

hf://datasets/ZachW/MGTDetect_CoCo/gpt3.5-davinci3/gpt3.5-Mixed-davinci3/gpt3.5_mixed_1000_train.jsonl (at revision aa49f92a8667f5a704ff576c728765c236940c6c)

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)

My code:

from datasets import load_dataset
dataset = load_dataset("ZachW/MGTDetect_CoCo")
@YichenZW
Copy link
Owner

Hi yongxin, I would suggest using json.loads() directly. You can refer to L13 in preprocess/extract_keywords.py. The 2 columns you mentioned are for the crawler to log the human-written text source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants