huggingface dataset load error #4

heyongxin233 · 2024-03-19T03:55:57Z

I got an error when loading the data set using huggingface, as follows:

datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 2 new columns ({'split', 'index'})

This happened while the json dataset builder was generating data using

hf://datasets/ZachW/MGTDetect_CoCo/gpt3.5-davinci3/gpt3.5-Mixed-davinci3/gpt3.5_mixed_1000_train.jsonl (at revision aa49f92a8667f5a704ff576c728765c236940c6c)

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)

My code：

from datasets import load_dataset
dataset = load_dataset("ZachW/MGTDetect_CoCo")

The text was updated successfully, but these errors were encountered:

YichenZW · 2024-03-26T07:45:54Z

Hi yongxin, I would suggest using json.loads() directly. You can refer to L13 in preprocess/extract_keywords.py. The 2 columns you mentioned are for the crawler to log the human-written text source.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

huggingface dataset load error #4

huggingface dataset load error #4

heyongxin233 commented Mar 19, 2024

YichenZW commented Mar 26, 2024

huggingface dataset load error #4

huggingface dataset load error #4

Comments

heyongxin233 commented Mar 19, 2024

YichenZW commented Mar 26, 2024