add a api load dataset from [huggingface datasets] #11126

simplew2011 · 2024-05-17T11:42:33Z

https://docs.dask.org/en/latest/bag-api.html

jrbourbeau · 2024-05-17T17:29:48Z

Thanks for the issue @simplew2011. We already support loading datasets from hugging face via fsspec's hf:// support (see https://docs.dask.org/en/stable/how-to/connect-to-remote-data.html). For example:

import dask.dataframe as dd
df = dd.read_parquet("hf://datasets/wikimedia/wikipedia/20231101.en")

Can you say more about what you're looking for? It could be things already work

simplew2011 · 2024-05-20T02:39:56Z

Currently, only JSONL files are support? loading JSON files is fail

json

[
{"id": 0, "text": "https://docs.dask.org/en/latest/bag-api.html"},
{"id": 1, "text": "https://docs.dask.org/en/latest/bag-api.html"}
]

jsonl

{"id": 0, "text": "https://docs.dask.org/en/latest/bag-api.html"}
{"id": 1, "text": "https://docs.dask.org/en/latest/bag-api.html"}

import dask.bag as db
b_bg = db.read_text('examples/demos/datasets.jsonl').map(json.loads) # ok
b_bg1 = db.read_text('examples/demos/datasets.json').map(json.load) # fail

I have a usage scenario where I have already used the datasets library to load raw data (such as local JSON and JSONL files). I would like to convert it directly to dask.dataframe in memory instead of converting it to parquet first
- like this api：https://docs.ray.io/en/latest/data/api/doc/ray.data.from_huggingface.html#ray.data.from_huggingface

jrbourbeau · 2024-05-20T17:22:09Z

Currently, only JSONL files are support? loading JSON files is fail

See this related Stackoverflow question and answer https://stackoverflow.com/questions/44889526/dask-bag-jsondecodeerror-when-reading-multiline-json-arrays. In short, the read_text function interprets every line of your file as a separate element.

I would like to convert it directly to dask.dataframe in memory instead of converting it to parquet first

Maybe you'll be better off just using Dask DataFrame's JSON reader? With data files like this:

data/0.json:

[
{"id": 0, "text": "https://docs.dask.org/en/latest/bag-api.html"},
{"id": 1, "text": "https://docs.dask.org/en/latest/bag-api.html"}
]

data/1.json:

[
{"id": 2, "text": "https://docs.dask.org/en/latest/bag-api.html"},
{"id": 3, "text": "https://docs.dask.org/en/latest/bag-api.html"}
]

You can read the data like this:

import dask.dataframe as dd

files = ["data/0.json", "data/1.json"]
df = dd.read_json("data/*.json", lines=False)
print(f"{df.compute() = }")

simplew2011 · 2024-05-21T01:42:13Z

thanks

github-actions bot added the needs triage Needs a response from a contributor label May 17, 2024

jrbourbeau mentioned this issue May 17, 2024

add a api load dataset from [huggingface datasets] dask/distributed#8649

Closed

jrbourbeau added needs info Needs further information from the user and removed needs triage Needs a response from a contributor labels May 17, 2024

simplew2011 closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a api load dataset from [huggingface datasets] #11126

add a api load dataset from [huggingface datasets] #11126

simplew2011 commented May 17, 2024

jrbourbeau commented May 17, 2024

simplew2011 commented May 20, 2024 •

edited

jrbourbeau commented May 20, 2024

simplew2011 commented May 21, 2024

add a api load dataset from [huggingface datasets] #11126

add a api load dataset from [huggingface datasets] #11126

Comments

simplew2011 commented May 17, 2024

jrbourbeau commented May 17, 2024

simplew2011 commented May 20, 2024 • edited

jrbourbeau commented May 20, 2024

simplew2011 commented May 21, 2024

simplew2011 commented May 20, 2024 •

edited