Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a api load dataset from [huggingface datasets] #11126

Closed
simplew2011 opened this issue May 17, 2024 · 4 comments
Closed

add a api load dataset from [huggingface datasets] #11126

simplew2011 opened this issue May 17, 2024 · 4 comments
Labels
needs info Needs further information from the user

Comments

@simplew2011
Copy link

@jrbourbeau
Copy link
Member

Thanks for the issue @simplew2011. We already support loading datasets from hugging face via fsspec's hf:// support (see https://docs.dask.org/en/stable/how-to/connect-to-remote-data.html). For example:

import dask.dataframe as dd
df = dd.read_parquet("hf://datasets/wikimedia/wikipedia/20231101.en")

Can you say more about what you're looking for? It could be things already work

@jrbourbeau jrbourbeau added needs info Needs further information from the user and removed needs triage Needs a response from a contributor labels May 17, 2024
@simplew2011
Copy link
Author

simplew2011 commented May 20, 2024

  • Currently, only JSONL files are support? loading JSON files is fail

    • json
    [
    {"id": 0, "text": "https://docs.dask.org/en/latest/bag-api.html"},
    {"id": 1, "text": "https://docs.dask.org/en/latest/bag-api.html"}
    ]
    
    • jsonl
    {"id": 0, "text": "https://docs.dask.org/en/latest/bag-api.html"}
    {"id": 1, "text": "https://docs.dask.org/en/latest/bag-api.html"}
    
    import dask.bag as db
    b_bg = db.read_text('examples/demos/datasets.jsonl').map(json.loads) # ok
    b_bg1 = db.read_text('examples/demos/datasets.json').map(json.load) # fail
    
  • I have a usage scenario where I have already used the datasets library to load raw data (such as local JSON and JSONL files). I would like to convert it directly to dask.dataframe in memory instead of converting it to parquet first

@jrbourbeau
Copy link
Member

Currently, only JSONL files are support? loading JSON files is fail

See this related Stackoverflow question and answer https://stackoverflow.com/questions/44889526/dask-bag-jsondecodeerror-when-reading-multiline-json-arrays. In short, the read_text function interprets every line of your file as a separate element.

I would like to convert it directly to dask.dataframe in memory instead of converting it to parquet first

Maybe you'll be better off just using Dask DataFrame's JSON reader? With data files like this:

data/0.json:

[
{"id": 0, "text": "https://docs.dask.org/en/latest/bag-api.html"},
{"id": 1, "text": "https://docs.dask.org/en/latest/bag-api.html"}
]

data/1.json:

[
{"id": 2, "text": "https://docs.dask.org/en/latest/bag-api.html"},
{"id": 3, "text": "https://docs.dask.org/en/latest/bag-api.html"}
]

You can read the data like this:

import dask.dataframe as dd

files = ["data/0.json", "data/1.json"]
df = dd.read_json("data/*.json", lines=False)
print(f"{df.compute() = }")

@simplew2011
Copy link
Author

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs info Needs further information from the user
Projects
None yet
Development

No branches or pull requests

2 participants