Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hugging Face Hub integration #1227

Open
jorritsandbrink opened this issue Apr 16, 2024 · 3 comments
Open

Hugging Face Hub integration #1227

jorritsandbrink opened this issue Apr 16, 2024 · 3 comments
Assignees
Labels
destination Issue related to new destinations enhancement New feature or request

Comments

@jorritsandbrink
Copy link
Collaborator

jorritsandbrink commented Apr 16, 2024

Feature description

  • I want to read files from a dataset hosted on the Hugging Face Hub as a dlt source
  • I want to write files to a dataset hosted on the Hugging Face Hub as a dlt destination

Supported file types

While the HF Hub can host any filetype (it's a Git repo under the hood), it's probably wise to limit initial support to dlt's currently supported tabular data file formats: csv, jsonl, and parquet. We could extend later if users show interest.

Approach

In line with Hugging Face' direction, it makes sense to treat the HF Hub as a cloud storage bucket similar to Amazon S3, Azure Blob Storage, and Google Cloud Storage. This should also make it relatively straightforward to implement the desired functionality, as the huggingface_hub Python library implements fsspec, which dlt can already handle.

Supported write dispositions

Using this approach, HF Hub will fall under the filesystem source/destination category. The filesystem destination supports the append and replace write dispositions, but not merge. We can implement merge later if users show interest (that would be for the filtesystem destination as a whole, not just HF Hub).

Missing functionality

  • Hugging Face credentials abstraction

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

No response

Proposed solution

No response

Related issues

No response

@jorritsandbrink jorritsandbrink added enhancement New feature or request source destination Issue related to new destinations labels Apr 16, 2024
@jorritsandbrink jorritsandbrink self-assigned this Apr 16, 2024
@rudolfix
Copy link
Collaborator

@jorritsandbrink looks like a good first step.

I want to read files from a dataset hosted on the Hugging Face Hub as a dlt source

I'm however curious how the HF datasets library would behave with the data we produce. if you you create a few parquet files where schema evolves (we append columns at the end) and push them to the same dataset, then request with load_dataset what is going to happen? Can I upload csvs and see them as single parquet on the client side? also streaming

dataset = load_dataset('oscar-corpus/OSCAR-2201', 'en', split='train', streaming=True)

so:

I want to access my dataset, train and stream like any other HF datasets

@jorritsandbrink
Copy link
Collaborator Author

jorritsandbrink commented Apr 18, 2024

@rudolfix

  1. Schema evolution seems not supported. Did a simple test with CSV and Parquet. Multiple files can be handled, but only if they contain the same column names (error is thrown otherwise). Couldn't find a config to enable it.
load_dataset("jorritsandbrink/dlt_dev", data_files=["foo.csv", "baz.csv"], sep=";")  # baz.csv has extra column "bla"

Error:

DatasetGenerationCastError: An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 1 new columns ({' bla'})

This happened while the csv dataset builder was generating data using

hf://datasets/jorritsandbrink/dlt_dev/baz.csv (at revision 6dab0737041dfeef3cc8446f61a1ecd059bec7e0)

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)

It is however possible to load only the initial set of columns by ignoring the new column by specifying features:

load_dataset(
    "jorritsandbrink/dlt_dev",
    data_files=["foo.csv", "baz.csv"],
    features=Features(
        {"foo": Value(dtype="int64", id=None), "bar": Value(dtype="int64", id=None), "baz": Value(dtype="int64", id=None)}
    ),
    sep=";"
)  # baz.csv has extra column "bla"

Result:

DatasetDict({
    train: Dataset({
        features: ['foo', ' bar', ' baz'],
        num_rows: 2
    })
})

Specifying a column in features that is not present in all data files causes an error.

I'd say we restrict schema evolution to prevent uploading datasets that are difficult to use (we don't want to place the burden of specifying features on the user). Schema contracts provide this functionality, but do we want to use that while it's still in experimental phase?

  1. Don't know what you mean with "see them as single parquet on the client side", but streaming from multiple CSV files seems to work.
    image

Edit:

Okay, I learned there's a parquet-converter bot that asynchronously converts the files in the dataset to Parquet in a dedicated branch called refs/convert/parquet.

@rudolfix
Copy link
Collaborator

rudolfix commented Apr 24, 2024

@jorritsandbrink seems that we'd need to bring the files locally and merge them using arrow or duckdb then emit unified dataset. this seems both very useful and doable. However we need to get more info on what the HF users need because this is IMO quite an investment.

@rudolfix rudolfix removed the source label Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
destination Issue related to new destinations enhancement New feature or request
Projects
Status: Todo
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants