-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hugging Face Hub integration #1227
Comments
@jorritsandbrink looks like a good first step.
I'm however curious how the HF datasets library would behave with the data we produce. if you you create a few parquet files where schema evolves (we append columns at the end) and push them to the same dataset, then request with
so:
|
load_dataset("jorritsandbrink/dlt_dev", data_files=["foo.csv", "baz.csv"], sep=";") # baz.csv has extra column "bla" Error:
It is however possible to load only the initial set of columns by ignoring the new column by specifying load_dataset(
"jorritsandbrink/dlt_dev",
data_files=["foo.csv", "baz.csv"],
features=Features(
{"foo": Value(dtype="int64", id=None), "bar": Value(dtype="int64", id=None), "baz": Value(dtype="int64", id=None)}
),
sep=";"
) # baz.csv has extra column "bla" Result: DatasetDict({
train: Dataset({
features: ['foo', ' bar', ' baz'],
num_rows: 2
})
}) Specifying a column in I'd say we restrict schema evolution to prevent uploading datasets that are difficult to use (we don't want to place the burden of specifying
Edit: Okay, I learned there's a |
@jorritsandbrink seems that we'd need to bring the files locally and merge them using arrow or duckdb then emit unified dataset. this seems both very useful and doable. However we need to get more info on what the HF users need because this is IMO quite an investment. |
Feature description
dataset
hosted on the Hugging Face Hub as adlt
sourcedataset
hosted on the Hugging Face Hub as adlt
destinationSupported file types
While the HF Hub can host any filetype (it's a Git repo under the hood), it's probably wise to limit initial support to
dlt
's currently supported tabular data file formats:csv
,jsonl
, andparquet
. We could extend later if users show interest.Approach
In line with Hugging Face' direction, it makes sense to treat the HF Hub as a cloud storage bucket similar to Amazon S3, Azure Blob Storage, and Google Cloud Storage. This should also make it relatively straightforward to implement the desired functionality, as the
huggingface_hub
Python library implementsfsspec
, whichdlt
can already handle.Supported write dispositions
Using this approach, HF Hub will fall under the
filesystem
source/destination category. Thefilesystem
destination supports theappend
andreplace
write dispositions, but notmerge
. We can implementmerge
later if users show interest (that would be for thefiltesystem
destination as a whole, not just HF Hub).Missing functionality
Are you a dlt user?
Yes, I'm already a dlt user.
Use case
No response
Proposed solution
No response
Related issues
No response
The text was updated successfully, but these errors were encountered: