Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for HuggingFace to httpfs #11831

Merged
merged 16 commits into from
May 13, 2024
Merged

Conversation

samansmink
Copy link
Contributor

@samansmink samansmink commented Apr 25, 2024

This PR adds native support for HuggingFace urls to query hugging face datasets directly from DuckDB.

Features

  • hf:// url format support (including specifying the branch)
  • hf:// url globbing
  • HUGGINGFACE secret type
    • manually provided token
    • fetched automatically from ~/.cache/huggingface/token
  • autoload httpfs extension on:
    • huggingface secret creation
    • query any hf:// url
  • supports pagination
  • supports duckdb's retry mechanism

How to use

(optionally) load credentials using a token

CREATE SECRET hf1 (TYPE HUGGINGFACE, TOKEN 'hf_my_very_secret_token');

(optionally) load credentials using ~/.cache/huggingface/token

CREATE SECRET hf2 (TYPE HUGGINGFACE, PROVIDER cache);

Now query some data (if secret was created, this can be private data)

SELECT * FROM parquet_scan('hf://datasets/me/my_data/some_file/test.parquet')
SELECT * FROM parquet_scan('hf://datasets/me/my_data/some_file/with/a/glob/**/t[eab]st.parquet')

To query from custom branches:

SELECT * FROM parquet_scan('hf://datasets/me@my-branch/my_data/some_file/test.parquet')
SELECT * FROM parquet_scan('hf://datasets/me@~parquet/my_data/some_file/test.parquet')

Testing

Everything is tested that was added, but it is a bit rudimentary still and could use improvement: I added a test that queries some dataset in my huggingface account with my token. This should be improved by moving a token for a test account into DuckDB CI then creating a dedicated job for this.

Also I tested the pagination by exposing a limit param manually. This forces HF's API to return the list query results one by one

Future work

  • fully implement generic bearer token support and add tests for this

@samansmink
Copy link
Contributor Author

hey @lhoestq @severo @Wauplin, I have a first version working in this PR, happy to take any feedback here!

extension/httpfs/hffs.cpp Outdated Show resolved Hide resolved
extension/httpfs/hffs.cpp Outdated Show resolved Hide resolved
extension/httpfs/hffs.cpp Show resolved Hide resolved
extension/httpfs/hffs.cpp Show resolved Hide resolved
Copy link

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks amazing :) Nothing to add to what @severo and @Wauplin said except minor comments

extension/httpfs/hffs.cpp Show resolved Hide resolved
extension/httpfs/hffs.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my side

@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 1, 2024 11:02
@samansmink samansmink marked this pull request as ready for review May 2, 2024 11:36
@samansmink
Copy link
Contributor Author

I've processed all comments, If CI passes this is good to go from my side.

Thanks for all the reviews!

extension/httpfs/hffs.cpp Outdated Show resolved Hide resolved
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 2, 2024 11:47
@samansmink samansmink marked this pull request as ready for review May 2, 2024 11:48
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 2, 2024 11:53
@samansmink samansmink marked this pull request as ready for review May 2, 2024 12:12
@Mytherin
Copy link
Collaborator

Mytherin commented May 3, 2024

Can you solve the merge conflict?

@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 6, 2024 08:03
@samansmink samansmink marked this pull request as ready for review May 6, 2024 08:09
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 6, 2024 10:38
@samansmink samansmink marked this pull request as ready for review May 8, 2024 11:42
@Mytherin Mytherin merged commit b9ecd8d into duckdb:main May 13, 2024
42 of 43 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request May 13, 2024
Merge pull request duckdb/duckdb#11831 from samansmink/hugging-face-fs
Merge pull request duckdb/duckdb#11981 from Tishj/python_optional_numpy
Merge pull request duckdb/duckdb#11980 from Tishj/pyfs_needs_gil
@severo
Copy link

severo commented May 15, 2024

Thanks for supporting hf://! What are the next steps regarding the docs and the release?

@AndreaFrancis
Copy link

Hi @samansmink, I loved the native HF implementation and am preparing some documents to share on the Dataset Viewer page: https://huggingface.co/docs/datasets-server/duckdb (still in progress).
I was trying the credential_chain provider, but it seems to have been an issue the first time, or maybe I am using it incorrectly.

These are the steps to reproduce:

  1. Access to duckdb
  2. Ensure that there is a configured token in ~/.cache/huggingface/token or HF_TOKEN is set
  3. Try to create the secret using credential_chain
    Invalid Input Error: Secret provider 'credential_chain' not found for type 'huggingface'
  4. Now, try to execute another HF operation like CREATE SECRET hf_token (TYPE HUGGINGFACE, TOKEN 'your_hf_token'); or try to get some files using the hf httpfs
  5. Try to create the secret using credential_chain again -> OK

See the following example:
image

It works after another httpfs usage for HF. I'm not sure if this is expected or not, but it might lead to confusion for users trying to set the token the very first time.

@samansmink
Copy link
Contributor Author

@AndreaFrancis ah good catch, PR is up here #12112

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants