-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for HuggingFace to httpfs #11831
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from my side
I've processed all comments, If CI passes this is good to go from my side. Thanks for all the reviews! |
Can you solve the merge conflict? |
Thanks! |
Merge pull request duckdb/duckdb#11831 from samansmink/hugging-face-fs Merge pull request duckdb/duckdb#11981 from Tishj/python_optional_numpy Merge pull request duckdb/duckdb#11980 from Tishj/pyfs_needs_gil
Thanks for supporting hf://! What are the next steps regarding the docs and the release? |
Hi @samansmink, I loved the native HF implementation and am preparing some documents to share on the Dataset Viewer page: https://huggingface.co/docs/datasets-server/duckdb (still in progress). These are the steps to reproduce:
It works after another httpfs usage for HF. I'm not sure if this is expected or not, but it might lead to confusion for users trying to set the token the very first time. |
@AndreaFrancis ah good catch, PR is up here #12112 |
This PR adds native support for HuggingFace urls to query hugging face datasets directly from DuckDB.
Features
hf://
url format support (including specifying the branch)hf://
url globbingHUGGINGFACE
secret type~/.cache/huggingface/token
hf://
urlHow to use
(optionally) load credentials using a token
CREATE SECRET hf1 (TYPE HUGGINGFACE, TOKEN 'hf_my_very_secret_token');
(optionally) load credentials using
~/.cache/huggingface/token
Now query some data (if secret was created, this can be private data)
To query from custom branches:
Testing
Everything is tested that was added, but it is a bit rudimentary still and could use improvement: I added a test that queries some dataset in my huggingface account with my token. This should be improved by moving a token for a test account into DuckDB CI then creating a dedicated job for this.
Also I tested the pagination by exposing a limit param manually. This forces HF's API to return the list query results one by one
Future work