-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Implement hf:// / "hugging face" integration in datafusion-cli #10792
base: main
Are you sure you want to change the base?
Conversation
Hi @alamb. I am still working on this for completing UTs, and E2Es and fixing bad code styles... Could you please help do a pre-quick review of the ideas behind... I did not find a simple way to implement this other than creating a wrapper |
b797d28
to
4591734
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, can you add a section here ? https://github.com/apache/datafusion/blob/main/docs/source/user-guide/cli/datasources.md
This is now able to be reviewed now. I completed most of the refining of the initial code. |
) | ||
STORED AS parquet | ||
LOCATION "hf://datasets/cais/mmlu/astronomy/"; | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently not working due to:
- The hugging face user access token is case-sensitive.
- a previous change enforces every option value in lower case. https://github.com/apache/datafusion/pull/9723/files.
I will figure out the history to see whether this will be feasible.
I am sorry @xinlifoobar for the delayed review -- I am traveling this week (actually presenting about DataFusion at SIGMOD: https://2024.sigmod.org/industrial-list.shtml) |
Ya, the first time I found this on my timeline on LinkedIn, and am glad to be part of this awesome project. I would pause updating on this PR since it is extremely large IMO and difficult for reviewers. Let me know your thoughts on it and I could do an update in the following iterations. |
If it is too complicated, maybe we should just stop working on it (or maybe we should put the code into a datafusion-contrib repo 🤔 ) |
@@ -0,0 +1,9 @@ | |||
select count(*) from "hf://datasets/cais/mmlu/astronomy/dev-00000-of-00001.parquet"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮 -- tests too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw duckdb build their organizations and datasets, instead of depending on the random existing ones, to make their CI safe. It would be too early for this PR to do so... It is still in an earlier stage.
Ya. re-implementing the datastore and associated facilities is code-consuming. Do you think the |
Sorry for the delay -- I paln to review this tomorrow |
Sorry, I lost Github connections for a couple of days and just returned. Also Thanks. please take your time. |
This is still on my list, but I am behind in my reviews |
Which issue does this PR close?
Closes #10720
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?