Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Implement hf:// / "hugging face" integration in datafusion-cli #10792

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

xinlifoobar
Copy link
Contributor

@xinlifoobar xinlifoobar commented Jun 4, 2024

Which issue does this PR close?

Closes #10720

Rationale for this change

What changes are included in this PR?

Are these changes tested?

# xinli @ arch-dev in ~/source/repos/datafusion/datafusion-cli on git:dev/xinli/hfstore o [12:16:09] 
$ ./target/debug/datafusion-cli
DataFusion CLI v38.0.0
> SELECT count(*) AS count
FROM 'hf://datasets/cais/mmlu/astronomy/dev-00000-of-00001.parquet';
+-------+
| count |
+-------+
| 5     |
+-------+
1 row(s) fetched. 
Elapsed 2.469 seconds.

> create external table test stored as parquet location "hf://datasets/cais/mmlu/astronomy/";
0 row(s) fetched. 
Elapsed 1.398 seconds.

> select count(*) from test;
+----------+
| COUNT(*) |
+----------+
| 173      |
+----------+
1 row(s) fetched. 
Elapsed 1.199 seconds.

> select * from test limit 2
;
+--------------------------------------------------------------------------------------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
| question                                                                             | subject   | choices                                                                                                                                                                                                                                                                                                                          | answer |
+--------------------------------------------------------------------------------------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
| You cool a blackbody to half its original temperature. How does its spectrum change? | astronomy | [Power emitted is 1/16 times as high; peak emission wavelength is 1/2 as long., Power emitted is 1/4 times as high; peak emission wavelength is 2 times longer., Power emitted is 1/4 times as high; peak emission wavelength is 1/2 as long., Power emitted is 1/16 times as high; peak emission wavelength is 2 times longer.] | 3      |
| What drives differentiation?                                                         | astronomy | [Spontaneous emission from radioactive atoms., The minimization of gravitational potential energy., Thermally induced collisions., Plate tectonics.]                                                                                                                                                                             | 1      |
+--------------------------------------------------------------------------------------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
2 row(s) fetched. 
Elapsed 1.098 seconds.

Are there any user-facing changes?

@xinlifoobar xinlifoobar changed the title [Draft][DONNOT MERGE] Impl for HF_Store Feat: Implement hf:// / "hugging face" integration in datafusion-cli Jun 7, 2024
@xinlifoobar xinlifoobar marked this pull request as ready for review June 7, 2024 04:15
@xinlifoobar
Copy link
Contributor Author

Hi @alamb. I am still working on this for completing UTs, and E2Es and fixing bad code styles... Could you please help do a pre-quick review of the ideas behind... I did not find a simple way to implement this other than creating a wrapper ObjectStore impl on top of HttpStore.

Copy link
Contributor

@edmondop edmondop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinlifoobar
Copy link
Contributor Author

Hi @alamb. I am still working on this for completing UTs, and E2Es and fixing bad code styles... Could you please help do a pre-quick review of the ideas behind... I did not find a simple way to implement this other than creating a wrapper ObjectStore impl on top of HttpStore.

This is now able to be reviewed now. I completed most of the refining of the initial code.

)
STORED AS parquet
LOCATION "hf://datasets/cais/mmlu/astronomy/";
```
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently not working due to:

  1. The hugging face user access token is case-sensitive.
  2. a previous change enforces every option value in lower case. https://github.com/apache/datafusion/pull/9723/files.

I will figure out the history to see whether this will be feasible.

@alamb
Copy link
Contributor

alamb commented Jun 11, 2024

I am sorry @xinlifoobar for the delayed review -- I am traveling this week (actually presenting about DataFusion at SIGMOD: https://2024.sigmod.org/industrial-list.shtml)

@xinlifoobar
Copy link
Contributor Author

I am sorry @xinlifoobar for the delayed review -- I am traveling this week (actually presenting about DataFusion at SIGMOD: https://2024.sigmod.org/industrial-list.shtml)

Ya, the first time I found this on my timeline on LinkedIn, and am glad to be part of this awesome project.

I would pause updating on this PR since it is extremely large IMO and difficult for reviewers. Let me know your thoughts on it and I could do an update in the following iterations.

@xinlifoobar xinlifoobar reopened this Jun 13, 2024
@alamb
Copy link
Contributor

alamb commented Jun 15, 2024

I would pause updating on this PR since it is extremely large IMO and difficult for reviewers. Let me know your thoughts on it and I could do an update in the following iterations.

If it is too complicated, maybe we should just stop working on it (or maybe we should put the code into a datafusion-contrib repo 🤔 )

@@ -0,0 +1,9 @@
select count(*) from "hf://datasets/cais/mmlu/astronomy/dev-00000-of-00001.parquet";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮 -- tests too!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw duckdb build their organizations and datasets, instead of depending on the random existing ones, to make their CI safe. It would be too early for this PR to do so... It is still in an earlier stage.

@xinlifoobar
Copy link
Contributor Author

I would pause updating on this PR since it is extremely large IMO and difficult for reviewers. Let me know your thoughts on it and I could do an update in the following iterations.

If it is too complicated, maybe we should just stop working on it (or maybe we should put the code into a datafusion-contrib repo 🤔 )

Ya. re-implementing the datastore and associated facilities is code-consuming. Do you think the objectstore solution is the right way to go? If so, I could split part of the code into datafusion-contrib repo.

@alamb
Copy link
Contributor

alamb commented Jun 28, 2024

Sorry for the delay -- I paln to review this tomorrow

@xinlifoobar
Copy link
Contributor Author

Sorry for the delay -- I paln to review this tomorrow

Sorry, I lost Github connections for a couple of days and just returned. Also Thanks. please take your time.

@alamb
Copy link
Contributor

alamb commented Jul 7, 2024

This is still on my list, but I am behind in my reviews

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement hf:// / "hugging face" integration in datafusion-cli
3 participants