Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(query): add native format in fuse table #9279

Merged
merged 6 commits into from
Dec 19, 2022

Conversation

sundy-li
Copy link
Member

@sundy-li sundy-li commented Dec 18, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

Summary about this PR

  1. bump arrow2 version
  2. Introduce an experimental native storage_format in fuse

Closes #issue

@vercel
Copy link

vercel bot commented Dec 18, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated
databend ⬜️ Ignored (Inspect) Dec 19, 2022 at 9:37AM (UTC)

@sundy-li sundy-li marked this pull request as draft December 18, 2022 06:03
@mergify mergify bot added the pr-feature this PR introduces a new feature to the codebase label Dec 18, 2022
@sundy-li
Copy link
Member Author

sundy-li commented Dec 18, 2022

how to use:

 create table tmp (a int) ENGINE=FUSE STORAGE_FORMAT='native';

It had great performance improvement in hits dataset:

MySQL [(none)]> select count(1),  max(URL) from hits;
+----------+-----------------------------------------+
| count(1) | max(url)                                |
+----------+-----------------------------------------+
| 99997497 | https://yugra-advert2792270][to]=&input |
+----------+-----------------------------------------+
1 row in set (4.491 sec)

MySQL [(none)]> select count(1),  max(URL) from hits_native;
+----------+-----------------------------------------+
| count(1) | max(url)                                |
+----------+-----------------------------------------+
| 99997497 | https://yugra-advert2792270][to]=&input |
+----------+-----------------------------------------+
1 row in set (0.506 sec)


@sundy-li sundy-li marked this pull request as ready for review December 19, 2022 09:06
@sundy-li
Copy link
Member Author

Average 1.85x in hits dataset

image

@BohuTANG
Copy link
Member

BohuTANG commented Dec 19, 2022

Based on s3:
image

Q7 & Q8 native is faster than parquet, others are not very obvious, and a number of them are even slower than parquet.

Q7: SELECT MIN(EventDate), MAX(EventDate) FROM hits;
Q8: SELECT AdvEngineID, COUNT(*) FROM hits WHERE AdvEngineID <> 0 GROUP BY AdvEngineID ORDER BY COUNT(*) DESC;

@sundy-li
Copy link
Member Author

sundy-li commented Dec 19, 2022

Q7 & Q8 native is faster than parquet, others are not very obvious, and a number of them are even slower than parquet.

Currently, we don't have prewhere optimization like parquet source. This is an initial implementation that take no bad effect on default parquet-based fuse engine, we can have it later.

I'd like to design another faster page filter index than parquet page index later.

@BohuTANG
Copy link
Member

@sundy-li

Some tests are deleted(like 09_fuse_engine/09_0001_remote_insert_v2), this is for?

@sundy-li
Copy link
Member Author

@sundy-li

Some tests are deleted(like 09_fuse_engine/09_0001_remote_insert_v2), this is for?

It's duplicated.

Copy link
Member

@BohuTANG BohuTANG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@BohuTANG BohuTANG merged commit 6ced33d into datafuselabs:main Dec 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants