Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: change default value of use_parquet2 to 0. #13754

Merged
merged 2 commits into from Nov 23, 2023

Conversation

youngsofun
Copy link
Member

@youngsofun youngsofun commented Nov 17, 2023

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

we do not hope user to use this option unless nessary.

now only affect copy from parquet and select parquet.
later when other component mig to parquet rs, may resue this setting with diff meaning, or use a more specific setting name?

if any one feel there are some risk to use parquet_rs as default?we can stop and test and review more.
@b41sh @sundy-li about the data type
@RinChanNOWWW @Dousir9 about the read progress


This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Nov 17, 2023
@Dousir9
Copy link
Member

Dousir9 commented Nov 17, 2023

I downloaded the parquet file and located the problem:
panicked at src/common/arrow/src/arrow/io/parquet/read/deserialize/binary/utils.rs:73:46: called Result::unwrap() on an Err value: Overflow

@Dousir9
Copy link
Member

Dousir9 commented Nov 17, 2023

2 vCPU 13.0GB

use parquet2=1:

截屏2023-11-17 14 20 29

use parquet2=0:

截屏2023-11-17 14 36 48

@BohuTANG
Copy link
Member

2 vCPU 13.0GB

use parquet2=1:

截屏2023-11-17 14 20 29 ## use parquet2=0: 截屏2023-11-17 14 36 48

Looks parquet_rs is still bad in poor hardware.

@BohuTANG
Copy link
Member

Let's draft this PR.

@BohuTANG BohuTANG marked this pull request as draft November 17, 2023 09:58
@sundy-li
Copy link
Member

sundy-li commented Nov 17, 2023

looks parquet_rs is still bad in poor hardware.

But In better hardware, it may have better performance. And �I think correctness is high priority than performance, parquet2 lacked much features or bugs which were hard to fix.

Such as :

Cell In[15], line 3
      1 ctx.sql(
      2         "select * from billing where line_item_line_item_type = 'Usage' "
----> 3     ).collect()

RuntimeError: DataFrame collect error: Unimplemented. Code: 1002, Text = arrow: Decoding Int64 "PlainDictionary"-encoded optional , index-filtered parquet pages.

which is found by @everpcpc

@PsiACE
Copy link
Member

PsiACE commented Nov 17, 2023

I think correctness is high priority than performance, parquet2 lacked much features or bugs which were hard to fix.

I agree with this point. In addition, with the further maturity of the Arrow Rust ecosystem, it can be expected that the performance of Parquet will improve, while the maintainability of Parquet2 is almost no longer guaranteed.

@BohuTANG
Copy link
Member

@sundy-li agreed, I understand your situation. let's continue this PR and keep the use_parquet setting.

@BohuTANG BohuTANG marked this pull request as ready for review November 17, 2023 23:08
@youngsofun
Copy link
Member Author

unloading is using parquet2,
parquet_rs reader does not recognize variant types in unloaded parquet.
this is not expected since they are using the same mechanism.
cc @b41sh

@b41sh
Copy link
Member

b41sh commented Nov 18, 2023

unloading is using parquet2, parquet_rs reader does not recognize variant types in unloaded parquet. this is not expected since they are using the same mechanism. cc @b41sh

I'll follow up to fix this.

@BohuTANG BohuTANG added the ci-cloud Build docker image for cloud test label Nov 23, 2023
Copy link
Contributor

Docker Image for PR

  • tag: pr-13754-60a2a23

note: this image tag is only available for internal use,
please check the internal doc for more details.

@BohuTANG BohuTANG merged commit 1b6d8fd into datafuselabs:main Nov 23, 2023
87 checks passed
andylokandy pushed a commit to andylokandy/databend that referenced this pull request Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-cloud Build docker image for cloud test pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read remote parquet file Result::unwrap() on an Err value: Overflow
6 participants