Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize decimal: reduce validation when construct the decimal array or cast to the decimal array #2313

Closed
1 of 4 tasks
liukun4515 opened this issue Aug 4, 2022 · 6 comments
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@liukun4515
Copy link
Contributor

liukun4515 commented Aug 4, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When we use the ballista/datafusion to execute some query( the table has may decimal data type), and perf the cpu.

The validation of decimal cost a lot of cpu about 5% from below picture.

image

Describe the solution you'd like

Describe alternatives you've considered

Additional context

@liukun4515 liukun4515 added the enhancement Any new improvement worthy of a entry in the changelog label Aug 4, 2022
@liukun4515 liukun4515 self-assigned this Aug 4, 2022
@liukun4515
Copy link
Contributor Author

In the arrow-rs, there are many places to generate the decimal array.
If the precision/range of the target decimal is larger than the source decimal/data value, we don't need to validation the generated decimal array.

For example, Reading the decimal(n,0) from parquet int64 column and the n is greater equal to 18, we don't need to verify the result of the decimal array, because the value from int64 will not be overflow the target precision.

From above the method, we use less cpu for decimal data type.

@tustvold
Copy link
Contributor

tustvold commented Aug 4, 2022

Also potentially related to this, the way decimal data is currently read from parquet is hopelessly inefficient #2318. I keep meaning to fix it, but I haven't got to it yet. Perhaps I can find some time this weekend... 🤔

@liukun4515
Copy link
Contributor Author

Also potentially related to this, the way decimal data is currently read from parquet is hopelessly inefficient #2318. I keep meaning to fix it, but I haven't got to it yet. Perhaps I can find some time this weekend... 🤔

Your optimization will improve the performance of reading decimal data.
This issue may reduce unnecessary validation for generated decimal array.

@liukun4515
Copy link
Contributor Author

Decimal also can be deserialized from INT32 or INT64 type of parquet

@alamb
Copy link
Contributor

alamb commented Aug 9, 2022

This PR #2383 may have improved performance as well

@tustvold
Copy link
Contributor

tustvold commented Dec 6, 2022

We no longer perform validation except where explicitly opted in to

@tustvold tustvold closed this as completed Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants