New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][SPARK-14098][SQL] Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called #15219
Conversation
Test build #65835 has finished for PR 15219 at commit
|
Test build #65837 has finished for PR 15219 at commit
|
Test build #65838 has finished for PR 15219 at commit
|
Test build #65870 has finished for PR 15219 at commit
|
Test build #65871 has finished for PR 15219 at commit
|
Test build #65872 has finished for PR 15219 at commit
|
Test build #65884 has finished for PR 15219 at commit
|
Test build #65887 has finished for PR 15219 at commit
|
Test build #65973 has finished for PR 15219 at commit
|
Test build #66081 has finished for PR 15219 at commit
|
Test build #66085 has finished for PR 15219 at commit
|
Jenkins, retest this please |
Test build #66111 has finished for PR 15219 at commit
|
Test build #66122 has finished for PR 15219 at commit
|
Test build #66179 has finished for PR 15219 at commit
|
@davies would it be possible to review this?
|
Test build #66314 has finished for PR 15219 at commit
|
Test build #66325 has finished for PR 15219 at commit
|
@davies could you please review this? cc @cloud-fan |
Test build #66575 has finished for PR 15219 at commit
|
Test build #71692 has finished for PR 15219 at commit
|
Jenkins, retest this please |
Test build #71711 has finished for PR 15219 at commit
|
Jenkins, retest this please |
Test build #71719 has finished for PR 15219 at commit
|
Jenkins, retest this please |
Test build #71745 has finished for PR 15219 at commit
|
Jenkins, retest this please |
Test build #71765 has finished for PR 15219 at commit
|
Jenkins, retest this please |
Test build #71787 has finished for PR 15219 at commit
|
Jenkins, retest this please |
Test build #71832 has finished for PR 15219 at commit
|
## What changes were proposed in this pull request? This pull request adds test cases for the following cases: - keep all data types with null or without null - access `CachedBatch` disabling whole stage codegen - access only some columns in `CachedBatch` This PR is a part of apache#15219. Here are motivations to add these tests. When apache#15219 is enabled, the first two cases are handled by specialized (generated) code. The third one is a pitfall. In general, even for now, it would be helpful to increase test coverage. ## How was this patch tested? added test suites itself Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes apache#15462 from kiszk/columnartestsuites.
…ctor/ColumnarBatch ## What changes were proposed in this pull request? This PR refactors the code generation part to get data from `ColumnarVector` and `ColumnarBatch` by using a trait `ColumnarBatchScan` for ease of reuse. This is because this part will be reused by several components (e.g. parquet reader, Dataset.cache, and others) since `ColumnarBatch` will be first citizen. This PR is a part of apache#15219. In advance, this PR makes the code generation for `ColumnarVector` and `ColumnarBatch` reuseable as a trait. In general, this is very useful for other components from the reuseability view, too. ## How was this patch tested? tested existing test suites Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes apache#15467 from kiszk/columnarrefactor.
Test build #72538 has finished for PR 15219 at commit
|
…ctor/ColumnarBatch ## What changes were proposed in this pull request? This PR refactors the code generation part to get data from `ColumnarVector` and `ColumnarBatch` by using a trait `ColumnarBatchScan` for ease of reuse. This is because this part will be reused by several components (e.g. parquet reader, Dataset.cache, and others) since `ColumnarBatch` will be first citizen. This PR is a part of apache#15219. In advance, this PR makes the code generation for `ColumnarVector` and `ColumnarBatch` reuseable as a trait. In general, this is very useful for other components from the reuseability view, too. ## How was this patch tested? tested existing test suites Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes apache#15467 from kiszk/columnarrefactor.
@kiszk This PR has been stale for a long time, if we don't plan to continue work on that in the near future, could you please temporally close this? We can always reopen the PR when it's good time to restart working on the issue, WDYT? Thanks! |
@jiangxb1987 Thank you for pinging me. Sure, since we are working for this feature in other PRs, I close this. |
What changes were proposed in this pull request?
Here is a design document for this change. This PR is derived from #11956 and #13899.
I am splitting this PR into multiple smaller child PRs (#15462, #15467, #15468) for ease of review.
This PR implements a new in-memory cache feature used by
DataFrame.cache
andDataset.cache
. The followings are basic design of this PR with suggestions from @davies.ColumnarBatch
withColumnVector
that are common data representations for columnar storageColumnVector
inColumnarBatch
depends on its data typeColumnVector
for the in-memory cache by whole-stage codegen.ColumnVector
to keepUnsafeArrayData
ColumnVector
byte[]
for UnsafeArrayData` and compressed dataAdvantage of this PR improves runtime performance and refactors to use common components.
For performance, this PR eliminates lots of virtual calls and data copy in old code paths. In particular, for 2-a., this PR avoids data copy from a columnar storage
CachedBatch
to a row-based iterator.For refactoring, the following PR may remove unused components for
CachedBatchBytes
.Options
A
ColumnVector
for all primitive data types inColumnarBatch
can be compressed. Currently, there are two ways to enable compression:true
into a propertyspark.sql.inMemoryColumnarStorage.compressed
(default istrue
), orMEMORY_ONLY_SER
,MEMORY_ONLY_SER_2
,MEMORY_AND_DISK_SER
, orMEMORY_AND_DISK_SER_2
.The compression scheme is specified by a property
spark.sql.inMemoryColumnarStorage.compression.codec
(default islz4
).Performance results
3.4x for building a
CachedColumnarBatch
and getting values (in 1 and 2-a)ToDo: the following PR performs compression for complex data type such as array
1.2x for getting values (only in 2-a)
Here is an example program
CachedColumnarBatch
2.a Generated code for getting values from
CachedColumnarBatch
in code generated by whole stage codegen (primarypath)2.b Generated code for copying values from
CachedColumnarBatch
in a generated iteratorHow was this patch tested?
Added new tests