WIP: add datafusion based parquet reader #312

jiacai2050 · 2022-10-18T04:07:24Z

Which issue does this PR close?

Closes #291

Rationale for this change

As described in #291, this PR also fix object store cache isn't working.
After #14, parquet reader will read all bytes out, ignoring whether if it's already cached.

What changes are included in this PR?

Replace hand-rolled parquet reader with datafusion's ParquetExec, and add CachableParquetFileReader to implement row-group level cache

Are there any user-facing changes?

No

How does this change test

Using existing UT

ShiKaiWi · 2022-10-25T03:01:29Z

analytic_engine/src/sst/parquet/builder.rs

@@ -264,7 +264,14 @@ mod tests {
            };

            let mut reader = ParquetSstReader::new(&sst_file_path, &store, &sst_reader_options);
-            assert_eq!(reader.meta_data().await.unwrap(), &sst_meta);
+            let sst_meta_readback = {
+                // size of SstMetaData is not what this file's size, so overwrite it


Add FIXME: prefix.

ShiKaiWi · 2022-10-25T06:28:44Z

analytic_engine/src/sst/parquet/reader.rs


+use super::encoding::{self, ParquetDecoder};


Avoid the relative importing path.

ShiKaiWi · 2022-10-25T06:45:43Z

analytic_engine/src/sst/parquet/reader.rs

+            self.metadata_size_hint, self.cache_hit, self.cache_miss, self.metrics.bytes_scanned.value()
+        );
+    }
+}


Miss one newline here.

ShiKaiWi · 2022-10-25T06:48:33Z

common_types/src/time.rs

+
+    /// Creates expression like:
+    /// start <= time && time < end
+    pub fn df_expr(&self, column_name: impl AsRef<str>) -> Expr {


Suggested change

pub fn df_expr(&self, column_name: impl AsRef<str>) -> Expr {

pub fn to_df_expr(&self, column_name: impl AsRef<str>) -> Expr {

jiacai2050 · 2022-10-27T06:13:09Z

In my local environment, the performance have regression when adopt this new reader, so further investigation is required before merge this.

Tested sst file: 104,022,899 rows

old: 11709ms(this exclude file read, since it's read out in advance)
new: 38986ms
new without read：23208ms

Related issue: apache/arrow-rs#2916

jiacai2050 force-pushed the new-parquet-reader branch from 2a50baf to b303920 Compare October 18, 2022 04:12

jiacai2050 mentioned this pull request Oct 18, 2022

support scan parquet file in reverse order #313

Closed

jiacai2050 force-pushed the new-parquet-reader branch 4 times, most recently from e38659a to 4f4de4f Compare October 18, 2022 10:59

jiacai2050 requested review from chunshao90 and ShiKaiWi October 18, 2022 11:03

jiacai2050 force-pushed the new-parquet-reader branch 6 times, most recently from 13d99c0 to c557007 Compare October 19, 2022 02:39

jiacai2050 mentioned this pull request Oct 19, 2022

SstMetaData's size is 0 after flush #321

Closed

jiacai2050 force-pushed the new-parquet-reader branch 9 times, most recently from 6115035 to 6243fcf Compare October 21, 2022 04:26

ShiKaiWi reviewed Oct 25, 2022

View reviewed changes

jiacai2050 added 5 commits October 26, 2022 14:31

add new parquet reader

74dff12

compile fine, but have nested runtime

d1ff3c6

add record batch projector

6f2eefb

remove unused snippets

25c548d

fix tests

5b8bc47

jiacai2050 added 7 commits October 26, 2022 14:31

fix benchmark

a548575

fix clippy

57b1ff7

fix test harness

6644dbb

change config data path

868469f

add comments, remove serialized_reader

ba896c7

add metrics

525ce7b

add log for test

fb08a09

jiacai2050 force-pushed the new-parquet-reader branch 3 times, most recently from c1f7c51 to 6d3851d Compare October 26, 2022 07:38

add threaded reader

1427794

jiacai2050 force-pushed the new-parquet-reader branch from 6d3851d to 1427794 Compare October 26, 2022 08:04

jiacai2050 changed the title ~~feat: add datafusion based parquet reader~~ WIP: add datafusion based parquet reader Oct 27, 2022

This was referenced Oct 31, 2022

feat: add async parquet reader #355

Merged

Tracking issue for query performance #363

Closed

jiacai2050 closed this Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: add datafusion based parquet reader #312

WIP: add datafusion based parquet reader #312

jiacai2050 commented Oct 18, 2022

ShiKaiWi Oct 25, 2022

ShiKaiWi Oct 25, 2022

ShiKaiWi Oct 25, 2022

jiacai2050 Oct 26, 2022

ShiKaiWi Oct 25, 2022

jiacai2050 Oct 26, 2022

jiacai2050 commented Oct 27, 2022 •

edited

Loading

	pub fn df_expr(&self, column_name: impl AsRef<str>) -> Expr {
	pub fn to_df_expr(&self, column_name: impl AsRef<str>) -> Expr {

WIP: add datafusion based parquet reader #312

WIP: add datafusion based parquet reader #312

Conversation

jiacai2050 commented Oct 18, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How does this change test

ShiKaiWi Oct 25, 2022

Choose a reason for hiding this comment

ShiKaiWi Oct 25, 2022

Choose a reason for hiding this comment

ShiKaiWi Oct 25, 2022

Choose a reason for hiding this comment

jiacai2050 Oct 26, 2022

Choose a reason for hiding this comment

ShiKaiWi Oct 25, 2022

Choose a reason for hiding this comment

jiacai2050 Oct 26, 2022

Choose a reason for hiding this comment

jiacai2050 commented Oct 27, 2022 • edited Loading

jiacai2050 commented Oct 27, 2022 •

edited

Loading