Use Iceberg writers for Parquet data written from Spark. #63

rdblue · 2018-12-22T00:55:27Z

This removes the use of Spark's Parquet WriteSupport and replaces it with Parquet writers like those used for Iceberg generics and Avro records.

rdblue · 2018-12-22T00:57:28Z

@aokolnychyi, this PR avoids using Spark's WriteSupport for Parquet. It would be interesting to see which option performs better. My initial testing shows this new write path takes about 48 seconds for 500,000 random records where the Spark WriteSupport takes about 52.

aokolnychyi · 2018-12-22T11:04:16Z

@rdblue I'll give it a try in a few days

aokolnychyi · 2019-01-11T18:40:08Z

I have a local benchmark that compares writing 5000000 records in Parquet format using gzip compression via Iceberg and via Spark native file source.

Here are the results with this PR:

Benchmark                                           Mode  Cnt  Score   Error  Units
SparkParquetFlatDataWriteBenchmark.writeFileSource    ss    5  4.587 ± 0.226   s/op
SparkParquetFlatDataWriteBenchmark.writeIceberg       ss    5  4.537 ± 0.090   s/op

Benchmark                                             Mode  Cnt  Score   Error  Units
SparkParquetNestedDataWriteBenchmark.writeFileSource    ss    5  8.798 ± 0.342   s/op
SparkParquetNestedDataWriteBenchmark.writeIceberg       ss    5  8.457 ± 0.282   s/op

This PR is also critical as Iceberg will collect proper metadata per file (e.g., min/max values). As for now, we only keep track of the number of rows during writes via Iceberg Spark data source.

rdblue · 2019-01-11T19:37:21Z

Thanks @aokolnychyi! Looks like it isn't faster, but at least it isn't slower.

aokolnychyi · 2019-01-11T21:35:15Z

I will share the benchmarking code later and we can discuss if the test is valid or not. Overall, I saw that the performance on flat data is very close but Iceberg is a bit faster on nested data.

I have more benchmarks for the read path. There are also interesting moments.

mccheah · 2019-01-22T01:46:43Z

spark/src/main/java/com/netflix/iceberg/spark/data/SparkParquetWriters.java

+    private final int precision;
+    private final int scale;
+    private final int length;
+    private final ThreadLocal<byte[]> bytes;


At one point would multiple threads access the same writer? Does this truly need to be a ThreadLocal?

Wonder if we can achieve something similar by making our own byte buffer pool that's instantiated here. Can make the lifecycle of the reused byte buffers more explicit.

These writer classes are used to build a tree that can consume objects that will be written to Parquet files. Those trees are expensive to build because it requires traversing the table schema. That means that we may want to cache and reuse them later, possibly from multiple threads. Because the function they perform is stateless, it would be easy to later assume that these are thread-safe. That's why I went ahead and made them thread-safe.

We could change how this is done, but it is one byte array allocated per thread, so I think this light-weight solution is fine.

spark/src/main/java/com/netflix/iceberg/spark/data/SparkParquetWriters.java

mccheah · 2019-01-22T01:57:19Z

spark/src/main/java/com/netflix/iceberg/spark/data/SparkParquetWriters.java

+
+    @Override
+    public ParquetValueWriter<?> map(GroupType map,
+                                     ParquetValueWriter<?> keyWriter,


Shouldn't indentation for this and many other methods in this class be 4 spaces in from the public modifier on the previous line? Also might want map to be on its own line. In general the formatting for method declarations throughout seems off.

The rule I was using was to indent to the start of arguments, and if any one argument could not fit in the space, to move all arguments back to 4 spaces from the method definition's start. I'm fine with updating this convention.

Out of curiosity, how many args do we want to keep per line? I've seen a few args per line in other places.

We should standardize this. It would be good to just pick a rule.

aokolnychyi · 2019-02-09T10:39:58Z

We tested this change locally and did not notice any issues so far. It would be nice to see this merged once @mccheah's comments are addressed.

prodeezy · 2019-02-27T08:29:44Z

@rdblue Thanks for this PR. I applied this to latest and ran my functional tests. Works great! confirmed that top level stats are kept in Manifests now and filters can prune files :-) File pruning on predicates is a critical feature for us. Looking forward to seeing this merged.

P.S. master branch has diverged slightly so the PR patch didn't apply cleanly. Some slight changes were needed. No biggie but wanted to let you know.

prodeezy · 2019-03-05T08:32:30Z

spark/src/main/java/com/netflix/iceberg/spark/data/SparkParquetWriters.java

+          case INT_8:
+          case INT_16:
+          case INT_32:
+          case INT_64:


I realize that INT96 is deprecated as per https://issues.apache.org/jira/browse/PARQUET-323 and don't want to encourage people using it in Iceberg but if people do want to re-write data with INT96 rows into iceberg (as INT64) would we run into this code? If so should we not handle it by truncating INT96 to INT64?

No, by the time this code gets INT96 values, they will be converted to binary, fixed, or a long. That's the engine's job because Iceberg won't be used to read INT96 values from files. So another read implementation produces values, then the engine hands them to Iceberg using a known type. No need to handle it here.

rdblue · 2019-03-18T22:59:27Z

@mccheah, @prodeezy, @aokolnychyi, I've rebased and updated this PR for your review comments. Please have another look when you get a chance. Thank you!

prodeezy · 2019-03-19T06:36:17Z

Minor nit: Would be nice to update the PR title to also reflect that this change adds top level column metrics to Iceberg.

prodeezy · 2019-03-19T06:41:17Z

spark/src/main/java/com/netflix/iceberg/spark/data/SparkParquetWriters.java

+
+    @Override
+    public ParquetValueWriter<?> message(MessageType message,
+                                         List<ParquetValueWriter<?>> fieldWriters) {


Style: fix indentation to standard (+2 spaces) for all methods in this file.

There's discussion on this further down. We should decide on a style for long method arguments and add rules to enforce it. I have generally used this style, but I know others have used 4 spaces and one argument per line. Some places where this style can't fit single arguments have used 4 spaces from the method definition and as many arguments per line as possible... so while I would say that this is pretty normal for the codebase, we need to clean it up.

prodeezy

Minor nit: Would be nice to update the PR title to also reflect that this change adds top level column metrics to Iceberg. Also pending some style fixes this looks good to me.

rdblue · 2019-03-19T18:29:19Z

Minor nit: Would be nice to update the PR title to also reflect that this change adds top level column metrics to Iceberg.

@prodeezy: What does this change that affects top-level column metrics? This shouldn't change metrics because they are taken from the footer after the file is written. That doesn't have much to do with how records are deconstructed and written.

#136 addresses the missing nested metrics problem, right?

aokolnychyi · 2019-03-19T19:52:25Z

I think @prodeezy meant that the metrics are not actually fetched from the footer as of today. Correct me if I am wrong, but we still use ParquetWriteAdapter right now. It doesn't read the footer.

Anyway, I believe collecting min/max stats is a side-effect of switching to Iceberg writers. The main feature is that we actually switch to Iceberg writers, so it makes sense to keep the PR name as it is.

rdblue · 2019-03-19T20:18:07Z

I didn't know that was broken! Thanks for pointing it out. I've opened #137 to fix the bug in the WriteSupport path.

rdblue · 2019-03-19T20:55:32Z

Thanks for the reviews, everyone! I'm merging this.

…er (apache#63) * PLATQ-3011 Do not add filter if there are no tombstone literals * Bump version to 1.0-adobe-17.16

rdblue changed the title ~~Use Iceberg writers for Parquet data from Spark.~~ Use Iceberg writers for Parquet data written from Spark. Dec 22, 2018

mccheah reviewed Jan 22, 2019

View reviewed changes

aokolnychyi mentioned this pull request Feb 18, 2019

Basic Benchmarks for Iceberg Spark Data Source #105

Merged

prodeezy reviewed Mar 5, 2019

View reviewed changes

prodeezy mentioned this pull request Mar 6, 2019

Add support for nested struct field based filter expressions in Iceberg #122

Closed

rdblue added 2 commits March 18, 2019 15:47

Use Iceberg writers for Parquet data from Spark.

a6f14cc

Fix imports.

d8cc791

rdblue force-pushed the add-spark-parquet-writers branch from 08ee543 to ee0330d Compare March 18, 2019 22:58

Address Matt's review comments.

61d3195

rdblue force-pushed the add-spark-parquet-writers branch from ee0330d to 61d3195 Compare March 18, 2019 23:36

prodeezy reviewed Mar 19, 2019

View reviewed changes

prodeezy approved these changes Mar 19, 2019

View reviewed changes

mccheah approved these changes Mar 19, 2019

View reviewed changes

rdblue merged commit 146094a into apache:master Mar 19, 2019

rdblue deleted the add-spark-parquet-writers branch March 19, 2019 21:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Iceberg writers for Parquet data written from Spark. #63

Use Iceberg writers for Parquet data written from Spark. #63

rdblue commented Dec 22, 2018

rdblue commented Dec 22, 2018 •

edited

aokolnychyi commented Dec 22, 2018

aokolnychyi commented Jan 11, 2019 •

edited

rdblue commented Jan 11, 2019

aokolnychyi commented Jan 11, 2019

mccheah Jan 22, 2019

mccheah Jan 22, 2019

rdblue Mar 18, 2019

mccheah Jan 22, 2019

rdblue Jan 27, 2019

aokolnychyi Feb 9, 2019

rdblue Mar 18, 2019

aokolnychyi commented Feb 9, 2019

prodeezy commented Feb 27, 2019 •

edited

prodeezy Mar 5, 2019

rdblue Mar 18, 2019

rdblue commented Mar 18, 2019

prodeezy commented Mar 19, 2019

prodeezy Mar 19, 2019 •

edited

rdblue Mar 19, 2019

prodeezy left a comment

rdblue commented Mar 19, 2019

aokolnychyi commented Mar 19, 2019

rdblue commented Mar 19, 2019

rdblue commented Mar 19, 2019

Use Iceberg writers for Parquet data written from Spark. #63

Use Iceberg writers for Parquet data written from Spark. #63

Conversation

rdblue commented Dec 22, 2018

rdblue commented Dec 22, 2018 • edited

aokolnychyi commented Dec 22, 2018

aokolnychyi commented Jan 11, 2019 • edited

rdblue commented Jan 11, 2019

aokolnychyi commented Jan 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Feb 9, 2019

prodeezy commented Feb 27, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Mar 18, 2019

prodeezy commented Mar 19, 2019

prodeezy Mar 19, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prodeezy left a comment

Choose a reason for hiding this comment

rdblue commented Mar 19, 2019

aokolnychyi commented Mar 19, 2019

rdblue commented Mar 19, 2019

rdblue commented Mar 19, 2019

rdblue commented Dec 22, 2018 •

edited

aokolnychyi commented Jan 11, 2019 •

edited

prodeezy commented Feb 27, 2019 •

edited

prodeezy Mar 19, 2019 •

edited