Support customizing the location where data is written in Spark #6

mccheah · 2018-11-26T19:28:37Z

Note that we only use this in the Spark writer, but this also has to be worked into the other integrations.

rdblue · 2018-11-26T20:29:25Z

spark/src/main/java/com/netflix/iceberg/spark/source/IcebergSource.java

+        .orElse(table.properties().getOrDefault(
+            TableProperties.WRITE_NEW_DATA_LOCATION,
+            new Path(new Path(table.location()), "data").toString()));
+    return Optional.of(new Writer(table, lazyConf(), format, dataLocation));


Instead of adding parameters to Writer whenever a change like this is made, I'd rather pass the options into Writer and handle these there. The dataLocation method could do this work instead of moving it outside the Writer class.

I think doing options processing from a Map<String, String>, inside a constructor, is a bit of an antipattern. Consider for example writing a unit test for this class in the future. If we pass the Writer constructor only a HashMap, the unit test would have to construct that HashMap in a specific way, i.e. knowing what key-value pairs the constructor is expecting.

Perhaps we can have a builder object that acts as a factory that accepts the Map and returns the Writer. The Writer constructor accepts the builder object and copies the set fields on the builder into its own fields.

What I think is strange is passing the location of a write into the writer when we're passing table into the writer. Why isn't that logic entirely handled in the writer? The normal case is for the write location to come from table config. I'm not even sure that we should allow overriding the write location in Spark's write properties. What is the use case there?

I like your reasoning about not passing options as a map to make testing clear in general, but doing it here just shifts the concern to a different test. The test case is that setting "write.folder-storage.path" in Spark options changes the location of output files. A test that passes in the location can validate that the location is respected, but what we actually want to do is test that the table's location defaults, or is set by the table property, or (maybe) is set by Spark options.

I think for our use case we can have the write location specified in the table property. That would be sufficient. I also don't see the downside of introducing the extra flexibility of allowing the override to be specified in data source options, but we could defer the feature until later.

rdblue · 2018-11-26T20:29:50Z

spark/src/main/java/com/netflix/iceberg/spark/source/IcebergSource.java

 import com.netflix.iceberg.hadoop.HadoopTables;
 import com.netflix.iceberg.spark.SparkSchemaUtil;
 import com.netflix.iceberg.types.CheckCompatibility;
 import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+


Nit: imports should be a single block, not separated by newlines.

rdblue · 2018-11-26T20:31:32Z

core/src/main/java/com/netflix/iceberg/TableProperties.java

+
+  // This only applies to files written after this property is set. Files previously written aren't relocated to
+  // reflect this parameter.
+  public static final String WRITE_NEW_DATA_LOCATION = "write.data.location";


Should this property mirror the existing write.object-storage.path property? Maybe write.folder-storage.path would be better.

Also, I would like to see a comment about what happens when this isn't set. That behavior should be to default to a data folder under the table location. These properties should be respected by all writers, so we want to have them well documented.

rdblue · 2018-11-26T20:34:49Z

spark/src/test/java/com/netflix/iceberg/spark/source/TestParquetWrite.java

@@ -71,7 +72,9 @@ public static void stopSpark() {
  public void testBasicWrite() throws IOException {
    File parent = temp.newFolder("parquet");
    File location = new File(parent, "test");
+    File dataLocation = new File(parent, "test-data");


This test should not be modified because it tests the default behavior. There should be an additional test for behavior with the new property. I'd also like to see a test that sets the property on the table instead of in the write options, and another one where both are set and the write option takes precedence.

Also, I think this should probably go into a more general test suite. This one is specific to Parquet, but you're testing the behavior for any file format.

rdblue · 2018-11-26T20:35:08Z

spark/src/main/java/com/netflix/iceberg/spark/source/Writer.java

  @Override
  public String toString() {
    return String.format("IcebergWrite(table=%s, type=%s, format=%s)",
        table, table.schema().asStruct(), format);
  }

-


Nit: whitespace-only change.

mccheah · 2018-11-29T22:21:36Z

spark/src/test/java/com/netflix/iceberg/spark/source/TestDataFrameWrites.java

+
+  private void writeAndValidateWithLocations(
+      Schema schema,
+      boolean setTablePropertyDataLocation,


Hm, rereading this I think it tries too hard to reuse code in exchange for the antipattern of using boolean switches. This can be written more idiomatically.

…data-location-rebased

… into custom-data-location-rebased

mccheah · 2018-12-11T01:18:57Z

Addressed the comments and is ready for another round of review. Also made the test cleaner.

rdblue · 2018-12-11T17:27:00Z

spark/src/test/java/com/netflix/iceberg/spark/source/TestDataFrameWrites.java


  @Parameterized.Parameters
  public static Object[][] parameters() {
    return new Object[][] {
        new Object[] { "parquet" },
+        new Object[] { "parquet" },


Why did you add these?

rdblue · 2018-12-11T17:28:50Z

Looks good to me, other than the duplicate test cases in parameters.

rdblue · 2018-12-11T20:37:10Z

Merged. Thanks @mccheah!

This adds a new table property, write.folder-storage.path, that controls the location of new data files.

* Publish to Bintray * Upgrade shadow plugin * Make bintray name equal to repository name

# This is the 1st commit message: Issue-629: Cherrypick Id # This is the commit message #2: Removed redundant methods and changed method name # This is the commit message #3: Fix Imports # This is the commit message #4: Fix Operation Check # This is the commit message apache#5: Fix Error Message # This is the commit message apache#6: Cherry picking operation to apply changes from incoming snapshot on current snapshot # This is the commit message apache#7: Initial working version of cherry-pick operation which applies appends only

* Fix mapred serialization bug

Revert "add double timestamp"

…el using catalog properties. (apache#6) * Allow table defaults to be configured and/ or enforced at catalog level using catalog properties. * Make catalogProps field private * Updates * Minor cleanup * Update

mccheah added 2 commits November 16, 2018 16:54

Support customizing the location where data is written in Spark.

7cc01b9

Merge remote-tracking branch 'mccheah-incubator/master' into HEAD

f158e1b

rdblue reviewed Nov 26, 2018

View reviewed changes

mccheah added 4 commits November 27, 2018 20:40

Support customizing the location where data is written in Spark.

dcc3404

Minor comments

3e88958

Adjust tests.

798cfb4

Test various write locations

f917697

mccheah force-pushed the custom-data-location branch from 862c593 to f917697 Compare November 28, 2018 04:41

mccheah commented Nov 29, 2018

View reviewed changes

This was referenced Dec 7, 2018

Support customizing the location where data is written in Spark. Netflix/iceberg#110

Closed

Support Customizing The Location Of Data Files Written By The Spark Data Source Netflix/iceberg#93

Closed

mccheah added 6 commits December 10, 2018 16:30

Merge remote-tracking branch 'upstream-incubator/master' into custom-…

f48a2f0

…data-location-rebased

Don't allow write data location to be set in data source options.

16efff2

Merge remote-tracking branch 'mccheah-incubator/custom-data-location'…

00368db

… into custom-data-location-rebased

Address some comments

9755b09

Don't import

a651062

Make test less convoluted

e415c83

rdblue reviewed Dec 11, 2018

View reviewed changes

Remove duplicate tests

cc1f33f

rdblue merged commit 0342f23 into apache:master Dec 11, 2018

rdblue referenced this pull request in rdblue/iceberg Dec 21, 2018

Spark: Support custom data location (#6)

c9d0f03

This adds a new table property, write.folder-storage.path, that controls the location of new data files.

rdblue mentioned this pull request Jan 9, 2019

Support customizing table locations #68

Merged

yifeih pushed a commit to yifeih/incubator-iceberg that referenced this pull request Apr 16, 2019

Publish artifacts to Bintray (apache#6)

fab05ae

* Publish to Bintray * Upgrade shadow plugin * Make bintray name equal to repository name

guilload pushed a commit to guilload/iceberg that referenced this pull request Jul 9, 2020

Fix mapred serialization bug (apache#6)

ce4e88c

* Fix mapred serialization bug

moulimukherjee referenced this pull request in moulimukherjee/iceberg Jul 24, 2020

Merge pull request #6 from stripe-internal/mouli/revert-double-timestamp

eb02667

Revert "add double timestamp"

jackye1995 mentioned this pull request Aug 13, 2021

Core: rename WRITE_NEW_DATA_LOCATION to WRITE_FOLDER_STORAGE_LOCATION #2965

Merged

rdblue mentioned this pull request May 18, 2022

WIP: Improve performance of expire snapshot by not double-scanning non-expired manifests #4736

Closed

adamyasharma2797 pushed a commit to adamyasharma2797/iceberg that referenced this pull request Jul 19, 2024

added check for empty results (apache#6)

10c3239

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support customizing the location where data is written in Spark #6

Support customizing the location where data is written in Spark #6

mccheah commented Nov 26, 2018

rdblue Nov 26, 2018

mccheah Nov 27, 2018

rdblue Dec 5, 2018

mccheah Dec 11, 2018

rdblue Nov 26, 2018

rdblue Nov 26, 2018 •

edited

Loading

rdblue Nov 26, 2018

rdblue Nov 26, 2018

mccheah Nov 29, 2018

mccheah commented Dec 11, 2018

rdblue Dec 11, 2018

rdblue commented Dec 11, 2018

rdblue commented Dec 11, 2018

Support customizing the location where data is written in Spark #6

Support customizing the location where data is written in Spark #6

Conversation

mccheah commented Nov 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Nov 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccheah commented Dec 11, 2018

Choose a reason for hiding this comment

rdblue commented Dec 11, 2018

rdblue commented Dec 11, 2018

rdblue Nov 26, 2018 •

edited

Loading